Table of Contents
Why Text Processing Is Perfect for Automation
Many everyday tasks involve working with text:
- Cleaning up data exported from another program
- Renaming or reorganizing content
- Extracting key information from logs, notes, or reports
- Generating reports or summaries
Doing these manually is slow and error-prone. Python can read text, transform it, and write out new text automatically. In this chapter, you will see practical patterns for automating common text-processing tasks.
We’ll assume you already know how to:
- Read and write files
- Use basic string operations
- Run Python scripts
Here we focus on automating text-related tasks using those skills.
Reading and Writing Text for Automation
When automating, you often follow this pattern:
- Read some text (from a file, user input, or another source)
- Process/transform the text
- Save or display the result
A typical file-based pattern looks like this:
# read from one file, write transformed text to another
with open("input.txt", "r", encoding="utf-8") as f_in:
text = f_in.read()
# process the text somehow
processed_text = text.upper() # example transformation
with open("output.txt", "w", encoding="utf-8") as f_out:
f_out.write(processed_text)The interesting part is what happens in the “process the text” step. That is what the rest of this chapter focuses on.
Common Text Transformations
Changing Case
Changing case is useful for:
- Normalizing data (for comparisons or searching)
- Making output consistent
Useful methods:
text.lower()– all lowercasetext.upper()– all uppercasetext.title()– first letter of each word capitalizedtext.capitalize()– first character capitalized, rest lowercase
Example: normalize emails to lowercase:
email = "Alice.Example@Email.COM"
normalized_email = email.lower()
print(normalized_email) # alice.example@email.comStripping and Cleaning Whitespace
Text often contains extra spaces, tabs, or newlines that you don’t want.
Common methods:
text.strip()– remove whitespace from both endstext.lstrip()– remove from the lefttext.rstrip()– remove from the righttext.replace(old, new)– replace one substring with another
Example: cleaning up lines from a file:
cleaned_lines = []
with open("raw_lines.txt", "r", encoding="utf-8") as f:
for line in f:
cleaned = line.strip() # remove leading/trailing spaces and \n
if cleaned: # skip empty lines
cleaned_lines.append(cleaned)
with open("cleaned_lines.txt", "w", encoding="utf-8") as f:
for line in cleaned_lines:
f.write(line + "\n")Replacing and Fixing Text
You can automatically fix common mistakes or convert from one format to another.
text = "Ths is a smple txt with erors."
# very simple replacement corrections
text = text.replace("Ths", "This")
text = text.replace("smple", "simple")
text = text.replace("txt", "text")
text = text.replace("erors", "errors")
print(text)You can combine many replacements into a loop:
text = "aple banan chery"
corrections = {
"aple": "apple",
"banan": "banana",
"chery": "cherry"
}
for wrong, right in corrections.items():
text = text.replace(wrong, right)
print(text) # apple banana cherryThis pattern is useful for bulk corrections across many documents.
Working with Lines and Records
Many text files are line-based (logs, CSV exports, plain text lists). Automating them usually means:
- Looping over lines
- Transforming each line
- Writing out the result
Splitting Text into Lines
You can split the whole text into a list of lines using splitlines():
with open("notes.txt", "r", encoding="utf-8") as f:
text = f.read()
lines = text.splitlines()
print(len(lines), "lines found")Or simply iterate directly over the file object (which yields lines):
with open("notes.txt", "r", encoding="utf-8") as f:
for line in f:
print("Line:", line.strip())Filtering Lines
Automation often means “find only the lines I care about”.
Example: filter lines containing a keyword:
keyword = "ERROR"
matching_lines = []
with open("application.log", "r", encoding="utf-8") as f:
for line in f:
if keyword in line:
matching_lines.append(line)
with open("errors_only.log", "w", encoding="utf-8") as f:
f.writelines(matching_lines)You can easily extend this to multiple keywords:
keywords = ["ERROR", "WARNING"]
with open("application.log", "r", encoding="utf-8") as f_in, \
open("important.log", "w", encoding="utf-8") as f_out:
for line in f_in:
if any(k in line for k in keywords):
f_out.write(line)Transforming Lines
You may want to modify each line in some way.
Example: add line numbers:
with open("original.txt", "r", encoding="utf-8") as f_in, \
open("numbered.txt", "w", encoding="utf-8") as f_out:
for i, line in enumerate(f_in, start=1):
clean = line.rstrip("\n")
new_line = f"{i:03}: {clean}\n" # 001:, 002:, etc.
f_out.write(new_line)Example: convert comma-separated values to tab-separated:
with open("data.csv", "r", encoding="utf-8") as f_in, \
open("data.tsv", "w", encoding="utf-8") as f_out:
for line in f_in:
line = line.rstrip("\n")
parts = line.split(",")
f_out.write("\t".join(parts) + "\n")Splitting and Joining Text
Breaking text into parts and joining it back together is a core text automation skill.
Splitting with `split()`
split() turns a string into a list:
text.split()– split on any whitespacetext.split(",")– split on a specific separator
Example: split full names into first and last names:
names = [
"Alice Smith",
"Bob Johnson",
"Charlie Doe"
]
for name in names:
parts = name.split(" ")
first = parts[0]
last = parts[-1]
print("First:", first, "| Last:", last)Joining with `join()`
separator.join(list_of_strings) combines elements into a single string.
Example: build a CSV line:
values = ["Alice", "30", "Engineer"]
line = ",".join(values)
print(line) # Alice,30,EngineerExample: reconstruct text from cleaned words:
raw = "This text has extra spaces."
words = raw.split() # split on any whitespace
clean = " ".join(words)
print(clean) # This text has extra spaces.Searching in Text
When automating, you often need to answer questions like:
- “Does this line contain a certain word?”
- “Where does this word first appear?”
- “How many times does this phrase occur?”
Useful operations:
substring in text– check presence (returnsTrue/False)text.find(substring)– first index, or-1if not foundtext.count(substring)– number of occurrences
Example: count how many times a word appears in a file:
word = "Python"
count = 0
with open("article.txt", "r", encoding="utf-8") as f:
for line in f:
count += line.count(word)
print(f"{word} appears {count} times.")Example: extract part of a line after a marker:
line = "User: alice | Role: admin | Active: yes"
marker = "Role: "
start_index = line.find(marker)
if start_index != -1:
start_index += len(marker)
# role ends at the next " | " or end of line
end_index = line.find(" | ", start_index)
if end_index == -1:
end_index = len(line)
role = line[start_index:end_index]
print("Role is:", role)This kind of “find marker, then slice” is common when you are extracting information from structured but not strictly formatted text (like logs).
Simple Text Normalization
“Normalizing” means converting text into a consistent, predictable format.
Common normalization steps:
- Lowercase everything:
text.lower() - Remove extra spaces:
split()then" ".join(...) - Replace special characters (e.g., convert
–to-) - Remove or replace punctuation for simple analysis
Example: normalize for comparison:
def normalize(text):
text = text.lower()
text = text.strip()
words = text.split()
return " ".join(words)
a = " Hello World "
b = "hello world"
print(normalize(a) == normalize(b)) # TrueGenerating Text Automatically
Automation is not only about cleaning up text; it can also generate text.
Using Templates with Placeholders
A simple pattern is to write a “template” with placeholders, then fill them in.
Example: generating personalized messages:
template = (
"Hello {name},\n"
"Thank you for purchasing {product}.\n"
"Your order number is {order_id}.\n"
)
orders = [
{"name": "Alice", "product": "Notebook", "order_id": "A001"},
{"name": "Bob", "product": "Pen", "order_id": "A002"},
]
for order in orders:
message = template.format(
name=order["name"],
product=order["product"],
order_id=order["order_id"]
)
filename = f"message_{order['order_id']}.txt"
with open(filename, "w", encoding="utf-8") as f:
f.write(message)This technique is very useful for:
- Batch generating emails (saved as text)
- Creating multiple similar reports or letters
- Producing configuration files from data
Combining Data with Text (Simple Reports)
You can summarize data into human-readable text.
Example: simple word count report:
def word_count(text):
words = text.split()
return len(words)
with open("article.txt", "r", encoding="utf-8") as f:
text = f.read()
report = []
report.append("Text Analysis Report")
report.append("====================")
report.append(f"Characters (including spaces): {len(text)}")
report.append(f"Words: {word_count(text)}")
report.append(f"Lines: {len(text.splitlines())}")
report_text = "\n".join(report)
with open("report.txt", "w", encoding="utf-8") as f:
f.write(report_text)This is a simple example of using automated text processing to generate a summary document.
Basic Batch Processing of Multiple Files
Automation becomes powerful when you apply it to a folder of text files instead of just one.
You can combine file/folder automation with text processing:
- Loop over files in a directory
- For each file, read text, process it, write output
Example: clean all .txt files in a folder by removing extra spaces:
import os
input_folder = "raw_texts"
output_folder = "clean_texts"
os.makedirs(output_folder, exist_ok=True)
for filename in os.listdir(input_folder):
if not filename.endswith(".txt"):
continue
in_path = os.path.join(input_folder, filename)
out_path = os.path.join(output_folder, filename)
with open(in_path, "r", encoding="utf-8") as f_in:
text = f_in.read()
# normalize spaces
words = text.split()
clean_text = " ".join(words)
with open(out_path, "w", encoding="utf-8") as f_out:
f_out.write(clean_text)
print("Processed:", filename)This pattern lets you automatically transform large numbers of text files at once.
Practical Mini-Automation Examples
Example 1: Automatically Add a Disclaimer to All Files
Task: Add the same disclaimer at the top of every .txt file in a folder.
import os
folder = "documents"
disclaimer = (
"DISCLAIMER: This is an automated note.\n"
"Please review for accuracy.\n\n"
)
for filename in os.listdir(folder):
if not filename.endswith(".txt"):
continue
path = os.path.join(folder, filename)
with open(path, "r", encoding="utf-8") as f:
original = f.read()
with open(path, "w", encoding="utf-8") as f:
f.write(disclaimer + original)
print("Updated:", filename)Example 2: Simple “Find and Replace” Tool for a Folder
Task: Replace a company name or term everywhere across many files.
import os
folder = "docs"
old = "OldCompany"
new = "NewCompany"
for filename in os.listdir(folder):
if not filename.endswith(".txt"):
continue
path = os.path.join(folder, filename)
with open(path, "r", encoding="utf-8") as f:
text = f.read()
if old not in text:
continue # skip files without the old name
updated_text = text.replace(old, new)
with open(path, "w", encoding="utf-8") as f:
f.write(updated_text)
print("Replaced in:", filename)Example 3: Extract Email Addresses from Text
Without going deep into regular expressions, you can still do simple pattern-based extraction.
This example looks for words that “look like” emails (contain @ and .):
emails = set()
with open("contacts.txt", "r", encoding="utf-8") as f:
for line in f:
parts = line.split()
for part in parts:
if "@" in part and "." in part:
# basic cleanup of punctuation around the email
candidate = part.strip(".,;:()[]<>\"'")
emails.add(candidate)
with open("emails_found.txt", "w", encoding="utf-8") as f:
for email in sorted(emails):
f.write(email + "\n")
print("Found", len(emails), "unique email addresses.")This is a basic, but often good-enough, automation for pulling contact information from messy text.
Tips for Reliable Text Automation
Automating text can easily go wrong if you are not careful. Here are some simple guidelines:
- Work on copies first
When changing many files, operate on a copy of the folder or write results to a new output folder. - Test on a few small examples
Before running on hundreds of files, manually check output for 1–2 files. - Print sample output
Add temporaryprint()statements to inspect intermediate results (for example,print(clean_text[:200])). - Be explicit about encoding
Useencoding="utf-8"when opening files to avoid many common issues. - Log what your script does
Print or log filenames and actions so you can review what happened if something goes wrong.
Practice Ideas
To get comfortable with automating text processing, try writing small scripts that:
- Convert all text in a file to lowercase and remove extra spaces
- Count how many times each word appears in a file and output the top 10
- Rename all
.txtfiles in a folder based on part of their contents (for example, first line of the file) - Merge all
.txtfiles in a folder into one big file, separated by headers - Create personalized “certificates” or “letters” from a list of names
Each of these exercises uses the same core ideas:
- Read text (from one or more files)
- Process/transform/filter it
- Write the result automatically
Once you are comfortable with these patterns, you can handle a wide range of text-related tasks with Python automation.