14.4 Automating text processing

Why Text Processing Is Perfect for Automation

Many everyday tasks involve working with text:

Cleaning up data exported from another program
Renaming or reorganizing content
Extracting key information from logs, notes, or reports
Generating reports or summaries

Doing these manually is slow and error-prone. Python can read text, transform it, and write out new text automatically. In this chapter, you will see practical patterns for automating common text-processing tasks.

We’ll assume you already know how to:

Read and write files
Use basic string operations
Run Python scripts

Here we focus on automating text-related tasks using those skills.

Reading and Writing Text for Automation

When automating, you often follow this pattern:

Read some text (from a file, user input, or another source)
Process/transform the text
Save or display the result

A typical file-based pattern looks like this:

# read from one file, write transformed text to another
with open("input.txt", "r", encoding="utf-8") as f_in:
    text = f_in.read()
# process the text somehow
processed_text = text.upper()  # example transformation
with open("output.txt", "w", encoding="utf-8") as f_out:
    f_out.write(processed_text)

The interesting part is what happens in the “process the text” step. That is what the rest of this chapter focuses on.

Common Text Transformations

Changing Case

Changing case is useful for:

Normalizing data (for comparisons or searching)
Making output consistent

Useful methods:

text.lower() – all lowercase
text.upper() – all uppercase
text.title() – first letter of each word capitalized
text.capitalize() – first character capitalized, rest lowercase

Example: normalize emails to lowercase:

email = "Alice.Example@Email.COM"
normalized_email = email.lower()
print(normalized_email)  # alice.example@email.com

Stripping and Cleaning Whitespace

Text often contains extra spaces, tabs, or newlines that you don’t want.

Common methods:

text.strip() – remove whitespace from both ends
text.lstrip() – remove from the left
text.rstrip() – remove from the right
text.replace(old, new) – replace one substring with another

Example: cleaning up lines from a file:

cleaned_lines = []
with open("raw_lines.txt", "r", encoding="utf-8") as f:
    for line in f:
        cleaned = line.strip()  # remove leading/trailing spaces and \n
        if cleaned:             # skip empty lines
            cleaned_lines.append(cleaned)
with open("cleaned_lines.txt", "w", encoding="utf-8") as f:
    for line in cleaned_lines:
        f.write(line + "\n")

Replacing and Fixing Text

You can automatically fix common mistakes or convert from one format to another.

text = "Ths is a smple txt with erors."
# very simple replacement corrections
text = text.replace("Ths", "This")
text = text.replace("smple", "simple")
text = text.replace("txt", "text")
text = text.replace("erors", "errors")
print(text)

You can combine many replacements into a loop:

text = "aple banan chery"
corrections = {
    "aple": "apple",
    "banan": "banana",
    "chery": "cherry"
}
for wrong, right in corrections.items():
    text = text.replace(wrong, right)
print(text)  # apple banana cherry

This pattern is useful for bulk corrections across many documents.

Working with Lines and Records

Many text files are line-based (logs, CSV exports, plain text lists). Automating them usually means:

Looping over lines
Transforming each line
Writing out the result

Splitting Text into Lines

You can split the whole text into a list of lines using splitlines():

with open("notes.txt", "r", encoding="utf-8") as f:
    text = f.read()
lines = text.splitlines()
print(len(lines), "lines found")

Or simply iterate directly over the file object (which yields lines):

with open("notes.txt", "r", encoding="utf-8") as f:
    for line in f:
        print("Line:", line.strip())

Filtering Lines

Automation often means “find only the lines I care about”.

Example: filter lines containing a keyword:

keyword = "ERROR"
matching_lines = []
with open("application.log", "r", encoding="utf-8") as f:
    for line in f:
        if keyword in line:
            matching_lines.append(line)
with open("errors_only.log", "w", encoding="utf-8") as f:
    f.writelines(matching_lines)

You can easily extend this to multiple keywords:

keywords = ["ERROR", "WARNING"]
with open("application.log", "r", encoding="utf-8") as f_in, \
     open("important.log", "w", encoding="utf-8") as f_out:
    for line in f_in:
        if any(k in line for k in keywords):
            f_out.write(line)

Transforming Lines

You may want to modify each line in some way.

Example: add line numbers:

with open("original.txt", "r", encoding="utf-8") as f_in, \
     open("numbered.txt", "w", encoding="utf-8") as f_out:
    for i, line in enumerate(f_in, start=1):
        clean = line.rstrip("\n")
        new_line = f"{i:03}: {clean}\n"  # 001:, 002:, etc.
        f_out.write(new_line)

Example: convert comma-separated values to tab-separated:

with open("data.csv", "r", encoding="utf-8") as f_in, \
     open("data.tsv", "w", encoding="utf-8") as f_out:
    for line in f_in:
        line = line.rstrip("\n")
        parts = line.split(",")
        f_out.write("\t".join(parts) + "\n")

Splitting and Joining Text

Breaking text into parts and joining it back together is a core text automation skill.

Splitting with `split()`

split() turns a string into a list:

text.split() – split on any whitespace
text.split(",") – split on a specific separator

Example: split full names into first and last names:

names = [
    "Alice Smith",
    "Bob Johnson",
    "Charlie Doe"
]
for name in names:
    parts = name.split(" ")
    first = parts[0]
    last = parts[-1]
    print("First:", first, "| Last:", last)

Joining with `join()`

separator.join(list_of_strings) combines elements into a single string.

Example: build a CSV line:

values = ["Alice", "30", "Engineer"]
line = ",".join(values)
print(line)  # Alice,30,Engineer

Example: reconstruct text from cleaned words:

raw = "This   text  has   extra   spaces."
words = raw.split()  # split on any whitespace
clean = " ".join(words)
print(clean)  # This text has extra spaces.

Searching in Text

When automating, you often need to answer questions like:

“Does this line contain a certain word?”
“Where does this word first appear?”
“How many times does this phrase occur?”

Useful operations:

substring in text – check presence (returns True/False)
text.find(substring) – first index, or -1 if not found
text.count(substring) – number of occurrences

Example: count how many times a word appears in a file:

word = "Python"
count = 0
with open("article.txt", "r", encoding="utf-8") as f:
    for line in f:
        count += line.count(word)
print(f"{word} appears {count} times.")

Example: extract part of a line after a marker:

line = "User: alice | Role: admin | Active: yes"
marker = "Role: "
start_index = line.find(marker)
if start_index != -1:
    start_index += len(marker)
    # role ends at the next " | " or end of line
    end_index = line.find(" | ", start_index)
    if end_index == -1:
        end_index = len(line)
    role = line[start_index:end_index]
    print("Role is:", role)

This kind of “find marker, then slice” is common when you are extracting information from structured but not strictly formatted text (like logs).

Simple Text Normalization

“Normalizing” means converting text into a consistent, predictable format.

Common normalization steps:

Lowercase everything: text.lower()
Remove extra spaces: split() then " ".join(...)
Replace special characters (e.g., convert – to -)
Remove or replace punctuation for simple analysis

Example: normalize for comparison:

def normalize(text):
    text = text.lower()
    text = text.strip()
    words = text.split()
    return " ".join(words)
a = "  Hello   World "
b = "hello world"
print(normalize(a) == normalize(b))  # True

Generating Text Automatically

Automation is not only about cleaning up text; it can also generate text.

Using Templates with Placeholders

A simple pattern is to write a “template” with placeholders, then fill them in.

Example: generating personalized messages:

template = (
    "Hello {name},\n"
    "Thank you for purchasing {product}.\n"
    "Your order number is {order_id}.\n"
)
orders = [
    {"name": "Alice", "product": "Notebook", "order_id": "A001"},
    {"name": "Bob", "product": "Pen", "order_id": "A002"},
]
for order in orders:
    message = template.format(
        name=order["name"],
        product=order["product"],
        order_id=order["order_id"]
    )
    filename = f"message_{order['order_id']}.txt"
    with open(filename, "w", encoding="utf-8") as f:
        f.write(message)

This technique is very useful for:

Batch generating emails (saved as text)
Creating multiple similar reports or letters
Producing configuration files from data

Combining Data with Text (Simple Reports)

You can summarize data into human-readable text.

Example: simple word count report:

def word_count(text):
    words = text.split()
    return len(words)
with open("article.txt", "r", encoding="utf-8") as f:
    text = f.read()
report = []
report.append("Text Analysis Report")
report.append("====================")
report.append(f"Characters (including spaces): {len(text)}")
report.append(f"Words: {word_count(text)}")
report.append(f"Lines: {len(text.splitlines())}")
report_text = "\n".join(report)
with open("report.txt", "w", encoding="utf-8") as f:
    f.write(report_text)

This is a simple example of using automated text processing to generate a summary document.

Basic Batch Processing of Multiple Files

Automation becomes powerful when you apply it to a folder of text files instead of just one.

You can combine file/folder automation with text processing:

Loop over files in a directory
For each file, read text, process it, write output

Example: clean all .txt files in a folder by removing extra spaces:

import os
input_folder = "raw_texts"
output_folder = "clean_texts"
os.makedirs(output_folder, exist_ok=True)
for filename in os.listdir(input_folder):
    if not filename.endswith(".txt"):
        continue
    in_path = os.path.join(input_folder, filename)
    out_path = os.path.join(output_folder, filename)
    with open(in_path, "r", encoding="utf-8") as f_in:
        text = f_in.read()
    # normalize spaces
    words = text.split()
    clean_text = " ".join(words)
    with open(out_path, "w", encoding="utf-8") as f_out:
        f_out.write(clean_text)
    print("Processed:", filename)

This pattern lets you automatically transform large numbers of text files at once.

Practical Mini-Automation Examples

Example 1: Automatically Add a Disclaimer to All Files

Task: Add the same disclaimer at the top of every .txt file in a folder.

import os
folder = "documents"
disclaimer = (
    "DISCLAIMER: This is an automated note.\n"
    "Please review for accuracy.\n\n"
)
for filename in os.listdir(folder):
    if not filename.endswith(".txt"):
        continue
    path = os.path.join(folder, filename)
    with open(path, "r", encoding="utf-8") as f:
        original = f.read()
    with open(path, "w", encoding="utf-8") as f:
        f.write(disclaimer + original)
    print("Updated:", filename)

Example 2: Simple “Find and Replace” Tool for a Folder

Task: Replace a company name or term everywhere across many files.

import os
folder = "docs"
old = "OldCompany"
new = "NewCompany"
for filename in os.listdir(folder):
    if not filename.endswith(".txt"):
        continue
    path = os.path.join(folder, filename)
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()
    if old not in text:
        continue  # skip files without the old name
    updated_text = text.replace(old, new)
    with open(path, "w", encoding="utf-8") as f:
        f.write(updated_text)
    print("Replaced in:", filename)

Example 3: Extract Email Addresses from Text

Without going deep into regular expressions, you can still do simple pattern-based extraction.

This example looks for words that “look like” emails (contain @ and .):

emails = set()
with open("contacts.txt", "r", encoding="utf-8") as f:
    for line in f:
        parts = line.split()
        for part in parts:
            if "@" in part and "." in part:
                # basic cleanup of punctuation around the email
                candidate = part.strip(".,;:()[]<>\"'")
                emails.add(candidate)
with open("emails_found.txt", "w", encoding="utf-8") as f:
    for email in sorted(emails):
        f.write(email + "\n")
print("Found", len(emails), "unique email addresses.")

This is a basic, but often good-enough, automation for pulling contact information from messy text.

Tips for Reliable Text Automation

Automating text can easily go wrong if you are not careful. Here are some simple guidelines:

Work on copies first
When changing many files, operate on a copy of the folder or write results to a new output folder.
Test on a few small examples
Before running on hundreds of files, manually check output for 1–2 files.
Print sample output
Add temporary print() statements to inspect intermediate results (for example, print(clean_text[:200])).
Be explicit about encoding
Use encoding="utf-8" when opening files to avoid many common issues.
Log what your script does
Print or log filenames and actions so you can review what happened if something goes wrong.

Practice Ideas

To get comfortable with automating text processing, try writing small scripts that:

Convert all text in a file to lowercase and remove extra spaces
Count how many times each word appears in a file and output the top 10
Rename all .txt files in a folder based on part of their contents (for example, first line of the file)
Merge all .txt files in a folder into one big file, separated by headers
Create personalized “certificates” or “letters” from a list of names

Each of these exercises uses the same core ideas:

Read text (from one or more files)
Process/transform/filter it
Write the result automatically

Once you are comfortable with these patterns, you can handle a wide range of text-related tasks with Python automation.

Comments

Please login to add a comment.

Don't have an account? Register now!