Kahibaro
Discord Login Register

Automating text processing

Why Text Processing Is Perfect for Automation

Many everyday tasks involve working with text:

Doing these manually is slow and error-prone. Python can read text, transform it, and write out new text automatically. In this chapter, you will see practical patterns for automating common text-processing tasks.

We’ll assume you already know how to:

Here we focus on automating text-related tasks using those skills.

Reading and Writing Text for Automation

When automating, you often follow this pattern:

  1. Read some text (from a file, user input, or another source)
  2. Process/transform the text
  3. Save or display the result

A typical file-based pattern looks like this:

# read from one file, write transformed text to another
with open("input.txt", "r", encoding="utf-8") as f_in:
    text = f_in.read()
# process the text somehow
processed_text = text.upper()  # example transformation
with open("output.txt", "w", encoding="utf-8") as f_out:
    f_out.write(processed_text)

The interesting part is what happens in the “process the text” step. That is what the rest of this chapter focuses on.

Common Text Transformations

Changing Case

Changing case is useful for:

Useful methods:

Example: normalize emails to lowercase:

email = "Alice.Example@Email.COM"
normalized_email = email.lower()
print(normalized_email)  # alice.example@email.com

Stripping and Cleaning Whitespace

Text often contains extra spaces, tabs, or newlines that you don’t want.

Common methods:

Example: cleaning up lines from a file:

cleaned_lines = []
with open("raw_lines.txt", "r", encoding="utf-8") as f:
    for line in f:
        cleaned = line.strip()  # remove leading/trailing spaces and \n
        if cleaned:             # skip empty lines
            cleaned_lines.append(cleaned)
with open("cleaned_lines.txt", "w", encoding="utf-8") as f:
    for line in cleaned_lines:
        f.write(line + "\n")

Replacing and Fixing Text

You can automatically fix common mistakes or convert from one format to another.

text = "Ths is a smple txt with erors."
# very simple replacement corrections
text = text.replace("Ths", "This")
text = text.replace("smple", "simple")
text = text.replace("txt", "text")
text = text.replace("erors", "errors")
print(text)

You can combine many replacements into a loop:

text = "aple banan chery"
corrections = {
    "aple": "apple",
    "banan": "banana",
    "chery": "cherry"
}
for wrong, right in corrections.items():
    text = text.replace(wrong, right)
print(text)  # apple banana cherry

This pattern is useful for bulk corrections across many documents.

Working with Lines and Records

Many text files are line-based (logs, CSV exports, plain text lists). Automating them usually means:

Splitting Text into Lines

You can split the whole text into a list of lines using splitlines():

with open("notes.txt", "r", encoding="utf-8") as f:
    text = f.read()
lines = text.splitlines()
print(len(lines), "lines found")

Or simply iterate directly over the file object (which yields lines):

with open("notes.txt", "r", encoding="utf-8") as f:
    for line in f:
        print("Line:", line.strip())

Filtering Lines

Automation often means “find only the lines I care about”.

Example: filter lines containing a keyword:

keyword = "ERROR"
matching_lines = []
with open("application.log", "r", encoding="utf-8") as f:
    for line in f:
        if keyword in line:
            matching_lines.append(line)
with open("errors_only.log", "w", encoding="utf-8") as f:
    f.writelines(matching_lines)

You can easily extend this to multiple keywords:

keywords = ["ERROR", "WARNING"]
with open("application.log", "r", encoding="utf-8") as f_in, \
     open("important.log", "w", encoding="utf-8") as f_out:
    for line in f_in:
        if any(k in line for k in keywords):
            f_out.write(line)

Transforming Lines

You may want to modify each line in some way.

Example: add line numbers:

with open("original.txt", "r", encoding="utf-8") as f_in, \
     open("numbered.txt", "w", encoding="utf-8") as f_out:
    for i, line in enumerate(f_in, start=1):
        clean = line.rstrip("\n")
        new_line = f"{i:03}: {clean}\n"  # 001:, 002:, etc.
        f_out.write(new_line)

Example: convert comma-separated values to tab-separated:

with open("data.csv", "r", encoding="utf-8") as f_in, \
     open("data.tsv", "w", encoding="utf-8") as f_out:
    for line in f_in:
        line = line.rstrip("\n")
        parts = line.split(",")
        f_out.write("\t".join(parts) + "\n")

Splitting and Joining Text

Breaking text into parts and joining it back together is a core text automation skill.

Splitting with `split()`

split() turns a string into a list:

Example: split full names into first and last names:

names = [
    "Alice Smith",
    "Bob Johnson",
    "Charlie Doe"
]
for name in names:
    parts = name.split(" ")
    first = parts[0]
    last = parts[-1]
    print("First:", first, "| Last:", last)

Joining with `join()`

separator.join(list_of_strings) combines elements into a single string.

Example: build a CSV line:

values = ["Alice", "30", "Engineer"]
line = ",".join(values)
print(line)  # Alice,30,Engineer

Example: reconstruct text from cleaned words:

raw = "This   text  has   extra   spaces."
words = raw.split()  # split on any whitespace
clean = " ".join(words)
print(clean)  # This text has extra spaces.

Searching in Text

When automating, you often need to answer questions like:

Useful operations:

Example: count how many times a word appears in a file:

word = "Python"
count = 0
with open("article.txt", "r", encoding="utf-8") as f:
    for line in f:
        count += line.count(word)
print(f"{word} appears {count} times.")

Example: extract part of a line after a marker:

line = "User: alice | Role: admin | Active: yes"
marker = "Role: "
start_index = line.find(marker)
if start_index != -1:
    start_index += len(marker)
    # role ends at the next " | " or end of line
    end_index = line.find(" | ", start_index)
    if end_index == -1:
        end_index = len(line)
    role = line[start_index:end_index]
    print("Role is:", role)

This kind of “find marker, then slice” is common when you are extracting information from structured but not strictly formatted text (like logs).

Simple Text Normalization

“Normalizing” means converting text into a consistent, predictable format.

Common normalization steps:

Example: normalize for comparison:

def normalize(text):
    text = text.lower()
    text = text.strip()
    words = text.split()
    return " ".join(words)
a = "  Hello   World "
b = "hello world"
print(normalize(a) == normalize(b))  # True

Generating Text Automatically

Automation is not only about cleaning up text; it can also generate text.

Using Templates with Placeholders

A simple pattern is to write a “template” with placeholders, then fill them in.

Example: generating personalized messages:

template = (
    "Hello {name},\n"
    "Thank you for purchasing {product}.\n"
    "Your order number is {order_id}.\n"
)
orders = [
    {"name": "Alice", "product": "Notebook", "order_id": "A001"},
    {"name": "Bob", "product": "Pen", "order_id": "A002"},
]
for order in orders:
    message = template.format(
        name=order["name"],
        product=order["product"],
        order_id=order["order_id"]
    )
    filename = f"message_{order['order_id']}.txt"
    with open(filename, "w", encoding="utf-8") as f:
        f.write(message)

This technique is very useful for:

Combining Data with Text (Simple Reports)

You can summarize data into human-readable text.

Example: simple word count report:

def word_count(text):
    words = text.split()
    return len(words)
with open("article.txt", "r", encoding="utf-8") as f:
    text = f.read()
report = []
report.append("Text Analysis Report")
report.append("====================")
report.append(f"Characters (including spaces): {len(text)}")
report.append(f"Words: {word_count(text)}")
report.append(f"Lines: {len(text.splitlines())}")
report_text = "\n".join(report)
with open("report.txt", "w", encoding="utf-8") as f:
    f.write(report_text)

This is a simple example of using automated text processing to generate a summary document.

Basic Batch Processing of Multiple Files

Automation becomes powerful when you apply it to a folder of text files instead of just one.

You can combine file/folder automation with text processing:

Example: clean all .txt files in a folder by removing extra spaces:

import os
input_folder = "raw_texts"
output_folder = "clean_texts"
os.makedirs(output_folder, exist_ok=True)
for filename in os.listdir(input_folder):
    if not filename.endswith(".txt"):
        continue
    in_path = os.path.join(input_folder, filename)
    out_path = os.path.join(output_folder, filename)
    with open(in_path, "r", encoding="utf-8") as f_in:
        text = f_in.read()
    # normalize spaces
    words = text.split()
    clean_text = " ".join(words)
    with open(out_path, "w", encoding="utf-8") as f_out:
        f_out.write(clean_text)
    print("Processed:", filename)

This pattern lets you automatically transform large numbers of text files at once.

Practical Mini-Automation Examples

Example 1: Automatically Add a Disclaimer to All Files

Task: Add the same disclaimer at the top of every .txt file in a folder.

import os
folder = "documents"
disclaimer = (
    "DISCLAIMER: This is an automated note.\n"
    "Please review for accuracy.\n\n"
)
for filename in os.listdir(folder):
    if not filename.endswith(".txt"):
        continue
    path = os.path.join(folder, filename)
    with open(path, "r", encoding="utf-8") as f:
        original = f.read()
    with open(path, "w", encoding="utf-8") as f:
        f.write(disclaimer + original)
    print("Updated:", filename)

Example 2: Simple “Find and Replace” Tool for a Folder

Task: Replace a company name or term everywhere across many files.

import os
folder = "docs"
old = "OldCompany"
new = "NewCompany"
for filename in os.listdir(folder):
    if not filename.endswith(".txt"):
        continue
    path = os.path.join(folder, filename)
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()
    if old not in text:
        continue  # skip files without the old name
    updated_text = text.replace(old, new)
    with open(path, "w", encoding="utf-8") as f:
        f.write(updated_text)
    print("Replaced in:", filename)

Example 3: Extract Email Addresses from Text

Without going deep into regular expressions, you can still do simple pattern-based extraction.

This example looks for words that “look like” emails (contain @ and .):

emails = set()
with open("contacts.txt", "r", encoding="utf-8") as f:
    for line in f:
        parts = line.split()
        for part in parts:
            if "@" in part and "." in part:
                # basic cleanup of punctuation around the email
                candidate = part.strip(".,;:()[]<>\"'")
                emails.add(candidate)
with open("emails_found.txt", "w", encoding="utf-8") as f:
    for email in sorted(emails):
        f.write(email + "\n")
print("Found", len(emails), "unique email addresses.")

This is a basic, but often good-enough, automation for pulling contact information from messy text.

Tips for Reliable Text Automation

Automating text can easily go wrong if you are not careful. Here are some simple guidelines:

Practice Ideas

To get comfortable with automating text processing, try writing small scripts that:

Each of these exercises uses the same core ideas:

  1. Read text (from one or more files)
  2. Process/transform/filter it
  3. Write the result automatically

Once you are comfortable with these patterns, you can handle a wide range of text-related tasks with Python automation.

Views: 14

Comments

Please login to add a comment.

Don't have an account? Register now!