Regular Expressions for Document Engineering

Gregory M. Kapfhammer

November 10, 2025

Course learning objectives

Learning Objectives for Document Engineering

  • CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
  • CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
  • CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
  • CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.
  • This week’s content aids attainment of CS-104-2, CS-104-3, and CS-104-4!

Creating regular expressions in Python

  • Define a regular expression pattern
  • Use the re module for regex operations
  • Compile patterns with re.compile()
  • Raw strings (r'') prevent escape confusion

Basic regular expression steps

  • Import the module: start with import re
  • Define pattern: use raw string like r'pattern'
  • Compile pattern: create regex object with re.compile(pattern)
  • Apply pattern: run match(), search(), or findall()
  • Extract results: process match objects or lists of matches
  • Test and debug: verify with various input cases of strings

How do we create effective patterns? How do we test that patterns work correctly? How do we debug when patterns fail? How do we optimize patterns for performance? How do we reliably use them in programs?

Creating your first regular expression

  • \d matches any digit (0-9)
  • {3} means exactly 3 occurrences
  • The pattern requires hyphens at specific positions
  • Does this work for a wide variety of phone numbers? Well, try it out!

Testing regular expressions in Python

Pattern matching for email

Revisit the phone regular expression

  • Enhance: can hyphens be optional? Allow dots or spaces as separators?
  • Question: What are the benefits and drawbacks of regular expressions?

Key Components of a Regular Expression

  • Literal characters: match exact text
  • Metacharacters: special meaning symbols
  • Character classes: sets of characters
  • Quantifiers: specify amount of repetition

Regular expression notation

  • Pattern matching: describe sets of strings concisely
  • Practical extensions:
    • . means any char
    • + means one or more
    • [...] is a character class
    • [a-z] matches any character in range
    • [^abc] matches any character except those listed
    • These are all “syntactic sugar” for convenience
  • Can you write a regex for email addresses?
  • How do you test a regular expressions’ correctness?

Understanding regex metacharacters

  • . matches any single character except newline
  • ^ matches start of string
  • $ matches end of string
  • * matches zero or more repetitions
  • + matches one or more repetitions
  • ? matches zero or one repetition
  • {n} matches exactly n repetitions
  • {n,m} matches between n and m repetitions
  • \ escapes special characters

Character classes in regex

  • [abc] matches any single character a, b, or c
  • [a-z] matches any lowercase letter
  • [A-Z] matches any uppercase letter
  • [0-9] matches any digit
  • [^abc] matches any character except a, b, or c
  • \d matches any digit (same as [0-9])
  • \w matches word characters (letters, digits, underscore)
  • \s matches whitespace (spaces, tabs, newlines)
  • \D, \W, \S are negations of the above

Explore quantifiers like * and +

Further exploration of quantifiers

  • Experiment: try changing the pattern to \d+ or \d* to see how matching behavior changes! What did you discover and learn?

Use regular expressions for pattern matching

  • Search: find pattern anywhere in string
  • Match: check if pattern starts string
  • Find all: extract all matches for pattern
  • Replace: substitute matched patterns

Key regex methods in Python

  • re.match(pattern, string): checks if pattern matches at start of string
  • re.search(pattern, string): finds first occurrence of pattern anywhere
  • re.findall(pattern, string): returns list of non-overlapping matches
  • re.finditer(pattern, string): returns iterator of match objects
  • re.sub(pattern, repl, string): replaces matches with new string
  • re.split(pattern, string): splits string by pattern occurrences
  • pattern.fullmatch(string): checks if entire string matches pattern
  • Explore: How can pattern matching aid the implementation of your document engineering project? What are new features that you could add? How would you test them to ensure correctness?

Search versus match methods

import re

pattern = r'\d{3}'
text = 'Order number 456 received'

# search finds pattern anywhere
search_result = re.search(pattern, text)
print(f"Search result: {search_result.group() if search_result else 'None'}")

# match only checks start of string
match_result = re.match(pattern, text)
print(f"Match result: {match_result.group() if match_result else 'None'}")

# fullmatch requires entire string to match
full_result = re.fullmatch(pattern, '789')
print(f"Fullmatch result: {full_result.group() if full_result else 'None'}")
Search result: 456
Match result: None
Fullmatch result: 789
  • Search is most flexible for finding patterns
  • Match checks specific positions
  • Fullmatch requires exact pattern conformity

Finding all matches with findall

import re

text = 'Contact us at support@example.com or sales@company.org'
email_pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'

# find all email addresses
emails = re.findall(email_pattern, text)
print(f"Found {len(emails)} email addresses:")
for email in emails:
    print(f"  - {email}")
Found 2 email addresses:
  - support@example.com
  - sales@company.org
  • \b represents word boundaries
  • + matches one or more characters
  • Character classes [...] define valid characters

Try extracting emails yourself

  • Try modifying: change the text to include your own email examples!

Regular expressions for document engineering

  • Extract metadata: parse dates, versions, IDs
  • Validate input: check format compliance
  • Clean text: remove unwanted patterns
  • Analyze content: find patterns in docs

Extracting dates from documents

import re

document = """
Meeting scheduled for 2024-03-15.
Report due on 03/22/2024.
Conference from 2024-04-10 to 2024-04-12.
"""

# pattern for ISO date format (YYYY-MM-DD)
iso_pattern = r'\d{4}-\d{2}-\d{2}'
iso_dates = re.findall(iso_pattern, document)

# pattern for US date format (MM/DD/YYYY)
us_pattern = r'\d{2}/\d{2}/\d{4}'
us_dates = re.findall(us_pattern, document)

print("ISO format dates:", iso_dates)
print("US format dates:", us_dates)
ISO format dates: ['2024-03-15', '2024-04-10', '2024-04-12']
US format dates: ['03/22/2024']
  • Different date formats require different patterns
  • Prosegrammers extract structured data from unstructured text

Cleaning markdown formatting

import re

markdown_text = """
# Header One
This is **bold** and this is *italic*.
Check out [this link](http://example.com).
"""

# remove markdown bold syntax
no_bold = re.sub(r'\*\*(.+?)\*\*', r'\1', markdown_text)

# remove markdown italic syntax
no_italic = re.sub(r'\*(.+?)\*', r'\1', no_bold)

# remove markdown links, keep text
no_links = re.sub(r'\[(.+?)\]\(.+?\)', r'\1', no_italic)

print("Original:")
print(markdown_text)
print("\nCleaned:")
print(no_links)
Original:

# Header One
This is **bold** and this is *italic*.
Check out [this link](http://example.com).


Cleaned:

# Header One
This is bold and this is italic.
Check out this link.
  • .+? is non-greedy matching (stops at first occurrence)
  • Capture groups (.+?) let us keep matched text
  • \1 in replacement refers to first captured group

Validating document structure

import re

def validate_header_structure(markdown: str) -> bool:
    """Check if markdown has proper header hierarchy."""
    lines = markdown.split('\n')
    header_pattern = r'^(#{1,6})\s+(.+)$'
    previous_level = 0
    for line in lines:
        match = re.match(header_pattern, line)
        if match:
            current_level = len(match.group(1))
            if current_level > previous_level + 1:
                return False
            previous_level = current_level
    return True

good_doc = "# Title\n## Section\n### Subsection"
bad_doc = "# Title\n### Subsection"

print(f"Good structure: {validate_header_structure(good_doc)}")
print(f"Bad structure: {validate_header_structure(bad_doc)}")
Good structure: True
Bad structure: False
  • Document engineering tools verify structure correctness
  • Regex helps enforce formatting conventions

Testing regular expressions

  • Unit tests: verify pattern correctness
  • Test cases: positive and negative examples
  • Edge cases: empty strings, special chars
  • Frameworks: use unittest or pytest

Testing regex with unittest

import unittest
import re

class TestEmailRegex(unittest.TestCase):
    def setUp(self):
        """Set up test fixtures."""
        self.email_pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
        self.regex = re.compile(self.email_pattern)
    
    def test_valid_email(self):
        """Test that valid emails match."""
        self.assertTrue(self.regex.fullmatch('user@example.com'))
        self.assertTrue(self.regex.fullmatch('test.user@domain.co.uk'))
    
    def test_invalid_email(self):
        """Test that invalid emails do not match."""
        self.assertFalse(self.regex.fullmatch('invalid@'))
        self.assertFalse(self.regex.fullmatch('@example.com'))
        self.assertFalse(self.regex.fullmatch('no-at-sign.com'))

unittest.main(argv=['ignored'], verbosity=2, exit=False)
<unittest.main.TestProgram at 0x7efc14e32870>

Testing regex patterns systematically

  • Positive tests: verify pattern matches valid inputs
  • Negative tests: ensure pattern rejects invalid inputs
  • Edge cases: test empty strings, very long inputs, special characters
  • Boundary tests: check minimum and maximum length requirements
  • Real-world data: use actual document samples for testing
  • Test coverage: ensure all pattern components are tested
  • Refactoring: change patterns confidently with comprehensive tests

Well-tested regex patterns make document engineering tools reliable and maintainable!

Testing date extraction function

import unittest
import re

def extract_dates(text: str) -> list:
    """Extract all ISO format dates from text."""
    pattern = r'\d{4}-\d{2}-\d{2}'
    return re.findall(pattern, text)

class TestDateExtraction(unittest.TestCase):
    def test_single_date(self):
        """Test extraction of single date."""
        result = extract_dates("Meeting on 2024-03-15")
        self.assertEqual(result, ['2024-03-15'])
    
    def test_multiple_dates(self):
        """Test extraction of multiple dates."""
        text = "From 2024-01-01 to 2024-12-31"
        result = extract_dates(text)
        self.assertEqual(len(result), 2)
    
    def test_no_dates(self):
        """Test text with no dates."""
        result = extract_dates("No dates here")
        self.assertEqual(result, [])

unittest.main(argv=['ignored'], verbosity=2, exit=False)
<unittest.main.TestProgram at 0x7efc14e73350>

Practical regex testing strategies

  • Start simple: test basic cases before complex ones
  • Use online tools: regex101.com for pattern debugging
  • Document patterns: add comments explaining regex logic
  • Version patterns: track changes to regex as requirements evolve
  • Benchmark performance: test speed with large documents
  • Handle errors: use try-except blocks for malformed input
  • Share test data: maintain test document collection for validation

Testing regex is essential because patterns can be complex and subtle bugs can hide in edge cases!

Benefits and limitations of regular expressions

  • Benefits:
    • Powerful pattern matching in compact syntax
    • Built-in support across programming languages
    • Fast for many text processing tasks
    • Great for validation and extraction
  • Limitations:
    • Can become complex and hard to read
    • Not suitable for parsing nested structures
    • Performance issues with catastrophic backtracking
    • Learning curve for advanced features

When to use regex for documents

  • Good use cases:
    • Extracting emails, URLs, dates from text
    • Validating input formats (phone numbers, IDs)
    • Simple text cleaning and normalization
    • Finding keywords or patterns in documents
    • Basic markdown or syntax highlighting
  • Consider alternatives for:
    • Parsing HTML or XML (use BeautifulSoup, lxml)
    • Complex nested structures (use parsers)
    • Full language parsing (use AST tools)
    • Large-scale text analysis (use NLP libraries)

Choose the right tool for the job! Regex is powerful but not always the best solution.

Regex best practices for prosegrammers

  • Use raw strings r'' to avoid escape character confusion
  • Compile patterns once and reuse for better performance
  • Add comments to explain complex patterns
  • Test patterns thoroughly with diverse inputs
  • Use named groups for readability: (?P<name>...)
  • Keep patterns simple; split complex logic into multiple patterns
  • Use online tools like regex101 for development and testing
  • Document assumptions about input format
  • Handle edge cases gracefully in production code

Master regex fundamentals to become an effective prosegrammer who can process and analyze documents efficiently!

Find an open-source Python project that contains a regular expression!

  • What did you find? How does it work?
  • What are the benefits and limitations?
  • Share the link and a code segment
  • Here is an example from GatorGrader:
MULTILINECOMMENT_RE_JAVA = r"""/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/"""
SINGLELINECOMMENT_RE_JAVA = r"""^(?:[^"/\\]|\"(?:[^\"\\]|\\.)*
\"|/(?:[^/"\\]|\\.)|/\"(?:[^\"\\]|\\.)*\"|\\.)*//(.*)$"""
SINGLELINECOMMENT_RE_PYTHON = r"""^(?:[^"#\\]|\"(?:[^\"\\]|\\.)*\"|
/(?:[^#"\\]|\\.)|/\"(?:[^\"\\]|\\.)*\"|\\.)*#(.*)$"""
MULTILINECOMMENT_RE_PYTHON = r'^[ \t]*"""(.*?)"""[ \t]*$'

Course goals reminder

  • Document Creation:
    • Design and implement document generation workflows
    • Test all aspects of documents to ensure quality and accuracy
    • Create frameworks for automated document production
  • Document Analysis:
    • Collect and analyze data about document usage and quality
    • Visualize insights to improve documentation strategies
  • Document Processing:
    • Use regex for pattern matching and text extraction
    • Build validation tools for document structure
    • Clean and normalize document content programmatically
  • Check syllabus for details about Document Engineering course!