Regular Expressions for Document Engineering

Gregory M. Kapfhammer

November 10, 2025

Course learning objectives

Learning Objectives for Document Engineering

CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.

This week’s content aids attainment of CS-104-2, CS-104-3, and CS-104-4!

Creating cool regular expressions in Python

Define a regular expression pattern
Use the re module for regex operations
Compile patterns with re.compile()
Raw strings (r'') prevent escape confusion

Basic regular expression steps

Import the module: start with import re
Define pattern: use raw string like r'pattern'
Compile pattern: create regex object with re.compile(pattern)
Apply pattern: run match(), search(), or findall()
Extract results: process match objects or lists of matches
Test and debug: verify with various input cases of strings

How do we create effective patterns? … How do we test that patterns work correctly? … How do we debug when patterns fail? … How do we optimize patterns for performance? … How do we reliably use them in programs?

Creating your first regular expression

\d matches any digit (0-9)
{3} means exactly 3 occurrences
The pattern requires hyphens at specific positions
Does this work for a wide variety of phone numbers? Well, try it out!

Testing regular expressions in Python

Simple pattern matching for email

Revisit the phone regular expression

Enhance: can hyphens be optional? Allow dots or spaces as separators?
Question: What are the benefits and drawbacks of regular expressions?

Key components of a regular expression

Literal characters: match exact text
Metacharacters: special meaning symbols
Character classes: sets of characters
Quantifiers: specify amount of repetition

Regular expression notation

Pattern matching: describe sets of strings concisely
Practical extensions of “basic” regular expressions:
- . means any char
- + means one or more
- [...] is a character class
- [a-z] matches any character in range
- [^abc] matches any character except those listed
- These are all “syntactic sugar” for convenience
Can you write an improved regex for email addresses?
How do you test a regular expressions’ correctness?

Understanding regex metacharacters

. matches any single character except newline
^ matches start of string
$ matches end of string
* matches zero or more repetitions
+ matches one or more repetitions
? matches zero or one repetition
{n} matches exactly n repetitions
{n,m} matches between n and m repetitions
\ escapes special characters like \. to match a literal dot

Character classes in regex

[abc] matches any single character a, b, or c
[a-z] matches any lowercase letter
[A-Z] matches any uppercase letter
[0-9] matches any digit
[^abc] matches any character except a, b, or c
\d matches any digit (this is the same as [0-9])
\w matches word characters (i.e., letters, digits, and underscore)
\s matches whitespace (i.e., spaces, tabs, and newlines)
\D, \W, \S are negations of the above three classes

Explore quantifiers like `*` and `+`

Further exploration of quantifiers

Experiment: try changing the pattern to \d+ or \d* to see how matching behavior changes! What did you discover and learn?

Use regular expressions for pattern matching

Search: find pattern anywhere in string
Match: check if pattern starts string
Find all: extract all matches for pattern
Replace: substitute matched patterns

Key regex methods in Python

re.match(pattern, string): checks if pattern matches at start of string
re.search(pattern, string): finds first occurrence of pattern anywhere
re.findall(pattern, string): returns list of non-overlapping matches
re.finditer(pattern, string): returns iterator of match objects
re.sub(pattern, repl, string): replaces matches with new string
re.split(pattern, string): splits string by pattern occurrences
pattern.fullmatch(string): checks if entire string matches pattern

Explore: How can pattern matching aid the implementation of your document engineering project? What are new features that you could add? How would you test them to ensure system correctness?

Search versus match methods

Search is most flexible for finding patterns
Match checks specific positions
Fullmatch requires exact pattern conformity

Finding all matches with `findall`

import re

text = 'Contact us at support@example.com or sales@company.org'
email_pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'

# find all email addresses
emails = re.findall(email_pattern, text)
print(f"Found {len(emails)} email addresses:")
for email in emails:
    print(f"  - {email}")

Found 2 email addresses:
  - support@example.com
  - sales@company.org

\b represents word boundaries
+ matches one or more characters
Character classes [...] define valid characters

Try (again) to extract emails!

Modification: change the text to include your own suitable emails
Exploration: identify some email addresses that are not detected
Extension: try to make pattern detection for emails more robust

Regular expressions for document engineering

Extract metadata: parse dates, versions, identifiers
Validate input: confirm format compliance
Clean text: remove unwanted patterns
Analyze content: find patterns in documents
Confirm correctness: test to be confident in correctness
Ensure understanding: ensure you understand the regex

Extracting dates from documents

Prosegrammers extract structured data from unstructured text
Different date formats (e.g., ISO versus US) require different patterns

Cleaning markdown formatting

Validating document structure

Document engineering tools verify structure correctness
Regex helps enforce formatting conventions
Make sure that your patterns are tested and reliable
Aim to avoid false positives and false negatives

How can we test regular expressions?

Unit tests: verify pattern correctness
Test cases: positive and negative examples
Edge cases: empty strings or special characters
Frameworks: use unittest or pytest

Testing regex with `unittest`

import unittest
import re

class TestEmailRegex(unittest.TestCase):
    def setUp(self):
        """Set up test fixtures."""
        self.email_pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
        self.regex = re.compile(self.email_pattern)
    
    def test_valid_email(self):
        """Test that valid emails match."""
        self.assertTrue(self.regex.fullmatch('user@example.com'))
        self.assertTrue(self.regex.fullmatch('test.user@domain.co.uk'))
    
    def test_invalid_email(self):
        """Test that invalid emails do not match."""
        self.assertFalse(self.regex.fullmatch('invalid@'))
        self.assertFalse(self.regex.fullmatch('@example.com'))
        self.assertFalse(self.regex.fullmatch('no-at-sign.com'))

unittest.main(argv=['ignored'], verbosity=2, exit=False)

<unittest.main.TestProgram at 0x7fd0c0e1c890>

Testing regex patterns systematically

Positive tests: verify pattern matches valid inputs
Negative tests: ensure pattern rejects invalid inputs
Edge cases: test empty strings, very long inputs, special characters
Boundary tests: check minimum and maximum length requirements
Real-world data: use actual document samples for testing
Test coverage: ensure all pattern components are tested
Refactoring: change patterns confidently with comprehensive tests

Well-tested regex patterns make document engineering tools reliable and maintainable! Please test all methods that use regular expressions!

Testing date extraction function

import unittest
import re

def extract_dates(text: str) -> list:
    """Extract all ISO format dates from text."""
    pattern = r'\d{4}-\d{2}-\d{2}'
    return re.findall(pattern, text)

class TestDateExtraction(unittest.TestCase):
    def test_single_date(self):
        """Test extraction of single date."""
        result = extract_dates("Meeting on 2024-03-15")
        self.assertEqual(result, ['2024-03-15'])
    
    def test_multiple_dates(self):
        """Test extraction of multiple dates."""
        text = "From 2024-01-01 to 2024-12-31"
        result = extract_dates(text)
        self.assertEqual(len(result), 2)
    
    def test_no_dates(self):
        """Test text with no dates."""
        result = extract_dates("No dates here")
        self.assertEqual(result, [])

unittest.main(argv=['ignored'], verbosity=2, exit=False)

<unittest.main.TestProgram at 0x7fd0c0e1de20>

Practical regex testing strategies

Start simple: test basic cases before complex ones
Use online tools: regex101.com for pattern debugging
Document patterns: add comments explaining regex logic
Version patterns: track changes to regex as requirements evolve
Benchmark performance: test speed with large documents
Handle errors: use try-except blocks for malformed input
Share test data: maintain test document collection for validation

Testing regex is essential because patterns can be complex and hard to understand — subtle bugs can hide in tricky edge cases!

Benefits and limitations of regular expressions

Benefits:
- Powerful pattern matching in compact syntax
- Built-in support across programming languages
- Fast for many text processing tasks
- Great for validation and extraction
Limitations:
- Can become complex and hard to read
- Not suitable for parsing nested structures
- Performance issues with catastrophic backtracking

When to use regex for documents

Good use cases:
- Extracting emails, URLs, dates from text
- Validating input formats (e.g., simple phone numbers or IDs)
- Simple text cleaning and normalization
- Finding keywords or patterns in documents
- Basic markdown or syntax highlighting
Consider alternatives for:
- Parsing HTML or XML (use BeautifulSoup or lxml)
- Complex nested structures (use Markdown parser)
- Full language parsing (use AST-based tools)
- Large-scale text analysis (use NLP libraries)

Regex best practices

Use raw strings with r'' to avoid escape character confusion
Compile patterns once and reuse for better performance
Add comments to explain complex patterns
Test patterns thoroughly with diverse inputs
Use named groups for readability: (?P<name>...)
Keep patterns simple; split complex logic into multiple patterns
Use online tools like regex101 for development and testing
Document assumptions about input format
Handle edge cases gracefully in production code
Again, write tests for all functions using regex!

Find an open-source Python project that contains a regular expression!

What did you find? How does it work?
What are the benefits and limitations?
Share the link and a code segment
Here is an example from GatorGrader:

MULTILINECOMMENT_RE_JAVA = r"""/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/"""
SINGLELINECOMMENT_RE_JAVA = r"""^(?:[^"/\\]|\"(?:[^\"\\]|\\.)*
\"|/(?:[^/"\\]|\\.)|/\"(?:[^\"\\]|\\.)*\"|\\.)*//(.*)$"""
SINGLELINECOMMENT_RE_PYTHON = r"""^(?:[^"#\\]|\"(?:[^\"\\]|\\.)*\"|
/(?:[^#"\\]|\\.)|/\"(?:[^\"\\]|\\.)*\"|\\.)*#(.*)$"""
MULTILINECOMMENT_RE_PYTHON = r'^[ \t]*"""(.*?)"""[ \t]*$'

Course goals reminder

Document Creation:
- Design and implement document generation workflows
- Test all aspects of documents to ensure quality and accuracy
- Create frameworks for automated document production
Document Analysis:
- Collect and analyze data about document usage and quality
- Visualize insights to improve documentation strategies
Document Processing:
- Use regex for pattern matching and text extraction
- Build validation tools for document structure
- Clean and normalize document content programmatically
Check syllabus for details about Document Engineering course!

Regular expressions aid prosegramming

Next steps with regular expression techniques:
- Find locations in your tool where regex could add value:
  - Could pattern matching help validate input formats?
  - Would text extraction improve document processing?
  - Could regex speed up an automated content analysis?
- Combine multiple patterns for powerful document tools
- How would regex make your document tools more intelligent?

Key takeaways for prosegrammers

Understand regular expression basics
- Use raw strings (r'') and re.compile() for clarity and reuse
- Know metacharacters, character classes, and quantifiers
Apply regular expressions thoughtfully
- Use search, match, findall, and sub appropriately
- Prefer specialized parsers for complex formats (e.g., HTML or Markdown)
Test and validate patterns
- Write unit tests with positive, negative, and edge cases
- Benchmark patterns to avoid “catastrophic backtracking”
Practical prosegrammer tips
- Document and comment complex patterns and use named groups
- Compile once and reuse patterns for performance
- Use real-world sample data when testing