Natural Language Processing for Document Engineering

Gregory M. Kapfhammer

November 17, 2025

NLP for prosegrammers

What is NLP for document engineering?
- Breaking text into meaningful units
- Extracting key information from documents
- Analyzing language patterns
- Building tools to understand written content

What are this week’s highlights?
- A “from scratch” implementation of several key NLP techniques:
  - Tokenization and segmentation
  - Stemming and lemmatization
  - Keyword extraction and frequency analysis
  - Build keyword in context tools

Key insights for prosegrammers

Natural language processing transforms raw text into structured data that programs can more easily analyze
Although great packages exist, simple algorithms handle common NLP tasks without external libraries
Understanding basic NLP helps prosegrammers build their own powerful document analysis tools

Course learning objectives

Learning Objectives for Document Engineering

CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.

This week’s content aids in the attainment of learning objective CS-104-3!

Tokenization means breaking text into words

Tokenization splits text into individual units called tokens
- Words, numbers, punctuation marks
- Foundation for all text analysis tasks
- Simple approach uses “blank space splitting”
Why tokenize documents?
- Count words and analyze vocabulary
- Search for specific terms
- Build word frequency distributions
- Prepare text for further NLP processing

Basic word tokenization

Split text on blank spaces to create list of tokens
Simple, but effective, way to start basic text analysis
Key question: What happens to punctuation in this approach?

Tokenization with punctuation

Use regular expressions to extract only word characters
Convert to lowercase for consistent comparison
Removes punctuation and normalizes text

Segmentation: dividing text into sentences

Segmentation splits text into sentences or paragraphs
- Identify sentence boundaries using punctuation
- Handle abbreviations and special cases
- Essential for readability analysis
Why segment documents?
- Calculate sentences per paragraph
- Analyze sentence complexity
- Extract specific paragraphs or sections
- Prepare text for summarization

Basic sentence segmentation

Use regular expressions to complete the task!
Split on sentence-ending punctuation marks
Remove empty strings and extra blank spaces

Paragraph segmentation

Use double newlines as paragraph boundaries
However, not all document adopt this common convention!

Build stemming and lemmatization

Stemming reduces words to their root form by removing suffixes
- Example: “running” becomes “run”, “happily” becomes “happi”
- Fast but may produce non-words
- Yet, useful for search and matching
Lemmatization reduces words to their dictionary form
- Example: “running” becomes “run”, “better” becomes “good”
- Although more accurate, it requires linguistic knowledge
- This week, we’ll implement simple rule-based stemming!

Simple suffix-based stemmer

Attempt to remove common English suffixes according to a priority order

Basic lemmatization with lookup

Use dictionary to map words to base forms
Returns original word if not in dictionary

Stop word removal: filtering common words

Stop words are common words that carry little meaning
- Articles: “the”, “a”, “an”
- Prepositions: “in”, “on”, “at”
- Conjunctions: “and”, “or”, “but”
Why remove stop words?
- Focus on meaningful content words
- Reduce data size for analysis
- Improve keyword extraction accuracy
- Highlight important terms in documents

Implementing stop word filtering

Use set for efficient lookup of stop words
Filter tokens using list comprehension
Preserves word order while removing common words

Analysis after removing stop words

Combine tokenization with stop word removal
Count only meaningful content words to reveal key topics

Enhancing word frequency analysis

Word frequency counts how often each word appears
- Identifies most common terms
- Reveals document themes and topics
- Foundation for text mining and analysis
Applications for prosegrammers
- Find frequently discussed topics
- Compare vocabulary across documents
- Detect important technical terms
- Build word clouds and visualizations

Building a frequency counter

Use dictionary to store word counts
Increment count for each occurrence
Sort by frequency to find most common words

Advanced word frequency analysis

Extract top N most frequent words
Returns sorted list of word-count pairs
Useful for quick document summarization

Keyword extraction to find important terms

Keyword extraction identifies terms that best represent content
- Combine frequency analysis with filtering
- Remove stop words to focus on content
- Select words above frequency threshold
Why extract keywords?
- Automatic document tagging and categorization
- Generate document summaries and abstracts
- Index documents for search engines
- Identify main topics in large text collections

Simple keyword extractor

Filter out stop words before counting
Keep words appearing at least min_freq times
Sort by frequency to rank importance

Advanced keyword scoring

Combine frequency with word length for scoring
Longer technical terms get higher scores
Normalized by total word count

Keyword in context: understanding word usage

Keyword in context (KWIC) shows words with surrounding text
- Display keyword with left and right context
- Understand how terms are used in practice
- Analyze word meanings and patterns
Applications for document analysis
- Study how technical terms are defined
- Find usage examples for documentation
- Analyze sentiment around specific words
- Build concordances for linguistic study

Building a KWIC tool

Find all occurrences of target keyword
Extract context words before and after
Display as concordance with aligned keywords

Enhanced KWIC with formatting

Align keywords in center column for easy scanning
Right-justify left context for visual alignment
Professional concordance display format

NLP tools for prosegrammers

Next steps with NLP techniques:
- Find locations in your tool where NLP could add value
  - Could tokenization help parse user input?
  - Would keyword extraction improve search features?
  - Could KWIC help users find usage examples?
- Combine multiple techniques for powerful analysis
- How would NLP make your document tools more intelligent?

Key takeaways for prosegrammers

Master text processing fundamentals
- Tokenization splits text into analyzable units
- Segmentation divides documents into sentences and paragraphs
- Stemming and lemmatization normalize word forms
Build powerful analysis tools
- Word frequency reveals document themes and patterns
- Stop word removal focuses on meaningful content
- Keyword extraction identifies important terms automatically
Understand context and usage
- KWIC displays show how words are used in practice
- Combine techniques to build sophisticated document tools
- Simple algorithms handle many real-world NLP tasks
Think like a prosegrammer
- NLP transforms raw text into structured insights
- Build from scratch to understand core concepts
- Apply these techniques to real document engineering challenges