Natural Language Processing for Document Engineering

Gregory M. Kapfhammer

November 17, 2025

NLP for prosegrammers

  • What is NLP for document engineering?
    • Breaking text into meaningful units
    • Extracting key information from documents
    • Analyzing language patterns
    • Building tools to understand written content
  • What are this week’s highlights?
    • A “from scratch” implementation of several key NLP techniques:
      • Tokenization and segmentation
      • Stemming and lemmatization
      • Keyword extraction and frequency analysis
      • Build keyword in context tools

Key insights for prosegrammers

  • Natural language processing transforms raw text into structured data that programs can more easily analyze
  • Although great packages exist, simple algorithms handle common NLP tasks without external libraries
  • Understanding basic NLP helps prosegrammers build their own powerful document analysis tools

Course learning objectives

Learning Objectives for Document Engineering

  • CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
  • CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
  • CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
  • CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.
  • This week’s content aids in the attainment of learning objective CS-104-3!

Tokenization means breaking text into words

  • Tokenization splits text into individual units called tokens
    • Words, numbers, punctuation marks
    • Foundation for all text analysis tasks
    • Simple approach uses “blank space splitting”
  • Why tokenize documents?
    • Count words and analyze vocabulary
    • Search for specific terms
    • Build word frequency distributions
    • Prepare text for further NLP processing

Basic word tokenization

  • Split text on blank spaces to create list of tokens
  • Simple, but effective, way to start basic text analysis
  • Key question: What happens to punctuation in this approach?

Tokenization with punctuation

  • Use regular expressions to extract only word characters
  • Convert to lowercase for consistent comparison
  • Removes punctuation and normalizes text

Segmentation: dividing text into sentences

  • Segmentation splits text into sentences or paragraphs
    • Identify sentence boundaries using punctuation
    • Handle abbreviations and special cases
    • Essential for readability analysis
  • Why segment documents?
    • Calculate sentences per paragraph
    • Analyze sentence complexity
    • Extract specific paragraphs or sections
    • Prepare text for summarization

Basic sentence segmentation

  • Use regular expressions to complete the task!
  • Split on sentence-ending punctuation marks
  • Remove empty strings and extra blank spaces

Paragraph segmentation

  • Use double newlines as paragraph boundaries
  • However, not all document adopt this common convention!

Build stemming and lemmatization

  • Stemming reduces words to their root form by removing suffixes
    • Example: “running” becomes “run”, “happily” becomes “happi”
    • Fast but may produce non-words
    • Yet, useful for search and matching
  • Lemmatization reduces words to their dictionary form
    • Example: “running” becomes “run”, “better” becomes “good”
    • Although more accurate, it requires linguistic knowledge
    • This week, we’ll implement simple rule-based stemming!

Simple suffix-based stemmer

  • Attempt to remove common English suffixes according to a priority order

Basic lemmatization with lookup

  • Use dictionary to map words to base forms
  • Returns original word if not in dictionary

Stop word removal: filtering common words

  • Stop words are common words that carry little meaning
    • Articles: “the”, “a”, “an”
    • Prepositions: “in”, “on”, “at”
    • Conjunctions: “and”, “or”, “but”
  • Why remove stop words?
    • Focus on meaningful content words
    • Reduce data size for analysis
    • Improve keyword extraction accuracy
    • Highlight important terms in documents

Implementing stop word filtering

  • Use set for efficient lookup of stop words
  • Filter tokens using list comprehension
  • Preserves word order while removing common words

Analysis after removing stop words

  • Combine tokenization with stop word removal
  • Count only meaningful content words to reveal key topics

Enhancing word frequency analysis

  • Word frequency counts how often each word appears
    • Identifies most common terms
    • Reveals document themes and topics
    • Foundation for text mining and analysis
  • Applications for prosegrammers
    • Find frequently discussed topics
    • Compare vocabulary across documents
    • Detect important technical terms
    • Build word clouds and visualizations

Building a frequency counter

  • Use dictionary to store word counts
  • Increment count for each occurrence
  • Sort by frequency to find most common words

Advanced word frequency analysis

  • Extract top N most frequent words
  • Returns sorted list of word-count pairs
  • Useful for quick document summarization

Keyword extraction to find important terms

  • Keyword extraction identifies terms that best represent content
    • Combine frequency analysis with filtering
    • Remove stop words to focus on content
    • Select words above frequency threshold
  • Why extract keywords?
    • Automatic document tagging and categorization
    • Generate document summaries and abstracts
    • Index documents for search engines
    • Identify main topics in large text collections

Simple keyword extractor

  • Filter out stop words before counting
  • Keep words appearing at least min_freq times
  • Sort by frequency to rank importance

Advanced keyword scoring

  • Combine frequency with word length for scoring
  • Longer technical terms get higher scores
  • Normalized by total word count

Keyword in context: understanding word usage

  • Keyword in context (KWIC) shows words with surrounding text
    • Display keyword with left and right context
    • Understand how terms are used in practice
    • Analyze word meanings and patterns
  • Applications for document analysis
    • Study how technical terms are defined
    • Find usage examples for documentation
    • Analyze sentiment around specific words
    • Build concordances for linguistic study

Building a KWIC tool

  • Find all occurrences of target keyword
  • Extract context words before and after
  • Display as concordance with aligned keywords

Enhanced KWIC with formatting

  • Align keywords in center column for easy scanning
  • Right-justify left context for visual alignment
  • Professional concordance display format

NLP tools for prosegrammers

  • Next steps with NLP techniques:
    • Find locations in your tool where NLP could add value
      • Could tokenization help parse user input?
      • Would keyword extraction improve search features?
      • Could KWIC help users find usage examples?
    • Combine multiple techniques for powerful analysis
    • How would NLP make your document tools more intelligent?

Key takeaways for prosegrammers

  • Master text processing fundamentals
    • Tokenization splits text into analyzable units
    • Segmentation divides documents into sentences and paragraphs
    • Stemming and lemmatization normalize word forms
  • Build powerful analysis tools
    • Word frequency reveals document themes and patterns
    • Stop word removal focuses on meaningful content
    • Keyword extraction identifies important terms automatically
  • Understand context and usage
    • KWIC displays show how words are used in practice
    • Combine techniques to build sophisticated document tools
    • Simple algorithms handle many real-world NLP tasks
  • Think like a prosegrammer
    • NLP transforms raw text into structured insights
    • Build from scratch to understand core concepts
    • Apply these techniques to real document engineering challenges