Introduction to Document Engineering

Gregory M. Kapfhammer

August 25, 2025

Document engineering

  • What is document engineering?
    • Creating documents using code
    • Manipulating and analyzing text data
    • Building documentation systems
    • “Prosegrammers” combine prose and programming
  • Why is it important?
    • Documentation quality affects software success
      • Poor docs cause user confusion
      • Clear docs increase adoption
      • Automated docs reduce maintenance cost

Becoming a prosegrammer

  • Master Python programming
    • Text processing and analysis
    • Document creation and manipulation
    • Automation tools for writing
  • Create compelling documentation
    • Clear and professional writing
    • Interactive documents with code
    • Version control for documents
  • Science and engineering to analyze & improve documents!

What does a prosegrammer do?

  • Prose (written word) meets Programming (software development)
  • Generate reports from data automatically
  • Build interactive documentation systems
  • Create tools that transform and analyze text
  • Automate repetitive writing tasks
  • Analyze large collections of documents

How do we create better documents with code? How do we analyze text data to gain insights? How do we automate documentation workflows? How do we effectively use generative AI?

Python function for text analysis

from typing import Dict
import string

def word_frequency(text: str) -> Dict[str, int]:
    """Analyze text and return a dictionary of word frequencies."""
    cleaned_text = text.lower().translate(str.maketrans('', '', string.punctuation))
    words = cleaned_text.split()
    frequency_dict = {}
    for word in words:
        frequency_dict[word] = frequency_dict.get(word, 0) + 1
    return frequency_dict
  • Text analysis: fundamental skill for prosegrammers
  • Word frequency: helps understand document content patterns
  • Define function: accept an input, process, make output
  • Identify parts: what are key parts that make this function work?

Call the text analysis function

# example text about document engineering
sample_text = "Document engineering combines programming with writing. Writing clear documents requires skill."

# analyze the text and display results
word_counts = word_frequency(sample_text)
print("Word Frequencies:")
for word, count in sorted(word_counts.items()):
    print(f"'{word}': {count}")
Word Frequencies:
'clear': 1
'combines': 1
'document': 1
'documents': 1
'engineering': 1
'programming': 1
'requires': 1
'skill': 1
'with': 1
'writing': 2

Try the word_frequency function!

  • Important question: what patterns do you notice in the word frequencies?

Improved document analysis function

import re
from typing import Dict, Any

def document_summary(text: str) -> Dict[str, Any]:
    """Generate a comprehensive summary of document statistics."""
    # count words (excluding punctuation-only tokens)
    words = [word for word in text.split() if any(char.isalnum() for char in word)]
    word_count = len(words)
    # count sentences (simple approach using sentence-ending punctuation)
    sentences = re.split(r'[.!?]+', text)
    sentence_count = len([s for s in sentences if s.strip()])
    # count paragraphs (assuming double newlines separate paragraphs)
    paragraphs = [p for p in text.split('\n\n') if p.strip()]
    paragraph_count = len(paragraphs)
    # calculate averages
    avg_words_per_sentence = word_count / sentence_count if sentence_count > 0 else 0
    avg_sentences_per_paragraph = sentence_count / paragraph_count if paragraph_count > 0 else 0
    return {
        'word_count': word_count, 'sentence_count': sentence_count,
        'paragraph_count': paragraph_count,
        'average_words_per_sentence': round(avg_words_per_sentence, 1),
        'average_sentences_per_paragraph': round(avg_sentences_per_paragraph, 1)
    }

Exploring document_summary

  • document_summary: analyzes text structure and readability
  • Uses re for sentence detection and provides key quality metrics
  • Helps prosegrammers assess document effectiveness
  • Accepts as input a string of text and returns a dictionary of statistics
  • Different phases of text processing:
    • Count the number of words
    • Count the number of sentences
    • Count the number of paragraphs
    • Calculate the averages
    • Return the results in a structured format

Running a document analysis

# define an example document about prosegrammers
sample_document = """
Prosegrammers are skilled professionals who combine programming expertise with writing abilities. They create tools that help generate, analyze, and improve documents.

Document engineering is an exciting field that leverages technology to enhance written communication. Python provides excellent libraries for text processing.

By mastering both code and prose, prosegrammers can automate repetitive writing tasks, analyze large collections of documents, and create dynamic content.
"""

# analyze the document using the defined summary function
summary = document_summary(sample_document.strip())
print("Document Analysis Summary:")
for metric, value in summary.items():
    print(f"{metric.replace('_', ' ').title()}: {value}")
Document Analysis Summary:
Word Count: 62
Sentence Count: 5
Paragraph Count: 3
Average Words Per Sentence: 12.4
Average Sentences Per Paragraph: 1.7

Discuss analysis results

  • Discuss document_summary in your teams:
    • How would you explain that this program works?
    • What is the most confusing aspect of this code?
    • What is thought-provoking about this approach?
  • Apply document_summary in your teams:
    • What insights do these metrics give on document readability?
    • How could prosegrammers use these tools in real projects?
    • What other document analysis features would be useful?
  • Prosegrammers use programs to analyze prose!

Real-world engineering challenges

  • Characterize documents and their creation process
    • How are documents currently written and maintained?
    • What tools and workflows are being used?
    • What are the pain points in the current process?
  • Compare and improve document workflows
    • What metrics matter for document quality and efficiency?
    • How to measure the effectiveness of documentation?
    • What tools will improve the writing and publishing process?
    • How to optimize workflows to reduce manual effort?

Why is documentation challenging?

  • Different audiences need different formats
  • Documents must stay synchronized with code
  • Collaboration on documents is often difficult
  • Maintaining consistency across large projects
  • Balancing automation with human creativity
  • Ensuring accessibility and usability

During this academic semester you will overcome these challenges and become a proficient prosegrammer!

Learn about document engineering

Explore the Quarto documentation and Python documentation for details

Review exemplary projects like Django docs and FastAPI docs and uv docs

  • Document engineering requires both technical and writing skills
  • Key areas of focus for this course:
    • Python programming and text processing
    • Markdown and markup languages
    • Version control for documents
    • Automation and workflow optimization
  • Analysis of document quality and user experience

Document engineering with AI

  • Using AI tools like GitHub Copilot, Google Gemini CLI, or Claude:
    • Is the generated text accurate and well-written?
    • Can the generated content be improved and personalized?
    • Is the generated text clear, accessible, and appropriate?
    • Can you integrate AI-generated content into your workflow?
    • Can you maintain quality standards while using AI assistance?

Prosegrammers who use AI writing and coding tools are responsible for ensuring quality, accuracy, and ethical standards of all their work!

Let’s set up your document engineering environment!

  • Laboratory session on Wednesday
  • Classroom session on Friday
  • Skill-check next Friday to confirm

Essential tools for prosegrammers

  • Text editor like VS Code, Zed, or Neovim for writing
  • Version control like Git for tracking document changes
  • Documentation generator like Quarto or Sphinx
  • Generative AI tools like GitHub Copilot or Google Gemini CLI

How do we characterize effective document tools? How do we compare their features for different projects? How do we integrate them into efficient workflows? How do we configure these tools for our needs? How do we use them for teamwork?

Document engineering environment

  • Text editor with syntax highlighting and extensions (e.g., VS Code)
  • Version control system (e.g., Git with GitHub)
  • Document format (e.g., Markdown, reStructuredText, LaTeX)
  • Static site generator (e.g., Quarto, Hugo, Jekyll)
  • Collaboration platforms and review workflows (e.g, GitHub)
  • Automation tools (e.g., GitHub Actions, pre-commit hooks)
  • Deployment targets (e.g., GitHub Pages, Netlify)
  • Package managers for dependencies (e.g., Uv and npm)
  • We will learn how to use these through the semester!

Development environment setup

  • Installing essential tools for prosegrammers
  • Configuring development environment for document work
    • Complete these tasks during the first and second weeks
    • A skill-check during second week tests your setup
    • Please attend the SOS Week events to learn more
    • Work with instructor and student technical leaders
    • Don’t hesitate to regularly ask questions in Discord
    • Keep working and don’t give up with setup tasks

Essential tools for prosegrammers

Tips for effective document engineering setup

GitHub Student Benefits and Copilot

  • GitHub Student Developer Pack
    • Free access to premium developer tools and services
    • Apply at education.github.com
    • Requires verification with .edu email or student ID
  • GitHub Copilot Pro for Students
    • AI-powered code completion and generation
    • Free for verified students and educators
    • Integrates with VS Code and other editors
  • Why GitHub tools? Essential for collaboration!

Testing your prosegrammer setup

# Run these commands in your terminal window
git --version                   # Check Git installation
python --version                # Check Python (via UV)
quarto --version                # Check Quarto installation
code --version                  # Check VS Code installation
uv --version                    # Check UV package manager
  • Test each tool individually before starting projects
  • Create a test document with code and text to verify integration
  • Consult documentation links when troubleshooting
  • Schedule office hours with the course instructor
  • Visit office hours with the student technical leaders
  • Have these setup by second Friday of the semester!

Can you clone the course website to your laptop and run quarto preview? What output do you see?

  • Setup ssh keys for GitHub to ensure secure communication
  • Run git clone git@github.com:prosegrammers/www.prosegrammers.com.git
  • Run cd www.prosegrammers.com to change into this directory
  • Run quarto preview to build and serve the website locally
  • Confirm that the course website builds and serves locally
  • Can you add content to the website and see it rendered?

After cloning the course website to your laptop, let’s test uv! How fun!

  • Stay in the www.prosegrammers.com directory
  • Run cd scripts to change to the scripts/ directory
  • Run uv run welcome.py to receive your welcome to the course!
  • What output do you see on the screen? How does this work?

Overall document engineering setup

Tips for effective document engineering setup

  • Devote time outside class to installing and configuring tools

  • Confirm that most tools work during the first lab session

  • Successfully get all tools to work during the first lab session

  • Create and render test documents with the provided examples

  • Complete the first document engineering project on time

  • Contribute to collaborative documentation projects

  • Prepare for first document engineering skill check

  • Get ready for an exciting journey into document engineering!

  • If you are having trouble, publicly ask for help on Discord!

Goals of document engineering

  • Document Creation:
    • Design and implement document generation workflows
    • Test all aspects of documents to ensure quality and accuracy
    • Create frameworks for automated document production
  • Document Analysis:
    • Collect and analyze data about document usage and quality
    • Visualize insights to improve documentation strategies
  • Communicate results and best practices for document engineering
  • Check syllabus for details about Document Engineering course!