Using Data Containers for Document Engineering

Gregory M. Kapfhammer

October 13, 2025

Data containers

  • What is document engineering?
    • Creating documents using code
    • Manipulating and analyzing text data
    • Building documentation systems
    • “Prosegrammers” combine prose and programming
  • What are this week’s highlights?
    • Explore Python data containers for document engineering
      • Lists for organizing document collections and sections
      • Tuples for storing document metadata immutably
      • Sets for managing unique keywords and tags

Key insights for prosegrammers

  • Document engineering means blending code and prose to build resources for both humans and machines
  • Python containers organize document data efficiently: lists for sequences, tuples for records, sets for uniqueness
  • Data containers can store multiple documents, in different formats, with different data and metadata

Python collections overview

  • Lists: mutable sequences for document sections
    • Store chapters, paragraphs, or document versions
    • Perfect for ordered content that may change
    • Support appending, removing, and modifying elements
  • Tuples: immutable records for document metadata
    • Store title, author, date information safely
    • Guaranteed not to change accidentally
    • Efficient for fixed document properties
  • Sets: unique collections for document keywords
    • Eliminate duplicate tags automatically
    • Fast membership testing and set operations
    • Perfect for managing document categories

Using lists in Python

  • Creating document collections
    • Store related files in ordered sequences
    • Build documentation hierarchies
  • Modifying document structures
    • Add, remove, and reorganize content
    • Update documentation dynamically
  • Accessing document elements
    • Find specific documents by position
    • Process collections systematically

Basic list operations

  • Create document list, iterate with for loop, and then display details

Two-dimensional lists

  • Create a list of lists, iterate with for loop, and then display details

Modifying lists dynamically

Lists for document engineering

  • Document collections: store related files in order
  • Dynamic modification: add, insert, and remove content
  • Flexible organization: restructure documents as needed
  • Index-based access: retrieve specific sections efficiently
  • Next steps for understanding how to use lists:
    • Find a location in your document engineering tool where you used lists
      • Is it working correctly?
      • How did you test and debug it?
      • How can you refactor the code?
    • If you did not find a list being used, how could you add one to your project?

Using lists in Python for document engineering

  • Input one or more documents from the file system
  • Parse each document to a data structure instance
  • Store all data structures for each document in a list
  • Iterate through the list to process all data structures
  • Output the results of the analysis to the console
  • How to extend your tool to handle multiple files?

Tuples for storing immutable metadata

  • Creating immutable records
    • Store document properties safely
    • Prevent accidental data changes
  • Organizing metadata collections
    • Build consistent document catalogs
    • Maintain data integrity
  • Analyzing document metrics
    • Extract statistics from records
    • Process structured data efficiently

Basic tuple operations

  • Create metadata tuple, iterate with for loop, and then display details
  • How is a tuple different from a list? How do we use it differently?

Document analysis with tuples

  • Create list of tuples, iterate with for loop, and then display details
  • Many combinations of data structures (e.g., lists and tuple) are possible!

Tuples for document engineering

  • Immutable records: metadata cannot be accidentally changed
  • Structured data: consistent format for document properties
  • Tuple unpacking: easy extraction of individual values
  • Statistical analysis: compute metrics across documents
  • Next steps for understanding how to use tuples:
    • Find a location in your document engineering tool where you used tuples
      • Is it working correctly?
      • How did you test and debug it?
      • How can you refactor the code?
    • If you did not find a tuple being used, how could you add one to your project?

Sets for storing document keywords

  • Managing unique keywords
    • Eliminate duplicate tags automatically
    • Build clean tag collections
  • Performing set operations
    • Find common and unique tags
    • Analyze document relationships
  • Categorizing documents
    • Organize content by complexity
    • Create document taxonomies

Basic set operations for documents

  • Create list of tags for multiple documents under analysis
  • Find those tags that are unique and those that are common

Sets for document engineering

  • Automatic uniqueness: no duplicate tags allowed
  • Set operations: union, intersection, difference for analysis
  • Tag management: organize and categorize document content
  • Membership testing: quickly check document existence
  • Next steps for understanding how to use sets:
    • Find a location in your document engineering tool where you used sets
      • Is it working correctly?
      • How did you test and debug it?
      • How can you refactor the code?
    • If you did not find a set being used, how could you add one to your project?

Container integration

Container Mutable Ordered Duplicates Best For
List ✅ Yes ✅ Yes ✅ Allowed Document sections, chapters, file collections
Tuple ❌ No ✅ Yes ✅ Allowed Document metadata, fixed records, coordinates
Set ✅ Yes ❌ No ❌ Not Allowed Keywords, tags, unique identifiers

Summary of data container choices

  • Lists: when you need to modify and maintain order
  • Tuples: when data should never change
  • Sets: when uniqueness matters most
  • Customize your own data containers to meet your tool’s needs!
  • Next steps for understanding how to use containers:
    • Think of a feature for your document engineering tool needing a container:
      • How would your tool’s feature use a container?
      • If you could use a container, what would be the benefit?
      • What type of container would you pick for this feature?
      • How would you test to confirm that the container works correctly?

Integrated document analysis

  • Create list of sample documents that contain simple text
  • Determine the number of unique words across all of the documents

Key takeaways for prosegrammers

  • Choose the right container
    • Lists for document sequences and mutable collections
    • Tuples for immutable metadata and structured records
    • Sets for unique keywords, tags, and categories
  • Master container operations
    • Create, access, modify, and analyze document data
    • Use indexing, slicing, and iteration effectively
    • Apply set operations for document categorization
  • Think and act like a prosegrammer
    • Combine containers to solve complex document analysis challenges
    • Use type hints to make your Python code clear and maintainable
    • Apply containers to handle real-world document engineering challenges