Using Dictionaries for Document Engineering

Gregory M. Kapfhammer

October 20, 2025

Dictionaries for document engineering

What is document engineering?
- Creating documents using code
- Manipulating and analyzing text data
- Building documentation systems

What are this week’s highlights?
- Explore Python dictionaries for document engineering
  - Store document content with unique identifiers
  - Map metadata to documents efficiently
  - Read and parse JSON files into dictionaries

Key insights for prosegrammers

Document engineering means blending code and prose to build resources for humans and machines
Dictionaries store key-value pairs that map document identifiers to content, metadata to properties
JSON files naturally translate to dictionaries, supporting seamless data exchange between programs

What are dictionaries? Why do they matter for prosegramming?

Dictionaries store data as key-value pairs
- Keys are unique identifiers (usually strings or numbers)
- Values can be any Python object (strings, lists, other dictionaries)
- Provide fast lookups based on keys
Why use dictionaries for documents?
- Map document IDs to their content
- Store document metadata (title, author, date)
- Parse structured data formats like JSON
- Build document indexes and catalogs

Creating and using dictionaries in Python

Creating dictionaries
- Initialize empty or with key-value pairs
- Build document metadata collections
Accessing dictionary values
- Retrieve values by key
- Handle missing keys safely
Modifying dictionaries
- Add new key-value pairs
- Update existing values

Basic dictionary creation

Create dictionary with string keys and mixed value types
Use items() method to iterate over key-value pairs

Accessing dictionary values

# access document information by key
document = {
    "doc_id": "D001",
    "title": "Python Dictionaries Guide",
    "author": "Dr. Kapfhammer"
}

print(f"Title: {document['title']}")
print(f"Author: {document['author']}")

# safe access with get() method
word_count = document.get("word_count", "Not available")
print(f"Word count: {word_count}")

Title: Python Dictionaries Guide
Author: Dr. Kapfhammer
Word count: Not available

Use bracket notation document['key'] for direct access
Use get() method with default value for safe access

Adding and modifying entries

Add new entries with bracket notation dict[key] = value
Update existing entries using same syntax

Dictionaries with different value types

String values: store text content
Integer values: store counts and IDs
List values: store multiple related items
Dictionary values: nest related data structures
Tuple and set values: store ordered and unique data
Let’s explore how to create and use dictionaries with different data!

Dictionary with string and integer

It is possible to mix string and integer values in same dictionary
While the keys describe the type of information stored, type errors are possible

Dictionary mapping to lists

Dictionary mapping to dictionary

Nested dictionaries represent complex data structures
Access nested values with multiple brackets: dict[key1][key2]

Dictionary mapping to tuples and sets

Tuples store fixed metadata that should not change
Sets store unique tags and prevent duplicates
Both tuples and sets can be the values of a dictionary

Looking up a value in a dictionary using a key is efficient! How helpful?

Dictionaries are very efficient for looking up values by key
Average time complexity for lookups is constant time or \(O(1)\)
Dictionaries ideal for mapping document identifiers to content
You can learn more about this in the Algorithm Analysis class
Let’s explore more about dictionaries for document engineering!

Mapping document identifiers to content

Document collections
- Store multiple documents in single dictionary
- Access documents by unique identifier
Efficient lookups
- Retrieve specific documents quickly
- No need to search through lists
Catalog management
- Build document indexes and catalogs
- Organize documents systematically

Creating a document catalog

Processing documents from catalog

Process all documents in catalog with iteration
Return analysis results as new dictionary
This example illustrates the key points
More advanced examples would use real Markdown documents

Dictionary operations for prosegramming

Checking for keys
- Verify if document exists in catalog
- Avoid errors when accessing missing keys
Removing entries
- Delete documents from catalog
- Clean up obsolete entries
Updating dictionaries
- Merge multiple document collections
- Update metadata in bulk

Checking keys and removing entries

Use in operator to check key existence in the dictionary
Use del statement to remove key-value pairs from the dictionary

Updating and merging dictionaries

Use update() method to merge dictionaries together
Later updates overwrite earlier values for same keys

Iterating through keys and values

Use keys() to iterate over keys only
Use values() to iterate over values only
Use items() to iterate over key-value pairs

Using Python to read and parse JSON files

JSON format
- JavaScript Object Notation for data exchange
- Maps naturally to Python dictionaries
- Standard format for document metadata
Reading JSON
- Use json module from standard library
- Parse JSON strings into dictionaries
- Handle file reading and parsing together

Parsing JSON strings to dictionaries

Use json.loads() to parse JSON string to dictionary
JSON arrays become Python lists, objects become dictionaries

Working with JSON documents

Parse nested JSON structures with lists and dictionaries
Process document collections systematically

Creating JSON from dictionaries

Practical document engineering examples

Word frequency analysis
- Count word occurrences in documents
- Build frequency dictionaries
Document indexing
- Create searchable document indexes
- Map keywords to document IDs
Metadata management
- Organize document properties
- Track document relationships

Word frequency analysis

Use get() method with default value to count occurrences
Dictionary stores word as key and count as value

Building a document index

Build inverted index mapping keywords to document IDs
Use sets to store unique document references
Makes it easy to find documents by specific keyword

Managing document metadata

Filter documents by metadata criteria
Return matching document IDs for further processing

Dictionaries for prosegrammers

Next steps for using dictionaries:
- Find locations in your tool where dictionaries could improve design
  - Could you use a dictionary to map IDs to content?
  - Would nested dictionaries organize complex data better?
  - Could JSON files store and load your tool’s configuration?
- If you are already using dictionaries, how can you refactor their use?
- How would dictionaries make your tool more powerful?

Key takeaways for prosegrammers

Master dictionary fundamentals
- Create dictionaries with various value types
- Access, modify, and iterate through key-value pairs
- Use dictionary methods for safe and efficient operations
Leverage JSON for data exchange
- Read and parse JSON files into Python dictionaries
- Convert dictionaries to JSON for storage and sharing
- Handle nested structures with lists and dictionaries
Think like a prosegrammer
- Use dictionaries to map document IDs to content and metadata
- Build indexes and catalogs for document collections
- Apply dictionaries to real-world document engineering challenges