Using Dictionaries for Document Engineering

Gregory M. Kapfhammer

October 20, 2025

Dictionaries for document engineering

  • What is document engineering?
    • Creating documents using code
    • Manipulating and analyzing text data
    • Building documentation systems
  • What are this week’s highlights?
    • Explore Python dictionaries for document engineering
      • Store document content with unique identifiers
      • Map metadata to documents efficiently
      • Read and parse JSON files into dictionaries

Key insights for prosegrammers

  • Document engineering means blending code and prose to build resources for humans and machines
  • Dictionaries store key-value pairs that map document identifiers to content, metadata to properties
  • JSON files naturally translate to dictionaries, supporting seamless data exchange between programs

What are dictionaries? Why do they matter for prosegramming?

  • Dictionaries store data as key-value pairs
    • Keys are unique identifiers (usually strings or numbers)
    • Values can be any Python object (strings, lists, other dictionaries)
    • Provide fast lookups based on keys
  • Why use dictionaries for documents?
    • Map document IDs to their content
    • Store document metadata (title, author, date)
    • Parse structured data formats like JSON
    • Build document indexes and catalogs

Creating and using dictionaries in Python

  • Creating dictionaries
    • Initialize empty or with key-value pairs
    • Build document metadata collections
  • Accessing dictionary values
    • Retrieve values by key
    • Handle missing keys safely
  • Modifying dictionaries
    • Add new key-value pairs
    • Update existing values

Basic dictionary creation

  • Create dictionary with string keys and mixed value types
  • Use items() method to iterate over key-value pairs

Accessing dictionary values

# access document information by key
document = {
    "doc_id": "D001",
    "title": "Python Dictionaries Guide",
    "author": "Dr. Kapfhammer"
}

print(f"Title: {document['title']}")
print(f"Author: {document['author']}")

# safe access with get() method
word_count = document.get("word_count", "Not available")
print(f"Word count: {word_count}")
Title: Python Dictionaries Guide
Author: Dr. Kapfhammer
Word count: Not available
  • Use bracket notation document['key'] for direct access
  • Use get() method with default value for safe access

Adding and modifying entries

  • Add new entries with bracket notation dict[key] = value
  • Update existing entries using same syntax

Dictionaries with different value types

  • String values: store text content
  • Integer values: store counts and IDs
  • List values: store multiple related items
  • Dictionary values: nest related data structures
  • Tuple and set values: store ordered and unique data
  • Let’s explore how to create and use dictionaries with different data!

Dictionary with string and integer

  • It is possible to mix string and integer values in same dictionary
  • While the keys describe the type of information stored, type errors are possible

Dictionary mapping to lists

Dictionary mapping to dictionary

  • Nested dictionaries represent complex data structures
  • Access nested values with multiple brackets: dict[key1][key2]

Dictionary mapping to tuples and sets

  • Tuples store fixed metadata that should not change
  • Sets store unique tags and prevent duplicates
  • Both tuples and sets can be the values of a dictionary

Looking up a value in a dictionary using a key is efficient! How helpful?

  • Dictionaries are very efficient for looking up values by key
  • Average time complexity for lookups is constant time or \(O(1)\)
  • Dictionaries ideal for mapping document identifiers to content
  • You can learn more about this in the Algorithm Analysis class
  • Let’s explore more about dictionaries for document engineering!

Mapping document identifiers to content

  • Document collections
    • Store multiple documents in single dictionary
    • Access documents by unique identifier
  • Efficient lookups
    • Retrieve specific documents quickly
    • No need to search through lists
  • Catalog management
    • Build document indexes and catalogs
    • Organize documents systematically

Creating a document catalog

Processing documents from catalog

  • Process all documents in catalog with iteration
  • Return analysis results as new dictionary
  • This example illustrates the key points
  • More advanced examples would use real Markdown documents

Dictionary operations for prosegramming

  • Checking for keys
    • Verify if document exists in catalog
    • Avoid errors when accessing missing keys
  • Removing entries
    • Delete documents from catalog
    • Clean up obsolete entries
  • Updating dictionaries
    • Merge multiple document collections
    • Update metadata in bulk

Checking keys and removing entries

  • Use in operator to check key existence in the dictionary
  • Use del statement to remove key-value pairs from the dictionary

Updating and merging dictionaries

  • Use update() method to merge dictionaries together
  • Later updates overwrite earlier values for same keys

Iterating through keys and values

  • Use keys() to iterate over keys only
  • Use values() to iterate over values only
  • Use items() to iterate over key-value pairs

Using Python to read and parse JSON files

  • JSON format
    • JavaScript Object Notation for data exchange
    • Maps naturally to Python dictionaries
    • Standard format for document metadata
  • Reading JSON
    • Use json module from standard library
    • Parse JSON strings into dictionaries
    • Handle file reading and parsing together

Parsing JSON strings to dictionaries

  • Use json.loads() to parse JSON string to dictionary
  • JSON arrays become Python lists, objects become dictionaries

Working with JSON documents

  • Parse nested JSON structures with lists and dictionaries
  • Process document collections systematically

Creating JSON from dictionaries

Practical document engineering examples

  • Word frequency analysis
    • Count word occurrences in documents
    • Build frequency dictionaries
  • Document indexing
    • Create searchable document indexes
    • Map keywords to document IDs
  • Metadata management
    • Organize document properties
    • Track document relationships

Word frequency analysis

  • Use get() method with default value to count occurrences
  • Dictionary stores word as key and count as value

Building a document index

  • Build inverted index mapping keywords to document IDs
  • Use sets to store unique document references
  • Makes it easy to find documents by specific keyword

Managing document metadata

  • Filter documents by metadata criteria
  • Return matching document IDs for further processing

Dictionaries for prosegrammers

  • Next steps for using dictionaries:
    • Find locations in your tool where dictionaries could improve design
      • Could you use a dictionary to map IDs to content?
      • Would nested dictionaries organize complex data better?
      • Could JSON files store and load your tool’s configuration?
    • If you are already using dictionaries, how can you refactor them?
    • How would dictionaries make your tool more powerful?

Key takeaways for prosegrammers

  • Master dictionary fundamentals
    • Create dictionaries with various value types
    • Access, modify, and iterate through key-value pairs
    • Use dictionary methods for safe and efficient operations
  • Leverage JSON for data exchange
    • Read and parse JSON files into Python dictionaries
    • Convert dictionaries to JSON for storage and sharing
    • Handle nested structures with lists and dictionaries
  • Think like a prosegrammer
    • Use dictionaries to map document IDs to content and metadata
    • Build indexes and catalogs for document collections
    • Apply dictionaries to real-world document engineering challenges

test