File Input and Output for Document Engineering

Gregory M. Kapfhammer

October 27, 2025

File input and output

  • What is document engineering?
    • Creating documents using code
    • Manipulating and analyzing text data
    • Building documentation systems
    • “Prosegrammers” combine prose and programming
  • What are this week’s highlights?
    • Explore file input and output for document engineering
      • Read and write files using Python’s standard library
      • Work with JSON files for structured document data
      • Analyze and summarize document content programmatically

Key insights for prosegrammers

  • Document engineering means blending code and prose to build resources for both humans and machines
  • File input/output (I/O) enables prosegrammers to read, write, and transform documents stored on disk
  • JSON files provide structured data storage that naturally maps to Python dictionaries

File operations overview

  • Reading files: load document content from disk
    • Open files in read mode
    • Read entire contents or line-by-line
    • Process text data for analysis
  • Writing files: save generated documents to disk
    • Open files in write mode
    • Create new files or overwrite existing ones
    • Output analysis results and reports
  • Path management: work with file paths safely
    • Use pathlib.Path for cross-platform compatibility
    • Navigate directory structures programmatically
    • Build robust file handling systems

Course learning objectives

Learning Objectives for Document Engineering

  • CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
  • CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
  • CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
  • CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.
  • Learning about file input/output in Python aids CS-104-3 and CS-104-4!

Using open and read

  • Opening files for reading
    • Use open() function with filename
    • Specify read mode with 'r' parameter
    • Always close files after use
  • Reading file contents
    • Read entire file with read() method
    • Read line-by-line with readlines() method
    • Process content as strings
  • Safe file handling
    • Use context managers with with statement
    • Automatic file closing prevents resource leaks
    • Handle potential errors gracefully

Basic file reading

  • Use with context manager statement for automatic file closing
  • Read entire file as single string with read() method

Line by line with readlines

  • Use readlines() to get list of lines inside of the file
  • Each line includes newline character, use strip() to remove

Using open and write

  • Opening files for writing
    • Use open() function with write mode 'w'
    • Creates new file or overwrites existing file
    • Use append mode 'a' to add to existing files
  • Writing content to files
    • Write strings with write() method
    • Write multiple lines with writelines() method
    • Control formatting with newline characters
  • Saving analysis results
    • Generate reports programmatically
    • Export processed document data
    • Create documentation automatically

Basic file writing with w mode

  • Use write mode 'w' to create or overwrite files
  • Add newline characters explicitly with \n

Appending to file with a mode

  • Use append mode 'a' to add content without overwriting
  • New content added to end of existing file

Try out pathlib!

  • Path objects: modern file path handling
    • Create platform-independent file paths
    • Navigate directory structures easily
    • Check file existence and properties
  • Convenient methods: simple file I/O
    • Read entire files with read_text() method
    • Write content with write_text() method
    • Handle paths as objects, not strings
  • Safer operations: avoid common errors
    • Type-safe path manipulation
    • Automatic path separator handling
    • Clear, readable code

Working with Path objects

  • Use the / operator to join path components naturally
  • write_text() and read_text() methods simplify file operations

Listing files in directories with glob

  • Use glob() method to find files matching patterns
  • Access file properties through stat() method

Working with JSON files

  • JSON format: structured data storage
    • JavaScript Object Notation for data exchange
    • Maps naturally to Python dictionaries and lists
    • Human-readable and machine-parseable
  • Reading JSON: parse structured data
    • Use json module from standard library
    • Load JSON files into Python data structures
    • Access nested data easily
  • Writing JSON: save structured data
    • Convert Python objects to JSON format
    • Control formatting with indentation
    • Preserve data types and structure

Reading JSON files

  • Use json.load() to read JSON from file object
  • JSON objects become dictionaries and JSON arrays become lists

Analyzing JSON document data

  • Use json.load() to parse JSON data into dictionaries and lists
  • Analyze structured data with standard Python operations

Analyzing JSON data

  • Count operations: analyze collection size
    • Count key-value pairs in dictionaries
    • Determine number of items in lists
    • Track unique values across documents
  • Statistical analysis: compute metrics
    • Calculate minimum, maximum, and average values
    • Identify patterns in document properties
    • Generate summary statistics
  • Value frequency analysis: understand data distribution
    • Count unique values for each key
    • Find most and least common values
    • Build frequency dictionaries

Counting and analyzing JSON data

  • Count total items and nested key-value pairs
  • Use dictionary and list comprehension for analysis

Detecting unique values

  • Look at the doc_metadata.json file inside of the repository
  • Extract and count the unique values for each property
  • This function uses sets to automatically eliminate duplicates

JSON file called doc_metadata.json

{
  "D001": {
    "category": "tutorial",
    "difficulty": "beginner"
  },
  "D002": {
    "category": "tutorial",
    "difficulty": "advanced"
  },
  "D003": {
    "category": "reference",
    "difficulty": "intermediate"
  },
  "D004": {
    "category": "guide",
    "difficulty": "beginner"
  },
  "D005": {
    "category": "tutorial",
    "difficulty": "intermediate"
  }
}

Creating summary data and writing to JSON

  • Writing JSON files: save structured results to files
    • Use json.dump() to write to file objects
    • Use json.dumps() to create JSON strings
    • Format output with indentation for readability
  • Round-trip processing: read, analyze, write JSON files
    • Load existing document data
    • Perform analysis and generate summaries
    • Save results for later use or sharing

Creating and writing summary data

  • Create the summary variable as a dict in Python
  • Save this variable to a JSON file called doc_summary.json

File input and output for prosegrammers

  • Next steps for using file input/output (I/O):
    • Find locations in your tool where file operations could improve functionality
      • Could you load configuration from JSON files?
      • Can your tool read multiple input JSON documents?
      • Would exporting results to JSON make follow-on analyses easier?
    • If you already use file I/O, how can you make it more robust?
    • How would better file handling make your tool more useful?

Key takeaways for prosegrammers

  • Understand file operations
    • Read and write files using open() with context managers
    • Use pathlib.Path for platform-independent file handling
    • Handle file paths and directories programmatically
  • Leverage JSON for structured data
    • Parse JSON files into Python dictionaries and lists
    • Analyze document collections with statistical methods
    • Write summary data back to JSON for sharing
  • Build like a prosegrammer
    • Build complete pipelines: read, analyze, summarize, write
    • Process multiple documents systematically
    • Apply file I/O to real-world document engineering challenges