File Input and Output for Document Engineering

Gregory M. Kapfhammer

October 27, 2025

File input and output

What is document engineering?
- Creating documents using code
- Manipulating and analyzing text data
- Building documentation systems
- “Prosegrammers” combine prose and programming

What are this week’s highlights?
- Explore file input and output for document engineering
  - Read and write files using Python’s standard library
  - Work with JSON files for structured document data
  - Analyze and summarize document content programmatically

Key insights for prosegrammers

Document engineering means blending code and prose to build resources for both humans and machines
File input/output (I/O) enables prosegrammers to read, write, and transform documents stored on disk
JSON files provide structured data storage that naturally maps to Python dictionaries

File operations overview

Reading files: load document content from disk
- Open files in read mode
- Read entire contents or line-by-line
- Process text data for analysis
Writing files: save generated documents to disk
- Open files in write mode
- Create new files or overwrite existing ones
- Output analysis results and reports
Path management: work with file paths safely
- Use pathlib.Path for cross-platform compatibility
- Navigate directory structures programmatically
- Build robust file handling systems

Course learning objectives

Learning Objectives for Document Engineering

CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.

Learning about file input/output in Python aids CS-104-3 and CS-104-4!

Using `open` and `read`

Opening files for reading
- Use open() function with filename
- Specify read mode with 'r' parameter
- Always close files after use
Reading file contents
- Read entire file with read() method
- Read line-by-line with readlines() method
- Process content as strings
Safe file handling
- Use context managers with with statement
- Automatic file closing prevents resource leaks
- Handle potential errors gracefully

Basic file reading

Use with context manager statement for automatic file closing
Read entire file as single string with read() method

Line by line with `readlines`

Use readlines() to get list of lines inside of the file
Each line includes newline character, use strip() to remove

Using `open` and `write`

Opening files for writing
- Use open() function with write mode 'w'
- Creates new file or overwrites existing file
- Use append mode 'a' to add to existing files
Writing content to files
- Write strings with write() method
- Write multiple lines with writelines() method
- Control formatting with newline characters
Saving analysis results
- Generate reports programmatically
- Export processed document data
- Create documentation automatically

Basic file writing with `w` mode

Use write mode 'w' to create or overwrite files
Add newline characters explicitly with \n

Appending to file with `a` mode

Use append mode 'a' to add content without overwriting
New content added to end of existing file

Try out `pathlib`!

Path objects: modern file path handling
- Create platform-independent file paths
- Navigate directory structures easily
- Check file existence and properties
Convenient methods: simple file I/O
- Read entire files with read_text() method
- Write content with write_text() method
- Handle paths as objects, not strings
Safer operations: avoid common errors
- Type-safe path manipulation
- Automatic path separator handling
- Clear, readable code

Working with `Path` objects

Use the / operator to join path components naturally
write_text() and read_text() methods simplify file operations

Listing files in directories with `glob`

Use glob() method to find files matching patterns
Access file properties through stat() method

Working with JSON files

JSON format: structured data storage
- JavaScript Object Notation for data exchange
- Maps naturally to Python dictionaries and lists
- Human-readable and machine-parseable
Reading JSON: parse structured data
- Use json module from standard library
- Load JSON files into Python data structures
- Access nested data easily
Writing JSON: save structured data
- Convert Python objects to JSON format
- Control formatting with indentation
- Preserve data types and structure

Reading JSON files

Use json.load() to read JSON from file object
JSON objects become dictionaries and JSON arrays become lists

Analyzing JSON document data

Use json.load() to parse JSON data into dictionaries and lists
Analyze structured data with standard Python operations

Analyzing JSON data

Count operations: analyze collection size
- Count key-value pairs in dictionaries
- Determine number of items in lists
- Track unique values across documents
Statistical analysis: compute metrics
- Calculate minimum, maximum, and average values
- Identify patterns in document properties
- Generate summary statistics
Value frequency analysis: understand data distribution
- Count unique values for each key
- Find most and least common values
- Build frequency dictionaries

Counting and analyzing JSON data

Count total items and nested key-value pairs
Use dictionary and list comprehension for analysis

Detecting unique values

Look at the doc_metadata.json file inside of the repository
Extract and count the unique values for each property
This function uses sets to automatically eliminate duplicates

JSON file called `doc_metadata.json`

{
  "D001": {
    "category": "tutorial",
    "difficulty": "beginner"
  },
  "D002": {
    "category": "tutorial",
    "difficulty": "advanced"
  },
  "D003": {
    "category": "reference",
    "difficulty": "intermediate"
  },
  "D004": {
    "category": "guide",
    "difficulty": "beginner"
  },
  "D005": {
    "category": "tutorial",
    "difficulty": "intermediate"
  }
}

Creating summary data and writing to JSON

Writing JSON files: save structured results to files
- Use json.dump() to write to file objects
- Use json.dumps() to create JSON strings
- Format output with indentation for readability
Round-trip processing: read, analyze, write JSON files
- Load existing document data
- Perform analysis and generate summaries
- Save results for later use or sharing

Creating and writing summary data

Create the summary variable as a dict in Python
Save this variable to a JSON file called doc_summary.json

File input and output for prosegrammers

Next steps for using file input/output (I/O):
- Find locations in your tool where file operations could improve functionality
  - Could you load configuration from JSON files?
  - Can your tool read multiple input JSON documents?
  - Would exporting results to JSON make follow-on analyses easier?
- If you already use file I/O, how can you make it more robust?
- How would better file handling make your tool more useful?

Key takeaways for prosegrammers

Understand file operations
- Read and write files using open() with context managers
- Use pathlib.Path for platform-independent file handling
- Handle file paths and directories programmatically
Leverage JSON for structured data
- Parse JSON files into Python dictionaries and lists
- Analyze document collections with statistical methods
- Write summary data back to JSON for sharing
Build like a prosegrammer
- Build complete pipelines: read, analyze, summarize, write
- Process multiple documents systematically
- Apply file I/O to real-world document engineering challenges

File Input and Output for Document Engineering

File input and output

Key insights for prosegrammers

File operations overview

Course learning objectives

Using open and read

Basic file reading

Line by line with readlines

Using open and write

Basic file writing with w mode

Appending to file with a mode

Try out pathlib!

Working with Path objects

Listing files in directories with glob

Working with JSON files

Reading JSON files

Analyzing JSON document data

Analyzing JSON data

Counting and analyzing JSON data

Detecting unique values

JSON file called doc_metadata.json

Creating summary data and writing to JSON

Creating and writing summary data

File input and output for prosegrammers

Key takeaways for prosegrammers

Using `open` and `read`

Line by line with `readlines`

Using `open` and `write`

Basic file writing with `w` mode

Appending to file with `a` mode

Try out `pathlib`!

Working with `Path` objects

Listing files in directories with `glob`

JSON file called `doc_metadata.json`