Retrieval Augmented Generation for Document Engineering

Gregory M. Kapfhammer

December 1, 2025

Understanding RAG

What is RAG?
- Combining document retrieval with text generation
- Finding relevant information to support answers
- Building context-aware document systems
- Enhancing responses with retrieved knowledge

What are this week’s highlights?
- A “from scratch” implementation of basic RAG concepts:
  - Document ingestion and preprocessing
  - Text chunking and organization
  - Simple vector-like representations
  - Retrieval and context building
  - Response generation with context

Key insights for prosegrammers

RAG combines retrieval and generation to build intelligent document systems that provide meaningful, contextualized answers
Simple implementations using basic Python can demonstrate core RAG concepts without requiring use of complex libraries
Understanding RAG fundamentals helps prosegrammers design better document engineering tools and pipelines
You can leverage these insights to build a more full-featured system using packages SentenceTransformers and FAISS!

Course learning objectives

Learning Objectives for Document Engineering

CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.

Content aids in attainment of learning objectives CS-104-3 and CS-104-4!

Document ingestion and preprocessing

Document ingestion loads text data into a processing pipeline
- Read files from the filesystem
- Parse different text formats
- Extract raw content for analysis
- Foundation of all RAG systems
Why ingest documents?
- Build knowledge base for retrieval
- Prepare content for searching
- Enable context-aware responses
- Support question-answering systems

Reading documents from files

Simple document loading from string content
Normalize text by removing extra blank spaces
Key insight: RAG starts with loading documents into memory

Data cleaning and preprocessing

Remove extra blank spaces and normalize formatting
Prepare text for consistent processing
Cleaning ensures uniform document handling

Divide text into chunks

Text chunking divides documents into smaller, manageable pieces
- Split long documents into segments
- Create context-sized pieces for retrieval
- Balance chunk size for completeness
- Enable efficient searching and matching
Why chunk documents?
- Large documents overwhelm processing
- Smaller chunks match queries better
- Control context window size
- Improve retrieval precision
- Trade-off efficiency and representation of document’s relevance!

Simple sentence-based chunking

Split documents into sentence-level chunks
Each chunk becomes searchable unit
Smaller chunks enable precise retrieval

Fixed-size word chunking

Control chunk size by word count
Useful for consistent context windows
Trade-off between completeness and granularity

Vector-like entities

Vector representations encode text as numerical values
- Transform text into comparable format
- Enable similarity calculations
- Foundation for semantic search
- Real systems use embeddings from models
Simple representation approach
- Word frequency as feature vector
- Shared vocabulary across chunks
- Basic similarity through overlap
- Demonstrates core concept simply
Better to use SentenceTransformers or a cloud-based API for an LLM!

Tools for vector embeddings

Popular embedding tools and packages:
- SentenceTransformers: Pre-trained models for semantic embeddings
- OpenAI Embeddings API: Cloud-based embedding generation
- Hugging Face Transformers: Open-source embedding models
- FAISS: Efficient similarity search for vectors
- ChromaDB: Vector database for RAG systems
- Pinecone: Managed vector database service
- Qdrant: Vector search and storage solutions

For this course: Simple word-based representations without external dependencies!
Want to learn more? SE Radio 690: Kacper Łukawski on Qdrant Vector Database

Simple word frequency vectors

See text as word frequency dictionary, each word becomes a “feature dimension”

Computing similarity between chunks

Calculate overlap between word sets
Higher overlap means higher relevance
Basis for retrieval ranking
Sophistated systems use cosine similarity or other metrics!

Relevant documents

Retrieval finds most relevant chunks for a query
- Compare query against all chunks
- Rank by similarity score
- Select top matches
- Core of RAG systems
Why retrieve documents?
- Provide relevant context for answers
- Find supporting information
- Build knowledge-grounded responses
- Enable question answering
- Offer input to a local or cloud-based LLM

Building a simple retriever

Score all chunks against query
Sort by relevance score …
… And return top matches! For this query, did system pick correct chunks?

Understanding relevance scores

Show explicit word matches
Explain why chunk is relevant

Combining retrieved context with queries

Context combination merges query with retrieved information
- Build context from top chunks
- Format for response generation
- Maintain source attribution
- Create comprehensive knowledge base
Why combine context?
- Provide evidence for answers
- Support factual responses
- Enable source citation

Building context from retrieval

Format query with retrieved chunks and create structured context

Formatting context for responses

Add source tracking to context and include relevance scores

Context-driven data

Response generation creates answers using retrieved context
- Extract relevant information
- Synthesize coherent responses
- Maintain factual grounding
- In practice, uses (large) language models
Simple generation approach
- Template-based responses
- Direct information extraction
- Demonstrates concept flow
- Real systems use LLMs for flexibility

Tools for response generation

Language models for generation:
- OpenAI GPT: Cloud-based LLM for text generation
- Anthropic Claude: Conversational AI with long context
- Google Gemini: Multimodal generation capabilities
- Hugging Face Models: Open-source LLMs like Llama
- LangChain: Framework for building RAG applications
- LlamaIndex: Data framework for LLM applications

For this course: We use template-based generation to demonstrate the concept without requiring external APIs or packages. This approach generates an interesting result! Yet, not general-purpose enough to be used in a production RAG tool.

Simple template-based generation

Create response from top retrieved chunk
Simple template wraps information
Demonstrates basic generation concept

Better response with multiple sources

Synthesize information from multiple chunks and present full answer
Better responses use more context; can you explain why these match?

Complete RAG pipeline demonstration

End-to-end RAG system combines all components
- Ingest and preprocess documents
- Chunk text into searchable units
- Create vector representations
- Retrieve relevant chunks
- Generate context-grounded responses
Real-world applications
- Question answering systems
- Document search assistants

Complete RAG system

Integrate all RAG components
Process multiple documents
Return context-aware response

RAG system with source attribution

Enhancing RAG

Improving RAG systems:
- Better chunking strategies:
  - Semantic chunking by topic
  - Overlapping chunks for context
  - Adaptive chunk sizes
- Enhanced retrieval methods:
  - Advanced similarity metrics
  - Hybrid search combining keywords and vectors
  - Re-ranking for better results
- Context optimization:
  - Chunk selection strategies
  - Context window management
  - Prompt engineering for generation

Key takeaways for prosegrammers

Understand RAG components
- Document ingestion prepares knowledge base
- Chunking creates retrievable units
- Vector representations enable similarity search
- Retrieval finds relevant context
- Generation produces grounded responses
Master retrieval concepts
- Similarity scoring ranks relevance
- Top-k selection balances context and precision
- Source attribution maintains transparency
- Retrieved context grounds generated answers
What are some practical ways in which you could integrate RAG into your document engineering tool? How will you extend the starting implementation presented this week? Make sure to listen to SE Radio Episode 690: Kacper Łukawski on Qdrant Vector Database!