Searching and Sorting for Document Engineering

Gregory M. Kapfhammer

November 3, 2025

Searching and sorting

Searching: find specific content in documents
- Locate keywords in large text collections
- Find references and citations in documentation
- Search through API documentation and code examples

Sorting: organize document data
- Order search results by relevance
- Alphabetize glossaries and indexes
- Rank documents by date or popularity
Both methods help to make documents more accessible and useful!

Why searching and sorting matter

Document engineering applications:
- Search engines index billions of web pages
- Documentation sites need fast keyword search
- Version control systems sort commits by date
- Citation managers organize references alphabetically
Performance matters for user experience:
- Slow search frustrates users
- Efficient sorting enables real-time features
- Algorithm choice affects scalability
Although not course’s key focus, how do we measure efficiency?

Course learning objectives

Learning Objectives for Document Engineering

CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.

This week’s content aids attainment of CS-104-2, CS-104-3, and CS-104-4!

When the input size grows…the code slows!

In document engineering, focus first on correctness, then on efficiency! Recall, an efficient but incorrect program is useless!
Program efficiency: how quickly a program runs as input grows
Big-O notation: describes how runtime grows with input size
Complexity classes: constant, linear, logarithmic, or quadratic
Okay, let’s briefly explore how to measure efficiency!

Understanding big-O notation

Big-O describes algorithm efficiency as input grows
- \(O(1)\): constant time, same speed regardless of size
- \(O(n)\): linear time, doubles when size doubles
- \(O(\log n)\): logarithmic time, grows very slowly
- \(O(n^2)\): quadratic time, 4× slower when size doubles
Why it matters for document engineering:
- Processing 1,000 vs 1,000,000 documents
- Real-time search requires \(O(\log n)\) or better
- Batch processing can tolerate \(O(n \log n)\)
Choose algorithms based on data size and performance needs!

Okay, now let’s explore searching algorithms!

Linear search: check every item sequentially at \(O(n)\)
Binary search: divide and conquer approach at \(O(\log n)\)
Binary search requires sorted data but is much faster
Essential for implementing search features in documentation!

Binary search implementation

def binary_search(L, item):
    if len(L) == 0: return False
    median = len(L) // 2
    if item == L[median]:
        return True
    elif item < L[median]:
        return binary_search(L[:median], item)
    else:
        return binary_search(L[median + 1:], item)

print(binary_search([1, 2, 3, 4, 5], 3))
print(binary_search([2, 4, 6, 8, 10], 5))

True
False

Searches for item in sorted list L by dividing in half
Returns True if found, False otherwise
Time complexity: \(O(\log n)\) for sorted lists

Document search example

Scenario: Search for keywords in sorted documentation index

doc_index = ["API", "authentication", "configuration", "deployment",
             "installation", "quickstart", "reference", "tutorial"]

search_term = "configuration"
found = binary_search(doc_index, search_term)
print(f"'{search_term}' found in index: {found}")

search_term = "testing"
found = binary_search(doc_index, search_term)
print(f"'{search_term}' found in index: {found}")

'configuration' found in index: True
'testing' found in index: False

Documentation indexes store sorted topic names
Binary search enables fast lookup in large documentation
Used by search engines and documentation generators

Okay, now let’s explore sorting algorithms!

Simple sorts: bubble, selection, insertion at \(O(n^2)\)
Advanced sorts: mergesort, quicksort at \(O(n \log n)\)
Simple algorithms work well for small datasets
Advanced algorithms essential for large document collections!

Simple sorting: bubble sort

def bubblesort(L):
    for _ in range(len(L)-1):
        for i in range(len(L)-1):
            if L[i]>L[i+1]:
                L[i], L[i+1] = L[i+1], L[i]

data = [30, 54, 26, 93, 17, 77, 31, 44, 55, 20]
print("Original:", data)
bubblesort(data)
print("Sorted:", data)

Original: [30, 54, 26, 93, 17, 77, 31, 44, 55, 20]
Sorted: [17, 20, 26, 30, 31, 44, 54, 55, 77, 93]

Repeatedly swaps adjacent elements if out of order
Simple to understand and implement
Time complexity: \(O(n^2)\) makes it slow for large datasets

Advanced sorting: mergesort

Divides list in half, sorts recursively, then merges
More efficient than bubble sort for large datasets
Time complexity: \(O(n \log n)\) in all cases

Sorting documents by date

Scenario: Sort blog posts by publication date

posts = [
    ("Getting Started", "2025-03-15"),
    ("Advanced Topics", "2025-01-20"),
    ("Quick Reference", "2025-02-10")
]

sorted_posts = sorted(posts, key=lambda x: x[1])
print("Posts by date:")
for title, date in sorted_posts:
    print(f"  {date}: {title}")

Posts by date:
  2025-01-20: Advanced Topics
  2025-02-10: Quick Reference
  2025-03-15: Getting Started

Documentation sites display content chronologically
Python’s sorted() uses efficient timsort at \(O(n \log n)\)
key parameter enables sorting by custom criteria

Python’s built-in sorting

X = [3, 1, 5, 2, 4]
Y = sorted(X)
print("Original X:", X)
print("Sorted Y:", Y)

X.sort()
print("Sorted X:", X)

Original X: [3, 1, 5, 2, 4]
Sorted Y: [1, 2, 3, 4, 5]
Sorted X: [1, 2, 3, 4, 5]

sorted(): returns new sorted list, keeps original unchanged
.sort(): sorts list in-place, modifies original
Both use timsort algorithm with \(O(n \log n)\) average case
Choose based on whether you need to preserve original data

Sorting documentation keywords

Scenario: Alphabetize API function names for reference

api_functions = ["render", "preview", "create", "analyze", "export"]

api_functions.sort()
print("Alphabetized functions:")
for func in api_functions:
    print(f"  - {func}()")

Alphabetized functions:
  - analyze()
  - create()
  - export()
  - preview()
  - render()

API documentation often lists functions alphabetically
Makes it easier for users to find specific functions
Sorted references improve documentation usability

Practical applications for prosegrammers

Search features:
- Documentation search bars need fast algorithms
- Binary search enables quick keyword lookup
- Full-text search uses more advanced data structures
Content organization:
- Sorting blog posts, articles, and documentation pages
- Alphabetizing glossaries and reference materials
- Ranking search results by relevance score

Choosing the right algorithm

For searching algorithms:
- Small datasets: linear search is fine
- Large sorted datasets: use binary search
- Unsorted data: consider sorting first if searching often
For sorting algorithms:
- Small datasets (< 50 items): simple sorts work well
- Large datasets: use Python’s sorted() or .sort()
- Special cases: counting sort for restricted ranges
Make sure to understand algorithm worst-case time complexity to make smart choices for your document engineering projects!

Searching and sorting aid prosegramming

Next steps with searching and sorting techniques:
- Find locations in your tool where searching/sorting could add value:
  - Could binary search help find references quickly?
  - Would sorting improve document organization?
  - Could efficient algorithms speed up processing?
- Combine searching and sorting for powerful document tools
- How would efficient algorithms make your document tools better?

Key takeaways for prosegrammers

Choose the right search method
- Use linear search for small collections and binary search for large, sorted indexes
- Consider full-text search or indexing or databases for complex queries
Sort with project requirements in mind
- Use Python’s sorted() or .sort() for production workloads
- Use key= to sort by dates, relevance, or custom fields
Balance correctness and efficiency
- Prioritize correct, well-tested code before optimizing
- Use algorithmic complexity (i.e., Big-O notation and analysis) to grasp fundamentals
Practical prosegrammer tips
- Use bisect or indexing structures for fast lookups
- Benchmark sorting/searching on representative data
- Keep sorts stable when ordering related document metadata