Searching and Sorting for Document Engineering

Gregory M. Kapfhammer

November 3, 2025

Searching and sorting

  • Searching: find specific content in documents
    • Locate keywords in large text collections
    • Find references and citations in documentation
    • Search through API documentation and code examples
  • Sorting: organize document data
    • Order search results by relevance
    • Alphabetize glossaries and indexes
    • Rank documents by date or popularity
  • Both methods help to make documents more accessible and useful!

Why searching and sorting matter

  • Document engineering applications:
    • Search engines index billions of web pages
    • Documentation sites need fast keyword search
    • Version control systems sort commits by date
    • Citation managers organize references alphabetically
  • Performance matters for user experience:
    • Slow search frustrates users
    • Efficient sorting enables real-time features
    • Algorithm choice affects scalability
  • Although not course’s key focus, how do we measure efficiency?

Course learning objectives

Learning Objectives for Document Engineering

  • CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
  • CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
  • CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
  • CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.
  • This week’s content aids attainment of CS-104-2, CS-104-3, and CS-104-4!

When the input size grows…the code slows!

  • In document engineering, focus first on correctness, then on efficiency! Recall, an efficient but incorrect program is useless!
  • Program efficiency: how quickly a program runs as input grows
  • Big-O notation: describes how runtime grows with input size
  • Complexity classes: constant, linear, logarithmic, or quadratic
  • Okay, let’s briefly explore how to measure efficiency!

Understanding big-O notation

  • Big-O describes algorithm efficiency as input grows
    • \(O(1)\): constant time, same speed regardless of size
    • \(O(n)\): linear time, doubles when size doubles
    • \(O(\log n)\): logarithmic time, grows very slowly
    • \(O(n^2)\): quadratic time, 4× slower when size doubles
  • Why it matters for document engineering:
    • Processing 1,000 vs 1,000,000 documents
    • Real-time search requires \(O(\log n)\) or better
    • Batch processing can tolerate \(O(n \log n)\)
  • Choose algorithms based on data size and performance needs!

Okay, now let’s explore searching algorithms!

  • Linear search: check every item sequentially at \(O(n)\)
  • Binary search: divide and conquer approach at \(O(\log n)\)
  • Binary search requires sorted data but is much faster
  • Essential for implementing search features in documentation!

Binary search implementation

def binary_search(L, item):
    if len(L) == 0: return False
    median = len(L) // 2
    if item == L[median]:
        return True
    elif item < L[median]:
        return binary_search(L[:median], item)
    else:
        return binary_search(L[median + 1:], item)

print(binary_search([1, 2, 3, 4, 5], 3))
print(binary_search([2, 4, 6, 8, 10], 5))
True
False
  • Searches for item in sorted list L by dividing in half
  • Returns True if found, False otherwise
  • Time complexity: \(O(\log n)\) for sorted lists

Document search example

Scenario: Search for keywords in sorted documentation index

doc_index = ["API", "authentication", "configuration", "deployment",
             "installation", "quickstart", "reference", "tutorial"]

search_term = "configuration"
found = binary_search(doc_index, search_term)
print(f"'{search_term}' found in index: {found}")

search_term = "testing"
found = binary_search(doc_index, search_term)
print(f"'{search_term}' found in index: {found}")
'configuration' found in index: True
'testing' found in index: False
  • Documentation indexes store sorted topic names
  • Binary search enables fast lookup in large documentation
  • Used by search engines and documentation generators

Okay, now let’s explore sorting algorithms!

  • Simple sorts: bubble, selection, insertion at \(O(n^2)\)
  • Advanced sorts: mergesort, quicksort at \(O(n \log n)\)
  • Simple algorithms work well for small datasets
  • Advanced algorithms essential for large document collections!

Simple sorting: bubble sort

def bubblesort(L):
    for _ in range(len(L)-1):
        for i in range(len(L)-1):
            if L[i]>L[i+1]:
                L[i], L[i+1] = L[i+1], L[i]

data = [30, 54, 26, 93, 17, 77, 31, 44, 55, 20]
print("Original:", data)
bubblesort(data)
print("Sorted:", data)
Original: [30, 54, 26, 93, 17, 77, 31, 44, 55, 20]
Sorted: [17, 20, 26, 30, 31, 44, 54, 55, 77, 93]
  • Repeatedly swaps adjacent elements if out of order
  • Simple to understand and implement
  • Time complexity: \(O(n^2)\) makes it slow for large datasets

Advanced sorting: mergesort

  • Divides list in half, sorts recursively, then merges
  • More efficient than bubble sort for large datasets
  • Time complexity: \(O(n \log n)\) in all cases

Sorting documents by date

Scenario: Sort blog posts by publication date

posts = [
    ("Getting Started", "2025-03-15"),
    ("Advanced Topics", "2025-01-20"),
    ("Quick Reference", "2025-02-10")
]

sorted_posts = sorted(posts, key=lambda x: x[1])
print("Posts by date:")
for title, date in sorted_posts:
    print(f"  {date}: {title}")
Posts by date:
  2025-01-20: Advanced Topics
  2025-02-10: Quick Reference
  2025-03-15: Getting Started
  • Documentation sites display content chronologically
  • Python’s sorted() uses efficient timsort at \(O(n \log n)\)
  • key parameter enables sorting by custom criteria

Python’s built-in sorting

X = [3, 1, 5, 2, 4]
Y = sorted(X)
print("Original X:", X)
print("Sorted Y:", Y)

X.sort()
print("Sorted X:", X)
Original X: [3, 1, 5, 2, 4]
Sorted Y: [1, 2, 3, 4, 5]
Sorted X: [1, 2, 3, 4, 5]
  • sorted(): returns new sorted list, keeps original unchanged
  • .sort(): sorts list in-place, modifies original
  • Both use timsort algorithm with \(O(n \log n)\) average case
  • Choose based on whether you need to preserve original data

Sorting documentation keywords

Scenario: Alphabetize API function names for reference

api_functions = ["render", "preview", "create", "analyze", "export"]

api_functions.sort()
print("Alphabetized functions:")
for func in api_functions:
    print(f"  - {func}()")
Alphabetized functions:
  - analyze()
  - create()
  - export()
  - preview()
  - render()
  • API documentation often lists functions alphabetically
  • Makes it easier for users to find specific functions
  • Sorted references improve documentation usability

Practical applications for prosegrammers

  • Search features:
    • Documentation search bars need fast algorithms
    • Binary search enables quick keyword lookup
    • Full-text search uses more advanced data structures
  • Content organization:
    • Sorting blog posts, articles, and documentation pages
    • Alphabetizing glossaries and reference materials
    • Ranking search results by relevance score

Choosing the right algorithm

  • For searching:
    • Small datasets: linear search is fine
    • Large sorted datasets: use binary search
    • Unsorted data: consider sorting first if searching often
  • For sorting:
    • Small datasets (< 50 items): simple sorts work well
    • Large datasets: use Python’s sorted() or .sort()
    • Special cases: counting sort for restricted ranges
  • Make sure to understand algorithm worst-case time complexity to make smart choices for your document engineering projects!