Building Smart Documentation Databases for Your AI Coding Assistant
Building Smart Documentation Databases for Your AI Coding Assistant
When you feed raw documentation into an AI agent, you're essentially asking it to find needles in a haystack. Every legal page, changelog entry, and navigation hub becomes noise that dilutes the signal. At NameOcean, we've been thinking deeply about how to prepare documentation for AI-powered development workflows, and we want to share a practical framework for doing this right.
The Problem: Not All Pages Are Created Equal
Here's a truth that catches many developers off guard: a significant portion of any technical documentation site exists purely for structure and compliance. Index pages that link to other pages, privacy policies, changelogs, API reference lists—these are essential for human readers navigating a website, but they're dead weight for an AI agent trying to learn from actual content.
When you dump unfiltered documentation into a vector database or knowledge base, you're forcing your AI to wade through material that doesn't teach it anything useful. The result? Slower queries, bloated embeddings, and AI responses that reference the wrong pages entirely.
A Two-Pass Classification Strategy
The most efficient approach combines rule-based filtering with selective LLM classification. Think of it as a triage system for your documentation.
First Pass: The Quick Filter
Start with pattern matching on URLs and basic content structure. You can catch the low-hanging fruit instantly:
- Legal pages: Check for patterns like
/legal/,/privacy,/terms,/eula,/cookie - Navigation hubs: Pages with fewer than 200 words that contain mostly links
- Changelogs: Often follow predictable URL patterns
- Reference pages: Sometimes detectable by structure alone
This pass runs locally, costs nothing, and handles maybe 40-60% of your pages.
Second Pass: LLM Classification
For the remaining pages, send a lightweight payload to a local LLM (not an API—keep this self-contained). Pass:
- The URL
- The page title
- The first 200 words of content
- A list of heading hierarchy
Ask it to classify using a framework like Diátaxis, which distinguishes between:
- Conceptual: Explanations and background knowledge
- Tutorial: Learning-by-doing, guided walkthroughs
- How-to: Task-oriented guides for specific outcomes
- Examples: Code samples and demonstrations
- Structural: Navigation pages, references, legal content
The LLM only touches what the rules couldn't handle, keeping costs down and processing fast.
Embedding Content Intelligently
Once you've filtered out the noise, embedding pages for semantic search becomes much more effective. Here's the catch: documentation pages often exceed token limits.
Instead of truncating, split at heading boundaries and average the resulting embeddings. This preserves semantic structure—headings, code blocks, and lists matter for understanding context.
When a page is too long, the markdown structure naturally creates splitting points:
def embed_page(content: str) -> list[float]:
chunks = re.split(r'(?m)^#{1,3} ', content)
if len(chunks) == 1:
return model.encode(content).tolist()
embeddings = [model.encode(chunk) for chunk in chunks if chunk.strip()]
avg = np.mean(embeddings, axis=0)
return (avg / np.linalg.norm(avg)).tolist()
Use a local sentence transformer model here—you avoid API costs and latency, and honestly, for technical documentation, local models perform well enough.
Building a Hybrid Knowledge Graph
The real power emerges when you combine two types of relationships:
Explicit Links: The hyperlinks that documentation authors wrote. These are high-confidence connections reflecting intentional structure.
Semantic Edges: Connections discovered through embedding similarity. If two pages have cosine similarity above your threshold (we use 0.75), they're conceptually related even if not explicitly linked.
Store these as directed edges in a graph. For semantic edges, weight them with the similarity score. For link edges, they're unweighted—a link is a link.
A critical optimization: cap the number of neighbors per page (20 is a good starting point) to avoid creating massive connection hubs that confuse traversal. And exclude navigation, legal, and reference pages from semantic graph generation—they only pollute the signal.
The Final Artifact: A Self-Contained SQLite Database
Everything lives in one portable SQLite database:
- Full documentation content (cleaned markdown)
- Page classifications
- Embeddings (as vectors or serialized blobs)
- Graph edges with weights
- URL and metadata
This is powerful because:
- It's portable: Move it anywhere, use it offline
- It's queryable: AI agents can write SQL. LLMs are remarkably good at it.
- It's filterable: "Show me only tutorial and how-to pages about authentication"
- It's navigable: Agents can traverse the knowledge graph, moving from one relevant page to semantically similar pages
Practical Workflow
The complete pipeline looks like:
- Crawl the documentation site (handle redirects, respect robots.txt, deal with JavaScript-rendered content)
- Clean the extracted HTML into usable markdown
- Classify pages with rules, then LLM for the uncertain cases
- Embed with local models, handling long pages by splitting at heading boundaries
- Build the graph with explicit links and semantic edges
- Store everything in SQLite
Now your AI agent has a curated, structured, queryable knowledge base instead of a raw dump of HTML.
Why This Matters for Developers
Whether you're building an internal code assistant, integrating AI into your IDE, or developing a knowledge retrieval system for your team, documentation quality becomes a force multiplier. With intelligent structuring, your AI spends cycles on actual content rather than noise. Queries get faster. Responses become more relevant. And you maintain complete control over your knowledge base—no vendor dependencies, no API costs, no privacy concerns about sending proprietary documentation to third-party services.
The framework we've outlined here scales from small single-product documentation sites up to massive multi-product ecosystems. The key insight is simple: filtering and structuring your documentation upstream saves your AI agent from swimming in noise downstream.