Chapter 3: Search and Retrieval Mechanics¶
Retrieval within CodaCite represents the "Engine of Discovery," transitioning from simple keyword matching to a sophisticated, agentic process that reasons about its own quality. This chapter explores the physics of Hybrid Search and the mechanics of the self-correcting LangGraph retrieval loop.
3.1 Hybrid Search: The Dual-Path Strategy¶
To achieve high precision (exact terminology) and high recall (conceptual meaning), CodaCite employs a Hybrid Search strategy in SurrealDB.
Unlike traditional RAG, CodaCite's retrieval engine combines the strengths of Keyword Search and Vector Search through a Hybrid Retriever:
- BM25 (Lexical): Ensures precision for specific entities, acronyms, and technical jargon.
- Vector (Semantic): Captures thematic intent and conceptual similarity, even when keywords do not match.
Weighted α-Scoring¶
The final relevance score is a weighted combination of lexical and semantic results:
Typically, \(\alpha\) is tuned to 0.4, favoring semantic context while retaining strong keyword anchoring.
3.2 The Self-Correcting Retrieval Loop (LangGraph)¶
CodaCite does not rely on a single, static retrieval call. Instead, it utilizes an agentic loop built on LangGraph. This loop mimics the human process of "searching, evaluating, and refining."
The Retrieval Cycle¶
- Retrieve: The initial query is vectorized and executed against the Hybrid Index. Simultaneously, the system links query entities to the Knowledge Graph and performs a depth-limited traversal.
- Rerank: A
ModernBERTcross-encoder reranks all retrieved candidates. This phase applies high-precision filtering (typically >0.3 score) to discard noisy or irrelevant context snippets. - Rewrite (Conditional): If the reranking phase determines that no relevant documents were found and the
rewrite_countis below the limit, the Query Rewriter node is triggered. It uses an LLM to optimize the query and restarts the loop. - Final Synthesis: Once relevant context is secured, the documents are returned as the grounded context window for generation.
3.3 Graph-Enforced Scoping¶
Retrieval is strictly constrained by the Notebook Scope. When a user selects specific notebooks, the retrieval engine applies a graph filter:
- O(1) RecordID Fetches: The system uses SurrealDB's direct
RecordIDfetching for partition-based filtering, minimizing join latency during the LangGraph loop. - Security & Relevance: This ensures that context from unrelated projects does not "bleed" into the current analysis, maintaining strict logical isolation.
graph TD
START((Query)) --> RETRIEVE[Hybrid Search + Graph]
RETRIEVE --> RERANK[ModernBERT Rerank]
RERANK --> GRADE{Good Docs?}
GRADE -- "No + Max < 3" --> REWRITE[Query Rewriter]
REWRITE --> RETRIEVE
GRADE -- "Yes OR Max == 3" --> END[Final Context Window]