Hybrid Retrieval

BioCortex uses a three-stage hybrid retrieval pipeline to select tools for each task. This design is 3–5× faster than LLM-only selection and scales to 10,000+ tools (vs. ~200 that fit in a single LLM context).

Pipeline Overview

User query
    │
    ▼
Stage 1: Vector semantic search   →  Top-50 candidates (~50 ms)
    │
    ▼
Stage 2: Knowledge graph expansion →  Expanded set (dependencies discovered)
    │
    ▼
Stage 3: LLM reranking            →  Top-15 tools (precision)

Stage 1: Vector Semantic Search

Goal: Fast recall of semantically relevant tools. How it works:

Tool descriptions are encoded once (e.g. all-MiniLM-L6-v2) into 384-dimensional, L2-normalized vectors. Stored in an index (e.g. FAISS-compatible).
The user query is encoded with the same model.
Cosine similarity returns the top-50 tools.

Advantages:

Speed: ~50 ms vs. several seconds for an LLM call.
Scale: Independent of tool count; works with 10,000+ tools.
Semantics: Finds tools even when keywords don’t match exactly.

Offline fallback: If the embedding model or index is unavailable, the system falls back to LLM-only retrieval (fewer tools, slower).

Stage 2: Knowledge Graph Expansion

Goal: Discover implicit dependencies and related tools. How it works:

Start from the top-50 candidates from Stage 1.
Run a 2-hop BFS on:
- Tool–tool graph: adjacency from the tool registry (e.g. co-occurrence, pipeline dependency).
- BioKnowledgeGraph: entity–tool and domain–tool links (e.g. gene → pathway → analysis tool).
The expanded set includes all discovered tools (often 80–150).

Example: For “scRNA-seq clustering”, Stage 1 might return scanpy_preprocess, leiden_clustering. Stage 2 adds scrublet_doublet_detection, harmony_batch_correction, celltypist_annotation via dependency and domain links. Advantages:

Discovers tools the user didn’t mention.
Ensures prerequisite and co-used tools are considered.
Uses GO/KEGG/UniProt-style relationships when available.

Stage 3: LLM Reranking

Goal: Precision selection based on full task context. How it works:

The expanded set (with tool names and descriptions) is sent to the LLM.
The LLM is asked to select the top-15 tools most relevant to the task, considering:
- Task requirements and data types
- Pipeline order and dependencies
The returned list is the final tool set for the Planner/Executor.

Fallback: If the LLM call fails, the expanded set is truncated to top-15 by Stage 1 similarity score.

Embedding and Indexing

Model: e.g. sentence-transformers/all-MiniLM-L6-v2 (384-dim). Can be cached under ./data/models/sentence_transformers for air-gapped use.
Text per tool: name + ": " + description + " (domain: " + domain + ")".
Query: Same encoding as tool text.

Knowledge Graph Role

The BioKnowledgeGraph contributes in Stage 2 via:

Tool–tool edges (requires_tool, compatible_with, same_domain)
Entity–tool links (e.g. gene, pathway, GO term → tools that use them)
Domain–tool membership

Subgraph extraction is bounded (depth and max nodes) to keep expansion predictable.

Configuration

Top-k per stage (e.g. 50 → 15) can be configured.
BFS depth and max expansion size for Stage 2 can be tuned.
Vector search can be disabled to force LLM-only retrieval (e.g. for debugging).

See Configuration and the code in biocortex.retrieval.

Native Web & Literature Search

In addition to tool retrieval, BioCortex includes a built-in web search layer that can be used directly by the agent without relying on external Biomni tools. This layer is registered in the tool registry and is automatically selected for tasks requiring current information or literature evidence.

Search Backends

Backend	Type	Requires
Tavily	General web search (AI-optimized)	`TAVILY_API_KEY`
Serper	Google-powered web search	`SERPER_API_KEY`
SerpAPI	Google results via SerpAPI	`SERPAPI_API_KEY`
DuckDuckGo	General web search (no key required)	—
PubMed / NCBI	Biomedical literature (full abstracts)	Optional `NCBI_EMAIL`
bioRxiv	Preprints in biology	—
Semantic Scholar	Scientific paper search + citations	—
arXiv	Preprints in CS, physics, math, quantitative biology	—

BioCortex tries backends in priority order (keyed backends first, then keyless fallbacks) and returns the first successful result.

arXiv Search

A dedicated arXiv search tool enables the agent to find and cite recent preprints:

search_arxiv(query, max_results=10, categories=["q-bio", "cs.AI"])

Results include title, authors, abstract, arXiv ID, and PDF link.

Environment Variables

TAVILY_API_KEY=tvly-...
SERPER_API_KEY=...
SERPAPI_API_KEY=...
NCBI_EMAIL=you@example.com   # increases NCBI rate limit

Knowledge Graph — Structure and persistence.
Adding Tools and Agents — How new tools are registered and retrieved.
MCP Integration — External tool servers via Model Context Protocol.

Getting Started

Core Framework

Tools & Extensions

Web & Automation

Deployment & auth

Advanced

Reference

Hybrid retrieval

Hybrid Retrieval

Pipeline Overview

Stage 1: Vector Semantic Search

Stage 2: Knowledge Graph Expansion

Stage 3: LLM Reranking

Embedding and Indexing

Knowledge Graph Role

Configuration

Native Web & Literature Search

Search Backends

arXiv Search

Environment Variables

Getting Started

Core Framework

Tools & Extensions

Web & Automation

Deployment & auth

Advanced

Reference

​Hybrid Retrieval

​Pipeline Overview

​Stage 1: Vector Semantic Search

​Stage 2: Knowledge Graph Expansion

​Stage 3: LLM Reranking

​Embedding and Indexing

​Knowledge Graph Role

​Configuration

​Native Web & Literature Search

​Search Backends

​arXiv Search

​Environment Variables

​Related

Hybrid Retrieval

Pipeline Overview

Stage 1: Vector Semantic Search

Stage 2: Knowledge Graph Expansion

Stage 3: LLM Reranking

Embedding and Indexing

Knowledge Graph Role

Configuration

Native Web & Literature Search

Search Backends

arXiv Search

Environment Variables

Related