Hybrid Retrieval
BioCortex uses a three-stage hybrid retrieval pipeline to select tools for each task. This design is 3–5× faster than LLM-only selection and scales to 10,000+ tools (vs. ~200 that fit in a single LLM context).Pipeline Overview
Stage 1: Vector Semantic Search
Goal: Fast recall of semantically relevant tools. How it works:- Tool descriptions are encoded once (e.g.
all-MiniLM-L6-v2) into 384-dimensional, L2-normalized vectors. Stored in an index (e.g. FAISS-compatible). - The user query is encoded with the same model.
- Cosine similarity returns the top-50 tools.
- Speed: ~50 ms vs. several seconds for an LLM call.
- Scale: Independent of tool count; works with 10,000+ tools.
- Semantics: Finds tools even when keywords don’t match exactly.
Stage 2: Knowledge Graph Expansion
Goal: Discover implicit dependencies and related tools. How it works:- Start from the top-50 candidates from Stage 1.
- Run a 2-hop BFS on:
- Tool–tool graph: adjacency from the tool registry (e.g. co-occurrence, pipeline dependency).
- BioKnowledgeGraph: entity–tool and domain–tool links (e.g. gene → pathway → analysis tool).
- The expanded set includes all discovered tools (often 80–150).
scanpy_preprocess, leiden_clustering. Stage 2 adds scrublet_doublet_detection, harmony_batch_correction, celltypist_annotation via dependency and domain links.
Advantages:
- Discovers tools the user didn’t mention.
- Ensures prerequisite and co-used tools are considered.
- Uses GO/KEGG/UniProt-style relationships when available.
Stage 3: LLM Reranking
Goal: Precision selection based on full task context. How it works:- The expanded set (with tool names and descriptions) is sent to the LLM.
- The LLM is asked to select the top-15 tools most relevant to the task, considering:
- Task requirements and data types
- Pipeline order and dependencies
- The returned list is the final tool set for the Planner/Executor.
Embedding and Indexing
- Model: e.g.
sentence-transformers/all-MiniLM-L6-v2(384-dim). Can be cached under./data/models/sentence_transformersfor air-gapped use. - Text per tool:
name + ": " + description + " (domain: " + domain + ")". - Query: Same encoding as tool text.
Knowledge Graph Role
The BioKnowledgeGraph contributes in Stage 2 via:- Tool–tool edges (requires_tool, compatible_with, same_domain)
- Entity–tool links (e.g. gene, pathway, GO term → tools that use them)
- Domain–tool membership
Configuration
- Top-k per stage (e.g. 50 → 15) can be configured.
- BFS depth and max expansion size for Stage 2 can be tuned.
- Vector search can be disabled to force LLM-only retrieval (e.g. for debugging).
biocortex.retrieval.
Native Web & Literature Search
In addition to tool retrieval, BioCortex includes a built-in web search layer that can be used directly by the agent without relying on external Biomni tools. This layer is registered in the tool registry and is automatically selected for tasks requiring current information or literature evidence.Search Backends
| Backend | Type | Requires |
|---|---|---|
| Tavily | General web search (AI-optimized) | TAVILY_API_KEY |
| Serper | Google-powered web search | SERPER_API_KEY |
| SerpAPI | Google results via SerpAPI | SERPAPI_API_KEY |
| DuckDuckGo | General web search (no key required) | — |
| PubMed / NCBI | Biomedical literature (full abstracts) | Optional NCBI_EMAIL |
| bioRxiv | Preprints in biology | — |
| Semantic Scholar | Scientific paper search + citations | — |
| arXiv | Preprints in CS, physics, math, quantitative biology | — |
arXiv Search
A dedicated arXiv search tool enables the agent to find and cite recent preprints:Environment Variables
Related
- Knowledge Graph — Structure and persistence.
- Adding Tools and Agents — How new tools are registered and retrieved.
- MCP Integration — External tool servers via Model Context Protocol.