Hybrid Retrieval

BioCortex uses a three-stage hybrid retrieval pipeline to select tools for each task. This design is 3–5× faster than LLM-only selection and scales to 10,000+ tools (vs. ~200 that fit in a single LLM context).

Pipeline Overview

User query


Stage 1: Vector semantic search   →  Top-50 candidates (~50 ms)


Stage 2: Knowledge graph expansion →  Expanded set (dependencies discovered)


Stage 3: LLM reranking            →  Top-15 tools (precision)

Goal: Fast recall of semantically relevant tools. How it works:
  1. Tool descriptions are encoded once (e.g. all-MiniLM-L6-v2) into 384-dimensional, L2-normalized vectors. Stored in an index (e.g. FAISS-compatible).
  2. The user query is encoded with the same model.
  3. Cosine similarity returns the top-50 tools.
Advantages:
  • Speed: ~50 ms vs. several seconds for an LLM call.
  • Scale: Independent of tool count; works with 10,000+ tools.
  • Semantics: Finds tools even when keywords don’t match exactly.
Offline fallback: If the embedding model or index is unavailable, the system falls back to LLM-only retrieval (fewer tools, slower).

Stage 2: Knowledge Graph Expansion

Goal: Discover implicit dependencies and related tools. How it works:
  1. Start from the top-50 candidates from Stage 1.
  2. Run a 2-hop BFS on:
    • Tool–tool graph: adjacency from the tool registry (e.g. co-occurrence, pipeline dependency).
    • BioKnowledgeGraph: entity–tool and domain–tool links (e.g. gene → pathway → analysis tool).
  3. The expanded set includes all discovered tools (often 80–150).
Example: For “scRNA-seq clustering”, Stage 1 might return scanpy_preprocess, leiden_clustering. Stage 2 adds scrublet_doublet_detection, harmony_batch_correction, celltypist_annotation via dependency and domain links. Advantages:
  • Discovers tools the user didn’t mention.
  • Ensures prerequisite and co-used tools are considered.
  • Uses GO/KEGG/UniProt-style relationships when available.

Stage 3: LLM Reranking

Goal: Precision selection based on full task context. How it works:
  1. The expanded set (with tool names and descriptions) is sent to the LLM.
  2. The LLM is asked to select the top-15 tools most relevant to the task, considering:
    • Task requirements and data types
    • Pipeline order and dependencies
  3. The returned list is the final tool set for the Planner/Executor.
Fallback: If the LLM call fails, the expanded set is truncated to top-15 by Stage 1 similarity score.

Embedding and Indexing

  • Model: e.g. sentence-transformers/all-MiniLM-L6-v2 (384-dim). Can be cached under ./data/models/sentence_transformers for air-gapped use.
  • Text per tool: name + ": " + description + " (domain: " + domain + ")".
  • Query: Same encoding as tool text.

Knowledge Graph Role

The BioKnowledgeGraph contributes in Stage 2 via:
  • Tool–tool edges (requires_tool, compatible_with, same_domain)
  • Entity–tool links (e.g. gene, pathway, GO term → tools that use them)
  • Domain–tool membership
Subgraph extraction is bounded (depth and max nodes) to keep expansion predictable.

Configuration

  • Top-k per stage (e.g. 50 → 15) can be configured.
  • BFS depth and max expansion size for Stage 2 can be tuned.
  • Vector search can be disabled to force LLM-only retrieval (e.g. for debugging).
See Configuration and the code in biocortex.retrieval.
In addition to tool retrieval, BioCortex includes a built-in web search layer that can be used directly by the agent without relying on external Biomni tools. This layer is registered in the tool registry and is automatically selected for tasks requiring current information or literature evidence.

Search Backends

BackendTypeRequires
TavilyGeneral web search (AI-optimized)TAVILY_API_KEY
SerperGoogle-powered web searchSERPER_API_KEY
SerpAPIGoogle results via SerpAPISERPAPI_API_KEY
DuckDuckGoGeneral web search (no key required)
PubMed / NCBIBiomedical literature (full abstracts)Optional NCBI_EMAIL
bioRxivPreprints in biology
Semantic ScholarScientific paper search + citations
arXivPreprints in CS, physics, math, quantitative biology
BioCortex tries backends in priority order (keyed backends first, then keyless fallbacks) and returns the first successful result. A dedicated arXiv search tool enables the agent to find and cite recent preprints:
search_arxiv(query, max_results=10, categories=["q-bio", "cs.AI"])
Results include title, authors, abstract, arXiv ID, and PDF link.

Environment Variables

TAVILY_API_KEY=tvly-...
SERPER_API_KEY=...
SERPAPI_API_KEY=...
NCBI_EMAIL=you@example.com   # increases NCBI rate limit