System Architecture
BioCortex is organized into seven functional layers plus a central orchestrator and LLM router. This page summarizes each layer and how they work together.High-Level Overview
1. Strategy Router
Role: Classify the user task and select the execution strategy.- SimpleReAct — For direct queries, lookups, single-step tasks. Uses an enhanced ReAct loop (plan → execute step → observe → update). ~3× faster than full pipeline for simple tasks.
- DAG Parallel — For multi-step analytical workflows. Delegates to the Planner to build a TaskDAG; independent nodes at the same level run in parallel via
asyncio.gather(). - MCTS — For exploratory research (e.g. drug discovery, hypothesis generation). Uses Monte Carlo Tree Search to explore reasoning paths; the best path is converted to a DAG and executed.
2. Multi-Agent Pipeline (DAG Mode)
| Agent | Responsibility |
|---|---|
| Planner | Turns natural language into a TaskDAG: nodes with name, description, task_type, domain, tools_needed, dependencies, language, priority. Validated for acyclicity (e.g. NetworkX). |
| Executor | Translates each node into executable code; uses hybrid retrieval for tool selection; runs code in a sandbox; collects artifacts. |
| Critic | Validates each result (scientific correctness, completeness, consistency, statistical validity, errors). Produces structured ValidationResult with pass/fail and retry guidance. |
| Synthesizer | Compiles all results into a structured report (title, summary, methodology, findings, artifacts, limitations, next steps) with provenance links. |
3. Hybrid Retrieval System
Three-stage pipeline for tool (and data) selection:-
Stage 1 — Vector semantic search
Query and tool descriptions are embedded (e.g.all-MiniLM-L6-v2); cosine similarity returns top-50 candidates in ~50 ms. Scales to 10,000+ tools. -
Stage 2 — Knowledge graph expansion
From the top-50, a 2-hop BFS on the tool relationship graph and biological knowledge graph discovers implicit dependencies (e.g. batch correction, doublet removal for scRNA-seq). -
Stage 3 — LLM reranking
The expanded set is sent to the LLM for context-aware selection of the final top-15 tools.
4. Knowledge Graph Engine
The BioKnowledgeGraph is a typed, directed graph (e.g. NetworkX) storing biological entities and relationships. It supports:- Node types: gene, protein, drug, disease, pathway, GO term, cell type, organism, tool, dataset, protocol, publication, domain.
- Edge types: interacts_with, regulates, inhibits, encodes, expressed_in, associated_with, targets, treats; requires_tool, produces_for; subclass_of, has_function; etc.
5. Three-Tier Memory System
- Working memory — Current session conversation; hierarchical compression when token count exceeds budget (auto-calibrated from model context or configured max).
- Episodic memory — Past analysis sessions (task, plan summary, tools, domains, result summary, success, quality, lessons). Engram-inspired recall: N-gram fingerprint index (O(1) candidates), multi-signal index (content, tools, domains, entities), relevance gate.
build_episodic_context()returns a token-budgeted string injected into the Planner. - Semantic memory — (Subject, predicate, object) facts; integrated with the knowledge graph.
6. Multimodal Fusion Layer
Encoders for sequences (ESM-2, DNABERT-2, RNA-FM), structure (PDB, ESMFold, contact maps), and images (CLIP or domain-specific). Fusion strategies: concatenation + random projection, weighted average, or text-only descriptions for LLM context. Enables cross-modal similarity and entity linking via the knowledge graph.7. Execution Sandbox & Provenance
- SandboxExecutor — Isolated execution (subprocess in dev, Docker in prod); timeouts and resource limits; automatic artifact collection by file type.
- ProvenanceTracker — Records each step: code, inputs/outputs (with SHA256), tools, versions, stdout/stderr, success/failure. Export as JSON or reproducible Jupyter notebook.
Orchestrator & LLM Router
- Orchestrator — Receives the natural language task; calls the Strategy Router; runs the chosen strategy (ReAct, DAG pipeline, or MCTS); injects episodic context into the Planner; coordinates the pipeline and returns the final report.
- LLM Router — Manages role-specific models (reasoning, coder, fast) and fallback chains across providers (OpenAI, Anthropic, Azure, Gemini, Groq, Ollama, custom). All calls are context-window guarded (see Context Window and Budget).
Package Layout
| Package | Contents |
|---|---|
biocortex.core | Orchestrator, agents (Planner, Executor, Critic, Synthesizer), DAG, strategy router, MCTS, self-refinement, provenance, grounding, tool creation |
biocortex.retrieval | Hybrid retrieval pipeline |
biocortex.knowledge | Biological knowledge graph |
biocortex.memory | Three-tier memory engine |
biocortex.multimodal | Encoders and fusion |
biocortex.execution | Sandboxed executor |
biocortex.domains | Domain tools and registry loader |
biocortex.adapters | Biomni tool bridge |
biocortex.llm | Multi-model LLM router |
biocortex.web | FastAPI app, REST/WebSocket, state |
Next Steps
- Strategy Routing — Classification and strategy selection in detail.
- Multi-Agent Pipeline — Planner, Executor, Critic, Synthesizer.
- Hybrid Retrieval — Three-stage tool selection.