System Architecture

BioCortex is organized into seven functional layers plus a central orchestrator and LLM router. This page summarizes each layer and how they work together.

High-Level Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                         Strategy Router                                  │
│  (classify task → SimpleReAct | DAG Parallel | MCTS)                     │
└─────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                     Multi-Agent Pipeline                                 │
│  Planner → Executor → Critic → Synthesizer  (DAG-parallel execution)      │
└─────────────────────────────────────────────────────────────────────────┘
         │                │                │
         ▼                ▼                ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────────────────────────────┐
│   Hybrid     │  │  Knowledge   │  │  Three-Tier Memory                   │
│   Retrieval  │  │  Graph       │  │  (working + episodic + semantic)     │
└──────────────┘  └──────────────┘  └──────────────────────────────────────┘
         │                │                │
         └────────────────┼────────────────┘
                          ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Multimodal Fusion  │  Execution Sandbox  │  Provenance Tracker         │
└─────────────────────────────────────────────────────────────────────────┘

1. Strategy Router

Role: Classify the user task and select the execution strategy.

SimpleReAct — For direct queries, lookups, single-step tasks. Uses an enhanced ReAct loop (plan → execute step → observe → update). ~3× faster than full pipeline for simple tasks.
DAG Parallel — For multi-step analytical workflows. Delegates to the Planner to build a TaskDAG; independent nodes at the same level run in parallel via asyncio.gather().
MCTS — For exploratory research (e.g. drug discovery, hypothesis generation). Uses Monte Carlo Tree Search to explore reasoning paths; the best path is converted to a DAG and executed.

Classification uses a fast heuristic (regex + structural cues) and, when confidence is low (< 0.75), an LLM-based classifier. Output includes strategy, confidence, complexity, estimated steps, and biological domains.

2. Multi-Agent Pipeline (DAG Mode)

Agent	Responsibility
Planner	Turns natural language into a TaskDAG: nodes with name, description, task_type, domain, tools_needed, dependencies, language, priority. Validated for acyclicity (e.g. NetworkX).
Executor	Translates each node into executable code; uses hybrid retrieval for tool selection; runs code in a sandbox; collects artifacts.
Critic	Validates each result (scientific correctness, completeness, consistency, statistical validity, errors). Produces structured `ValidationResult` with pass/fail and retry guidance.
Synthesizer	Compiles all results into a structured report (title, summary, methodology, findings, artifacts, limitations, next steps) with provenance links.

Execution is level-by-level: topological generations are computed; within each level, all ready nodes run in parallel. Failed nodes can be retried (reflection-guided repair, deep retry, or LATM tool creation); failed branches are isolated so successful branches still complete.

3. Hybrid Retrieval System

Three-stage pipeline for tool (and data) selection:

Stage 1 — Vector semantic search
Query and tool descriptions are embedded (e.g. all-MiniLM-L6-v2); cosine similarity returns top-50 candidates in ~50 ms. Scales to 10,000+ tools.
Stage 2 — Knowledge graph expansion
From the top-50, a 2-hop BFS on the tool relationship graph and biological knowledge graph discovers implicit dependencies (e.g. batch correction, doublet removal for scRNA-seq).
Stage 3 — LLM reranking
The expanded set is sent to the LLM for context-aware selection of the final top-15 tools.

When vector search is unavailable (e.g. offline), the system falls back to LLM-only retrieval.

4. Knowledge Graph Engine

The BioKnowledgeGraph is a typed, directed graph (e.g. NetworkX) storing biological entities and relationships. It supports:

Node types: gene, protein, drug, disease, pathway, GO term, cell type, organism, tool, dataset, protocol, publication, domain.
Edge types: interacts_with, regulates, inhibits, encodes, expressed_in, associated_with, targets, treats; requires_tool, produces_for; subclass_of, has_function; etc.

Used for: Stage 2 retrieval expansion, memory fact storage, Planner dependency ordering, multimodal entity linking, and KG-grounded verification (hallucination detection: grounded vs inferred vs ungrounded claims).

5. Three-Tier Memory System

Working memory — Current session conversation; hierarchical compression when token count exceeds budget (auto-calibrated from model context or configured max).
Episodic memory — Past analysis sessions (task, plan summary, tools, domains, result summary, success, quality, lessons). Engram-inspired recall: N-gram fingerprint index (O(1) candidates), multi-signal index (content, tools, domains, entities), relevance gate. build_episodic_context() returns a token-budgeted string injected into the Planner.
Semantic memory — (Subject, predicate, object) facts; integrated with the knowledge graph.

6. Multimodal Fusion Layer

Encoders for sequences (ESM-2, DNABERT-2, RNA-FM), structure (PDB, ESMFold, contact maps), and images (CLIP or domain-specific). Fusion strategies: concatenation + random projection, weighted average, or text-only descriptions for LLM context. Enables cross-modal similarity and entity linking via the knowledge graph.

7. Execution Sandbox & Provenance

SandboxExecutor — Isolated execution (subprocess in dev, Docker in prod); timeouts and resource limits; automatic artifact collection by file type.
ProvenanceTracker — Records each step: code, inputs/outputs (with SHA256), tools, versions, stdout/stderr, success/failure. Export as JSON or reproducible Jupyter notebook.

Orchestrator & LLM Router

Orchestrator — Receives the natural language task; calls the Strategy Router; runs the chosen strategy (ReAct, DAG pipeline, or MCTS); injects episodic context into the Planner; coordinates the pipeline and returns the final report.
LLM Router — Manages role-specific models (reasoning, coder, fast) and fallback chains across providers (OpenAI, Anthropic, Azure, Gemini, Groq, Ollama, custom). All calls are context-window guarded (see Context Window and Budget).

Package Layout

Package	Contents
`biocortex.core`	Orchestrator, agents (Planner, Executor, Critic, Synthesizer), DAG, strategy router, MCTS, self-refinement, provenance, grounding, tool creation
`biocortex.retrieval`	Hybrid retrieval pipeline
`biocortex.knowledge`	Biological knowledge graph
`biocortex.memory`	Three-tier memory engine
`biocortex.multimodal`	Encoders and fusion
`biocortex.execution`	Sandboxed executor
`biocortex.domains`	Domain tools and registry loader
`biocortex.adapters`	Biomni tool bridge
`biocortex.llm`	Multi-model LLM router
`biocortex.web`	FastAPI app, REST/WebSocket, state

Next Steps

Strategy Routing — Classification and strategy selection in detail.
Multi-Agent Pipeline — Planner, Executor, Critic, Synthesizer.
Hybrid Retrieval — Three-stage tool selection.

Getting Started

Core Framework

Tools & Extensions

Web & Automation

Deployment & auth

Advanced

Reference

Architecture

System Architecture

High-Level Overview

1. Strategy Router

2. Multi-Agent Pipeline (DAG Mode)

3. Hybrid Retrieval System

4. Knowledge Graph Engine

5. Three-Tier Memory System

6. Multimodal Fusion Layer

7. Execution Sandbox & Provenance

Orchestrator & LLM Router

Package Layout

Next Steps

Getting Started

Core Framework

Tools & Extensions

Web & Automation

Deployment & auth

Advanced

Reference

​System Architecture

​High-Level Overview

​1. Strategy Router

​2. Multi-Agent Pipeline (DAG Mode)

​3. Hybrid Retrieval System

​4. Knowledge Graph Engine

​5. Three-Tier Memory System

​6. Multimodal Fusion Layer

​7. Execution Sandbox & Provenance

​Orchestrator & LLM Router

​Package Layout

​Next Steps

System Architecture

High-Level Overview

1. Strategy Router

2. Multi-Agent Pipeline (DAG Mode)

3. Hybrid Retrieval System

4. Knowledge Graph Engine

5. Three-Tier Memory System

6. Multimodal Fusion Layer

7. Execution Sandbox & Provenance

Orchestrator & LLM Router

Package Layout

Next Steps