Hypothesis strategy — scientific reasoning engine

TL;DR — For open-ended scientific questions (“What causes X?”, “Propose a mechanism for X”), BioCortex selects the Hypothesis strategy. It runs an iterative generate → test → evaluate → refine loop and produces a report with an explicit evidence chain, not a one-shot LLM guess.

Why a Hypothesis strategy?

Many biomedical questions are open scientific hypothesis problems. They are neither “run this fixed pipeline” nor “search for the best analysis path”; they need:

Multiple candidate mechanisms / hypotheses
Designed evidence collection
Evaluation of supporting vs contradicting evidence
Rejection of failed hypotheses and refinement of survivors
A final conclusion with calibrated confidence

BioCortex’s other strategies serve different jobs:

Strategy	Best for	Core idea
SimpleReAct	Single-step Q&A, format conversion	ReAct loop, direct execution
DAG Parallel	Multi-step omics workflows	Decompose into a DAG and run in parallel
MCTS	Unknown optimal analysis path	Monte Carlo tree search over paths
Hypothesis	Open mechanistic science questions	Multi-round hypothesis generate → test → refine

Hypothesis mode reasons at the scientific level—like a biologist weighing explanations, gathering evidence, and converging on a justified answer—not merely executing a canned pipeline.

Design influences

The Hypothesis engine borrows ideas from three families of systems:

From Biomni A1

think → <execute> → <observe>  loop + [✓]/[✗] checklist

BioCortex uses an explicit plan–execute–observe triple for each hypothesis round: design a test → run it → record observations.

From PantheonOS Evolution

Analyzer → Mutator → Evaluator + failure history (avoid repeating dead ends)

BioCortex keeps rejected hypothesis history (rejected_history): falsified directions are recorded so later rounds avoid the same failed ideas and do not waste compute.

From CellType CLI EvidenceReasoner

Weighted claim scoring + contradiction detection

BioCortex uses weighted evidence and contradiction penalties: opposing evidence is weighted 1.5× supporting evidence so hypotheses cannot “slip through” with strong contradictions.

End-to-end flow

┌─────────────────────────────────────────────────────────────────┐
│              Hypothesis strategy — execution flow               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User task: "What drives immunotherapy resistance in TNBC?"     │
│                       │                                          │
│                       ▼                                          │
│  ── Round 0: Generate hypotheses ──────────────────────────────│
│  LLM proposes 3–5 candidates (statement + mechanism + prediction)│
│                                                                  │
│  H1 ○  PD-L1/PD-1 pathway mutations exhaust T cells  [conf=0.50]│
│  H2 ○  TGF-β signaling promotes immune-exclusion     [conf=0.50]│
│  H3 ○  MDSC accumulation in TME suppresses effector T [conf=0.50]│
│  H4 ○  CDK4/6 dysregulation beyond checkpoints       [conf=0.50]│
│  H5 ○  Epigenetic silencing of MHC-I enables escape  [conf=0.50]│
│                       │                                          │
│  ── Round 1: Test all active hypotheses ───────────────────────│
│                       │                                          │
│  For each active hypothesis:                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  _design_test()   ← pick best test for this hypothesis  │    │
│  │       type: code / literature / knowledge_graph / reasoning│   │
│  │  _execute_test()  ← run the test                        │    │
│  │       code → sandbox                                     │    │
│  │       literature → PubMed + Semantic Scholar             │    │
│  │       knowledge_graph → BioKnowledgeGraph                │    │
│  │       reasoning → LLM assessment                         │    │
│  │  _evaluate_evidence() ← polarity + confidence            │    │
│  │       polarity: supports / contradicts / neutral         │    │
│  │       confidence: 0.0 – 1.0                              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                       │                                          │
│  net_confidence = 0.5 + (∑support − 1.5×∑oppose) / (2×total)     │
│                       │                                          │
│  H1 ✗  conf=0.22 → REJECTED (contradictions > 2× support)        │
│  H2 ↻  conf=0.61 → SUPPORTED                                     │
│  H3 ○  conf=0.54 → ACTIVE                                        │
│  H4 ○  conf=0.51 → ACTIVE                                        │
│  H5 ↻  conf=0.67 → SUPPORTED                                     │
│                       │                                          │
│  ── Round 2: Refine + continue testing ────────────────────────│
│  _refine_hypotheses(): tighten statements for still-ACTIVE items │
│  Continue H3, H4 (new tests; no duplicates of prior tests)       │
│  rejected_history blocks designs too similar to H1             │
│                       │                                          │
│  ── Convergence ───────────────────────────────────────────────│
│  If net_confidence ≥ 0.80 and round ≥ 2 → CONVERGED → early stop │
│  Else continue until max_rounds (default 4)                     │
│                       │                                          │
│  ── Final synthesis ─────────────────────────────────────────────│
│  _pick_winner() + _synthesize_report()                           │
│                                                                  │
│  ★  Leading hypothesis + evidence chain                         │
│     Alternatives (including rejected directions and why)         │
│     Confidence assessment                                        │
│     Recommended next steps (3–5 concrete experiments)            │
└─────────────────────────────────────────────────────────────────┘

Core data structures

`Hypothesis` — one hypothesis object

@dataclass
class Hypothesis:
    id: str
    statement: str          # One-sentence claim
    mechanism: str          # Proposed biology (1–2 sentences)
    prediction: str         # Testable prediction
    status: HypothesisStatus  # ACTIVE / SUPPORTED / REJECTED / REFINED / CONVERGED
    confidence: float       # Current confidence 0.0–1.0
    evidence_for: list[EvidenceItem]
    evidence_against: list[EvidenceItem]
    tests_run: list[str]    # Avoid duplicate tests
    refinement_history: list[str]

    @property
    def net_confidence(self) -> float:
        # Weighted net — opposing evidence × 1.5
        pos = Σ evidence_for.confidence
        neg = Σ evidence_against.confidence × 1.5
        return clamp(0.5 + (pos - neg) / (2 × total), 0.0, 1.0)

`EvidenceItem` — one piece of evidence

@dataclass
class EvidenceItem:
    source: str           # "pubmed_search" / "code_execution" / "kg_query" / "reasoning"
    content: str
    polarity: EvidencePolarity   # SUPPORTS / CONTRADICTS / NEUTRAL
    confidence: float
    test_description: str

`HypothesisStatus` state machine

ACTIVE  ──(supporting evidence, conf>0.6)──▶  SUPPORTED
   │                                    │
   │     (round≥2)                      │ (conf≥0.80, round≥2)
   ▼                                    ▼
REFINED ◀──(refine)            CONVERGED ★ (terminal, early stop)
   │
   │ (oppose > 2× support and ≥2 opposing items)
   ▼
REJECTED ✗  (logged in rejected_history; no further tests)

Confidence scoring

Formula

net_confidence = clamp(
    0.5 + (pos_sum − 1.5 × neg_sum) / (2 × total_sum),
    min=0.0, max=1.0
)

where:
  pos_sum  = Σ evidence_for[i].confidence
  neg_sum  = Σ evidence_against[i].confidence
  total_sum = pos_sum + neg_sum  (floor 0.01 to avoid div-by-zero)

Rationale

Start at 0.5 — No evidence ⇒ neutral prior.
1.5× penalty on opposition — Contradictions should weigh more than weak support (falsifiability).
Normalize by total evidence — Prevents “stacking” many low-quality supporting snippets.

Example

H2:
  evidence_for:    [0.7, 0.6]   → pos_sum = 1.3
  evidence_against: [0.4]      → neg_sum = 0.4 × 1.5 = 0.6
  raw opposing mass = 0.4, total evidence mass = 1.7

net_confidence = 0.5 + (1.3 - 0.6) / (2 × 1.7) ≈ 0.706 → SUPPORTED

Four test types

_design_test() picks a modality from hypothesis content and context:

Type	When used	Implementation
`code`	Data files (.h5ad, .csv, …) and testable stats/plots	Sandbox Python execution
`literature`	Mechanism needs literature grounding	PubMed + Semantic Scholar
`knowledge_graph`	Gene–pathway–disease relations	BioKnowledgeGraph neighborhood
`reasoning`	No data/tools or quick sanity check	LLM reasoning over biology knowledge

Each run yields an EvidenceItem; _evaluate_evidence() assigns polarity.

From round 2 onward, _refine_hypotheses() runs on hypotheses still ACTIVE:

Original:  "TGF-β signaling promotes immune exclusion"
Evidence:   Support: SMAD2/3 phosphorylation up (literature)
            Oppose: TGF-β inactive in some TNBC subtypes
Refined:    "TGF-β/SMAD2 signaling in Claudin-low TNBC promotes
             immune exclusion via ECM remodeling"

After refinement:

Prior wording moves to refinement_history
Status becomes REFINED
Later tests target the tighter claim

vs PantheonOS: their mutator edits candidate code/parameters; here we refine the scientific statement itself.

Rejected history

rejected_history: list[str] = []
# Example:
# "H1: PD-L1 mutations exhaust T cells — rejected: meta-analyses show no
#   strong link between PD-L1 mutation rate and resistance (literature, conf=0.72)"

Each _design_test() prompt includes:

Previously REJECTED hypotheses (avoid repeating):
- H1: PD-L1 ... — rejected: ...

This prevents the model from re-proposing falsified mechanisms or near-duplicate tests.

Final report outline

## Research Question
[Restatement]

## Methodology
[Hypothesis strategy; N rounds; M hypotheses tested]

## Key Findings
[Evidence summary]

## Leading Hypothesis ★
[Best-supported statement]
- Mechanism: ...
- Confidence: ...
- Evidence for: ...
- Evidence against: ...

## Alternative Hypotheses
[Status of others, including rejections]

## Confidence Assessment
[Drivers and limitations]

## Recommended Next Steps
1. ...

When Hypothesis auto-triggers

Phase-1 heuristics in strategy routing look for patterns such as:

Pattern / cue	Example phrasing
`hypothesis`	“Generate a hypothesis about X”
`propose mechanism`	“Mechanism by which FOXP3 suppresses T cells”
`what causes`	“What causes temozolomide resistance in glioblastoma?”
`mechanism of/behind/for`	“Mechanism of resistance to KRAS G12C inhibitors”
`why does/do` + disease/resistance	“Why do tumors evade immune checkpoint blockade?”
`novel biomarker`	“Novel biomarkers for early pancreatic cancer”
`how does … resist/evade/escape`	“How does GBM escape temozolomide?”
`propose … target/candidate/pathway`	“Candidate therapeutic targets for ALS”

If hypothesis_score ≥ 2 and wins over other strategy scores, Hypothesis is chosen; otherwise Phase-2 LLM classification can confirm.

Tunable parameters

Parameter	Default	Meaning
`max_rounds`	`4`	Maximum iteration rounds
`max_hypotheses`	`5`	Cap on initial hypotheses
`convergence_threshold`	`0.80`	Early stop when net confidence reaches this
`min_rounds`	`2`	Minimum rounds even if convergence looks early

Force Hypothesis mode

biocortex run -s hypothesis "What drives EMT in pancreatic cancer?"

# In biocortex CLI
/strategy hypothesis

# Python API
agent.go(task, strategy=ReasoningStrategy.HYPOTHESIS)

Compared to MCTS

Aspect	MCTS	Hypothesis
Search object	Analysis path (tools + order)	Scientific claim (which mechanism)
Feedback	Simulated path value from LLM	Polarity scores from real evidence
Output	Best path → report	Best-supported hypothesis + chain → report
State	Tree nodes (visits / reward)	Hypothesis objects (confidence / evidence / status)
Stop	Fixed budget	Converged confidence or max rounds
Fit	“How should I analyze this dataset?”	“Why does this biological phenomenon occur?”

Code locations

File	Role
`biocortex/core/hypothesis.py`	`HypothesisReasoner`, dataclasses, main loop
`biocortex/core/orchestrator.py`	`_run_hypothesis()`, context + packaging
`biocortex/core/strategy.py`	`_HYPOTHESIS_INDICATORS`, heuristics + LLM routing
`biocortex/tools/search_tool.py`	`pubmed_search()`, `semantic_scholar_search()` for literature tests

Future improvements

Data-aware generation — If .h5ad is attached, inject a data summary (cells, genes, clusters) into round-0 generation.
Deeper KG — Fully wire knowledge_graph tests to BioKnowledgeGraph.context() and 2-hop neighborhoods.
Cross-hypothesis evidence — Share evidence across H2/H3 when relevant.
Calibration — Tune convergence_threshold and the 1.5 factor using BixBench-style evals.

Strategy routing — How strategies are chosen
Multi-agent pipeline — DAG-parallel workflows
Hybrid retrieval — Tool + web/literature retrieval
Knowledge graph — BioKnowledgeGraph structure and queries

Getting Started

Core Framework

Tools & Extensions

Web & Automation

Deployment & auth

Advanced

Reference

Hypothesis strategy

Hypothesis strategy — scientific reasoning engine

Why a Hypothesis strategy?

Design influences

From Biomni A1

From PantheonOS Evolution

From CellType CLI EvidenceReasoner

End-to-end flow

Core data structures

`Hypothesis` — one hypothesis object

`EvidenceItem` — one piece of evidence

`HypothesisStatus` state machine

Confidence scoring

Formula

Rationale

Example

Four test types

Refinement

Rejected history

Final report outline

When Hypothesis auto-triggers

Tunable parameters

Force Hypothesis mode

Compared to MCTS

Code locations

Future improvements

Getting Started

Core Framework

Tools & Extensions

Web & Automation

Deployment & auth

Advanced

Reference

​Hypothesis strategy — scientific reasoning engine

​Why a Hypothesis strategy?

​Design influences

​From Biomni A1

​From PantheonOS Evolution

​From CellType CLI EvidenceReasoner

​End-to-end flow

​Core data structures

​Hypothesis — one hypothesis object

​EvidenceItem — one piece of evidence

​HypothesisStatus state machine

​Confidence scoring

​Formula

​Rationale

​Example

​Four test types

​Refinement

​Rejected history

​Final report outline

​When Hypothesis auto-triggers

​Tunable parameters

​Force Hypothesis mode

​Compared to MCTS

​Code locations

​Future improvements

​Related docs

Hypothesis strategy — scientific reasoning engine

Why a Hypothesis strategy?

Design influences

From Biomni A1

From PantheonOS Evolution

From CellType CLI EvidenceReasoner

End-to-end flow

Core data structures

`Hypothesis` — one hypothesis object

`EvidenceItem` — one piece of evidence

`HypothesisStatus` state machine

Confidence scoring

Formula

Rationale

Example

Four test types

Refinement

Rejected history

Final report outline

When Hypothesis auto-triggers

Tunable parameters

Force Hypothesis mode

Compared to MCTS

Code locations

Future improvements

Related docs