Hypothesis strategy — scientific reasoning engine

TL;DR — For open-ended scientific questions (“What causes X?”, “Propose a mechanism for X”), BioCortex selects the Hypothesis strategy. It runs an iterative generate → test → evaluate → refine loop and produces a report with an explicit evidence chain, not a one-shot LLM guess.

Why a Hypothesis strategy?

Many biomedical questions are open scientific hypothesis problems. They are neither “run this fixed pipeline” nor “search for the best analysis path”; they need:
  1. Multiple candidate mechanisms / hypotheses
  2. Designed evidence collection
  3. Evaluation of supporting vs contradicting evidence
  4. Rejection of failed hypotheses and refinement of survivors
  5. A final conclusion with calibrated confidence
BioCortex’s other strategies serve different jobs:
StrategyBest forCore idea
SimpleReActSingle-step Q&A, format conversionReAct loop, direct execution
DAG ParallelMulti-step omics workflowsDecompose into a DAG and run in parallel
MCTSUnknown optimal analysis pathMonte Carlo tree search over paths
HypothesisOpen mechanistic science questionsMulti-round hypothesis generate → test → refine
Hypothesis mode reasons at the scientific level—like a biologist weighing explanations, gathering evidence, and converging on a justified answer—not merely executing a canned pipeline.

Design influences

The Hypothesis engine borrows ideas from three families of systems:

From Biomni A1

think → <execute> → <observe>  loop + [✓]/[✗] checklist
BioCortex uses an explicit plan–execute–observe triple for each hypothesis round: design a test → run it → record observations.

From PantheonOS Evolution

Analyzer → Mutator → Evaluator + failure history (avoid repeating dead ends)
BioCortex keeps rejected hypothesis history (rejected_history): falsified directions are recorded so later rounds avoid the same failed ideas and do not waste compute.

From CellType CLI EvidenceReasoner

Weighted claim scoring + contradiction detection
BioCortex uses weighted evidence and contradiction penalties: opposing evidence is weighted 1.5× supporting evidence so hypotheses cannot “slip through” with strong contradictions.

End-to-end flow

┌─────────────────────────────────────────────────────────────────┐
│              Hypothesis strategy — execution flow               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User task: "What drives immunotherapy resistance in TNBC?"     │
│                       │                                          │
│                       ▼                                          │
│  ── Round 0: Generate hypotheses ──────────────────────────────│
│  LLM proposes 3–5 candidates (statement + mechanism + prediction)│
│                                                                  │
│  H1 ○  PD-L1/PD-1 pathway mutations exhaust T cells  [conf=0.50]│
│  H2 ○  TGF-β signaling promotes immune-exclusion     [conf=0.50]│
│  H3 ○  MDSC accumulation in TME suppresses effector T [conf=0.50]│
│  H4 ○  CDK4/6 dysregulation beyond checkpoints       [conf=0.50]│
│  H5 ○  Epigenetic silencing of MHC-I enables escape  [conf=0.50]│
│                       │                                          │
│  ── Round 1: Test all active hypotheses ───────────────────────│
│                       │                                          │
│  For each active hypothesis:                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  _design_test()   ← pick best test for this hypothesis  │    │
│  │       type: code / literature / knowledge_graph / reasoning│   │
│  │  _execute_test()  ← run the test                        │    │
│  │       code → sandbox                                     │    │
│  │       literature → PubMed + Semantic Scholar             │    │
│  │       knowledge_graph → BioKnowledgeGraph                │    │
│  │       reasoning → LLM assessment                         │    │
│  │  _evaluate_evidence() ← polarity + confidence            │    │
│  │       polarity: supports / contradicts / neutral         │    │
│  │       confidence: 0.0 – 1.0                              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                       │                                          │
│  net_confidence = 0.5 + (∑support − 1.5×∑oppose) / (2×total)     │
│                       │                                          │
│  H1 ✗  conf=0.22 → REJECTED (contradictions > 2× support)        │
│  H2 ↻  conf=0.61 → SUPPORTED                                     │
│  H3 ○  conf=0.54 → ACTIVE                                        │
│  H4 ○  conf=0.51 → ACTIVE                                        │
│  H5 ↻  conf=0.67 → SUPPORTED                                     │
│                       │                                          │
│  ── Round 2: Refine + continue testing ────────────────────────│
│  _refine_hypotheses(): tighten statements for still-ACTIVE items │
│  Continue H3, H4 (new tests; no duplicates of prior tests)       │
│  rejected_history blocks designs too similar to H1             │
│                       │                                          │
│  ── Convergence ───────────────────────────────────────────────│
│  If net_confidence ≥ 0.80 and round ≥ 2 → CONVERGED → early stop │
│  Else continue until max_rounds (default 4)                     │
│                       │                                          │
│  ── Final synthesis ─────────────────────────────────────────────│
│  _pick_winner() + _synthesize_report()                           │
│                                                                  │
│  ★  Leading hypothesis + evidence chain                         │
│     Alternatives (including rejected directions and why)         │
│     Confidence assessment                                        │
│     Recommended next steps (3–5 concrete experiments)            │
└─────────────────────────────────────────────────────────────────┘

Core data structures

Hypothesis — one hypothesis object

@dataclass
class Hypothesis:
    id: str
    statement: str          # One-sentence claim
    mechanism: str          # Proposed biology (1–2 sentences)
    prediction: str         # Testable prediction
    status: HypothesisStatus  # ACTIVE / SUPPORTED / REJECTED / REFINED / CONVERGED
    confidence: float       # Current confidence 0.0–1.0
    evidence_for: list[EvidenceItem]
    evidence_against: list[EvidenceItem]
    tests_run: list[str]    # Avoid duplicate tests
    refinement_history: list[str]

    @property
    def net_confidence(self) -> float:
        # Weighted net — opposing evidence × 1.5
        pos = Σ evidence_for.confidence
        neg = Σ evidence_against.confidence × 1.5
        return clamp(0.5 + (pos - neg) / (2 × total), 0.0, 1.0)

EvidenceItem — one piece of evidence

@dataclass
class EvidenceItem:
    source: str           # "pubmed_search" / "code_execution" / "kg_query" / "reasoning"
    content: str
    polarity: EvidencePolarity   # SUPPORTS / CONTRADICTS / NEUTRAL
    confidence: float
    test_description: str

HypothesisStatus state machine

ACTIVE  ──(supporting evidence, conf>0.6)──▶  SUPPORTED
   │                                    │
   │     (round≥2)                      │ (conf≥0.80, round≥2)
   ▼                                    ▼
REFINED ◀──(refine)            CONVERGED ★ (terminal, early stop)

   │ (oppose > 2× support and ≥2 opposing items)

REJECTED ✗  (logged in rejected_history; no further tests)

Confidence scoring

Formula

net_confidence = clamp(
    0.5 + (pos_sum − 1.5 × neg_sum) / (2 × total_sum),
    min=0.0, max=1.0
)

where:
  pos_sum  = Σ evidence_for[i].confidence
  neg_sum  = Σ evidence_against[i].confidence
  total_sum = pos_sum + neg_sum  (floor 0.01 to avoid div-by-zero)

Rationale

  • Start at 0.5 — No evidence ⇒ neutral prior.
  • 1.5× penalty on opposition — Contradictions should weigh more than weak support (falsifiability).
  • Normalize by total evidence — Prevents “stacking” many low-quality supporting snippets.

Example

H2:
  evidence_for:    [0.7, 0.6]   → pos_sum = 1.3
  evidence_against: [0.4]      → neg_sum = 0.4 × 1.5 = 0.6
  raw opposing mass = 0.4, total evidence mass = 1.7

net_confidence = 0.5 + (1.3 - 0.6) / (2 × 1.7) ≈ 0.706 → SUPPORTED

Four test types

_design_test() picks a modality from hypothesis content and context:
TypeWhen usedImplementation
codeData files (.h5ad, .csv, …) and testable stats/plotsSandbox Python execution
literatureMechanism needs literature groundingPubMed + Semantic Scholar
knowledge_graphGene–pathway–disease relationsBioKnowledgeGraph neighborhood
reasoningNo data/tools or quick sanity checkLLM reasoning over biology knowledge
Each run yields an EvidenceItem; _evaluate_evidence() assigns polarity.

Refinement

From round 2 onward, _refine_hypotheses() runs on hypotheses still ACTIVE:
Original:  "TGF-β signaling promotes immune exclusion"
Evidence:   Support: SMAD2/3 phosphorylation up (literature)
            Oppose: TGF-β inactive in some TNBC subtypes
Refined:    "TGF-β/SMAD2 signaling in Claudin-low TNBC promotes
             immune exclusion via ECM remodeling"
After refinement:
  • Prior wording moves to refinement_history
  • Status becomes REFINED
  • Later tests target the tighter claim
vs PantheonOS: their mutator edits candidate code/parameters; here we refine the scientific statement itself.

Rejected history

rejected_history: list[str] = []
# Example:
# "H1: PD-L1 mutations exhaust T cells — rejected: meta-analyses show no
#   strong link between PD-L1 mutation rate and resistance (literature, conf=0.72)"
Each _design_test() prompt includes:
Previously REJECTED hypotheses (avoid repeating):
- H1: PD-L1 ... — rejected: ...
This prevents the model from re-proposing falsified mechanisms or near-duplicate tests.

Final report outline

## Research Question
[Restatement]

## Methodology
[Hypothesis strategy; N rounds; M hypotheses tested]

## Key Findings
[Evidence summary]

## Leading Hypothesis ★
[Best-supported statement]
- Mechanism: ...
- Confidence: ...
- Evidence for: ...
- Evidence against: ...

## Alternative Hypotheses
[Status of others, including rejections]

## Confidence Assessment
[Drivers and limitations]

## Recommended Next Steps
1. ...

When Hypothesis auto-triggers

Phase-1 heuristics in strategy routing look for patterns such as:
Pattern / cueExample phrasing
hypothesis“Generate a hypothesis about X”
propose mechanism“Mechanism by which FOXP3 suppresses T cells”
what causes“What causes temozolomide resistance in glioblastoma?”
mechanism of/behind/for“Mechanism of resistance to KRAS G12C inhibitors”
why does/do + disease/resistance“Why do tumors evade immune checkpoint blockade?”
novel biomarker“Novel biomarkers for early pancreatic cancer”
how does … resist/evade/escape“How does GBM escape temozolomide?”
propose … target/candidate/pathway“Candidate therapeutic targets for ALS”
If hypothesis_score ≥ 2 and wins over other strategy scores, Hypothesis is chosen; otherwise Phase-2 LLM classification can confirm.

Tunable parameters

ParameterDefaultMeaning
max_rounds4Maximum iteration rounds
max_hypotheses5Cap on initial hypotheses
convergence_threshold0.80Early stop when net confidence reaches this
min_rounds2Minimum rounds even if convergence looks early

Force Hypothesis mode

biocortex run -s hypothesis "What drives EMT in pancreatic cancer?"

# In biocortex CLI
/strategy hypothesis

# Python API
agent.go(task, strategy=ReasoningStrategy.HYPOTHESIS)

Compared to MCTS

AspectMCTSHypothesis
Search objectAnalysis path (tools + order)Scientific claim (which mechanism)
FeedbackSimulated path value from LLMPolarity scores from real evidence
OutputBest path → reportBest-supported hypothesis + chain → report
StateTree nodes (visits / reward)Hypothesis objects (confidence / evidence / status)
StopFixed budgetConverged confidence or max rounds
Fit“How should I analyze this dataset?”“Why does this biological phenomenon occur?”

Code locations

FileRole
biocortex/core/hypothesis.pyHypothesisReasoner, dataclasses, main loop
biocortex/core/orchestrator.py_run_hypothesis(), context + packaging
biocortex/core/strategy.py_HYPOTHESIS_INDICATORS, heuristics + LLM routing
biocortex/tools/search_tool.pypubmed_search(), semantic_scholar_search() for literature tests

Future improvements

  1. Data-aware generation — If .h5ad is attached, inject a data summary (cells, genes, clusters) into round-0 generation.
  2. Deeper KG — Fully wire knowledge_graph tests to BioKnowledgeGraph.context() and 2-hop neighborhoods.
  3. Cross-hypothesis evidence — Share evidence across H2/H3 when relevant.
  4. Calibration — Tune convergence_threshold and the 1.5 factor using BixBench-style evals.