Knowledge Graph
The BioKnowledgeGraph is a typed, directed graph that stores biological entities and their relationships. It supports retrieval expansion, memory integration, dependency ordering, multimodal entity linking, and KG-grounded verification (hallucination detection).Role in BioCortex
- Hybrid retrieval (Stage 2): 2-hop BFS from tool candidates discovers related tools via entity–tool and domain–tool links.
- Memory: Semantic memory facts can be stored as graph nodes/edges; episodic content can reference entities.
- Planner: Tool dependency ordering can use graph structure.
- Multimodal fusion: Entity linking across modalities uses the same graph.
- KGGroundingValidator: Each claim in the final report is checked against the graph → grounded / inferred / ungrounded.
Node and Edge Types
Node types (examples):gene, protein, drug, disease, pathway, GO term, cell type, organism, tool, dataset, protocol, publication, domain. Edge types (examples):
- Biological: interacts_with, regulates, inhibits, activates, encodes, expressed_in, associated_with, targets, treats
- Tool: requires_tool, produces_for, compatible_with, same_domain
- Data: uses_data, produces_data, cited_in
- Ontology: subclass_of, has_function, involved_in
Persistence and Ontology Integration
- Persistence: Graph is serialized (e.g. JSON: node list + edge list + metadata) so it can be loaded/saved.
- Gene Ontology: OBO parsing adds terms and
is_a(subclass) relationships; obsolete terms can be filtered. - KEGG: Pathway data adds pathway–gene associations (e.g. involved_in).
- Auto-learning: After analyses, biological entities are extracted from task and findings (regex NER: genes, GO terms, UniProt, KEGG, PDB, species) and added as nodes with co-occurrence and tool linkages.
KG-Grounded Verification (Hallucination Detection)
KGGroundingValidator runs after report generation:- Claim extraction — LLM (or heuristic) extracts discrete factual claims from the report.
- Entity extraction — Regex NER finds biological entities in each claim.
- KG path verification — For each pair of entities, compute shortest path in the graph. Classify:
- Grounded (✅): Direct path (1–2 hops).
- Inferred (⚠️): Longer path (3+ hops).
- Ungrounded (❌): No path.
- Trivial (ℹ️): No biological entities.
- Confidence — Per-claim and overall grounding confidence; report is annotated with evidence chains and triple references.
Implementation
- Backend: e.g. NetworkX
DiGraphwith node/edge attributes. - Subgraph extraction for LLM context: bounded BFS (depth, max nodes), output as natural-language triples:
entity_A —[edge_type]→ entity_B.
Next Steps
- Hybrid Retrieval — Stage 2 expansion using the graph.
- Memory System — Semantic memory and graph integration.