Knowledge Graph

The BioKnowledgeGraph is a typed, directed graph that stores biological entities and their relationships. It supports retrieval expansion, memory integration, dependency ordering, multimodal entity linking, and KG-grounded verification (hallucination detection).

Role in BioCortex

  • Hybrid retrieval (Stage 2): 2-hop BFS from tool candidates discovers related tools via entity–tool and domain–tool links.
  • Memory: Semantic memory facts can be stored as graph nodes/edges; episodic content can reference entities.
  • Planner: Tool dependency ordering can use graph structure.
  • Multimodal fusion: Entity linking across modalities uses the same graph.
  • KGGroundingValidator: Each claim in the final report is checked against the graph → grounded / inferred / ungrounded.

Node and Edge Types

Node types (examples):
gene, protein, drug, disease, pathway, GO term, cell type, organism, tool, dataset, protocol, publication, domain.
Edge types (examples):
  • Biological: interacts_with, regulates, inhibits, activates, encodes, expressed_in, associated_with, targets, treats
  • Tool: requires_tool, produces_for, compatible_with, same_domain
  • Data: uses_data, produces_data, cited_in
  • Ontology: subclass_of, has_function, involved_in
Nodes and edges can carry attributes: name, type, properties, source, confidence, timestamp.

Persistence and Ontology Integration

  • Persistence: Graph is serialized (e.g. JSON: node list + edge list + metadata) so it can be loaded/saved.
  • Gene Ontology: OBO parsing adds terms and is_a (subclass) relationships; obsolete terms can be filtered.
  • KEGG: Pathway data adds pathway–gene associations (e.g. involved_in).
  • Auto-learning: After analyses, biological entities are extracted from task and findings (regex NER: genes, GO terms, UniProt, KEGG, PDB, species) and added as nodes with co-occurrence and tool linkages.

KG-Grounded Verification (Hallucination Detection)

KGGroundingValidator runs after report generation:
  1. Claim extraction — LLM (or heuristic) extracts discrete factual claims from the report.
  2. Entity extraction — Regex NER finds biological entities in each claim.
  3. KG path verification — For each pair of entities, compute shortest path in the graph. Classify:
    • Grounded (✅): Direct path (1–2 hops).
    • Inferred (⚠️): Longer path (3+ hops).
    • Ungrounded (❌): No path.
    • Trivial (ℹ️): No biological entities.
  4. Confidence — Per-claim and overall grounding confidence; report is annotated with evidence chains and triple references.
This gives users explicit confidence levels for interpreting AI-generated analyses.

Implementation

  • Backend: e.g. NetworkX DiGraph with node/edge attributes.
  • Subgraph extraction for LLM context: bounded BFS (depth, max nodes), output as natural-language triples: entity_A —[edge_type]→ entity_B.

Next Steps