Context Window and Token Budget

BioCortex is context-window aware: it knows each model’s limits and enforces token budgets on every LLM call so long conversations and large plans do not cause overflow or undefined behavior. Working memory size can be auto-calibrated from the active model.

Why This Matters

  • Models have different context sizes (e.g. Qwen3-max 262K, Claude 200K, smaller models 8K–32K).
  • Prompts include: system prompt, working memory, episodic context, task, and (in synthesis) per-step results. Without budgets, switching models or running long pipelines can exceed the context window.
  • Guards ensure that combined input is truncated to a safe budget before calling the LLM, and that the Synthesizer does not pack too much result text into one request.

Model Context Window Table

A table MODEL_CONTEXT_WINDOWS (in biocortex.config) maps model identifiers to:
  • max_input_tokens
  • max_output_tokens
Examples: qwen3-max, qwen-max, claude-sonnet-4-20250514, gpt-4o, etc. Resolution uses exact match, then prefix match, then substring; unknown models default to e.g. 128K input / 4K output.

Token Estimation

  • estimate_tokens(text) uses a hybrid heuristic:
    • Non-CJK: ~1 token per 4 characters.
    • CJK: ~2 tokens per character.
  • Slight overestimate for safety. Used for working memory compression, budget checks, and episodic truncation.

Auto-Calibration of Working Memory

  • memory.working_memory_max_tokens can be set to -1 (sentinel).
  • At config build time, _calibrate_memory_budget() sets it to 60% of the reasoning model’s max input, clamped between 16K and 600K.
  • So when you switch the reasoning model (e.g. to Qwen3-max), working memory size adapts automatically.

Per-Call Budget (BaseAgent)

Every agent (Planner, Executor, Critic, Synthesizer) inherits from BaseAgent, which:
  1. _get_input_budget(role) — For the LLM role (reasoning/coder/fast), returns:
    • max_input - output_reserve - 512
    • output_reserve = min(max_output, config.max_tokens) * 0.15
  2. _invoke_llm — Before calling the LLM:
    • Estimates total tokens for: system prompt, extra context (e.g. working memory, episodic context), user message.
    • If total exceeds the budget: truncates extra context and user message (optionally keeping the tail); only truncates the system prompt if still over.
    • Priority: keep system prompt as intact as possible, then balance context and message.
So every LLM call is guarded against overflow.

Synthesizer Per-Step Budget

The Synthesizer receives all step results to produce the final report. For long DAGs, concatenating every step can exceed the coder model’s context.
  • _compute_per_step_char_budget(num_steps):
    • Takes the coder model’s max input.
    • Subtracts fixed reserves for system prompt and task description.
    • Divides the remainder by num_steps.
    • Converts to characters (×4) and clamps to e.g. [500, 8000] per step.
  • Each step’s result string is truncated to that length while preserving head and tail (so the beginning and end of each result remain visible). This keeps the total context within the model’s window.

Summary

  • Model table + token estimation + per-role input budget + truncation in _invoke_llm prevent context overflow on all agent calls.
  • Working memory can auto-calibrate to the reasoning model.
  • Synthesizer uses a dynamic per-step character budget so long pipelines still fit in the coder’s context.

Next Steps