Context Window and Token Budget
BioCortex is context-window aware: it knows each model’s limits and enforces token budgets on every LLM call so long conversations and large plans do not cause overflow or undefined behavior. Working memory size can be auto-calibrated from the active model.Why This Matters
- Models have different context sizes (e.g. Qwen3-max 262K, Claude 200K, smaller models 8K–32K).
- Prompts include: system prompt, working memory, episodic context, task, and (in synthesis) per-step results. Without budgets, switching models or running long pipelines can exceed the context window.
- Guards ensure that combined input is truncated to a safe budget before calling the LLM, and that the Synthesizer does not pack too much result text into one request.
Model Context Window Table
A table MODEL_CONTEXT_WINDOWS (inbiocortex.config) maps model identifiers to:
- max_input_tokens
- max_output_tokens
qwen3-max, qwen-max, claude-sonnet-4-20250514, gpt-4o, etc. Resolution uses exact match, then prefix match, then substring; unknown models default to e.g. 128K input / 4K output.
Token Estimation
- estimate_tokens(text) uses a hybrid heuristic:
- Non-CJK: ~1 token per 4 characters.
- CJK: ~2 tokens per character.
- Slight overestimate for safety. Used for working memory compression, budget checks, and episodic truncation.
Auto-Calibration of Working Memory
- memory.working_memory_max_tokens can be set to -1 (sentinel).
- At config build time, _calibrate_memory_budget() sets it to 60% of the reasoning model’s max input, clamped between 16K and 600K.
- So when you switch the reasoning model (e.g. to Qwen3-max), working memory size adapts automatically.
Per-Call Budget (BaseAgent)
Every agent (Planner, Executor, Critic, Synthesizer) inherits from BaseAgent, which:-
_get_input_budget(role) — For the LLM role (reasoning/coder/fast), returns:
max_input - output_reserve - 512output_reserve = min(max_output, config.max_tokens) * 0.15
-
_invoke_llm — Before calling the LLM:
- Estimates total tokens for: system prompt, extra context (e.g. working memory, episodic context), user message.
- If total exceeds the budget: truncates extra context and user message (optionally keeping the tail); only truncates the system prompt if still over.
- Priority: keep system prompt as intact as possible, then balance context and message.
Synthesizer Per-Step Budget
The Synthesizer receives all step results to produce the final report. For long DAGs, concatenating every step can exceed the coder model’s context.-
_compute_per_step_char_budget(num_steps):
- Takes the coder model’s max input.
- Subtracts fixed reserves for system prompt and task description.
- Divides the remainder by
num_steps. - Converts to characters (×4) and clamps to e.g. [500, 8000] per step.
- Each step’s result string is truncated to that length while preserving head and tail (so the beginning and end of each result remain visible). This keeps the total context within the model’s window.
Summary
- Model table + token estimation + per-role input budget + truncation in _invoke_llm prevent context overflow on all agent calls.
- Working memory can auto-calibrate to the reasoning model.
- Synthesizer uses a dynamic per-step character budget so long pipelines still fit in the coder’s context.
Next Steps
- Memory System — Working memory and compression.
- Multi-Agent Pipeline — Where budgets are applied.
- Configuration — Overriding model table and memory settings.