Determinism Where Possible: The Case for a Dumb Planner
Open any popular agent framework and you'll find the same architecture: an LLM at the center of a loop, deciding what to do next at each step. This is marketed as intelligence. In production it manifests as brittleness โ hallucinated tool calls, compounding errors, unrecoverable states, and reliability curves that nobody wants to publish.
Emily's Project Helios does the opposite. The planner is deterministic code. Task templates are defined at creation time. The LLM is invoked only when language generation is actually required. Verification is deterministic: exit_code, file_contains, pytest, api_response. Not judgment. Not "the model thinks it probably worked."
The measured outcomes: 122/122 tests passing, 357 atomic step claims per second with zero race conditions, a 10,445-memory autonomous correction executed in production with zero human intervention. This is not a demo. This is what reliable autonomy looks like.
Why "smart planner" goes wrong
An LLM-driven planner has a fundamental problem: errors compound across steps. Each turn, the LLM might hallucinate a tool call, misremember the state, or choose a wrong branch. For a 10-step task, even a 5% per-step error rate compounds to a ~40% failure rate. For a 20-step task, ~64%.
You can try to engineer around this โ better prompts, self-critique, retries โ but the fundamental issue is that you're asking a stochastic process to produce a reliable plan execution. It's the wrong tool for the job.
Why "dumb planner" works
A deterministic planner has a different property: each step's behavior is a function of its inputs, not of the LLM's mood. If a step says "run pytest tests/foo.py, verify exit_code 0," then either pytest passes or it doesn't. No ambiguity. No hallucinated success.
Errors don't compound because there's no cognitive process accumulating them. There's just code, executing steps, checking post-conditions.
Where the LLM lives in this architecture
The LLM is still valuable โ for generating the content of work. What it doesn't do is decide the work.
Concretely in Emily:
- A task to "send a progress update to the user" has the LLM generate the language
- A task to "determine whether to send an update" is deterministic code checking a condition
- A task to "format a report from these 50 memories" has the LLM generate the prose
- A task to "which 50 memories" is deterministic: top-N by ECGL score
LLMs are the commodity; the orchestration structure is the product. This matches the three-layer model Emily uses at the architectural level.
The verification engine
Eight verification types, all deterministic:
exit_codeโ check command exit statusfile_containsโ pattern matching in filesfile_not_existsโ verify file absencecommand_outputโ check stdoutpytestโ run a test suiteapi_responseโ HTTP endpoint validationdb_queryโ database assertionsmanualโ requires human verification
Notice what's missing: no "LLM judgment" verification type. No "the agent believes the task succeeded." Verification is code that either passes or fails. This is what makes the whole system auditable.
Kill switches and bounded autonomy
Because the planner is deterministic, you can actually reason about what it will and won't do. Three kill switch levels:
- Global โ
AUTONOMOUS_PULSE_ENABLED=falsestops all autonomous execution - Task-level โ
POST /helios/tasks/{task_id}/pause - Emergency โ direct DB update with
kill_switch_reason
You can only confidently kill a system you can reason about. LLM-driven agent loops are harder to kill because their state is a prompt history that may or may not respect a stop signal.
The atomic claim property
Under load, the planner does 357 atomic step claims per second with zero race conditions. Multiple workers can pick up steps concurrently; the database primitive ensures exactly one worker owns each step at a time.
This is the kind of guarantee you can state because the planner is code. Stating the same guarantee about an LLM-driven loop would require reasoning about the LLM's behavior under concurrent invocation โ which is not a tractable problem.
The philosophical inversion
Most of the industry is pushing in the direction of "smarter planners." Emily pushes in the direction of "dumber planners, richer tools, clearer contracts."
This is not an aesthetic preference. It's a direct response to what we've observed: the reliability of autonomous systems is bounded by the reliability of the planner, and LLM-driven planners have a reliability ceiling that's too low for production.
Reliability comes from the dumb planner. Intelligence comes from the tools the planner invokes. Keep those two responsibilities separate and you get systems that both work and are auditable.
The general principle
"Determinism where possible, stochasticity where necessary" is a good design heuristic for any system that mixes code and LLMs. Put the LLM where its strengths are (language, creativity, open-ended synthesis). Don't put it where its weaknesses are (sequencing, state tracking, verification).
Emily's Helios architecture is this principle, compiled to Python.
Part of the Emily OS architecture philosophy series.