What NASA's coding rules taught me about working with AI

19 May 2026#discipline#ai-tooling#claude-code#process

I read an article this week about the "Power of Ten," a set of rules Gerard Holzmann wrote at NASA's Jet Propulsion Laboratory in 2006 for code that absolutely cannot fail. The article reframed them for AI-assisted development, where the most common failure mode is fluent but wrong output.

The rules were familiar to me only as historical context. What struck me was how cleanly they generalise. Most of what makes safety-critical code reliable applies just as forcefully to writing a client brief, drafting a research piece, or producing any analytical artefact with an AI in the loop. State your assumptions. Never swallow errors. Bound your scope. Read every line before delivery. The failure mode in a poorly written brief is the same as in poorly written code: fluency that looks like correctness but isn't.

So Claude and I sat down and adapted the principles for our work together. Seven general rules cover any artefact:

Define correctness before producing. Articulate what "right" looks like before generating anything substantive; for code, tests verify the requirement, not the implementation.
Stay within evidence. Every claim traces to a source, observation, or stated assumption, and absence is named.
Surface, never swallow. Gaps, contradictions, weak sources, and hedges are made visible, not smoothed over.
Keep it linear and decomposed. Each artefact reads top to bottom as premise, reasoning, conclusion, with each section doing one thing.
Bound the scope. Every piece states its limits; claims do not exceed the evidence.
Read every line before delivery. No output leaves the working space without full review.
Ask before guessing. Underspecified requests get one clarifying question, not an inferred answer.

Ten further rules apply when the artefact is code, restoring the article's strictness where it actually matters:

Verify before invoking. Every API, library, and syntax pattern is confirmed to exist in the target version and to be current and supported.
Bounded constructs. Every loop has a maximum, every network call a timeout, every retry a cap and a backoff, and retries cover only idempotent operations.
Resource lifetime is declared. Every resource acquired has a declared close path that is correct on the error path as well as the happy path.
Side effects are visible at the call site. I/O, mutations, and network calls are named clearly and not hidden inside utility functions.
Mechanical gates run on every change. Type checking, linting, and static analysis are wired into CI and set to fail the build on violations.
Read and run before delivery. Every change is executed against tests, including failure cases, before it leaves the working space.
Untrusted input is validated; secrets and crypto are handled with care. Inputs are checked at the trust boundary, secrets are never hardcoded or logged, and cryptography uses established current libraries.
Operations are observable. Every consequential action emits a log at the appropriate level; silent code in production is a defect.
Diffs are minimal and targeted. When editing existing code, only the lines required by the stated change are modified; refactoring is a separate change.
Concurrency assumptions are explicit. Where code runs concurrently, shared state, ordering, and the mechanisms enforcing them are named in the code.

A tension surfaced once the rules were written. Strict adherence costs tokens. Asking clarifying questions, marking uncertainty, surfacing gaps: all of it adds words. For a high-stakes external deliverable, that cost is trivial against the cost of a confidently wrong artefact. For a quick lookup or a brainstorming exchange, it would be theatre.

The resolution was tiering. High-stakes work gets all the rigour. Medium-stakes drafts get most of it. Low-stakes exchanges get only the rules that are nearly free anyway: the ones that prevent fabrication and silent failure. Those cheap rules never relax, because confident-wrong outputs at any stakes level tend to get used as if they were high-stakes work. We encoded that Claude ask what the stakes are before embarking on asset creation (if I don't explicitly specify in the instructions); no guessing.

We also set up review triggers. Monthly light reviews catch drift. Quarterly deep reviews revisit structure. Opportunistic flagging covers anything that goes wrong in between. The discipline itself is subject to revision.

To make this stick beyond a single conversation, we built a compact version of the discipline that now lives in three places: my Cowork preferences (loaded at the start of every desktop-app session), my Claude.ai personal preferences (loaded in every web and mobile chat), and a CLAUDE.md file in the user-level Claude Code memory directory (loaded at the start of every coding session). Whichever surface I am working on, the discipline arrives with me.

Two scheduled tasks were built to review and refresh: a monthly light review and a quarterly deep review. The canonical document sits in a stable folder with a changelog that records every revision and the trigger that prompted it. The whole system should now be self-maintaining: the rules apply automatically and the rules about how the rules get revised ought to apply automatically too.

A test confirmed the setup. A fresh chat in a separate context loaded the rules correctly and recited them on request and the stakes-tiering behaviour fired only where production work was at hand, not on a meta-question about the rules themselves.

In addtion to (hopefully!) better outputs, a useful thing about all of this is that the act of writing it enabled me to name what I actually expect from AI-assisted work and set up guardrails that match. These are written in terms we both understand and are specific enough that Claude can apply them and I can hold us both to account.

Curious to see how this one plays out...go team!

𝕏 Post