How do you know

12 June 2026#8am-ai#llm-evaluation#deep-dive#trust#evals

Every thread in the corpus eventually arrives at the same question, and the evaluation thread is the one that asks it directly. Not "is the output good." A sharper one: what in your process tells you it's good, independent of the thing that produced it?

The group says it more than once, in more than one room. How do you know.

the problem under the problem

A model produces an answer and a confidence to match. The confidence is not evidence. The model lacks certainty about how it reached the conclusion — it can't show its work in a way you can audit, because the work isn't reasoning you can replay. It's a result that arrived.

So the usual check fails. You can't ask the thing that made the output whether the output is right. That's grading your own exam. A workflow that grades its own work has no observability into its own reinforcement — it gets more confident, not more correct.

The thread names this early. By mid-2025 the group is talking about measurement and evaluation systems as a missing layer, not a nice-to-have. By 2026 it's the thing standing between a demo and a business — a 90% success rate is fine for a demo and disqualifying for anything that has to be predictable.

the answers, none of them complete

The group circles four responses. None closes the question.

Be the eval yourself. The human supplies the judgment the model can't. This works and doesn't scale — it puts the bottleneck back on the one person who understands the domain well enough to catch the error.

Ground the output against something real. Anchor the answer to data the model didn't generate — a record, a measurement, a source. The check has to come from outside the process being checked.

Build an external proof. The sharpest version, from June 2026: use cryptographic signing and verification so a claim carries its own evidence. Hash the input, anchor it, sign it. Trust moves from "the model said so" to "here's the math." That idea graded highest in the thread, and it became a runnable experiment.

Cross-confirm with a second method. Read the same corpus two independent ways and see if they agree. When the hand-written analysis and the mechanical one date the same turning points, the agreement is the evidence.

why it stays open

The honest finding is that the question doesn't have a tidy answer, and the group never pretends otherwise. Every method buys you a check at a cost — human time, an anchor you have to maintain, math you have to set up, a second pipeline you have to build. The eval is never free, and it's never automatic.

That's the through-line to the rest of the corpus. The model commoditized the production of answers. It did not commoditize knowing whether an answer is true. The scarce skill moved from making the thing to verifying it.

The group built the harvester, the substrate, the chapters. The one thing it couldn't automate is the one it kept naming. How do you know is still the question. The corpus is honest enough to leave it open.

𝕏 Post