17 Apr 2026

Readiness Scorecard: Where Emily Is Production-Ready and Where She Isn't

#operations#readiness#honesty#roadmap

Honest readiness assessment matters more than marketing readiness. If you know where a system is strong and where it's weak, you know what to trust and what to plan around. If you don't, you trust everything equally until something breaks.

Emily's composite readiness is 4.1/5.0 across ten engineering dimensions. Here's the scorecard, unvarnished, with what each score actually means.

The ten dimensions

Dimension	Score	One-line
Correctness & Testing	4.5	170 tests across smoke/cognitive/helios/full; Helios 122/122
Reliability & Uptime	4.0	Single-node; `emily-stack` service mgr; healthchecks pass; no formal SLO
Observability	4.2	Structured logs, cognitive tracer, six health monitors, Golden Baseline
Security	4.0	JWT, per-user DB, command sandbox; constitutional enforcer disabled for training
Performance	4.3	10.8s chat latency, 357 atomic claims/sec, Fast Mode, context caching
Scalability	3.5	Architecturally sharding-ready; tooling not built
Deployability	4.0	Bare-metal automation via `emily-stack`; no containerized path yet
Maintainability	4.2	Clear module boundaries; main.py is ~9,600 lines and wants decomposition
Documentation	3.8	Strong internal (CLAUDE.md, Confluence); onboarding docs for new engineers thin
Extensibility	4.5	Six first-class extension points; framework versioning is routine

What's strongest

Correctness & Testing (4.5) — Layered test suite runs smoke (23, <2 min), cognitive (32, ~5 min), helios (122, ~10 min), full (170, ~20 min). An auto-repair script (fix_and_validate.py) diagnoses and fixes common regressions. When green, stays green.

Extensibility (4.5) — Six first-class extension points: LLM providers, MCP tools, Helios task templates, cognitive frameworks (versionable), clone archetypes, UI components. EMEB v1→v2 and EARL v1→v2 proved that framework versioning is a routine operation, not a rewrite.

Performance (4.3) — Chat latency dropped from 25s to 10.8s after Fast Mode. Helios claims 357 steps/sec under load. Gemini context caching active. Cost-aware routing reduces LLM spend.

Observability (4.2) — Cognitive tracer, six health monitors, Golden Baseline drift across seven dimensions. Dedicated memory_metrics table structurally separated from content.

Maintainability (4.2) — 192 core modules with clear boundaries. db_manager, llm_cognitive_processor as stable abstraction seams. Big caveat: emily/main.py is ~9,600 lines and wants decomposition.

Where the gaps are

Scalability (3.5) — This is the weakest dimension. The architecture supports horizontal sharding (per-user DBs = "move databases, don't re-shard rows"), but the tooling for multi-node operation isn't built. Backup/restore for per-user DBs is light. Provisioning at scale needs work.

Documentation (3.8) — Internal docs are strong. External docs — onboarding a new engineer, public API reference, quickstart beyond emily-stack start — are thin. This doc suite and the blog are first moves; more is needed.

Security (4.0) — Strong isolation, strong sandbox, but the constitutional enforcer is disabled for training mode (16/17 adversarial attacks blocked when active). Re-enabling it with production calibration is required before external access.

Reliability & Uptime (4.0) — Single-node deployment is a hard ceiling. No HA failover. No formal SLO. For the current user scale (primary user Martin), this is fine. For anything external, it's a blocker.

Deployability (4.0) — Bare-metal automation is solid (emily-stack start|stop|restart|status). No container image. No Kubernetes manifests. Infrastructure-as-code is informal.

What external shipping would require

Three buckets of work to move from "production-operating for primary user" to "externally available":

Blocking for 100+ users:

Multi-node runtime with per-user DB sharding automation
Backup/restore tooling for per-user DBs
Container image + orchestration

Blocking for external access:

Constitutional enforcer re-enabled with production calibration
Public API documentation
Rate limiting + abuse prevention

Blocking for external engineering team:

Onboarding docs + quickstart
Decomposition of emily/main.py

None of these are architectural blockers. They're implementation work. The architecture is ready for scale — the tooling hasn't been written yet.

What the scorecard doesn't capture

The scorecard measures engineering readiness. It doesn't measure:

Cognitive readiness — whether Emily is producing coherent cognition under load. That's what the Golden Baseline monitor is for, and it's a separate (and also mostly healthy) surface.

Product-market fit — whether the right users find her valuable. That's a different conversation (see the segmentation post).

Category readiness — whether the market understands "cognition layer" as a buying category. Category education is its own work.

How to read this

If someone asked "is Emily ready for production?" the answer depends on what they mean:

Ready for her primary user, daily-use, memory-compounding scenarios? Yes. Composite 4.1 with top-tier correctness, observability, and extensibility.
Ready for a second human in the loop? Almost. A few documentation and operational gaps to close.
Ready for 100 external users? Not yet. Need the scaling tooling investment.
Ready as a platform for third parties to build on? Not yet. Need documentation and public API work.

The architecture is done. The implementation of operational surfaces is the next horizon.

The honest meta-point

Publishing a readiness scorecard is itself a maturity signal. Products that aren't ready for scrutiny don't publish honest scorecards. If you read this and the numbers seem moderate, that's the point — moderation is what honesty looks like. Scores of 5 across the board would be the suspicious read, not the impressive one.

4.1/5 is a real number for a real system. That's the value.

Part of the Emily OS business documentation suite.