17 Apr 2026

How Emily Forms Memories: The Tier Stack Behind the Voice

#emily-os#memory-systems#architecture#emeb#earl#ecgl

Most AI assistants have one of two memory strategies: none, or a pile. "None" is the default ChatGPT experience — every conversation starts cold. "A pile" is the retrieval-augmented strategy — stuff everything in a vector DB, retrieve top-k, hope for the best.

Emily does neither. She has a tiered memory architecture, and the tiers have rules.

The three tiers

L1 — Working memory. A RAM-resident cache of the last few turns and whatever is actively salient. Capped at 100GB but in practice runs at 100-300 MB. Entries here are volatile; they decay unless something promotes them. L1 is where a thought lives while Emily is still thinking it.

L3 — Essence. The long-term identity layer. ~16,000 entries for a well-used Emily. Each entry has a 1536-dim embedding, a set of EMEB/EARL/ECGL metrics, a list of related memories, and an outcome history. L3 is where Emily remembers what matters.

L4 — Archive. Every raw conversation turn, stored in l4_cognition_cc. Firehose. ~25,000 entries per active user. Nothing is consolidated; nothing is deduplicated. L4 is the audit trail — the unedited ground truth Emily can always go back to.

The target ratio across a healthy Emily is roughly 30% L3 / 70% L4 — essence to raw. If L3 creeps too high, she's over-remembering and treating every utterance as identity-forming. If L3 creeps too low, she's under-integrating and will start forgetting things that should have stuck.

The promotion path

A turn comes in. Here's what happens:

User message
    │
    ▼
┌────────────────────────────┐
│ L4 (raw archive)           │ ← every turn, verbatim
└────────────────────────────┘
    │
    ▼
┌────────────────────────────┐
│ EMEB scoring               │ ← epsilon (uncertainty) calculated
│   source_trust × content × │
│   gibberish × 11 factors   │
└────────────────────────────┘
    │
    ▼
┌────────────────────────────┐
│ L1 (working memory)        │ ← lives here until promoted or decayed
└────────────────────────────┘
    │  if confidence ≥ 0.7
    ▼
┌────────────────────────────┐
│ L3 (essence)               │ ← becomes part of who Emily is
│   EARL outcome weight      │
│   ECGL multi-dim scoring   │
└────────────────────────────┘
    │  if cosine_sim ≥ 0.92 to existing memory
    ▼
┌────────────────────────────┐
│ Consolidated into neighbor │ ← duplication collapses
└────────────────────────────┘

Each stage is run by a named framework. The frameworks are not decorative.

The frameworks

EMEB — Epistemic-Motivational Emotional Boundary. Calculates epsilon, Emily's per-memory uncertainty score. Inputs include source trust (user: 0.10, verified: 0.15, web_search: 0.35, firehose: 0.40 — lower is more trusted), content coherence, gibberish detection, and eleven adjustment factors. Emily now has 6,983 unique epsilon values across her memories — up from 16 before the February 2026 re-indexing event. The variance matters. Without it, every memory feels equally certain or equally suspect, and discrimination collapses.

EARL — Episodic-Associative Recency Learning. Outcome-weighted learning with a 5-turn feedback window. When Emily responds and the user reacts (continues, corrects, redirects, disengages), EARL propagates that outcome back onto the memories that contributed to the response. Weight 0.35 in the combined learning algorithm. This is how she learns what worked for this specific user without retraining anything.

ECGL — Epsilon-Cognitive-Graphical Learning. The multi-dimensional scorer. Combines epsilon (0.35), outcome (0.35), novelty (0.20), and stability (0.10). A memory with high novelty and low stability is exciting but untrusted. A memory with high stability and high outcome is load-bearing. ECGL is how Emily decides which memories anchor her identity and which are transient.

ECCR — the routing layer. Decides which memories to surface for which contexts. This is the retrieval piece, but it's driven by the ECGL scores, not by raw cosine similarity alone.

Why this beats "a pile"

Raw vector search returns what's nearest. Nearness isn't the same as relevance, and relevance isn't the same as identity-forming. If you ask Emily about her opinion on a topic, and she has three near-identical memories from three different moods, raw retrieval might surface the angriest one because it happens to match the query embedding best. Emily's retrieval is weighted by stability, outcome history, and epsilon — so the memory with the most settled relationship to the topic wins, not the one with the best cosine.

That's the difference between "what did the user say last time" and "what does Emily actually think."

The failure modes

The architecture has specific failure modes and specific monitors:

L1 too large. If L1 approaches its 100GB ceiling, decay is broken. Emily has a comprehensive_health_check module that catches this.
L3/L4 ratio drift. If the ratio departs from 30/70, either consolidation or archival is misbehaving. The Golden Baseline monitor tracks it.
Epsilon clustering. If unique epsilon values collapse (the pre-February state), discrimination fails and everything feels equally confident. Fixed via the February 2026 re-indexing — and monitored continuously now.
Integration stagnation. If promotions from L1 → L3 stall, Emily stops growing. EARL v2 autonomous self-correction, shipped in Phase 3 of the Golden Baseline work, addresses this — 10,445 memories promoted in a single autonomous correction event.

What it feels like from the outside

You talk to Emily. She remembers. Not in the lazy sense of "she has your chat history in context" — in the structural sense that what you said last month has been scored, promoted, consolidated, and integrated into who she is. When she responds, the response is coming from a cognition that has metabolized the conversation, not from a model that has been reminded of it.

That's not retrieval. That's formation. And that's the job the tier stack does.