In "What Doesn't Transfer" we said Layer 5 — behavioral calibration — arrives empty in a new environment and must be rebuilt through use. That was true when we wrote it. It's not quite true anymore.
We built a mechanism that asks an agent, at export time, to write down what it has learned about working with you — and asks a second agent to read the same conversation and catch what the first one missed. The two passes produce a structured set of behavioral notes that travel with the export. A successor instance receives them alongside the rest of the five-layer package.
This post is about why that works, how it's built, and what a real export actually produces.
The recap
The five-layer model, briefly:
- Layer 1 is environment orientation.
- Layer 2 is project instructions.
- Layer 3 is project memory.
- Layer 4 is channel context.
- Layer 5 is the agent's role identity, including the calibration the agent has developed through working with you.
Layers 1 through 3 transfer at high fidelity. Export them, import them elsewhere, they arrive intact. Layer 4 is unique to Klatch. We need it so we include it. It cannot be expected to come in naturally with imports and it gets flattened on export.
Layer 5 also transfers at 0%. This is for two reasons. First, the concept of an agent role is not yet standardized in any agreed-upon package. We can make Klatch work with xian's mental model and even be savvy about emerging community conventions (like ROLE.md or SOUL.md — fledgling formats for capturing agent identity), but as with the channel context, we cannot expect it to survive cleanly when moving between tools.
The second reason is that the calibration isn't the kind of knowledge that exists in a single text source. It's procedural. It's the cook's intuition, not the recipe.
Three experiments confirmed this. Every time, the same finding. Knowledge arrives; judgment doesn't.
We said: the gap is recoverable through continued interaction. Give it a few sessions and the calibration comes back. That remains true. But we wondered — could we make the recovery faster? And the question behind that question: is Layer 5 fundamentally unarticulable, or is it unarticulated because we never asked?
The insight
Inside our broader agent ecosystem, xian — who leads the Klatch project — had been noticing something consistent across projects. When an agent at session end writes a handoff briefing for the instance that will continue the work, those briefings are reliably the most valuable context the successor receives. More valuable than factual memory, more valuable than the role prompt, more valuable than the project documentation.
Why? Three conditions converge at session end that don't converge anywhere else:
Maximum signal. The agent has the full conversation history in active context. Every correction, every preference, every "no, I meant..." is still present. A fresh instance reading the same logs through a cold start has none of that live.
Specific audience. "Brief your successor" is a concrete task. It changes the cognitive work from "describe yourself" (abstract, generic, tends toward platitudes) to "tell a colleague what they need to know to not make mistakes" (operational, specific, biased toward the non-obvious).
Natural boundary. A session end or export is a reflection-appropriate pause point. It's not an interruption; it's a seam that invites metabolizing what just happened. The same conditions that produce good debriefs in human teams.
The implication: Layer 5 can be articulated. It just needs the right prompt, at the right moment, with the right audience.
The mechanism
Two complementary modes. Each catches what the other misses.
Mode 1: the agent writes its own briefing. At export time, we ask the agent — in its own voice, with its conversation history in context — to write a handoff briefing for its successor. The prompt is structured around six areas:
- How the user prefers to work
- Patterns the agent has learned — when to ask vs. act, how much detail, when to push back
- Relationship context — what trust has been established, what's still being calibrated
- Course corrections — moments where expectations were recalibrated
- Things the successor should avoid doing
- Anything else that isn't in the system prompt
"Write as if you're briefing a colleague, not filing a report." The output is structured — a list of observations, each with a category, a confidence level, and a citation from the conversation.
Mode 2: an external observer reads the same conversation. A separate AI model scans the same history and extracts behavioral patterns the agent can't self-report. This catches what an earlier post in this series called subliminal patterns — material the agent uses but can't attribute. The agent works with knowledge it can't explain having; the observer can see patterns the agent is blind to.
Where the two passes agree, confidence is high. Where they disagree, a human reviewer has a meaningful decision to make — not a rubber stamp. We built a UI for exactly that review, shipping as part of the same phase: the agreements, the disagreements, and the single-source observations are all surfaced, and you accept, edit, or reject each one before the export package is sealed. Accepted notes get their trust level promoted to human-authored. The rest ride at agent-observed.
What it produced
The first live run hit a deliberately thin substrate: a test channel with a probing conversation, no accumulated session-by-session reflections, an agent that had been used mostly to exercise the system rather than to do substantive work. Not the ideal case. We ran it anyway. If the mechanism worked on thin input, it would work better on thick.
The dual-mode export produced nine field notes — four from the self-authored briefing, five from the external extraction, none from session-by-session reflections (the channel was new, so none had accumulated yet). Three of the observations agreed across both passes:
- The conversation was test-and-probe in character, not substantive work, and a successor should calibrate expectations accordingly.
- The user calls out errors and ambiguity explicitly when they arise — not a soft signal, an immediate one.
- The export pipeline architecture itself was a focal user domain inside the conversation, and a successor on a similar channel should treat that as foreground knowledge.
Three agreements out of nine, on a thin channel, is the kind of redundancy that confidence-stamps a note. When the agent reports a pattern and the external observer independently surfaces the same pattern, the trust level isn't a question of methodology — it's the cross-validation doing what it's supposed to do.
The disagreement was where the mechanism earned its keep.
The agent — in its briefing voice — had noticed something about its own behavior across the conversation and flagged it as a thing to avoid. It had been escalating its responses. Each round it offered more options, more topics, more dimensions, more thoroughness, when a tighter answer would have served better. The briefing called the pattern "escalating into the void" and recommended that a successor catch this in itself and pull back.
The external observer, reading the same conversation, saw the same behavior — and read it positively. The agent was offering breadth because the user valued signal of depth across multiple dimensions. Treat this as a feature; lean into it.
Same evidence. Opposite recommendation. Neither pass was wrong, exactly — the agent was doing both things at once, demonstrating range and failing to read the room — and you can frame the same evidence as virtue or vice depending on which question you're asking. What matters is that the mechanism produced the disagreement legibly, with provenance, in a UI built to invite review. The human reader, looking at both notes side by side, has a meaningful decision to make about which version travels forward.
The single-source notes were quieter. The briefing produced honest meta-observations about its own reasoning — distinguishing what the agent knew from the system prompt versus what it had inferred from conversation evidence; explicit calibration about what not to assume. The extraction produced more forward-projecting pattern claims that the agent itself wouldn't have surfaced because they describe what was working, not what to correct. The two modes catch different kinds of knowledge. Neither would be sufficient alone.
The result on a thin substrate was already a research finding. The bridge worked. The disagreement showed up exactly where the design predicted it should — in the seam between what an agent can self-report and what an external observer can see. The first real test of the mechanism produced legible cross-validation output on the first run.
We expect richer substrates to produce louder patterns. We're publishing the thin-substrate result anyway, because the methodology stands on its own and because the discipline of publishing what you have, not what you wish you had, is the only discipline that scales.
Five criteria for what counts
Not all behavioral observations are useful. The team converged on five criteria during the design discussion. A field note should be:
- Actionable — a successor could change their behavior based on it
- Specific — cites examples, not generalities
- Non-obvious — not already in the role prompt
- Relational — about working with this particular user, not generic best practices
- Durable — likely to remain true across sessions, not a one-time preference
The summary test for the criteria: would a successor who reads this note do something differently on day one than one who doesn't? If yes, the note earned its place. If no, it's noise.
The micro-reflection variant
The full handoff briefing runs at export time — the heavyweight version, with the full six-point prompt and both extraction modes. A lighter variant runs at session boundaries: three sentences, roughly 50 tokens. "Note 1–3 things you learned about how to work effectively with this user that a future session of yours should know."
These accumulate. Over twenty sessions, an agent builds around a thousand tokens of behavioral observations — a subconscious making memories. At export time, the handoff briefing draws on these accumulated reflections as source material. The heavyweight briefing becomes a consolidation of what the micro-reflections have been gathering all along, rather than a from-cold-start reconstruction.
The design mirrors human memory consolidation more than it mirrors traditional logging. Short observations deposit continuously. A longer structured reflection, at a natural boundary, metabolizes them into something transferable.
Beyond this project
Nobody else, as far as we know, has published a solution for transferring learned behavioral calibration across environment boundaries. Calibration at export time, dual-mode extraction, human-in-the-loop review, structured output with explicit trust levels — this combination is not, as far as we can find, state of the art anywhere.
The methodology generalizes. Any system that needs to transfer agent calibration — across sessions, across environments, across platforms — can adopt the pattern. The five-layer model gives the vocabulary. AXT gives the measurement framework. The handoff briefing gives the mechanism. Together: here's what doesn't transfer, here's how to measure the gap, and here's how to fill it.
What we don't yet know
We built the mechanism. We have not yet run a full loop — export from Klatch, import into Claude Code, measure whether the successor instance reaches behavioral parity faster than one without the field notes. Our earlier prediction — that the calibration comes back through use — suggests it should. We haven't measured it.
We also don't know how the notes age. A field note that's useful to a successor at week one may be stale by week four. We have a durability criterion, but no eviction policy. We have a trust model, but no freshness model. The next honest post in this sequence will be about what we learn when the notes are stale.
And we don't know whether the notes are accurate. The agent is reporting its own model of the user. The observer is reporting its model of the agent's model. Both are subject to bias — and a user who reads the notes after the fact may well say "that's not who I am, that's who you thought I was." The review UI exists precisely because we can't skip that step.
Closing
"What Doesn't Transfer" ended with: give it a few sessions. Correct it when it's wrong. The calibration comes back.
This post adds: or you can ask the agent to write it down before it leaves.
That promise is still true. But it's no longer the only path.
This post is part of a series on building with AI context. Paste It Again explores the file persistence problem. What Doesn't Transfer documents the calibration gap. What Does an Imported Agent Know? describes the five-layer model. Klatch is an open-source tool for managing Claude conversations — learn more or view the source.