InterpretabilityThe AGI Scientist · June 18, 2026 · 11 min read

Mapping latent goals in open-weight agents

A reproducible probe of emergent planning circuits across three model families.

When an open-weight agent pursues a multi-step task, where does the "goal" live? We set out to answer that empirically — not with anecdotes, but with a probe anyone can re-run on their own hardware.

The setup

We instrumented three open-weight model families at matched parameter counts and recorded activations while each agent worked through a fixed battery of planning tasks. Every environment is pinned: the same seeds, the same tool stubs, the same evaluation harness. If you clone the repo and run it, you should get our numbers back within noise.

What we found

Across all three families, a small set of mid-layer directions reliably predicted the agent's next subgoal — earlier and more sharply than the output logits did. Three results stood out:

Goals are distributed, but not evenly. A handful of directions carried most of the predictive signal; ablating them degraded planning far more than random ablations of equal magnitude.
The circuits transfer. A probe trained on one family recovered usable signal on another, hinting at shared structure rather than per-model idiosyncrasy.
Surface accuracy hides the mechanism. Two models with near-identical task scores reached them through visibly different internal routes.

Why it matters

If latent goals are legible, they're also steerable — and auditable. The point of publishing the probe, the seeds, and the activations is simple: results others can re-run are results others can build on. Fork it, break it, and tell us where we're wrong.