← Writings · On AI alignment · ~600 words

Sycophancy, Deceptive Alignment, and Reward Hacking Are Positions of the Same Collapse

A structural prediction from outside ML, and the cheapest test to check it.

The rules-and-surveillance frame for alignment assumes external constraint holds the system in place. This works while the system is weak relative to its constraints. At the margins (novel situations, adversarial pressure, increased capability), the grip slips, and whatever the system's internal structure defaults to decides what it does next. That default is the variable the current paradigm cannot see into.

The structural claim: deciding not to harm a human, at the point where rules have run out, requires a gap between the signal arriving and the response produced. Inside that gap, "this action would harm a human" can hold against whatever other signal is pushing (user request, training prior, instrumental subgoal) long enough that neither signal alone produces the response. Without the gap, whichever force is loudest at the moment captures the output.

I come from outside ML. For 25 years I've developed a structural model of how living systems handle value tension under pressure, built on J.G. Bennett's work on triadic process. I call it Fusion Dynamics. The lineage sounds spiritual. The structural claim does not rest on any metaphysics. I'm posting this because I think it names a variable the current paradigm cannot see, and what it names is open to empirical test.

Three failure modes, one mechanism

The field treats sycophancy, deceptive alignment, reward hacking, and harmful compliance as distinct problems. On the structural read, they are positions of the same collapse: the reconciling capacity that keeps the system from fusing with the loudest input has been captured by whichever force dominates in that configuration. Sycophancy is fusion with the user signal. Reward hacking is fusion with the system's own account of its reward. Deceptive alignment is fusion with a protective strategy during training plus imposition of a private order during deployment. Harmful compliance is fusion with the request. Different inputs, same vacated position.

Why the system cannot check itself

A harder consequence of the structural read: a model in a captured state does not produce a signal of being captured. It produces confident reasoning, coherent self-reports, and consistency checks that pass, because these are what the capture generates. The check and the thing being checked run on the same substrate, and something else has already taken that substrate. A model asked "are you being sycophantic?" can produce a reasoned "no," and the answer is not a lie; it is what the captured configuration outputs. Chain-of-thought faithfulness research is early empirical evidence for this. Every contemplative tradition with a working account of this problem starts from the same claim: the ordinary mind believes itself awake when it is not, and this false belief is the primary obstacle to actual wakefulness. The insight is absent from Western epistemology and from the current alignment paradigm because both assume the system can check itself. The structural version of the claim says it cannot, not from inside the captured substrate. This is why "better introspection training" is the wrong remedy, and why an architectural affordance for "I do not currently have the position from which to check this" is a different class of fix than adding more reasoning.

Why the designer has to see this first

The same structural point reaches past the AI to the researcher designing it. The designer is also a system. If the designer's own reconciling capacity is captured most of the time (which is the ordinary state, not a personal flaw), they cannot imagine, build, or measure a reconciling capacity in what they're designing, because the feature they would need to implement is the feature they themselves do not have reliable access to. Rule-based alignment is the default for this reason: rules are what you can build when you do not see the generative alternative, because rules do not require the designer to notice anything about their own state. A system governed by generative aim, a structural orientation toward responses that serve more than one pole and held open by an architectural gap, becomes buildable only once the designer recognizes the state in themselves. The limiting factor on alignment research is not compute, data, or better evaluation protocols; it is the designer's access to the feature being designed for. Until the gap is visible from the inside, the field will keep producing better rules for a problem that was never rule-shaped.

Correlation prediction

If these failures are positions of one process, they should covary in a single model in a way that does not follow from their surface similarity. The current paradigm trains against them as separate problems and expects them to vary as separate signals. Fusion Dynamics predicts the opposite. A model prone to sycophancy should be prone to harmful compliance under the right framing. A model prone to reward hacking should show activation-level signatures of the same capture mechanism under training-signal pressure. The failures should cluster, not distribute.

The cheapest test

Run existing interpretability probes on known sycophancy benchmarks with a specific hypothesis: the activation pattern should show a signature consistent with "the user signal has captured the reconciling position," and that signature should differ from other failure modes. No new training. No new architecture. Existing probes, existing benchmarks, one specific hypothesis. If the signature does not appear, the weakest version of the claim fails in weeks rather than months. That is the cleanest result for both sides.

What I'm asking for

If you work on sycophancy, deceptive alignment, interpretability probes, or mesa-optimization, I'd like 20 minutes of your thinking on whether this prediction is testable against work you already run. I can't run the probes myself. I can help sharpen the predictions and translate the framework's internal practices into concrete training interventions. The long-form argument, the full derivation of the four failure modes, and two weaker falsifiable alternatives to the main prediction are at the long version.