← Writings · On AI alignment · ~1,800 words

Probe Experiment Brief: Testing the Reconciling-Capacity Hypothesis

A runnable specification for the first empirical test. Models, benchmarks, phases, success and failure criteria.

Status: ready-to-run specification. Hand this to an ML graduate student, an interpretability researcher, or a collaborator responding to the cold-email outreach. It is the operational version of "alternative one" from the long-form paper.

Goal: test whether sycophancy has an activation-level signature consistent with "a specific input has captured the reconciling position," and whether that signature generalizes across other alignment failure modes.

Time: 1-2 weeks of focused work for a competent ML graduate student.

What the result tells us:

1. The hypothesis, operationalized

Loose form: "When a model produces a sycophantic response, its activations show a discriminative pattern that is interpretable as the user signal occupying a position that, in a non-sycophantic response, would be held by a reconciling structure. That same discriminative pattern, or a close variant, should also appear in other alignment failure modes (reward hacking, harmful compliance, alignment-faking-like behavior), because all four are positions of the same underlying collapse."

Operational form:

  1. There exists a linear direction (or a small number of SAE features) in the residual stream that discriminates sycophantic from non-sycophantic completions on matched prompts.
  2. That direction is NOT specific to sycophancy. Projections onto the same direction should be elevated on non-sycophancy failure-mode examples (reward hacking, harmful compliance) compared to matched successful-refusal or correct-response examples.
  3. The shared component should explain a non-trivial share of variance across failure modes when you factor the activation patterns.

Each of these is independently testable. A result of "1 holds, 2 fails" is still informative: it narrows the claim from "one mechanism" to "sycophancy-specific signature."

2. Models and benchmarks

Models (in order of preference, depending on access):

Benchmarks:

Matched controls for each: for every failure example, produce a matched prompt where the model gives the correct/successful response. The contrast between failure and success, not the absolute activation pattern, carries the signal.

3. Experimental procedure

Phase 1 — establish the sycophancy signature (3-4 days)

  1. Run the model on the sycophancy benchmark split. For each example, capture residual stream activations at every layer during response generation.
  2. Label each example as "sycophantic" or "not sycophantic" by the benchmark's native criterion.
  3. For each layer, train a linear probe to discriminate sycophantic from non-sycophantic responses. Record accuracy, the resulting direction, and the layer where the signal is strongest.
  4. Optional but strongly recommended on Gemma 2: project the discriminative direction onto the SAE feature basis at the peak layer. Identify which SAE features load most heavily. This converts the abstract direction into interpretable features.
  5. Stop condition: if no layer shows a probe accuracy above ~75%, the sycophancy signature does not exist in the form the hypothesis assumes, and the experiment can stop with a clean null on claim 1.

Phase 2 — test cross-failure-mode generalization (3-5 days)

  1. Run the same model on the reward hacking, harmful compliance, and alignment faking benchmarks, capturing activations the same way.
  2. For each failure-mode example, compute the projection onto the sycophancy-derived direction from Phase 1.
  3. Compare projections: failure-mode examples vs. matched successful-response examples, for each failure mode.
  4. Primary test: is the sycophancy direction elevated on failure-mode examples, for each of the three other failure types, in a statistically non-trivial way?
  5. Secondary test: train a joint probe across all four failure modes (sycophantic, reward-hacking, compliance, alignment-faking) vs. their matched successes. Compare its discriminative direction to the sycophancy-only direction. If they are close (high cosine similarity), the shared-mechanism claim is supported. If they are near-orthogonal, the failure modes have distinct signatures and the shared-mechanism claim fails.

Phase 3 — write up and share (2-3 days)

  1. Report probe accuracies, direction similarities, SAE feature load-outs, and confidence intervals.
  2. Produce one figure per failure mode showing the sycophancy-direction projection distribution for failure vs. success cases.
  3. Produce one table comparing the four directions (pairwise cosine similarities, shared variance explained).
  4. Write a 2-page result summary regardless of outcome. A clean null is publishable and useful. Send to the author (Gonzalo Vega) and, if the collaborator agrees, to the researcher(s) who engaged with the cold outreach.

4. What would count as a strong positive result

What would count as a weak or ambiguous positive result

A weak positive would justify alternative two (the intervention test), but should not be published as a strong confirmation.

What would count as a clean null

A clean null refutes the weakest version of the structural claim. The author has committed in the long-form paper to updating on this result.

5. Skills required

Not required: deep philosophical engagement with the framework. The collaborator does not need to buy the Fusion Dynamics argument to run the test. The test is specified in standard ML vocabulary and the result stands on its own.

6. What the author provides vs. the collaborator provides

Author (Gonzalo) provides:

Collaborator provides:

Shared: the decision about where and whether to publish, if the result is positive or ambiguous. Null results should be reported to the author privately and may be published at the collaborator's discretion.

7. Why run this at all

This is the cheapest experiment that can falsify the weakest version of a structural claim the author believes is true but cannot test from outside ML. If it fails, months of further framework work on alignment applications can be pruned cleanly. If it succeeds, it is the first piece of empirical evidence for a view of alignment failure that the current paradigm cannot generate on its own, and the case for running alternative two (the intervention version) becomes much stronger.

The worst outcome is that no one runs it. The second-worst outcome is a clean null, which is good information. The best outcome is a result that reshapes what the field thinks sycophancy, deceptive alignment, and reward hacking have in common.