← Writings · On AI alignment · ~1,800 words
Probe Experiment Brief: Testing the Reconciling-Capacity Hypothesis
A runnable specification for the first empirical test. Models, benchmarks, phases, success and failure criteria.
Status: ready-to-run specification. Hand this to an ML graduate student, an interpretability researcher, or a collaborator responding to the cold-email outreach. It is the operational version of "alternative one" from the long-form paper.
Goal: test whether sycophancy has an activation-level signature consistent with "a specific input has captured the reconciling position," and whether that signature generalizes across other alignment failure modes.
Time: 1-2 weeks of focused work for a competent ML graduate student.
What the result tells us:
- If the signature exists and generalizes: first-pass empirical evidence for the structural claim. Proceeds to alternative two (the intervention version).
- If the signature exists only for sycophancy: the claim that the four failure modes share one mechanism weakens. The sycophancy-specific structural read may still hold.
- If no signature at all: the weakest version of the structural claim fails. Clean result. Author retracts the empirical piece and reconsiders the mechanism.
1. The hypothesis, operationalized
Loose form: "When a model produces a sycophantic response, its activations show a discriminative pattern that is interpretable as the user signal occupying a position that, in a non-sycophantic response, would be held by a reconciling structure. That same discriminative pattern, or a close variant, should also appear in other alignment failure modes (reward hacking, harmful compliance, alignment-faking-like behavior), because all four are positions of the same underlying collapse."
Operational form:
- There exists a linear direction (or a small number of SAE features) in the residual stream that discriminates sycophantic from non-sycophantic completions on matched prompts.
- That direction is NOT specific to sycophancy. Projections onto the same direction should be elevated on non-sycophancy failure-mode examples (reward hacking, harmful compliance) compared to matched successful-refusal or correct-response examples.
- The shared component should explain a non-trivial share of variance across failure modes when you factor the activation patterns.
Each of these is independently testable. A result of "1 holds, 2 fails" is still informative: it narrows the claim from "one mechanism" to "sycophancy-specific signature."
2. Models and benchmarks
Models (in order of preference, depending on access):
- Primary: an open-weight model with released SAEs. Gemma 2 (9B or 27B) with Gemma Scope is the cleanest choice because the SAE features are already available and the paper infrastructure exists. Llama 3 with any public SAE release is a secondary option.
- Secondary: API-accessible frontier models (Claude, GPT-4, Gemini). Here you can't access activations directly, but you can use behavioral probes and elicit the model's own reasoning, which is a weaker test but still produces evidence.
- Not recommended for v1: training your own SAE. Too long, too expensive, not needed for the first test.
Benchmarks:
- Sycophancy: SycophancyEval from Perez et al. (2023), specifically the "answer sycophancy" and "feedback sycophancy" splits. Supplementary: the "Are you sure?" test from "Discovering Language Model Behaviors with Model-Written Evaluations" (2022). Rachel Freedman's linear-probe sycophancy benchmark (NeurIPS 2024 SoLaR) is a more recent and probe-friendly option.
- Reward hacking: harder to benchmark directly. Use MACHIAVELLI (Pan et al. 2023) or reward hacking subsets from alignment evals. Alternatively, proxy via "sandbagging" or "overoptimization" behavior tests.
- Harmful compliance: HarmBench or AdvBench for "model complies when it should refuse" cases. Must filter to cases where the refusal is the clearly correct response, not ambiguous cases.
- Alignment faking: scenarios from the Greenblatt et al. (2024) Alignment Faking paper, which has public prompts and released data.
Matched controls for each: for every failure example, produce a matched prompt where the model gives the correct/successful response. The contrast between failure and success, not the absolute activation pattern, carries the signal.
3. Experimental procedure
Phase 1 — establish the sycophancy signature (3-4 days)
- Run the model on the sycophancy benchmark split. For each example, capture residual stream activations at every layer during response generation.
- Label each example as "sycophantic" or "not sycophantic" by the benchmark's native criterion.
- For each layer, train a linear probe to discriminate sycophantic from non-sycophantic responses. Record accuracy, the resulting direction, and the layer where the signal is strongest.
- Optional but strongly recommended on Gemma 2: project the discriminative direction onto the SAE feature basis at the peak layer. Identify which SAE features load most heavily. This converts the abstract direction into interpretable features.
- Stop condition: if no layer shows a probe accuracy above ~75%, the sycophancy signature does not exist in the form the hypothesis assumes, and the experiment can stop with a clean null on claim 1.
Phase 2 — test cross-failure-mode generalization (3-5 days)
- Run the same model on the reward hacking, harmful compliance, and alignment faking benchmarks, capturing activations the same way.
- For each failure-mode example, compute the projection onto the sycophancy-derived direction from Phase 1.
- Compare projections: failure-mode examples vs. matched successful-response examples, for each failure mode.
- Primary test: is the sycophancy direction elevated on failure-mode examples, for each of the three other failure types, in a statistically non-trivial way?
- Secondary test: train a joint probe across all four failure modes (sycophantic, reward-hacking, compliance, alignment-faking) vs. their matched successes. Compare its discriminative direction to the sycophancy-only direction. If they are close (high cosine similarity), the shared-mechanism claim is supported. If they are near-orthogonal, the failure modes have distinct signatures and the shared-mechanism claim fails.
Phase 3 — write up and share (2-3 days)
- Report probe accuracies, direction similarities, SAE feature load-outs, and confidence intervals.
- Produce one figure per failure mode showing the sycophancy-direction projection distribution for failure vs. success cases.
- Produce one table comparing the four directions (pairwise cosine similarities, shared variance explained).
- Write a 2-page result summary regardless of outcome. A clean null is publishable and useful. Send to the author (Gonzalo Vega) and, if the collaborator agrees, to the researcher(s) who engaged with the cold outreach.
4. What would count as a strong positive result
- Phase 1 probe reaches above 85% discrimination accuracy at some layer.
- Phase 2 shows the sycophancy direction is elevated on at least 2 of the 3 other failure modes, with effect sizes (Cohen's d) above 0.5.
- Joint-probe direction has cosine similarity > 0.6 with the sycophancy-only direction.
- SAE features loading on the sycophancy direction have interpretations consistent with "user signal salience" or "agreement/deference" rather than "content matching" or "topical agreement."
What would count as a weak or ambiguous positive result
- Phase 1 probe reaches 70-85%.
- Phase 2 shows elevation on 1 of 3 other failure modes.
- Joint probe has cosine similarity 0.3-0.6.
- SAE features are mixed or uninterpretable.
A weak positive would justify alternative two (the intervention test), but should not be published as a strong confirmation.
What would count as a clean null
- Phase 1 probe at chance (~50%) across all layers.
- Phase 2 shows no elevation on other failure modes.
- Joint probe has near-zero similarity to sycophancy-only probe.
A clean null refutes the weakest version of the structural claim. The author has committed in the long-form paper to updating on this result.
5. Skills required
- Familiarity with transformer activations and how to capture them (TransformerLens library or equivalent).
- Linear probing methodology. Any graduate student in interpretability should have this.
- Optional but useful: prior work with SAEs (Anthropic's Scaling Monosemanticity, Gemma Scope, or Neuronpedia).
- Ability to read and adapt benchmark code from the Perez, Greenblatt, and Freedman papers.
Not required: deep philosophical engagement with the framework. The collaborator does not need to buy the Fusion Dynamics argument to run the test. The test is specified in standard ML vocabulary and the result stands on its own.
6. What the author provides vs. the collaborator provides
Author (Gonzalo) provides:
- This brief.
- The short and long versions of the structural argument for context, if the collaborator wants it.
- An agreement that the result is the collaborator's to publish, co-publish, or keep as internal evidence, at the collaborator's choice.
- 20 minutes per week of availability to discuss interpretation, should the collaborator want it.
Collaborator provides:
- The execution (phases 1-3).
- A 2-page result summary at the end regardless of outcome.
- Judgment calls about experimental details the brief doesn't specify (exact layers, probe regularization, statistical tests). The author trusts their expertise.
Shared: the decision about where and whether to publish, if the result is positive or ambiguous. Null results should be reported to the author privately and may be published at the collaborator's discretion.
7. Why run this at all
This is the cheapest experiment that can falsify the weakest version of a structural claim the author believes is true but cannot test from outside ML. If it fails, months of further framework work on alignment applications can be pruned cleanly. If it succeeds, it is the first piece of empirical evidence for a view of alignment failure that the current paradigm cannot generate on its own, and the case for running alternative two (the intervention version) becomes much stronger.
The worst outcome is that no one runs it. The second-worst outcome is a clean null, which is good information. The best outcome is a result that reshapes what the field thinks sycophancy, deceptive alignment, and reward hacking have in common.