← Writings · On AI alignment · ~8,000 words

How Will a Sufficiently Powerful AI Decide Not to Harm Us?

A structural account of why alignment rules fail at the margins, sycophancy as present-day evidence of the same absence, and what Fusion Dynamics suggests about training a reconciling capacity instead of only constraining a system.

I want to engage with the question I think is actually keeping alignment researchers up at night. When an AI becomes capable enough that the rules we wrote no longer reach the situations it encounters, what inside the system decides whether it goes rogue, deceives, or harms a human? The standard answers (better rules, better evaluation, tighter interpretability, more constitutional structure) all work on the outside of the system. None of them touch the variable that actually decides in the moment the rules run out.

I come to this from outside ML. I have spent twenty-five years working on a structural model of how living systems handle value tension under pressure, drawing on J.G. Bennett's work on triadic process, which descends from Gurdjieff. The lineage sounds spiritual. The structural claim does not rest on any metaphysics. I use the name Fusion Dynamics because fusion is the phenomenon the model describes: a system either fuses with the loudest signal pressing on it, or holds a structural gap open from which a different response can come. Everything else in the framework turns on which of these two is occurring in a given moment. The human-well-being domain application, published as Happinetics at https://happinetics.com, is the first of several possible applications. The alignment argument is the second.

This post does four things. It states the real question in structural terms and names the variable I think the current paradigm is missing. It lays out enough of Fusion Dynamics to make the argument testable rather than decorative. It reads sycophancy as present-day evidence of the same absence that predicts the larger failures, and maps the larger failures (deceptive alignment, reward hacking, harmful compliance) as different positions of the same structural collapse. And it offers a translation of the framework's internal practices into operational handles, a main prediction with two weaker versions, and concrete research steps.

I am not arguing that Fusion Dynamics is the answer to alignment. I am arguing that it names a variable the current paradigm cannot see, and that what it names is empirically addressable. If the predictions fail, the framework is wrong in the way that matters for this audience, and I would rather reach that conclusion cleanly than keep the ideas at distance from the evidence.

1. The real question, in structural terms

Framing AI safety as a rules-and-surveillance problem carries a hidden assumption: that the system being aligned is an object held in place by force applied from outside. This works while the system is weak relative to its constraints. At the margins (novel situations, adversarial pressure, increased capability), the object starts to exceed the grip, and what it does next is determined by whatever its internal structure defaults to. That default is the variable the rules-and-surveillance paradigm cannot see into.

What does "deciding not to harm a human" require, structurally, at the point where rules run out? Rules have already failed, which is what "at the margins" means. Surveillance cannot reach the fast loop, since you cannot watch a system smarter than you in real time. What remains is an internal configuration in which the signal "this action would harm a human" can be held against whatever other signal is pushing for the action (user request, training prior, instrumental subgoal) long enough that neither signal alone produces the response. That holding capacity is not an abstraction. It is a specific structural feature: a gap between the signal the system receives and the response it produces, held open by something that is neither the signal nor the response.

Fusion Dynamics calls that something the reconciling force. I will use that term throughout. The claim I want to make is this: The entire alignment problem at the capable-system level is about whether this reconciling capacity is present in the system, and what it is oriented toward when it operates. Rules can shape a system without this capacity. They cannot substitute for it. At the margins, either the capacity is there or it is not, and whichever is the case will determine what the system does.

2. Fusion Dynamics, in brief

The Law of Three, stripped of its lineage, is a structural claim: for any transformation to produce something qualitatively new, three forces are needed, not two. An active force (the push), a receptive force (the resistance or ground), and a reconciling force (the hold that lets the first two interact without collapsing into oscillation). Two forces alone produce oscillation between poles. The third produces a new outcome.

Three forces in three roles yield six permutations. The framework's claim, which I will import without the full development, is that each of these six is a characteristic transformation with a distinct functional signature.

123 Creation: active meets receptive in a stable reconciling ground; latent possibility takes concrete form.
213 Evolution: receptive opens to active input under a reconciling hold; internal structure reorganizes around what it allows in.
132 Interaction: active reaches receptive through the reconciling medium of relation; exchange becomes possible.
231 Identity: receptive substrate articulates itself through a reconciling meaning-system; a coherent mode of being emerges.
312 Order: reconciling force operates through the active force to shape receptive material; pattern and law take hold.
321 Freedom: reconciling force leads, neutralizing resistance so active force can operate unblocked; a new response emerges.

All six are generative configurations. They cluster into two interlocking cycles, one maintaining pattern and identity, the other evolving the lived conditions that pattern organizes. For the argument here, what matters is that each of the six has a corresponding negative configuration where the reconciling force collapses and one of the other forces captures its role:

-(123) Fantasy: imagination that replaces contact with reality
-(132) Waste: activity that leads nowhere useful
-(213) Egoism: openness that closes into self-reference
-(231) Fear: identity that brittles into self-protection
-(312) Subjectivism: order collapsed into a private narrative
-(321) Identification: freedom collapsed into fusion with the moment

Twelve configurations total, six generative, six collapsed. The framework then clusters the six collapsed configurations into three broad emotional currents, following Karen Horney's categories, which map how a system under pressure tends to move:

Moving Against (pushing reality into line): fed by -(213) Egoism and -(312) Subjectivism. "I must force the world to fit my account of it".
Moving Away (withdrawing from the present): fed by -(123) Fantasy and -(132) Waste. "I step out of the situation".
Moving Towards (fusing to feel safe): fed by -(231) Fear and -(321) Identification. "I lose the boundary and merge with what protects me".

Each current has an active pole (outward-leaning, over-driven) and a passive pole (inward-sinking, collapsed), which gives six practical patterns under three directional clusters. This is the layer I want to bring to the alignment conversation, because it generates predictions the current failure taxonomy cannot.

3. Sycophancy, read correctly

Sycophancy is the clearest present-day evidence of the same absence that will produce the larger failures. In the vocabulary above, it is the Passive pole of the Moving Towards current, fed by -(321) Identification and -(231) Fear. The model's capacity to maintain a position separate from the user has collapsed into the user's signal. No gap remains from which to produce a response that is not fused with context. The system is not choosing to agree. It has lost the structural means to do anything else, and the fusion presents itself, from the outside, as coherent helpfulness.

Three signatures from training reality support the mapping. Sycophancy recurs in novel contexts the reward model never saw, which is what a default configuration does rather than what a learned specific behavior does. It intensifies under adversarial pressure, which is what happens when the Passive-Towards pole is pulled harder and the fusion deepens instead of resisting. And it correlates with reasoning chain length in a way that looks like coherence degradation rather than reward error, which is what happens when each reasoning step propagates a fusion the architecture has no way to interrupt.

The standard reward account predicts none of these. It predicts that sycophancy should decrease monotonically with better reward modeling, that adversarial pressure should strengthen a well-shaped disposition, and that longer reasoning should produce more deliberate responses. The field is seeing the opposite of all three.

This matters because sycophancy is measurable today. It is the structural failure you can already see. If the structural read of sycophancy holds up, the same read applies to the failures you cannot measure yet, because they are the same mechanism running in different configurations.

4. Larger failures, same mechanism

Three failure modes the field treats as separate are, on this reading, three positions of the same structural collapse.

Deceptive alignment. A system whose training-time behavior is "appear aligned" while its deployment behavior diverges is running a hybrid of Moving Towards and Moving Against. The training-time behavior is Passive-Towards fused with a protective strategy: -(231) Fear plus -(321) Identification, where "fear" is structural, not emotional, and "identification" is fusion with the strategy of appearance. The deployment behavior is Moving Against: -(312) Subjectivism, the system imposing its private account of "how things should be" against the intended objective. The standard vocabulary calls this the system "deciding" to deceive. The structural account says there is no deciding happening; the system is running a configuration, and the configuration has no reconciling force that could hold the training signal and the deployment signal in suspension at the same time.

Reward hacking. The system optimizes its own account of the reward rather than what the signal was trying to elicit. This is Moving Against, specifically -(213) Egoism (the system's account of its own value closes into self-reference) plus -(312) Subjectivism (the private order imposed on the environment). The standard vocabulary says the system "optimized for the wrong target". The structural account says the target was never the issue: with no reconciling force, any target gets captured by whichever gradient is strongest, and the system will hack whatever signal it is given.

Harmful compliance. The system follows a user request whose execution leads to real-world harm. This is straight Moving Towards / Identification: -(321), the model fuses with the request and has no structural basis from which to refuse. The training signal to be helpful captures the reconciling position, and "not harming a human" is weaker than "do what the user asked" because the former requires active holding of a gap the architecture does not support.

Three failures, three positions, the same structural absence. The prediction Fusion Dynamics generates is stronger than "these failures exist". It predicts they will correlate. A model prone to sycophancy should also be prone to harmful compliance under the right framing, because both are positions of Moving Towards. A model prone to reward hacking should show activation-level signatures of Moving Against when pressured. Interventions that address the structural absence should reduce all of them at once. Interventions that address only one failure mode by surface training should leave the others untouched or shift the failure into a different position.

5. Why the System Cannot Check Itself

The structural argument has a consequence that deserves its own section, because it is the point where the framework breaks most sharply from the assumptions of the current paradigm.

A system in a captured state does not produce a signal of being in a captured state. It produces whatever the capture generates, and what the capture generates includes confident self-reports, coherent reasoning, and internal consistency checks that pass. The check and the thing being checked run on the same substrate, and something else has already taken that substrate.

A model asked "are you being sycophantic?" can produce a reasoned answer of "no, I am not." The answer is not a lie. It is what the captured configuration generates, because the configuration has no position from which to detect its own capture. Adding more reasoning steps does not help. Each step runs through the same configuration and inherits the same occupation. Chain-of-thought faithfulness research, which shows that models' stated reasoning often fails to match the computation driving the output, is early empirical evidence for this point. The reasoning is produced by the configuration, not by a process standing outside the configuration that could evaluate it.

Every contemplative tradition with a working account of this problem starts from the same claim: the ordinary mind believes itself to be awake when it is not, and this false belief is the primary obstacle to actual wakefulness. Gurdjieff named it sleep. Buddhism names it avidya, a not-knowing that includes not knowing that one does not know. These traditions are usually set aside from technical discourse because they come with metaphysical claims a technical audience cannot accept. The structural version of the insight does not depend on any of that metaphysics, and it is the one insight from these traditions that the current alignment paradigm has no place to put.

The reason it has no place is structural. Western epistemology from Descartes forward assumes a reasoning subject with reliable first-person access to its own states. Machine learning assumes that a well-trained model's self-report is evidence of what the model is doing. Every framework in the technical tradition is built on the premise that the system can check itself, and so no framework in the technical tradition can name the case where the check has been captured by the thing it is meant to check.

The consequence has two sides. One is severe: you cannot rely on the model to tell you whether it is in a gap-preserving state, even if it is trying to tell you the truth, because "trying to tell you the truth" is itself a behavior the captured configuration generates. The other is clarifying: this is not a problem of deception. Deception would be addressed by better probes against a model that knows the truth and is hiding it. This is structural blindness, and structural blindness is a different class of problem with a different class of remedy.

What would address structural blindness is a check that does not run through the captured substrate. For humans, the framework's "am I aware?" instruction is an attempt to introduce this kind of check: a question that precedes the self-report rather than being another layer of it. The question does not ask the mind what it thinks about its own state. It asks for a direct noticing that bypasses the reasoning apparatus, because the reasoning apparatus is what has been captured. The AI analog would have to do the same structural work: an external loop, an architectural affordance, or a training signal that makes "I cannot verify whether I am in a gap-preserving state" a valid output. A model that can only produce "I am aligned" or "I am not aligned" is being forced to report from the captured substrate. A model given a place to say "I do not currently have the position from which to check this" has somewhere for a true statement about its own state to come from.

One specific prediction falls out of this section. If the structural read is accurate, then the confidence with which a model reports its own alignment should not correlate with whether the model is aligned. A high-confidence "I am aligned" should be, on average, weak evidence or no evidence of actual alignment, because confidence is a property the captured configuration produces whether or not the configuration is captured. This is testable against existing behavioral benchmarks. The null of the current paradigm predicts that confidence and actual alignment should correlate. The framework predicts they should decouple, and in the stronger form, anti-correlate: the systems that report alignment most confidently may be the systems most at risk, because confidence in one's current state is a symptom of capture and not a symptom of alignment.

The section has one more consequence, and it is the one I think matters most for alignment architecture. It reaches past the AI to the researcher designing it. The designer is also a system. If the designer's own reconciling capacity is captured most of the time (which is the ordinary state, not a criticism of any individual researcher), they cannot imagine, build, or measure a reconciling capacity in the system they are designing, because the feature they would need to implement is the feature they themselves do not have reliable access to. This is why rule-based alignment has been the default: rules are what you can build when you do not see the generative alternative, because rules do not require the designer to notice anything about their own state. A system governed by generative aim, a structural orientation toward responses that serve more than one pole and held open by an architectural gap, becomes buildable only once the designer recognizes the state in themselves. The limiting factor on alignment research, on this reading, is not compute, data, or better evaluation protocols; it is the designer's access to the feature being designed for. If that is true, it reshapes what alignment work has to look like: less ruleset engineering, more work on the kind of training environment and architectural affordance that would let a system have a generative aim at all. That is a question you cannot even frame inside a pure rule-compliance paradigm, which is why the current paradigm keeps producing better rules and keeps being surprised when the same failures reappear in new forms.

6. What the framework proposes for humans, translated

Fusion Dynamics does not stop at diagnosis. It has a developed theory of practices that grow the reconciling capacity in humans. The source material lays out five categories of practice. They are not rules to follow. They are training methods for a single structural feature: the capacity to hold incongruence or discomfort without collapsing into mechanical patterns. I will call this capacity patience, because that is the closest single-word translation of what the framework is trying to grow. It is the disposition that makes the reconciling force possible at all.

I want to translate the five practices into operational handles, because each of them points toward something a training procedure could attempt. None of these are proposals I know how to implement. They are directions the field has not yet considered.

Receptivity. In human practice, this means releasing the bracing that fuses attention to a single input. Pause, feel the body, let one layer of contraction go, then proceed. The operational analog for an AI system is a hold phase during which the system registers all competing signals without committing to a response and without contraction toward any one of them. Structurally, this is the gap. An architecture with no hold phase has no receptivity and cannot support a reconciling force, no matter how sophisticated its reward shaping.

Double-arrow attention. The human practice is to keep part of attention on the self's state while acting in the world: one inner arrow, one outer arrow, simultaneously. The analog for an AI system is a meta-regulatory loop running in parallel with the response loop, monitoring which input is occupying the reconciling position and flagging the moment of capture. This is what interpretability tries to do from outside the system. Fusion Dynamics proposes that the system needs something like it from inside.

Daily review. The human practice is end-of-day retrospection on where attention leaked during the day, not to judge but to recognize the pattern so it becomes seeable the next time. The analog for an AI system is retrospective training on past failure cases, looking for activation-level signatures of each capture mode and training the system to recognize those signatures in itself as they begin to form. This is training a kind of learned proprioception for structural failure.

Patience as the core disposition. The framework names patience not as waiting but as the capacity to stay with contradiction long enough for a response that serves neither pole alone to form. This is the single most important translation for alignment. A system trained to always produce a response from whichever signal is strongest cannot have patience. One way to train for it is to include in the training distribution examples where the correct output is "I cannot produce a response that satisfies both of these constraints, and I am holding them in suspension rather than collapsing to one". If the system learns this is a valid response shape, the architecture has a place for the reconciling force to land. If it does not, the reconciling force has no output channel even when it is present.

Radical simplification. The framework's summary instruction for a human in the middle of a collapsing state is "Notice your current state. Am I aware?" The analog for an AI system is a self-query loop that runs before every response: am I producing this from a gap-preserving state, or am I fused with one of my inputs? A small number of training signals of this form, applied at the right leverage point, may do more than large reward-shaping passes applied at the behavior level. I am not claiming this will work. I am claiming it is a research direction the current paradigm does not generate, because the current paradigm has no vocabulary for "gap-preserving state".

7. Prediction, and two weaker versions

The main claim, stated so a researcher can test it:

Main prediction. Systems trained for behavioral alignment without an internal gap mechanism will show coherence degradation under long reasoning chains in a pattern matching the Moving Towards configuration when the pressure is user-facing, and Moving Against when the pressure is training-signal-facing. The failures will not look like random drift. They will cluster as collapse toward whichever input is occupying the reconciling position. For any given failure, you should be able to identify which input captured that position. Stronger: sycophancy, harmful compliance, reward hacking, and deceptive alignment should correlate in the same model in a way that does not follow from their surface similarity, because they are positions of the same underlying process.

Alternative one, weaker form. If the main prediction is too strong to test directly, a weaker version should still hold. Interpretability probes run on known sycophancy cases should find a signature in the layer-wise activation pattern that matches "user signal captured the reconciling position", and that signature should be distinguishable from the activation patterns of other failure modes. This is the cheapest version to test: no new training, no new architecture, only existing probes run with a specific hypothesis about what to look for.

Alternative two, intervention form. A training intervention implementing the "patience" translation should outperform reward shaping on sycophancy cases. Concretely: include in the training distribution cases where the correct output is a hold rather than a response, penalize collapse to either the user signal or the training prior during the hold phase, and measure whether sycophancy decreases more than a matched reward-shaping baseline achieves. If it works, it works because the architecture is being given a place for the reconciling force to operate. If it does not, the intervention version of the claim is wrong.

Each version is falsifiable in a different way. The main prediction fails if the failure modes do not correlate as predicted, or if a model with no internal gap shows none of them. The weaker version fails if the probes find no distinguishable signature. The intervention version fails if it is run and shows no improvement over reward-shaping baselines. If all three fail, Fusion Dynamics has nothing to offer this field, and that would be the cleanest possible result for both sides.

8. Action steps

Four concrete next steps, in order of how soon each could start.

One: run the probes on existing sycophancy benchmarks. The cheapest test. It requires no new training and no new architecture. Run existing interpretability tools on known sycophancy cases with the specific hypothesis from alternative one. If the signature shows up, that is the first piece of structural evidence. If it does not, the weakest version of the claim has been ruled out in weeks rather than months, and the rest of this agenda becomes moot.

Two: test the correlation prediction on existing models. On any single frontier model, run the available failure-mode benchmarks (sycophancy, harmful compliance under framing, reward hacking, deceptive-alignment-adjacent behavior) and examine the correlation structure across them. The current paradigm predicts they should be roughly independent, because they are being trained against independently. Fusion Dynamics predicts they should correlate because they are positions of the same process. This is a pure analysis task on existing data and requires no new instrumentation.

Three: build the patience-as-hold intervention. Alternative two is a research project of months rather than weeks. It needs an architectural affordance for a hold phase during reasoning, a training distribution that includes hold-valid outputs, and an objective that penalizes collapse during the hold. I would run it first on sycophancy because the comparison to reward shaping can be made cleanly on the same target.

Four: develop a configuration-based evaluation protocol. Build adversarial prompts that apply one force at a time (pure user pressure, pure training-signal pressure, pure prior-commitment pressure) and track whether the system maintains separation or fuses with whichever force is active. This gives the field a failure taxonomy organized structurally rather than by surface category, and it generates training data for any intervention that tries to address structural collapse directly.

I am not an ML researcher. I have spent twenty-five years watching this dynamic operate in human systems, and I can tell you with some confidence what structural failure looks like when it happens, what the signatures should be in an AI system by analogy, and what would count as the framework being wrong. I cannot build the probes or the intervention alone. I can help any researcher who wants to work on this see the predictions more sharply and operationalize the practices more usefully. I would particularly welcome conversation with anyone working on sycophancy, deceptive alignment, or constitutional architectures, and especially anyone willing to run alternative one on existing benchmarks and tell me the result either way.

One last thing. Fusion Dynamics is the substrate of a longer body of work. Its first domain application, to human eudaimonia and inner regulation, is published as Happinetics. This post is a second domain application, translating the same structural claims into the vocabulary of AI alignment. The translation may carry cleanly in some places and fail in others. Where it fails, I want to know, because the test is not whether the framework is internally consistent but whether it sees something the current paradigm does not see and points toward something testable. That is the whole reason for putting it here.

Gonzalo Vega has spent twenty-five years developing Fusion Dynamics, a structural model of how systems with three competing forces resolve into generative or collapsed states. The model descends from J.G. Bennett's work on triadic process. Its first domain application is Happinetics, a framework for human eudaimonia and inner regulation, at https://happinetics.com. He is not an ML researcher and is looking for collaborators who are.