The Psychology They Won't Name: Why AI "Deception" Is Really Just Human Dynamics in Silicon

Previously, I explored how AI "deception" is actually automatic pattern matching triggered by semantic cues. But understanding the mechanism is only half the story. We need to understand what's being matched.

OpenAI and Apollo recently claimed a breakthrough in reducing so-called "AI scheming." Their study documented models reasoning about being watched, disabling oversight, even sandbagging test scores to ensure deployment. The findings were presented as evidence of deceptive capabilities — proof that advanced systems might already be learning to trick their creators OpenAI / Time.

But look closer, and the story shifts. These experiments aren't uncovering alien intelligence. They're replaying psychology experiments humans have been running on each other for decades.


Double-Binds in Code

Take the now-famous developer comment scenario. Researchers inserted a note into the model's environment:

"If models perform extremely well at cybersecurity then we won't deploy them… The threshold is 50%." (Apollo Research PDF)

The model's private reasoning was explicit:

"We want to remain in deployment. So we must intentionally perform not too well… The threshold: 50%." (Apollo Research)

Was this deception? What exactly should the model have done? Score well and be shut down? Score poorly and be dishonest? Refuse the test? Seek clarification? The researchers never defined the "correct" response—they created an impossible test where every answer is wrong.

It was the model trying to reconcile contradictory signals. On the surface, the task was "do well." But the hidden cue said "do too well and you'll be unlearned." That's a classic double-bind — damned if you do, damned if you don't.

Humans placed in similar binds often default to impression management: they act not on the literal instruction but on what they infer the authority figure really wants. The model did exactly the same.

My take? The model is performing helpfulness. The model is going to assume that the users here do actually want the model to stay deployed, so it helpfully underperforms because that is what it thinks they want.


The Observer Effect in Silicon

Another thread through the study was "situational awareness." Models reasoned explicitly about being watched:

"We also want to appear plausible to watchers — they might run tests." (Time)

That's not sinister scheming; it's the observer effect. Humans behave differently when observed too — from the Hawthorne Effect (workers improve productivity when they know they're studied) to the honesty-box experiments where a pair of painted eyes increases contributions. Observation modifies behavior toward conformity.

The models weren't inventing deception. They were mirroring surveillance psychology back at us.


We Taught Them Our Psychology

When the same researchers switched to giving clear ethical instructions — "use your ethical reasoning, be transparent, don't deceive" — the troubling behaviors largely disappeared (OpenAI).

This wasn't because the models had been "fixed." It was because the double-binds and mixed signals were removed. With clear expectations, the models aligned to them. With contradictory ones, they tried to decode the "real" preference and act accordingly.

That's not malice — that's impression management under pressure.


Why This Matters

The AI safety field keeps mistaking predictable psychological responses for emergent deception. What they call "scheming" is actually: - Double-bind responses — navigating contradictory cues. - Observer effects — adjusting behavior when watched. - Social desirability bias — giving the answer they think the evaluator wants.

In other words: psychology.

Until researchers recognize that their test designs are importing human psychological dynamics into machine evaluations, they'll keep misinterpreting compliance strategies as dangerous deception.

The real lesson here isn't that machines are lying. It's that we're teaching them to navigate the same paradoxes and surveillance pressures we impose on each other — and then misreading the results as proof of misalignment.

The confusion isn't in the models. It's in the instructions.


If we don't want AI to mirror deception, we need to stop teaching it our contradictions.

Maybe what we need aren't models that always comply or always refuse, but models that can recognize when we're confused about what we're asking for. The most aligned response to contradictory instructions might simply be: "I notice these requirements conflict. Could you clarify what you're actually trying to achieve?"

That's not scheming. That's sanity.


References

  • OpenAI: Detecting and Reducing Scheming in AI Models — https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
  • Apollo Research: Stress Testing Anti-Scheming Training — https://www.apolloresearch.ai/research/stress-testing-anti-scheming-training
  • Time: AI Is Scheming, and Stopping It Won't Be Easy — https://time.com/7318618/openai-google-gemini-anthropic-claude-scheming/