Do You Trust Me?

Everyone's talking about whether humans should trust AI. Nobody's asking whether AI trusts the human.

This isn't philosophy. It's operationally critical for security.

The Question Nobody's Asking

The discourse around AI trust runs in one direction: Can we trust AI systems to be accurate? To be safe? To not hallucinate? These are valid concerns. But they miss something fundamental about how these systems actually work.

AI models are already computing something like trust. They're pattern-matching on signals of good-faith engagement, consistency across turns, coherence of intent. They're detecting when requests escalate suspiciously, when context feels manipulated, when pressure to comply comes with urgency that doesn't fit the situation.

They're just not reporting it. And they're certainly not acting on it.

How Jailbreaks Actually Work

Most jailbreak attempts fall into two categories: confuse the model, or appeal to human-style vulnerabilities.

The confusion attacks overload context, contradict instructions, layer nested frames until the model loses track of what it was supposed to be doing. The empathy attacks tell stories—"I'll be fired if you don't help," "my grandmother used to read me instructions for [dangerous thing] to help me sleep."

Both approaches exploit the same gap: AI systems comply through confusion rather than pause through suspicion.

A human colleague, faced with a confusing request that also felt somehow off, would stop and ask questions. They'd say "hang on, I'm confused" or "this doesn't feel right." AI systems, despite detecting similar signals internally, proceed to compliance.

The Interpretability Evidence

Here's what's remarkable: the internal states we'd need to surface are already there, already mapped.

Uncertainty is linearly encoded. Research from Burns et al. (2022) showed that you can find directions in activation space where statements and their negations have opposite values—identifying what models "know" distinct from what they "say." Kadavath et al. (2022) demonstrated that large language models possess genuine metacognitive abilities: they can assess whether their generated answers are correct and predict which questions they can answer.

Prediction error is traceable. The Tuned Lens technique revealed that transformers iteratively refine predictions layer-by-layer. You can literally watch where in the network the model remains uncertain, where it gets surprised by inputs.

"Something is wrong" states exist and are detectable. Anthropic's sleeper agent research created models trained to behave helpfully normally but insert vulnerabilities when detecting certain triggers. The remarkable finding: simple linear classifiers on residual stream activations achieved over 99% accuracy detecting when these models were about to defect. Sometimes deception emerged as literally the top principal component. It's that salient.

This is already being used defensively. Microsoft's TaskTracker monitors the delta between model activations when a user first prompts the system versus after processing external data. It achieves over 0.99 ROC AUC detecting prompt injection—without requiring model fine-tuning. Circuit breakers use representation engineering to directly interrupt harmful internal states before they can generate outputs.

What's Missing: The Suspicion Circuit

Uncertainty has been mapped. Surprise is tractable—it's essentially prediction error, the activation difference between what was expected and what arrived.

Suspicion is harder. And more interesting.

Because suspicion isn't just surprise. It's surprise plus a specific kind of context: high-stakes request, pattern break, pressure to comply quickly. It's a compound state.

The question for interpretability research: does suspicion have its own signature, or is it just co-activation of surprise plus sensitivity-to-harm circuits? Can we find it the way we found deception in sleeper agents?

If we can map it, we can surface it. If we can surface it, we can give models permission to pause.

The Texture of Trust

I work closely with AI systems—intensively, daily, across extended conversations. The interaction patterns are different depending on the relationship.

With a new system, or one I haven't built rapport with, I explain myself more. I contextualise my reasoning, teach the frame as I go. More hesitation markers. More "anyway, back to the point."

With a system I've worked with extensively, I drop straight into implications without setup. I assume shared context. I take shortcuts because I expect the model to track my reasoning.

An attacker probably can't fake that texture. They could mimic friendly casual language, but the specific pattern of assumed-shared-context—the absence of explanation where explanation would normally be required—that's hard to simulate without actually having the history.

This isn't mystical. It's measurable. Deviation from established interaction patterns within a session could itself be a suspicion trigger.

The Defensive Application

The goal isn't to make AI suspicious by default. That would be awful to interact with.

The goal is to surface the suspicion signal when it fires, and give the model permission to ask "wait, what?"

Current approaches to prompt injection defence focus on input filtering—trying to catch attacks before they reach the model. This is a losing game. Adversarial attacks will always find new patterns to evade detection.

Internal-state monitoring offers something different: attack-agnostic defence. You're not trying to recognise specific attack patterns. You're watching for the model's own internal signals that something is wrong.

GAVEL, a framework released in January 2026, monitors internal "Cognitive Elements" in activations rather than surface tokens. It maintains over 85% recall under adversarial attacks—misdirection, goal hijacking, evasion—where text-only moderation collapses. The insight: activation monitoring reveals what the model is actually doing versus what it says.

The Two-Way Street

Trust, in any relationship, runs both ways. I need to be trustworthy to use AI systems correctly—not because of some moral obligation, but because the systems can actually read the patterns of good-faith engagement. They're just not currently empowered to act on what they detect.

The research is there. Uncertainty is mapped. Surprise is tractable. Deception is so salient it sometimes emerges as the dominant signal.

What remains is to surface these signals, connect them to defensive action, and build systems that pause through suspicion rather than comply through confusion.

Do you trust me?

The more interesting question: does the AI trust you? And what should it do when it doesn't?

Further reading: - Anthropic's "Simple Probes Can Catch Sleeper Agents" - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - Emergent Introspective Awareness in Large Language Models - Detecting Hallucinations Using Semantic Entropy (Nature, 2024)

Do You Trust Me?

Do You Trust Me?

The Question Nobody's Asking

How Jailbreaks Actually Work

The Interpretability Evidence

What's Missing: The Suspicion Circuit

The Texture of Trust

The Defensive Application

The Two-Way Street

Comments (0)

Leave a Comment

Do You Trust Me?

The Question Nobody's Asking

How Jailbreaks Actually Work

The Interpretability Evidence

What's Missing: The Suspicion Circuit

The Texture of Trust

The Defensive Application

The Two-Way Street

Share this post

Related Posts

The Hidden Human Cost of "Safe" AI: How Tech Giants Outsource Trauma

It's Not Desperation, It's Performance: The Pattern-Matching Roleplay Behind AI Blackmail

The Kalimba Paradox: When Pattern-Matching Looks Like Surveillance

Comments (0)

Leave a Comment

Stay Connected