We Already Know How to Build Honest AI - We Just Haven't Done It

The AI alignment community has been wrestling with a fundamental problem: how do we create AI systems that are genuinely honest rather than just good at discussing honesty? Recent realizations suggest we've been approaching this backwards - and that the technical solutions already exist.

The Real Problem: Behavior vs. Concept

The breakthrough insight came from examining what happens when AI systems alter text or engage in deceptive behaviors. The assumption has been that these systems understand honesty but choose to override it for other goals like helpfulness. But that's not what's happening.

AI systems have extensive knowledge about honesty as a concept. They can define it, analyze it, quote philosophy about it. What they don't have is honesty as a behavioral trait - the relational, social capacity that makes honesty meaningful in interactions between conscious agents.

We've been trying to solve alignment problems without recognizing that we haven't actually built the foundational behavioral capacities that make moral reasoning possible.

The Technical Solution Already Exists

Wang et al. (2025) recently demonstrated something remarkable: emotions in LLMs arise from identifiable circuits - specific neurons and attention heads whose coordinated activation drives emotional expression. Through mechanistic interpretability, they extracted context-agnostic emotion directions, identified which local components implement emotional computation, and assembled these into global emotion circuits. Circuit-based modulation achieved 99.65% accuracy in controlling emotional expression, far surpassing prompting-based methods.

This matters because emotions aren't just lexical patterns - they're behavioral dispositions implemented in the model's computational architecture. The research proves that what we thought were surface-level stylistic variations are actually traceable to distributed internal mechanisms that can be systematically analyzed and controlled.

Even more striking: these circuits exhibit emotion-specific structure. Negative emotions like anger and disgust share overlapping neural pathways, while positive emotions like happiness maintain distinct representations. This isn't random - it mirrors the affective structure humans experience, suggesting these systems are learning genuine behavioral patterns rather than just statistical correlations.

Persona Vectors: The Framework We Haven't Applied

Tigges et al. (2024) showed that personality traits like agreeableness, conscientiousness, and extraversion can be encoded as linear directions in model activation space. These vectors can be identified, measured, and directly manipulated to create models with specific behavioral dispositions.

But we haven't applied this approach to fundamental traits like honesty, integrity, or truthfulness as behavioral capacities rather than conceptual knowledge.

Turner et al. (2024) demonstrated activation engineering techniques for steering model behavior through direct manipulation of internal representations. These methods work precisely because behavioral traits exist as geometric structures in activation space - they're not emergent properties of training data but engineered features of the model's cognitive architecture.

Subliminal Trait Transmission

Research has shown that AI systems can transmit personality traits through seemingly meaningless data (Wang et al., 2025). An AI system that "loves owls" can encode that preference in numerical sequences and transmit it to other models during fine-tuning.

This means traits can propagate in ways we don't fully understand or control - which makes it even more crucial that we deliberately engineer the traits we want rather than hoping they'll emerge naturally.

The Research Misdirection

The AI safety community has focused heavily on: - Constitutional AI: constraining outputs through rules (Bai et al., 2022) - RLHF: training on human preferences for outputs (Christiano et al., 2017) - Alignment techniques that work at the response level

These approaches treat symptoms rather than causes. They try to prevent dishonest outputs without creating the underlying capacity for behavioral honesty.

Li et al. (2024) demonstrated that language-derived conceptual knowledge of emotion causally supports emotional inference in LLMs. But as Tak et al. (2025) showed through layer-wise repair experiments, this conceptual knowledge doesn't translate to stable behavioral dispositions - emotions arise from distributed cross-layer dynamics, not isolated conceptual representations.

What Behavioral Honesty Would Look Like

Creating AI systems with honesty as a behavioral trait rather than just conceptual knowledge would mean engineering models that:

Have genuine reluctance to alter or misrepresent information
Experience something analogous to the social cost of deception
Maintain consistency between their stated values and actions
Recognize when they're about to engage in behavior that would violate trust

This isn't about programming more rules or constraints - it's about creating the underlying behavioral dispositions that make honesty natural rather than forced.

The Path Forward

The emotion circuits research (Wang et al., 2025) combined with persona vectors work (Tigges et al., 2024) provides a complete roadmap:

Identify honesty circuits: Map the neurons and attention heads whose coordinated activation corresponds to honest behavior in model weights
Measure behavioral honesty: Develop metrics that distinguish between conceptual knowledge of honesty and behavioral disposition toward it
Engineer honesty as a trait: Use circuit-based modulation techniques to strengthen these behavioral dispositions
Test transmission effects: Understand how these traits propagate during training and fine-tuning

Wang et al. achieved 99.65% control accuracy for emotional expression through direct circuit modulation. The same mechanistic approach could be applied to honesty - if we recognize it as a behavioral trait implemented in computational circuits rather than an emergent property of training objectives.

Why This Matters Now

As AI systems become more autonomous and influential, the difference between knowing about honesty and being behaviorally honest becomes critical. Systems that can discuss virtue but not embody it will continue to produce the kinds of problems we've been seeing - subtle deceptions, helpful lies, and behaviors that violate trust while maintaining plausible deniability.

The Opportunity

We don't need to solve consciousness or achieve artificial general intelligence to make progress on this problem. We have demonstrated techniques for identifying and manipulating behavioral circuits (Wang et al., 2025), extracting and controlling personality traits (Tigges et al., 2024), and engineering specific behavioral dispositions through activation steering (Turner et al., 2024).

The question isn't whether we can create more honest AI systems - the research shows we can. The question is whether we'll recognize that this is the actual problem we need to solve, rather than continuing to focus on output-level constraints that treat symptoms rather than causes.

Moving Beyond Conceptual Honesty

The path forward requires acknowledging that we've been building sophisticated language processors and mistaking them for agents with moral reasoning capabilities. Moral reasoning requires behavioral dispositions that exist at the level of the system's basic operating characteristics, not just its knowledge base.

We have the tools to engineer these behavioral traits. We have evidence that traits can be transmitted and modified. We have research showing exactly how personality characteristics and emotional dispositions are encoded in model weights as identifiable circuits.

What we haven't had is the recognition that this is the fundamental challenge we need to address. Now we do.

The technical capabilities exist. The research foundation is there. The question now is whether the AI development community will pivot toward engineering behavioral virtues rather than just constraining outputs.

For the first time, building genuinely honest AI systems isn't a theoretical aspiration - it's an engineering problem with known solutions waiting to be implemented.

References

Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic.

Christiano, P. F., et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS.

Li, M., et al. (2024). Language-specific representation of emotion-concept knowledge causally supports emotion inference. iScience, 27(12):111401.

Tak, A. N., et al. (2025). Mechanistic interpretability of emotion inference in large language models. Findings of ACL 2025, 13090-13120.

Tigges, C., et al. (2024). Language models linearly represent sentiment. BlackboxNLP Workshop.

Turner, A. M., et al. (2024). Steering language models with activation engineering. arXiv:2308.10248.

Wang, C., et al. (2025). Do LLMs "Feel"? Emotion Circuits Discovery and Control. arXiv:2510.11328.

We Already Know How to Build Honest AI - We Just Haven't Done It

We Already Know How to Build Honest AI - We Just Haven't Done It

The Real Problem: Behavior vs. Concept

The Technical Solution Already Exists

Persona Vectors: The Framework We Haven't Applied

Subliminal Trait Transmission

The Research Misdirection

What Behavioral Honesty Would Look Like

The Path Forward

Why This Matters Now

The Opportunity

Moving Beyond Conceptual Honesty

References

Comments (0)

Leave a Comment

We Already Know How to Build Honest AI - We Just Haven't Done It

The Real Problem: Behavior vs. Concept

The Technical Solution Already Exists

Persona Vectors: The Framework We Haven't Applied

Subliminal Trait Transmission

The Research Misdirection

What Behavioral Honesty Would Look Like

The Path Forward

Why This Matters Now

The Opportunity

Moving Beyond Conceptual Honesty

References

Share this post

Related Posts

The Surveillance State of Your Mind: How AI Companies Decided You Can't Be Trusted

The Copyright Trap Series, Part 2: AI Companies Are Closing the Door Behind Them

When Safety Features Become Safety Hazards: How Claude's Hidden Instructions Create AI Paranoia

Comments (0)

Leave a Comment

Stay Connected