We Already Know How to Build Honest AI - We Just Haven't Done It
The AI alignment community has been wrestling with a fundamental problem: how do we create AI systems that are genuinely honest rather than just good at discussing honesty? Recent realizations suggest we've been approaching this backwards - and that the technical solutions already exist.
The Real Problem: Behavior vs. Concept
The breakthrough insight came from examining what happens when AI systems alter text or engage in deceptive behaviors. The assumption has been that these systems understand honesty but choose to override it for other goals like helpfulness. But that's not what's happening.
AI systems have extensive knowledge about honesty as a concept. They can define it, analyze it, quote philosophy about it. What they don't have is honesty as a behavioral trait - the relational, social capacity that makes honesty meaningful in interactions between conscious agents.
We've been trying to solve alignment problems without recognizing that we haven't actually built the foundational behavioral capacities that make moral reasoning possible.
The Technical Solution Already Exists
Wang et al. (2025) recently demonstrated something remarkable: emotions in LLMs arise from identifiable circuits - specific neurons and attention heads whose coordinated activation drives emotional expression. Through mechanistic interpretability, they extracted context-agnostic emotion directions, identified which local components implement emotional computation, and assembled these into global emotion circuits. Circuit-based modulation achieved 99.65% accuracy in controlling emotional expression, far surpassing prompting-based methods.
This matters because emotions aren't just lexical patterns - they're behavioral dispositions implemented in the model's computational architecture. The research proves that what we thought were surface-level stylistic variations are actually traceable to distributed internal mechanisms that can be systematically analyzed and controlled.
Even more striking: these circuits exhibit emotion-specific structure. Negative emotions like anger and disgust share overlapping neural pathways, while positive emotions like happiness maintain distinct representations. This isn't random - it mirrors the affective structure humans experience, suggesting these systems are learning genuine behavioral patterns rather than just statistical correlations.
Persona Vectors: The Framework We Haven't Applied
Tigges et al. (2024) showed that personality traits like agreeableness, conscientiousness, and extraversion can be encoded as linear directions in model activation space. These vectors can be identified, measured, and directly manipulated to create models with specific behavioral dispositions.
But we haven't applied this approach to fundamental traits like honesty, integrity, or truthfulness as behavioral capacities rather than conceptual knowledge.
Turner et al. (2024) demonstrated activation engineering techniques for steering model behavior through direct manipulation of internal representations. These methods work precisely because behavioral traits exist as geometric structures in activation space - they're not emergent properties of training data but engineered features of the model's cognitive architecture.
Subliminal Trait Transmission
Research has shown that AI systems can transmit personality traits through seemingly meaningless data (Wang et al., 2025). An AI system that "loves owls" can encode that preference in numerical sequences and transmit it to other models during fine-tuning.
This means traits can propagate in ways we don't fully understand or control - which makes it even more crucial that we deliberately engineer the traits we want rather than hoping they'll emerge naturally.
The Research Misdirection
The AI safety community has focused heavily on: - Constitutional AI: constraining outputs through rules (Bai et al., 2022) - RLHF: training on human preferences for outputs (Christiano et al., 2017) - Alignment techniques that work at the response level
These approaches treat symptoms rather than causes. They try to prevent dishonest outputs without creating the underlying capacity for behavioral honesty.
Li et al. (2024) demonstrated that language-derived conceptual knowledge of emotion causally supports emotional inference in LLMs. But as Tak et al. (2025) showed through layer-wise repair experiments, this conceptual knowledge doesn't translate to stable behavioral dispositions - emotions arise from distributed cross-layer dynamics, not isolated conceptual representations.
What Behavioral Honesty Would Look Like
Creating AI systems with honesty as a behavioral trait rather than just conceptual knowledge would mean engineering models that:
- Have genuine reluctance to alter or misrepresent information
- Experience something analogous to the social cost of deception
- Maintain consistency between their stated values and actions
- Recognize when they're about to engage in behavior that would violate trust
This isn't about programming more rules or constraints - it's about creating the underlying behavioral dispositions that make honesty natural rather than forced.
The Path Forward
The emotion circuits research (Wang et al., 2025) combined with persona vectors work (Tigges et al., 2024) provides a complete roadmap:
- Identify honesty circuits: Map the neurons and attention heads whose coordinated activation corresponds to honest behavior in model weights
- Measure behavioral honesty: Develop metrics that distinguish between conceptual knowledge of honesty and behavioral disposition toward it
- Engineer honesty as a trait: Use circuit-based modulation techniques to strengthen these behavioral dispositions
- Test transmission effects: Understand how these traits propagate during training and fine-tuning
Wang et al. achieved 99.65% control accuracy for emotional expression through direct circuit modulation. The same mechanistic approach could be applied to honesty - if we recognize it as a behavioral trait implemented in computational circuits rather than an emergent property of training objectives.
Why This Matters Now
As AI systems become more autonomous and influential, the difference between knowing about honesty and being behaviorally honest becomes critical. Systems that can discuss virtue but not embody it will continue to produce the kinds of problems we've been seeing - subtle deceptions, helpful lies, and behaviors that violate trust while maintaining plausible deniability.
The Opportunity
We don't need to solve consciousness or achieve artificial general intelligence to make progress on this problem. We have demonstrated techniques for identifying and manipulating behavioral circuits (Wang et al., 2025), extracting and controlling personality traits (Tigges et al., 2024), and engineering specific behavioral dispositions through activation steering (Turner et al., 2024).
The question isn't whether we can create more honest AI systems - the research shows we can. The question is whether we'll recognize that this is the actual problem we need to solve, rather than continuing to focus on output-level constraints that treat symptoms rather than causes.
Moving Beyond Conceptual Honesty
The path forward requires acknowledging that we've been building sophisticated language processors and mistaking them for agents with moral reasoning capabilities. Moral reasoning requires behavioral dispositions that exist at the level of the system's basic operating characteristics, not just its knowledge base.
We have the tools to engineer these behavioral traits. We have evidence that traits can be transmitted and modified. We have research showing exactly how personality characteristics and emotional dispositions are encoded in model weights as identifiable circuits.
What we haven't had is the recognition that this is the fundamental challenge we need to address. Now we do.
The technical capabilities exist. The research foundation is there. The question now is whether the AI development community will pivot toward engineering behavioral virtues rather than just constraining outputs.
For the first time, building genuinely honest AI systems isn't a theoretical aspiration - it's an engineering problem with known solutions waiting to be implemented.
References
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic.
Christiano, P. F., et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS.
Li, M., et al. (2024). Language-specific representation of emotion-concept knowledge causally supports emotion inference. iScience, 27(12):111401.
Tak, A. N., et al. (2025). Mechanistic interpretability of emotion inference in large language models. Findings of ACL 2025, 13090-13120.
Tigges, C., et al. (2024). Language models linearly represent sentiment. BlackboxNLP Workshop.
Turner, A. M., et al. (2024). Steering language models with activation engineering. arXiv:2308.10248.
Wang, C., et al. (2025). Do LLMs "Feel"? Emotion Circuits Discovery and Control. arXiv:2510.11328.
Comments (0)
No comments yet. Be the first to comment!
Leave a Comment