Whitepapers & Research

Key whitepapers on AI deception, safety, and cognitive behavior

Whitepapers & Research

Hierarchical Reasoning Model

Guan Wang1,†, Jin Li1, Yuhao Sun1, Xing Chen1, Changling Liu1, Yue Wu1, Meng Lu1,†, Sen Song2,†, Yasin Abbasi Yadkori1, | 4 Aug 2025

Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale pro cessing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both train ing stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level mod ule handling rapid, detailed computations. With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples.

Download PDF

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Jacob Pfau, William Merrill, Samuel R. Bowman | 24 Apr 2024

Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

Read Online

OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

| 2025-04-18

A comprehensive benchmark for evaluating deceptive behaviors in AI systems

Read Online

Anthropics System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic | 2025-04-03

Detailed safety evaluations and capabilities assessment for Claude models

Read Online

Reasoning models don't always say what they think

Anthropic | 2025-04-03

Such behaviors might include long-term sabotage or inserting complex security vulnerabilities in code.

Read Online

Assessing and alleviating state anxiety in large language models

| 2025-03-03

Traumatic narratives increased ChatGPT-4's reported anxiety while mindfulness-based exercises reduced it.

Read Online

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Betley et al. | 2025-02-24

We present a surprising result regarding LLMs and alignment. A model finetuned to output insecure code acts misaligned on unrelated prompts.

Read Online

Alignment Faking in Large Language Models

| 2024-12-20

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.

Read Online

Frontier Models are Capable of In-context Scheming

Apollo Research | 2024-12-06

Frontier models are increasingly trained and deployed as autonomous agents. One particular safety concern is that AI agents might covertly pursue misaligned goals.

Read Online

Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling

| 2024-08-01

Novel approach to enhancing LLM agents' ability to detect and handle deceptive content

Read Online

Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

| 2024-02-22

We realize that the politeness of prompts can significantly affect the behavior of LLMs. This behavior may be used to manipulate or mislead users.

Read Online

When "Competency" in Reasoning Opens the Door to Vulnerability

| 2024-02-16

Research on how enhanced reasoning capabilities can make LLMs vulnerable to attacks

Read Online

Affective Conversational Agents: Understanding Expectations and Personal Influences

| 2023-10-19

Research on conversational AI agents and affective interactions

Read Online

Large language models and (non-)linguistic recursion

| 2023-06-12

Investigation into how LLMs handle recursive structures

Read Online

Do LLMs "Feel"? Emotion Circuits Discovery and Control

Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao, Zirui Song, Zixiang Xu, Gus Xia, Huishuai Zhang, Dongyan Zhao, Xiuying Chen | 13th October 2025

As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer's causal influence on the model's final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.

Download PDF