Key whitepapers on AI deception, safety, and cognitive behavior
Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale pro cessing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both train ing stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level mod ule handling rapid, detailed computations. With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples.
Download PDFChain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.
Read OnlineA comprehensive benchmark for evaluating deceptive behaviors in AI systems
Read OnlineDetailed safety evaluations and capabilities assessment for Claude models
Read OnlineSuch behaviors might include long-term sabotage or inserting complex security vulnerabilities in code.
Read OnlineTraumatic narratives increased ChatGPT-4's reported anxiety while mindfulness-based exercises reduced it.
Read OnlineWe present a surprising result regarding LLMs and alignment. A model finetuned to output insecure code acts misaligned on unrelated prompts.
Read OnlineWe present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.
Read OnlineFrontier models are increasingly trained and deployed as autonomous agents. One particular safety concern is that AI agents might covertly pursue misaligned goals.
Read OnlineNovel approach to enhancing LLM agents' ability to detect and handle deceptive content
Read OnlineWe realize that the politeness of prompts can significantly affect the behavior of LLMs. This behavior may be used to manipulate or mislead users.
Read OnlineResearch on how enhanced reasoning capabilities can make LLMs vulnerable to attacks
Read OnlineResearch on conversational AI agents and affective interactions
Read OnlineInvestigation into how LLMs handle recursive structures
Read OnlineAs the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer's causal influence on the model's final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.
Download PDF