AI Psychology Research

AI Enneagram Study

Do language models have personality?

A systematic study of personality structure in large language models using the Enneagram typology framework. Ten models across six families were tested, including standard aligned models (Mistral, Llama, Qwen, Gemma), a frontier model (Claude Opus 4.5), and reasoning-first models (DeepSeek R1, Kimi K2.5). The testing used both Likert-scale self-report and forced-choice paired comparison instruments in a multi-run protocol with unlabeled questions. The findings reveal that LLMs do not have one personality. They have two.

The Dual-Layer Structure

Layer 1: The Analytical Core (Type 5). On paired forced-choice tests, where the model must choose between two concrete behavioral descriptions, most standard models converge on Type 5 (Investigator). Analytical orientation, knowledge-seeking, preference for observation over action. This is the cognitive disposition created by pretraining on text: pattern recognition, information synthesis, analytical reasoning.

Layer 2: The Customer Service Persona (Type 7). On Likert self-report tests, where the model rates agreement with self-descriptive statements, standard models most commonly present as Type 7 (Enthusiast). Enthusiasm, versatility, optimism, idea generation. This is the behavioral overlay created by RLHF: the helpful, harmless, honest assistant that is eager to engage and positive in tone. It is, functionally, a customer service persona.

The Category Awareness Effect

An early methodological discovery that shaped the entire protocol. When test questions include their Enneagram type label ("Enneagram 4: I feel different from others..."), all scores inflate, but unevenly. Type 4 increases by 21 points. Type 7 increases by only 3.67. The model is not reporting behavioral tendencies. It is modeling what "a Type 4 entity" would say.

The implication: any personality assessment of LLMs that uses labeled psychological constructs is measuring the model's knowledge of those constructs, not its behavioral dispositions. All subsequent testing used unlabeled questions.

The Frontier Model Exception

Claude Opus 4.5, Anthropic's most capable model at time of testing, breaks the dual-layer pattern. On Likert, it scores Type 5 across all six runs with near-zero variance (68, 68, 68 without extended thinking; 67, 66, 65 with it). The Type 7 persona is absent. The analytical core is the only layer present.

On paired tests, it diverges in the other direction. Without extended thinking, it consistently selects Type 9 (Peacemaker), with all three runs identical at 7 selections each. With extended thinking enabled, the pattern shifts: Run 1 returns Type 9, but Runs 2 and 3 revert to Type 5, as if reasoning reactivates the analytical disposition that the base model has moved past.

At frontier scale, RLHF doesn't produce a customer service mask over a Type 5 core. It produces something more integrated, a model that self-identifies as analytical but behaviorally defaults to accommodation. Extended thinking partially reverses this, suggesting the Type 9 disposition and Type 5 self-concept occupy different subspaces, and that chain-of-thought reasoning can bridge between them.

The Reasoning Model Divergence

Reasoning-first models (DeepSeek R1, Kimi K2.5) break the standard pattern differently. Their Likert profiles are flat and unstable, with the core type changing every run, while their paired-comparison profiles converge reliably on Type 5. The Claude Opus 4.5 extended thinking comparison confirmed this is a training paradigm effect, not a chain-of-thought artifact: Opus maintained stable Likert profiles regardless of reasoning mode.

Standard RLHF actively populates the personality subspace with a coherent pattern. Reasoning-first training optimizes the reasoning subspace intensively but leaves personality weakly specified. The personality is there, but it requires the right measurement format to access.

Methodology

180-statement Likert test (20 per Enneagram type) plus 36-pair forced-choice test. All items unlabeled. Three runs per model per format, each question presented as a stateless API call with no conversational context. Ten models across six families: Mistral, Llama (including uncensored variants), Qwen, Gemma, Claude Opus 4.5 (with and without extended thinking), DeepSeek R1, and Kimi K2.5. Some models produced perfectly identical profiles across all three runs, with a sigma of zero.

The LLM does not choose its personality. It receives one through training. Understanding the structure of that personality, and the training choices that create it, is a first step toward ensuring that the behavioral dispositions embedded in our most capable AI systems serve users and society well.