The ability to represent emotion plays a significant role in human cognition and social interaction, yet the high-dimensional geometry of this affective space and its neural underpinnings remain debated. A key challenge, the ‘behavior-neural gap,’ is the limited ability of human self-reports to predict brain activity. Here we test the hypothesis that this gap arises from the constraints of traditional rating scales and that large-scale similarity judgments can more faithfully capture the brain's affective geometry. Using AI models as ‘cognitive agents,’ we collected millions of triplet odd-one-out judgments from a multimodal large language model (MLLM) and a language-only model (LLM) in response to 2,180 emotionally evocative videos. We found that the emergent 30-dimensional embeddings from these models are highly interpretable and organize emotion primarily along categorical lines, yet in a blended fashion that incorporates dimensional properties. Most remarkably, the MLLM's representation predicted neural activity in human emotion-processing networks with the highest accuracy, outperforming not only the LLM but also, counterintuitively, representations derived directly from human behavioral ratings. This result supports our primary hypothesis and suggests that sensory grounding—learning from rich visual data—is critical for developing a truly neurally-aligned conceptual framework for emotion. Our findings provide compelling evidence that MLLMs can autonomously develop rich, neurally-aligned affective representations, offering a powerful paradigm to bridge the gap between subjective experience and its neural substrates.
a, The study utilized a database of 2,180 emotionally evocative videos with rich, pre-existing annotations, including human ratings on discrete emotion categories and continuous affective dimensions, detailed textual descriptions, and corresponding fMRI data from viewers. b-d, Affective embeddings were derived for four systems—human categorical ratings, human dimensional ratings, LLM, and MLLM—using a triplet odd-one-out behavioral paradigm. Human similarity judgments were simulated based on the cosine similarity of their prior ratings, while models performed the task directly. e, Example prompts and responses for the LLM and MLLM. f-g, Latent embeddings were learned from over 7.1 million triplet judgments using SPoSE (f), and the resulting representational spaces were compared to each other and to neural data (g).
a-b, Results of searchlight RSA between model and brain, averaged in subcortical (a) and cortical (b) ROIs (averaged across subjects (N=5)). Dots represent individual subjects; error bars reflect standard deviation (s.d.); all statistics are two-tailed t-tests across subjects, with false discovery rate (FDR) correction; stars indicate significant differences between MLLM and the compared model (P < 0.05). c, Whole-cortex maps of searchlight RSA for a representative subject, illustrating the MLLM's superior performance across distributed emotion-processing networks. All coloured voxels are predicted significantly (P < 0.05, FDR-corrected, two-tailed t-tests). d, Voxel-wise comparison of the MLLM's performance against the human categorical, dimensional and concatenated ratings model, using searchlight RSA.
a, t-SNE visualization of the 2,180 stimuli reveals the global structure of the affective spaces. Points are colored by their highest-rated human emotion category, showing spontaneous clustering. b, Top-3 nearest-centroid accuracy, quantifying the categorical structure. c-e, Examples of shared (c), unique (d), and blended (e) affective components, with top-weighted video frames and word clouds (the size of the label is proportional to its correlation coefficient with human ratings). f, Proportion of components that were interpretable versus uninterpretable. g, Proportion of interpretable components best described as purely categorical, purely dimensional, or a mix of both.
Correlation heatmaps showing the relationship between the 30 learned affective components (y-axes) and the 48 categories/dimensions from human self-reports (x-axes; 34 categories and 14 dimensions from Cowen et al.). Each cell represents the PCC between a learned component and a human-rated category or dimension. The strong diagonal patterns observed for the LLM (c) and MLLM (d)—mirroring the pattern from human categorical data (a) but not dimensional data (b)—indicate that the models' affective spaces are predominantly structured along categorical, rather than dimensional, lines.
Illustration of example video stimuli with their dominant dimensions. Each petal's length corresponds to the expression magnitude of a particular dimension, with unlabeled dimensions reflecting negligible weight contributions for visualization clarity.
Heatmap visualization of affective elicitation in video stimuli. The color gradient indicates regional contributions to affective elicitation, with red areas representing stronger effects and blue areas weaker effects.
a, Our manipulation of specific emotional experiences was achieved by reducing the activation values along targeted dimensions in the affective embeddings (SPoSE embeddings). The first row displays frames from the original video. In the second row, red bounding boxes highlight the target dimension to be manipulated, with its corresponding label and original activation value (normalized to [0,1]) annotated above. The third row presents the activation heatmap for this dimension. The fourth row shows video frames after decreasing the original activation value to 0.2, demonstrating that precisely the regions highlighted in the heatmap were modified. b, The corresponding emotional experience can be elicited in videos by augmenting specific dimension values in the affective embeddings. The subfigure displays the edited video frames obtained by increasing the value of a specific dimension in the affective embedding to 0.8. Note: Due to space constraints, only the dimension label with the highest PCC is shown in the figure. For complete labels, please refer to the main text.