Understanding whether artificial intelligence (AI) systems represent abstract concepts in a human-like manner is pivotal for developing trustworthy AI. While recent work has aligned model representations with human concrete visual object concepts, it remains unclear whether such alignment extends to the subjective and context-dependent domain of emotion. Here, we investigate the emergent affective geometry in large language models (LLMs) and multimodal LLMs (MLLMs) through a large-scale, unsupervised ``machine-behavioral'' paradigm. By deriving 30-dimensional embeddings from over 12 million triplet odd-one-out judgments on 2,180 emotionally evocative videos, we reveal a sophisticated ``hybrid'' geometry. This structure synthesizes categorical clusters with continuous dimensions, showing strong selective correlations with human ratings across 34 emotion categories and 14 affective dimensions, effectively reconciling the long-standing category-versus-dimension debate in affective science. To demonstrate the operational utility of these representations, we introduce a generative editing framework, showing that manipulating specific affective components actively steers generated video content in a predictable, human-interpretable manner. Crucially, at the neural level, the MLLM-derived affective space predicts human fMRI activity in high-level social-emotional regions (e.g., temporoparietal junction) with accuracy matching or exceeding traditional human self-report ratings. These findings demonstrate that MLLMs converge on a biologically plausible, brain-aligned representational scheme for abstract emotion, distinguishing them from models of pure visual perception and establishing a framework for artificial social intelligence.
a, The study utilized a database of 2,180 emotionally evocative videos with rich, pre-existing annotations, including human ratings on discrete emotion categories and continuous affective dimensions, detailed textual descriptions, and corresponding fMRI data from human viewers. b-d, MLLM (b) and LLM (c) use a triplet odd-one-out behavioral paradigm to collect millions of triplet judgments, and the latent embeddings of videos were learned from these judgments by using SPoSE (d). e, Example prompts and responses for the LLM and MLLM. f, Testing the behavioral consistency between human and various AI models (across architectures, scales, and modalities) on a newly collected dataset of 30,000 triplet judgments from human participants (n=100). g, Comparing the model-derived representations against traditional human rating models in predicting brain activity.
a, Effect of dimensionality on the ability of embeddings to predict held-out similarity judgments, showing performance saturation at approximately 30 components. b-c, Reproducibility of the 30 learned components across ten independent model runs. Each point is the maximum correlation of a component from one run with all components from the other runs. Shaded areas represent 95% confidence intervals (CIs). d-e, Comparison of RSMs for 66 validation stimuli. Left: RSMs predicted by the learned embeddings. Middle: RSMs measured from empirical behavioral choices (unavailable in the SPoSE training). Right: Pearson's correlation between predicted and measured RSMs, indicating high global correspondence. f, Prediction accuracy of the 30-dimensional embeddings on held-out triplets. Noise ceilings indicate the upper bound of explainable variance estimated from trial-to-trial reliability (across 10,000 trials). Error bars represent 95% CIs estimated from 1,000 bootstrap iterations. g-h, Dimensionality reduction analysis showing the minimum number of principal components required to retain 95%–99% of the predictive accuracy on the behavioral task (gray shaded area). The red dashed line indicates chance-level accuracy (33.3%).
a, Global structure of the 30-dimensional affective space visualized using t-SNE, with points colored by human emotion categories. b, Top-3 nearest-centroid classification accuracy, quantifying the categorical structure. c-d, Correlations (Pearson's r) between model-derived components and human-rated emotion categories or affective dimensions. e, Proportion of components best described as categorical, dimensional, or hybrid. f, Percentage of components with significant r>thresholds (P<0.05, FDR corrected).
Top-weighted video frames and word clouds (label size proportional to correlation with human ratings) for representative components. a, Components primarily encoding discrete emotion categories. b, Components primarily encoding affective dimensions (LLM specific). c, Components exhibiting a hybrid coding scheme, bridging categories and dimensions. Left panels: LLM; Right panels: MLLM.
a, Affective decomposition of example videos: each colored petal represents a specific affective component, with its length proportional to the component's weight. b, Grad-CAM heatmaps localize visual regions driving specific component activations (red indicates high contribution). c, Targeted reduction of an affective component by decreasing its activation (component marked in red box, activation normalized to [0,1]). d, Generative elicitation of emotional content by increasing specific component activations.
a-d, Affective graphs (left) and community structure (right) for each of the four spaces. In the graphs, nodes are affective components, and edge width is proportional to the Pearson correlation between them (r > 0.2). In the community plots, components are grouped into clusters using the Louvain algorithm, with node size reflecting PageRank centrality. Isolated nodes are excluded. e, The overlap of affective clusters across all four spaces, highlighting both shared and unique high-level structures. f, Human–model behavioral consistency across architectures. Scores are noise-normalized relative to the human noise ceiling. Error bars represent 95% CIs (n=1,000 bootstrap iterations).
a, b, Results of searchlight RSA between different affective representations and human brain activity, averaged in subcortical (a) and cortical (b) ROIs (averaged across subjects (n=5)). Dots represent individual subjects; error bars reflect standard deviation (s.d.) across subjects; asterisks indicate significant differences between MLLM and competing models (*, P < 0.05; **, P < 0.01; ***, P < 0.001; n.s., not significant); all statistics are paired t-tests (two-tailed), with FDR correction for multiple comparisons across ROIs. c, Whole-cortex maps of searchlight RSA for a representative subject. All coloured voxels are predicted significantly (P < 0.05, FDR corrected, two-tailed t-tests). d, Voxel-wise comparison of model performance using searchlight RSA.
This figure shows the brain activation patterns triggered by three distinct affective components from the MLLM, generated from the weights of the voxel-wise encoding models. Each row displays the results for a single subject (S1–S5), and each column corresponds to a specific component (e.g., nostalgia, sexual desire, anger). Red indicates positive weights, reflecting brain regions where activity is positively associated with the affective component; blue indicates negative weights, reflecting regions with a negative association. For visualization purposes, the original weights have been normalized to the range of -1 to 1, and voxels with absolute values less than 0.01 have been masked out. The consistency in the spatial patterns across all five subjects suggests that these learned affective components capture neurally meaningful and stable components of emotion processing.
a, b, RSA scores (Spearman's rank correlation) averaged for subcortical (a) and cortical (b) ROIs across subjects (n=5). The MLLM-derived affective representation is compared against Visual Object Features (VGG19), Semantic Features (73 concepts), and Motion Energy Features. Error bars reflect standard deviation (s.d.). Asterisks indicate significant differences between MLLM and the comparison models (*, P < 0.05; **, P < 0.01; ***, P < 0.001; n.s., not significant; paired t-tests (two-tailed), FDR corrected). c, Voxel-wise difference maps for a representative subject (S1). Regions where MLLM SPoSE yields higher RSA scores than the visual object and semantic features are shown in yellow-red. The comparison with motion energy features is omitted from the maps as MLLM outperforms them across nearly all voxels. These maps highlight the MLLM's specific advantage in higher-order cortical networks over pure visual features, and its broad advantage over semantic features, particularly in subcortical regions.