Beyond Accuracy: An Empirical Study of Perception Stability in Multimodal Large Language Models [cvpr2026 findings]

Feng Chen, Chenhui Gou, Yefei He, Yang Yang, Bohan Zhuang, Qi Wu

May 2026

PDF

Abstract

Recent multimodal large language models (MLLMs) are predominantly compared by reporting accuracy on a collection of benchmarks, while paying much less attention to how stable their performance is across tasks, domains, and evaluation factors. This leads to a limited, and sometimes misleading picture of their actual perception capabilities. In this paper, we take a step toward systematically studying the perception stability of MLLMs and analyze its behavior across pre-training, post-training, and inference-time settings. Our study begins by decomposing 13 existing perception benchmarks into 7 core perception abilities and evaluating 18 MLLMs with a unified stability metric that measures the variability of normalized scores across heterogeneous sub-tasks. Our observations reveal that perception stability is a fundamental yet long-overlooked dimension of MLLM behavior. Concretely, we find that: (i) although recent open-source MLLMs can surpass strong closed-source models in terms of accuracy, they still exhibit a substantial gap in perception stability; (ii) the scaling behavior of stability during training is qualitatively different from that of accuracy—improved stability is largely achieved by reducing ability conflicts, i.e., phenomena where optimizing one perception ability degrades others; and (iii) stability is shaped by different factors at different stages: during pre-training and post-training it is primarily driven by the data mixing ratio and LLM size, whereas at inference time it is strongly influenced by the model’s reasoning style. Finally, we show that simple strategies such as targeted fine-tuning and model merging can yield partial improvements in perception stability, but are insufficient to fully eliminate ability conflicts.

Type

Conference paper