ParVL: Parallel Scaling and Expandable Compute Allocation for Multimodal LLMs [preprint]

Yang Yang, Qinyu Zhao, Lixin Gu, Mouxiang Chen, Xianghui Li, Changyao Tian, Hongjie Zhang, Liang Zheng

May 2026

PDF

Abstract

Existing scaling strategies for Multimodal Large Language Models (MLLMs) face distinct limitations: parameter scaling incurs prohibitive overhead, and advanced post-training computation scaling (e.g., Chain-of-Thought, MPO/RLHF) increases latency or complexity. More importantly, all existing methods fail to alter the rigid, fixed computation allocation between the Vision Transformer (ViT) and the Large Language Model (LLM) components in the post-training phase, limiting task-specific optimization. To address this, we introduce the Parallel Vision-Language (ParVL) scaling framework as a novel post-training enhancement for MLLMs. This framework raises a core challenge: Given a fixed parameter budget, how can one effectively expand computational capacity via parameter reuse while concurrently determining the optimal, flexible allocation of resources between the vision and language modalities? We adapt ParVL to MLLMs as a data-efficient post-training enhancement and systematically analyze its resource allocation problem. We propose a prefix-tuning technique that allows multiple parallel computational streams with extensive weight reuse. Importantly, we conduct a systematic study to directly address the allocation trade-off and identify the optimal distribution between the ViT encoder and the LLM decoder. Applied to a supervised fine-tuned (SFT) model, ParVL requires only a small amount of additional data to significantly boost performance over its SFT counterpart.

Type

Preprint