Meta FAIR Finds That Training Vision and Language Together From Scratch Beats Bolting Them Together Later

Kabui, Charles

Meta FAIR published “Beyond Language Modeling”, a large-scale study on how to build native multimodal foundation models from scratch, without starting from a language-pretrained backbone. The team used the Transfusion framework (next-token prediction for language, diffusion for vision) and trained on text, video, image-text pairs, and action-conditioned video simultaneously. Four findings stand out.

First, a Representation Autoencoder (RAE) built on SigLIP 2 outperforms both VAE-based encoders and raw pixels for visual understanding and generation, eliminating the need for separate encoders.
Second, adding vision data to language training does not hurt language performance; the two modalities are synergistic, and mixed-data training beats domain-specific training even with 5x less in-domain data.
Third, world-modeling capabilities (like navigation) emerge from general pretraining with as little as 1% domain-specific data.
Fourth, Mixture-of-Experts (MoE) architectures naturally learn modality specialization, with the model allocating more experts to text in early layers and more to vision in later layers, without human priors.

The most consequential number in the paper comes from its scaling law analysis: at 1 trillion parameters, vision’s optimal data requirement is 51x larger than language’s. This asymmetry explains why bolting a vision encoder onto a pretrained language model produces mediocre visual reasoning. MoE cuts this gap in half (exponent difference from 0.10 to 0.05), making a single model competitive with separate unimodal models on both modalities. For teams building multimodal products, the implication is direct: plan for dramatically more visual training data, or use MoE to compensate.

This work provides empirical backing for a shift already underway. Qwen3.5 ships a unified early-fusion multimodal model with MoE at consumer scale. Meta’s paper now explains the theory behind why that architecture works, and where the remaining bottlenecks are.

Sources:

Disclaimer: For information only. Accuracy or completeness not guaranteed. Illegal use prohibited. Not professional advice or solicitation. Read more: /terms-of-service

Reuse

GNU GENERAL PUBLIC LICENSE v3.0(View License)

Citation

BibTeX citation:

@misc{kabui2026,
  author = {{Kabui, Charles}},
  title = {Meta {FAIR} {Finds} {That} {Training} {Vision} and {Language}
    {Together} {From} {Scratch} {Beats} {Bolting} {Them} {Together}
    {Later}},
  date = {2026-03-08},
  url = {https://toknow.ai/posts/meta-fair-beyond-language-modeling-multimodal-pretraining/},
  langid = {en-GB}
}

For attribution, please cite this work as:

Kabui, Charles. 2026. “Meta FAIR Finds That Training Vision and Language Together From Scratch Beats Bolting Them Together Later.” https://toknow.ai/posts/meta-fair-beyond-language-modeling-multimodal-pretraining/.

Other Formats

Reuse

Citation