Meta FAIR Finds That Training Vision and Language Together From Scratch Beats Bolting Them Together Later

A controlled from-scratch pretraining study from Meta FAIR shows that unified multimodal models outperform the bolt-on approach, that vision needs 51x more data than language at scale, and that Mixture-of-Experts architectures resolve this imbalance.
artificial-intelligence
Author

Kabui, Charles

Published

2026-03-08

Keywords

multimodal-pretraining, transfusion, mixture-of-experts, meta-fair, world-models