Preface

We’re aiming to provide an introduction to multi-modal AI - specifically for Vision-Language Models (VLMs).
This is a “deep dive” from the perspective of someone with a general interest in AI, but it’s really just an introduction to a deep and complex field of vision-language models
So, we’ll assume familiarity with the general field of LLMs and deep learning but little in the specific area of multi-modal models.
We can’t possibly provide a paper-by-paper analysis of the field - which is both deep and accelerating, so we’ll try to cover:
1. An overview of the most important architectures and trends in research, illustrated via notable models - models anyone building in this space is likely to encounter.
2. We’ll touch on the key datasets and benchmarks
3. Then we’ll briefly examine recent attempts at “true” multi-modality (MM production as well as ingestion)
4. Finally, we’ll look at how the best-in-class models compare

Motivation

Many obvious use cases for VLMs - medical assistants, content filtering, media content indexing, managing large product catalogues, damage assessment in cars, etc
Two further reasons being interested in VLMs: one obvious, one less so.

In learning how to construct VLMs, we learn a lot about how to build the true, multi-modal models of the future - integrating further domains like audio, touch, LIDAR, etc
- Think about robotics: the range of sensory inputs that need to be integrated to cook a meal or perform a medical procedure.

The human brain uses cross-modal integrated information to determine which concepts are activated. E.G. the McGurk effect
The centrality of multi-sensory exploration to concept learning in infants was argued by Piaget.
In deep learning, there are theoretical results demonstrating that latent embeddings generated from multi-modal data are of higher quality than those from single-modality (see here).
Some researchers draw inspiration from Grounded Cognition Theory to argue for the importance of multi-modality in AI. (Note that current LMMs do not truly implement grounded cognition as described by the theory.)

Frontier LLMs already show evidence of high-level abstraction, world models and sophisticated reasoning. There’s no obvious ceiling on performance in view as of today.

If we’re Googling…:
- Large multi-modal model (LMM) and Multi-modal Large Language Model (MM-LMM or MLLM) generally mean the same thing.
- Vision-Language Models (VLMs) generally mean LLMs with image (and often video) understanding capabilities.