Preface
- We’re aiming to provide an introduction to multi-modal AI - specifically for Vision-Language Models (VLMs).
- This is a “deep dive” from the perspective of someone with a general interest in AI, but it’s really just an introduction to a deep and complex field of vision-language models
- So, we’ll assume familiarity with the general field of LLMs and deep learning but little in the specific area of multi-modal models.
- We can’t possibly provide a paper-by-paper analysis of the field - which is both deep and accelerating, so we’ll try to cover:
- An overview of the most important architectures and trends in research, illustrated via notable models - models anyone building in this space is likely to encounter.
- We’ll touch on the key datasets and benchmarks
- Then we’ll briefly examine recent attempts at “true” multi-modality (MM production as well as ingestion)
- Finally, we’ll look at how the best-in-class models compare
Motivation
- Many obvious use cases for VLMs - medical assistants, content filtering, media content indexing, managing large product catalogues, damage assessment in cars, etc
- Two further reasons being interested in VLMs: one obvious, one less so.
1. VLMs allow us to explore modality alignment
- In learning how to construct VLMs, we learn a lot about how to build the true, multi-modal models of the future - integrating further domains like audio, touch, LIDAR, etc
- Think about robotics: the range of sensory inputs that need to be integrated to cook a meal or perform a medical procedure.
2. MM Understanding may be important for AGI
The arguments for
- The human brain uses cross-modal integrated information to determine which concepts are activated. E.G. the McGurk effect
- The centrality of multi-sensory exploration to concept learning in infants was argued by Piaget.
- In deep learning, there are theoretical results demonstrating that latent embeddings generated from multi-modal data are of higher quality than those from single-modality (see here).
- Some researchers draw inspiration from Grounded Cognition Theory to argue for the importance of multi-modality in AI. (Note that current LMMs do not truly implement grounded cognition as described by the theory.)
The arguments against
- Frontier LLMs already show evidence of high-level abstraction, world models and sophisticated reasoning. There’s no obvious ceiling on performance in view as of today.
Vision Language Models
- If we’re Googling…:
- Large multi-modal model (LMM) and Multi-modal Large Language Model (MM-LMM or MLLM) generally mean the same thing.
- Vision-Language Models (VLMs) generally mean LLMs with image (and often video) understanding capabilities.