猫咪社区

Multimodal Large Language Models (MLLMs) transforming Computer Vision

July 1, 2024
7
聽min read
Multimodal Large Language Models (MLLMs) transforming Computer Vision

This article introduces what is a Multimodal Large Language Model (MLLM) [1], their applications using challenging prompts, and the top models reshaping Computer Vision as we speak.

Table of Contents

  1. What is a Multimodal Large Language Model (MLLM)?
  2. Applications and use cases of MLLMs in Computer Vision
  3. Top Multimodal Large Language Models
  4. What鈥檚 next

1. What is a Multimodal Large Language Model (MLLM)?

In layman terms, a Multimodal Large Language Model (MLLM) is a model that merges the reasoning capabilities of Large Language Models (LLMs), for instance GPT-3 [2] or LLaMA-3 [3], with the ability to receive, reason, and output with multimodal information.

Figure 1 illustrates a multimodal AI system in healthcare [4]. It receives two inputs: 1) a medical image and 2) a query in text: 鈥Is pleural effusion present in this image?鈥. The system output consists of an answer (i.e., a prediction) to the given query.

Figure 1. A multimodal medical system created by aligning a radiology Vision encoder and a LLM [4]

馃憠 In this article, we might use the term Multimodal Model as a shortcut for MLLM.

1.1 The rise of multimodality in Artificial Intelligence

Over the past few years, there has been a significant transformation in Artificial Intelligence, largely driven by the rise of Transformers [5] in Language Models [6]. It鈥檚 no breaking news that the adoption of this architecture, invented by Google in 2017, has also impacted the domain of Computer Vision.

One of the earliest examples was Vision Transformers (ViT) [7], which uses Transformers to segment images into multiple patches, treating them as individual visual tokens for input representation.

Figure 2. Some of the Multimodal Large Language Models (MLLMs) developed between 2022 and 2024

With the raise of Large Language Models (LLMs), a new type of generative model, Multimodal Large Language Models (MLLMs) naturally emerged.

As shown in Figure 2, in 2023 most big tech companies developed at least one MLLM. In 2024, OpenAI鈥檚 GPT-4o made the headlines during its launch in May.

1.2 MLLMs vs VLMs vs Foundation Models

Some consider MLLMs to be really Foundation Models. For instance, Google鈥檚 Vertex AI shows Multimodal Large Language Models such as Claude 3, PaliGemma or Gemini 1.5 as Foundation Models 馃.

馃憠 Learn more about Foundation Models in Computer Vision in this post.

On the other hand, Vision Language Models (VLMs) [8] are a specialized category of Multimodal Models that integrate text and image inputs and generate text outputs.

The main difference between Multimodal Models and VLMs lies in (1) the capacity of MLLMs to work with more modalities, not only text and images as VLMs, and (2) VLMs are less performant in reasoning skills.

1.2 Architecture

As illustrated in Figure 3, the architecture of a MLLM is divided in three parts:

  • Modality encoder: The encoding components condense raw data formats like visuals or sound into a more streamlined representation. Instead of beginning the training process from the ground up, a prevalent strategy involves utilizing a pre-trained encoder (e.g., CLIP) that has been calibrated to other modalities.
  • LLM backbone: A language model is required to output responses in text. It acts as the 鈥渂rain鈥 of the MLLM. The encoder is fed with images, audio or video and generates features, which are processed by a connector (or modality interface).
  • Modality interface (i.e., conector): This functions as an intermediary or link between the encoder and the LLM. Since LLMs can only interpret text, it鈥檚 crucial to connect text with other modalities effectively.

Figure 3. Multimodal Understanding: the components of the first stage of multimodality

2. Applications and use cases of Multimodal Models in Computer Vision

Instead of providing a list of the different use cases where these models excel, we spun a couple of GPUs to test three the top MLLMs using challenging queries (no more cats 馃樅 and dogs 馃惗 examples).

  • GPT-4o [9]: The most powerful multimodal model from OpenAI released in May 2024. We accessed this model using OpenAI鈥檚 API .
  • LLaVA 7b [10]: A multimodal model, derived from the open-source LlaMa model, that integrates a vision encoder and for general-purpose visual and language understanding, achieving impressive performance sometimes on par to GPT-4. We accessed this model launching a instance on Jarvislab.
  • Apple Ferret 7b [11]: An open-source Multimodal Large Language Model (MLLM) developed by Apple. It enables spatial understanding by using referring and grounding, which enables the model to recognize and describe any shape in an image, offering precise understanding, especially of smaller image regions. To access the model, we also launched a instance on

2.1 Counting objects in presence of occlusion

Figure 4 shows how these three top models performed when given an image and a challenging prompt that requested them to count hard hats.

Figure 4. Apple鈥檚 Ferret model was the only one that correctly identified the bounding boxes鈥 location (including the occluded one)

Despite providing a very rich description of the scene (see Figure 4), GPT-4o yielded incorrect coordinates to locate the required hard hats: some of them lie outside of the dimensions of the current image, which is why we only see one bounding box on the bottom right corner.

The open-source model, LLaVA, was incapable to detect all the four hard hats (it missed the occluded one on the left side) and provided the wrong location for the bounding boxes.

Surprisingly, Apple鈥檚 Ferret, was able to detect the four objects on the image: even the one on the left that is occluded! 猸愶笍

2.2 Autonomous driving: understanding and planning for risk

First, we picked this scene from an autonomous driving . Second, we increased the difficulty of the prompt: it requires the models to evaluate the risks from the self-driving car perspective while detecting two separate classes, vehicles and pedestrians (see Figure 5).

Figure 5. A challenging prompt requiring the models to detect objects and evaluate risks: Apple Ferret鈥檚 model performed better than GPT-4o.

The results show how LLaVA performs quite poorly: it hallucinates by not identifying the big truck in front of the autonomous car. Are open-source models really that bad when subjected to challenging tasks? 馃

While GPT-4o shines to return reasoned detailed responses in text, it again performs poorly when it comes to clearly detecting bounding boxes. In contrast, Apple鈥檚 Ferret is the only model that detects the majority of the objects with accurate bounding box coordinates 鉁.

2.3 Sports analytics: detecting objects and scene understanding

Until now, at least one of the models, Apple鈥檚 Ferret, has shown high performance in counting and detecting objects. Let鈥檚 turn our attention to a more challenging scenario: sports analytics 鈿斤笍.

Often, unimodal fine-tuned architectures, such as YOLO, tend to perform really well for detecting players in a soccer match: Ccn MLLMs perform good too?

Figure 6. A scene from a soccer match that was tested on the three MLLMs in this article

Ex 3. Question/Prompt: As an AI system that is an expert in sports, particularly in soccer, you鈥檒l be given a scene of a soccer match. Please, (1) describe the scene, (2) count the number of players in each team, (3) provide the bounding box coordinates of the ball and of the goalkeeper, (4) estimate the likelihood of a goal and say what team is likely to score it.

As shown in Figure 7, detecting the players and the ball broke the three models we analyzed! None of the models is capable to identify two teams and their players.

Figure 7. None of the MLLMs in this article was capable to detect the objects requested in the prompt

So, Multimodal Large Language Models (MLLMs) are good on average, but apparently they aren鈥檛 ready to solve computer vision tasks for more demanding use-cases. Even a YOLOv8 model does better in such specific (niche) tasks, 馃攷 see .

Is fine-tuning MLLMs the way to go instead? 馃

3. Top Multimodal Large Language Models

Now, we list some of the most important MLLMs redefining computer vision:

GPT-4o (2024, OpenAI)

  • Inputs: text, images, audio (beta), video (beta).
  • Outputs: text, images.
  • What is it: GPT-4o stands for 鈥淕PT-4 Omni鈥, with 鈥淥mni鈥 referring to its multimodal capabilities across text, vision, and audio modalities. It is a single unified model that can understand and generate any combination of text, images, audio, and video inputs/outputs.
  • Try it here:
  • 馃 Little known fact: GPT-4o employs a 鈥渕ulti-modal chain of thought鈥 approach, where it first reasons about how to break down a problem into a series of steps across different modalities (text, vision, audio), and then executes those steps to arrive at the final solution.

Claude 3.5 Sonnet (2024, Anthropic)

  • Inputs: text, images.
  • Output: text, images.
  • What is it: With a 200K token context window, Claude 3.5 Sonnet is a multimodal AI system that can understand and generate text, images, audio, and other data formats. Excels at in-depth analysis, research, hypothesis generation, and task automation across various domains like finance, life sciences, and software engineering.
  • Try it here:
  • 馃 Little known fact: Anthropic employs a technique called 鈥渞ecursive reward modeling鈥 which involves using an earlier version of Claude to provide feedback and rewards for the model鈥檚 outputs.

LLaVA (2023, University of Wisconsin-Madison)

  • Inputs: text, images.
  • Output: text.
  • What is it: LLaVA (Large Language and Vision Assistant) is an open-source multimodal AI model that can process and generate both text and visual data as inputs and outputs. It matches GPT-4鈥檚 chat abilities and sets a new record on Science QA, showcasing advanced visual-linguistic understanding.
  • Try it here:
  • 馃 Little known fact: LLaVA was trained using a technique called 鈥渋nstruction tuning鈥, where GPT-4 was used to generate synthetic multimodal tasks involving text and images (novel in 2023). LLaVA learned from these diverse examples generated by GPT-4 without direct human supervision.

Gemini 1.5 (2024, Google)

  • Inputs: text, images,
  • Output: text, images.
  • What is it: Gemini is a family of large language models developed by Google that can understand and operate across multiple modalities like text, images, audio (beta) and video (beta). It was first unveiled in December 2023 and is available in three optimized variants 鈥 Gemini Ultra (largest), Gemini Pro (for scaling), and Gemini Nano (for on-device tasks).
  • Try it here:
  • 馃 (Obvious) little known fact: Gemini鈥檚 name is a nod to the Gemini zodiac sign, which represents the 鈥淭wins鈥 in Greek mythology. This is fitting given Gemini鈥檚 dual nature as a highly capable language model that can also process and generate multimodal data like images, audio, and video.

Qwen-VL (2024, Alibaba Cloud)

  • Inputs: text, images,
  • Output: text, images.
  • What is it: Qwen-VL is an open-sourced multimodal AI model that combines language and vision capabilities. It鈥檚 an extension of the Qwen language model, designed to overcome limitations in multimodal generalization. Recently upgraded versions (Qwen-VL-Plus and Qwen-VL-Max) feature improved image reasoning, better detail analysis in images and text, and support for high-resolution images with varied aspect ratios.
  • Try it here:
  • 馃 (Fun) little known fact: After launch, Qwen-VL quickly rose to the top of the OpenVLM leaderboard but was surpassed by other more powerful models, especially GPT-4o.

4. What鈥檚 next?

Multimodal models are definitively transforming computer vision. As an , how can you best leverage them when building robust AI pipelines?

Moreover, how do these models, some of them also known as foundation models, impact a traditional computer vision pipeline? 馃 At , we believe that these models are paving the way for a new kind of pipeline: .

Learn more about the cutting edge of multimodality and foundation models in our brand-new CVPR 2024 series:

References

[] A Survey on Multimodal Large Language Models

[] Language Models are Few-Shot Learners

[] Introducing Meta Llama-3: The most capable openly available LLM to date

[] Multimodal medical AI

[] Attention is all you need

[] Language Models are Unsupervised Multitask Learners

[] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[] An Introduction to Vision-Language Modeling

[] GPT-4o

[] LLaVA: Large Language and Vision Assistant

[] FERRET: Refer and Ground Anything Anywhere at Any Granularity

Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan.

If you鈥檇 like to know more about 猫咪社区, explore .

Stay In Touch
Subscribe to our Newsletter
Stay up-to-date on the latest blogs and news from 猫咪社区!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Newsletter

Lorem ipsum dolor sit amet

Lorem ipsum dolor sit amet

Reach Super-Human Model Performance at Record Breaking Speed!

Figure out what鈥檚 wrong and fix it instantly