In this post we share with you the top highlights of CVPR 2024. Let鈥檚 get started! 馃殌
鈥
Table of Contents
- Embodied AI
- Generative AI
- Foundation Models
- Video Understanding
- What鈥檚 next?
鈥
1. Embodied AI

鈥
鈥
What is this about
Embodied AI is an approach to artificial intelligence that focuses on creating agents (e.g., robots, smart home systems) capable of learning and solving complex tasks through direct interaction with their environment.
鈥
As , keynote speaker, mentioned: 鈥淓mbodied AI can mean different things to different people. It has gone through a series of transformations over the years鈥, but a common feature is that Embodied AI is about systems able to perceive their surroundings (through vision and other senses), communicate using natural language, understand audio inputs, navigate and manipulate their environment to achieve goals, and engage in long-term planning and reasoning.
鈥
Key ideas
1. Current AI systems are vulnerable to adversarial attacks due to lack of true embodiment.
During Bongard鈥檚 keynote, he argued that simply putting deep learning systems into robots is not sufficient for true Embodied AI. He believes that embodiment is fundamentally about change, both internal and external. To create safe AI, we need technologies that undergo significant internal physical changes. 鈥淢orphological pretraining鈥 [1] through internal change can help AI systems better handle new tasks and adversarial attacks.
鈥
2. The path to truly 鈥済eneralizable robots鈥 is to scale simulation.
, senior director of computer vision at the Allen Institute for Artificial Intelligence (AI2) in Seattle, thinks that without requiring any adaptation or fine-tuning, expanding simulation data enables agents to masterfully navigate and manipulate in the real world. In his work Robothor [2], he examined the critical issue of how effectively models trained in simulation generalize to real-world scenarios, a question that has largely remained unresolved.
鈥
3. LLMs are better suited for long-horizon manipulation of any object compared to previous fancy approaches like Imitation Learning or more traditional methods such as classical task and motion planning.
, AI researcher and head of Embodied AI at , argues that (1) classical task and motion planning lack knowledge of objects in the world, and struggle with partial observability, (2) modern behavioural cloning techniques, like Imitation Learning [3], are incapable of generalize well to unseen environments. On the contrary, LLMs [4] can be used to do long-term horizon manipulation of any object in any environment:
- Train transformers [5] to predict how objects should move
- Use LLMs like GPT-4 for common sense reasoning and interpreting users
- Combine LLM outputs with planners to make sure constraints are met
- Train low-level motor skills using spatially-abstracted representations
鈥
Leaders & builders in the space
- : Director of the at the University of Vermont.
- : Senior Director of CV at , and Professor of CS at the University of Washington.
- , AI researcher and head of Embodied AI at .
- , VP of AI at .
- , founder of .
鈥
2. Generative AI

鈥
What is this about
Unless you鈥檝e been living under a rock for the past 24 months, you probably use GenAI on a daily basis by now. Generative AI [6] refers to artificial intelligence systems, for instance Google鈥檚 Imagen [7], that can create new content, such as text, images, audio, or video, that resembles human-created work.
鈥
GenAI was a really hot topic 馃敟 during CVPR 2024. The conference hosted the following GenAI-related workshops:
- SyntaGen: Generative Models for Synthetic Visual Datasets
- The Future of Generative Visual Art
- Responsible Generative AI Workshop
- Generative Models for Computer Vision
- Evaluation of Generative Foundation Models
鈥
Key ideas
1. The creation of multimodal datasets (containing paired image-text examples) can be demistified by executing a rigurous dataset development process.
In his keynote at the Evaluation of Generative Foundation Models workshop, , a researcher at , argued that multimodal learning can be accelerated by adopting a data-centric approach. He described a benchmark called DATACOMP [8], which aids in engineering multimodal datasets. The key idea of this benchmark, composed of 38 classification and retrieval tasks, is to keep both the training code and the GPU budget constant while proposing different training sets.
鈥
2. Training text-to-image models on richly detailed, generated image captions significantly enhances their prompt-following abilities.
from OpenAI claimed that often GenAI models struggle with interpreting detailed descriptions, frequently overlooking words or misunderstanding prompts. This problem comes from the noisy and inaccurate captions usually found in training datasets. By training a specialized image captioner to recaption the data, a more reliable and detailed dataset was created. Building on these insights, DALL-E 3 [9] was developed.
鈥
3. Learning vision without visual data is possible.
In a fantastic talk titled 鈥淟earning Vision with Zero Visual Data鈥 by from MIT, he argued that non-visual data such as noise [10], language, and/or code can be used to train a vision model. In particular, language models such as GPT-4 can correctly classify human drawings but struggle to identify concept categories that they are otherwise capable of rendering accurately.
鈥
Leaders & builders in the space
- : Professor of CS at the University of Washington and at .
- : Professor of CV at MIT.
- : Staff research scientist at Google and professor of CS at the Weizmann Institute of Science.
- : Research scientist at Wayve.
鈥
3. Foundation Models

鈥
What is this about
Foundation models are large-scale artificial intelligence systems trained on vast and diverse datasets, serving as a base for a wide range of AI applications. These models are characterized by their size, breadth of training data, and ability to be adapted to various tasks with minimal additional training.
鈥
馃攷 For a more detailed guide on the definitive Foundation Models reshaping the field computer vision, read our 猸愶笍.
鈥
Key ideas
1. Foundation models can work as real-world simulators.
Google researcher argued that one of the use cases for foundation models is to serve as real-world simulators. In his keynote at the Foundation Models for Autonomous Systems , he claimed that two requirements for foundation models to function as real-world simulators have already been covered:
- 1) The Internet鈥檚 data (in text and video form) provides a unified representation and task interface for a 鈥渨orld model鈥
- 2) Reinforcement learning is sufficiently advanced (for decision-making) to allow for planning in this 鈥渨orld model鈥 [12]
鈥
So, what鈥檚 missing? Two aspects: 1) hallucinations are still common in these models, and 2) better evaluation and feedback mechanisms.
鈥
2. The true benefit of foundation models in robotics lies in their ability to serve as general models that excel at decision-making.
In his titled 鈥淎 General-Purpose Robotic Navigation Model鈥, , an AI researcher and professor of computer science at Berkeley, argued that foundation models in domains like computer vision aren鈥檛 pretrained to make decisions per se. Currently, pretraining is only loosely related to decision tasks. However, if foundation models were pretrained to directly make important and useful decisions, it could be valuable for both robotics and other fields, since downstream machine learning tasks ultimately involve decision-making.
鈥
3. We won鈥檛 achieve a robotics-first foundation model until we address three key components: data scaling, steerability and promptability, and scalable evaluations
鈥, a research scientist at Google working on robotics, argued that three crucial ingredients are missing to build a true robotics-first foundation model. In his , he explained these three components:
- 1) Data scaling has worked incredibly well for LLMs and VLMs, but there鈥檚 no equivalent for robot data yet. However, there is hope if data interoperability is increased by treating robot actions as just another data modality
- 2) There is no promptable generalist robot like in LLMs, partly due to large context bandwidths, and the lack of robot data makes this even harder to achieve
- 3) Generalist models that can do anything need to be evaluated on everything 馃: LLMs are evaluated directly by humans, as they target a human data distribution. In contrast, robots target a physical data distribution, which might require real-world evaluations that we are not yet capable of conducting.
鈥
Leaders & builders in the space
- : Professor of computer science at Berkeley.
- : Co-founder of Wayve.
- : AI researcher at NVIDIA and professor of computer science at the University of Toronto.
- : Senior researcher scietist at Google.
鈥
4. Video Understanding

鈥
What is this about
Video Understanding refers to the field of artificial intelligence that focuses on developing systems capable of comprehending and analyzing the content, context, and events within video sequences. It goes beyond simple object recognition or scene classification to interpret complex temporal and spatial relationships, actions, and narratives depicted in video data.
鈥
Key ideas
1. Multimodal in-context learning is posed to transform the task of audio description (AD).
, an AI researcher at AMD, described how visual content in long-form videos can be transformed in audio descriptions using multimodal models, in particular GPT-4, using in-context learning (MM-ICL) with few-shot examples [14]. He claims that this strategy beats beats both fine-tuning-based and LLM/LMM-based approaches to generate audio descriptions for videos of extensive length.
鈥
2. LLMs are one of the key stones to solve long-range video captioning.
According to from FAIR, the reasoning abilities from LLMs make these models the perfect companion for hierarchical video captioning tasks [15]. In his keynote at a workshop focused on , Torresani explained why LLMs can be so powerful for these tasks:
- 1) Given short-term clip captions, LLMs can successfully generate descriptions and long-range video summaries
- 2) LLMs can be used to augment training data, effectively complementing manually annotated data to improve performance on caption creation
鈥
Leaders & builders in the space
- : Principal researcher at Microsoft Research.
- : Professor of computer vision at the University of Bristol and research scientist at Google.
- : Senior researcher at Google.
- : AI researcher at Wayve.
- : Senior director of GenAI at AMD.
- : AI researcher at FAIR.
鈥
5. What鈥檚 next?
It鈥檚 been less than 1 week and we鈥檙e already missing the energy and the enthusiasm of the crowd at CVPR.

鈥
馃憠 Stay tuned for more CVPR 2024 posts!
鈥
References
[] Josh Bongard talk, DAY 2 EI鈥23 Conference
鈥
[] RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
鈥
[] A Survey of Imitation Learning: Algorithms, Recent Developments, and Challenges
鈥
[] Language Models are Few-Shot Learners
鈥
[] Attention is all you need
鈥
[] Generative AI in Vision: A Survey on Models, Metrics and Applications
鈥
[] Imagen
鈥
[] DATACOMP: In search of the next generation of multimodal datasets
鈥
[] Improving Image Generation with Better Captions
鈥
[] Learning to See by Looking at Noise
鈥
[] Sora at CVPR 2024
鈥
[] UniSim: Learning Interactive Real-World Simulators
鈥
[] PRISM-1 by Wayve
鈥
[] MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
鈥
[] Video ReCap: Recursive Captioning of Hour-Long Videos
鈥
Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan.
鈥
\If you鈥檇 like to know more about 猫咪社区, explore .

鈥