The field of computer vision has seen incredible progress, but some believe there are signs it is stalling. At the International Conference on Computer Vision 2023 workshop 鈥淨uo Vadis, Computer Vision?鈥, researchers discussed what鈥檚 next for Computer Vision.
鈥
In this post we bring you the main takeaways from some of the best minds in the Computer Vision landscape that gathered for this workshop during ICCV23 in Paris.
鈥
Table of Contents
- Quo Vadis, Computer Vision?
- The Anti Foundation Models
- Data over Algorithms
- Video can describe the world better than Text
- After Data-Centric, the User will be the core
- Bring back the fundamentals
- So, is Computer Vision dead?
鈥
Disclaimer: We went under cover into the workshop to bring you the most secret CAMRiP quality insights! 馃暤锔
鈥
1. Quo Vadis, Computer Vision?

鈥
Computer vision has reached a critical juncture with the emergence of large generative models. This development is having a dual impact. On one hand, it is opening new research avenues and attracting academics and businesses eager to capitalize on these innovations. However, the swift pace of advancement is also causing uncertainty among computer vision researchers about where to focus next.
鈥
Many feel conflicted wondering if they can match the progress in generative models compared to more established computer vision problems. This ICCV 2023 workshop (see Figure 1) brought together experts like , , and to discuss this pivotal moment.
鈥
In the following sections we provide some highlights of the lively discussions followed on how computer vision should adapt and leverage generative models while still tackling core challenges in areas like video and embodied perception. There was consensus that combining strengths of computer vision and generative models thoughtfully is key, rather than seeing them as competing approaches.
鈥
2. The Anti Foundation Models
MIT鈥檚 professor , provided three reasons why he doesn鈥檛 like foundation models:
鈥
Reason 1: They don鈥檛 tell us how vision works
In short, Bill Freeman argues that foundation models are capable of solving vision tasks but despite this achievement, nobody is able to explain how vision works (i.e. they are still a black-box).
鈥
Reason 2. They aren鈥檛 fundamental (and therefore not stable)
As shown in Figure 2, professor鈥檚 Freeman hints that foundation models are simply just a trend.

鈥
Reason 3. They separate academia from industry
Finally, professor鈥檚 Freeman argues that foundation models create a boundary between those in academia (i.e. creative teams but no resources) versus those in industry (i.e. unimaginative teams but well-organized resources).
鈥
3. Data over Algorithms
Berkeley鈥檚 professor, , shared the two ingredients for achieving true AI:
- Focus on data over algorithms: GigaGAN [1] showed that large datasets enable old archiectures such as GAN to scale.
- Bottom-up emergence: data per-se is mostly noise, what is crucial is the right kind of (high-quality) data.
鈥
Also, he argues that LLMs are winning because they are being trained on all the available data with just 1 single epoch! (see Figure 3).

鈥
4. Video can describe the world better than Text
An audacious take was made by Berkeley鈥檚 professor , where he suggested that video is a more efficient (and perhaps effective) way to describe the world.

鈥
His views are supported by arguing that any book (see Figure 4 for some examples) can be represented in a more compact way using video (i.e. frames) than text (i.e. tokens): the same information can be conveyed way more efficiently using video than text.
鈥
Professor Malik believes video will help put Computer Vision again on the map in the next few years.
鈥
5. After Data-Centric, the User will be the core

鈥
Princeton鈥檚 professor, , provided fascinating insights on what is next after the data-centric approach to machine learning.
鈥
She elegantly explained, Figure 5, how the field has evolved from a pure focus on models (i.e. year 2000) to the current moat of 鈥渄ata is king鈥, and argues that a time where the human (i.e. user) is the center is next.
鈥

鈥
For instance, she makes the case for the need of gathering truly representative data from all over the world rather than simply focusing on web data, see Figure 6.
鈥
6. Bring back the fundamentals

鈥
Finally, MIT鈥檚 professor, gave a lightweight talk where he candidly shared his views on why curiosity is more important than performance (see Figure 8), especially in today鈥檚 LLMs driven world.
鈥
Professor鈥檚 Torralba argues that the field of Computer Vision has been already in a position where (mostly) outsiders confidently argue that the field has stalled, yet time has proven that someone comes up with some clever idea by focusing on the fundamentals rather than following the crowd.

鈥
7. So, is Computer Vision dead?
The ICCV23 workshop makes clear that rather than being dead, computer vision is evolving. As leading experts argued, promising directions lie in the interplay between vision and language models.
鈥
However, other frontiers also hold potential, like exploring when large vision models are needed or providing granular control over frozen generative architectures, as described by one of the papers awarded with the Marr Prize [2] in ICCV23.
鈥
While progress may require integrating strengths of vision and language, key computer vision challenges remain in areas like texture perception or peripheral vision where the question of how to throw away information is still a challenge. With an influx of new researchers and industry interest, the field is poised to take on some of these questions.
鈥
References
鈥
[] Scaling up GANs for Text-to-Image Synthesis
鈥
[] Adding Conditional Control to Text-to-Image Diffusion Models
鈥
Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan
鈥
If you would like to know more about 猫咪社区, sign up for a .
鈥

鈥
鈥