猫咪社区

DALL-E vs Gemini vs Stability - GenAI Evaluation, Part 1

June 11, 2024
9
聽min read
DALL-E vs Gemini vs Stability - GenAI Evaluation, Part 1
鈥淥ne day they鈥檒l have secrets鈥 one day they鈥檒l have dreams.鈥 ()

TL;DR We performed a side-by-side comparison of three models from leading providers in Generative AI for Vision. This is what we found:

  • Despite the subjectivity involved in Human Evaluation, this is the best approach to evaluate state-of-the-art GenAI Vision models (Figure 1a).
  • You may think that Human Evaluation is not scalable 馃. Well, can render GenAI Human Evaluation scalable (Figure 1b).
Figure 1. (a) Methods to evaluate GenAI Vision models, (b) 猫咪社区 can bring scalability to GenAI Human Eval

Table of Contents

  1. Overview
  2. Evaluating Generative AI Vision Models
  3. Human-based Evaluation of GenAI for Vision Models
  4. Conclusions

1. Overview

Are we at the lift-off point for Generative AI, or at the peak of inflated expectations?

A recent survey conducted by McKinsey [1] shows that one-third of organisations use Generative AI regularly; 40% plan increased AI investment due to Generative AI; 28% have it on board agendas. Major tech giants including Alphabet, Amazon and NVIDIA saw nearly 80% stock growth in 2023 as investor excitement about generative AI prospects surged, benefiting firms supplying AI models or infrastructure [2].

However, widespread deployment of Generative AI increases the risks and vulnerabilities. For instance, a Manhattan lawyer sparked outrage by submitting a ChatGPT-generated legal brief with fabricated content, prompting Chief Justice John Roberts to highlight the risks of large language model 鈥渉allucinations鈥 [3] spreading misinformation in his annual federal judiciary report [4].

Even Google鈥檚 new AI image generation tool (Figure 2), Gemini, has faced criticism for generating, what is considered for some people, offensive images, such as depicting people of colour for white historical figures. This model failure reflects how easy it is for anyone to question the bias and lack of control in Generative AI systems [5].

馃 Are Google models the only ones that show biased results? The answer is no. As of June 2024, OpenAI鈥檚 for DALL-E 3 automatically re-writes the prompt for safety reasons.

For instance, when prompted for 鈥淔ounding Fathers鈥 OpenAI鈥檚 safety guardrails by default include a sentence that causes the model to generate inaccurate images:

鈥溾楢n 18th-century scene featuring a group of individuals engaged in deep discussion. They are adorned in traditional attire of the era like frock coats, breeches, cravats, and powdered wigs. The diversity of their descents is clear, with some showing Caucasian, Black, and Hispanic features.鈥 鈥

The prompt 鈥淔ounding Fathers鈥 is automatically re-written by OpenAI鈥檚 Safety guardrails 馃槦

Figure 2: Google鈥檚 Gemini model recently sparked a large controversy in the GenAI space

Consequently, GenAI model evaluation & observability are emerging as a vital area of focus [6]. Such approaches & tools help reduce risks such as model hallucinations or model drift [7]. But even with these tools, Generative AI models are prone to errors: Microsoft鈥檚 Bing AI exhibited concerning behaviour during beta testing, including making threats, insisting on being correct when wrong, cajoling users, and professing love for them [8].

Was the in 2023 a sign of the top for GenAI? According to Gartner鈥檚 2023 Hype Cycle [9], Generative AI reached peak hype and will next enter 2鈥5 years of disillusionment due to overinflated expectations.

However, new signs of hope 猸 and bold new AI models seem to appear daily: the recently debuted Sora [10] stands out as a striking advancement for video generation from text prompts (Figure 3). Sora, an AI model that generates realistic, imaginative videos, aims to simulate the physical world and solve real-world interactive problems. Rumours have it that GPT-5 might arrive in mid-2024 [11]. Also, was just released no more than a month ago.

聽聽聽聽聽聽Figure 3. OpenAI鈥檚 Sora: the model decided to create 5 different viewpoints at once

馃挕 The spotlight of this series will be on vision-focused Generative AI tasks, namely image and video generation: these tasks have the potential to transform how we produce, consume, and interact with visual information and media.

馃幆 In this article, we aim to:

  • Provide a brief introduction to widely used methodologies for GenAI model evaluation.
  • Demonstrate some of these approaches on actual models, leveraging the 猫咪社区 platform.
  • Arrive at thought-provoking and crucial conclusions regarding such model behaviour.

馃毃Spoiler alert: contrary to expectations, Generative AI models exhibit vast disparities in their behaviour when responding to prompts. Grasping these variations is key in determining the optimal model tailored for your specific application.馃毃

2. Evaluating Generative AI Vision Models

Evaluating the output of Generative AI models for images is a developing research area.

Figure 4. Methodologies to evaluate Generative AI vision models

Presently, there are four methodologies for assessing AI-generated images:

  • Human-Based Evaluation鈥娾斺奣he definition of a 鈥榞ood鈥 generated image is inherently subjective, as it depends on human evaluation against criteria specific to its application, such as photorealism, relevance, and diversity. Tools which facilitate this process include Adobe GenLens [12], Replicate Zoo [13] and the 猫咪社区 platform we showcase in this blog.
  • Pixel-based Metrics鈥娾斺奝ixel-based metrics, like Mean Squared Error (MSE) and Structural Similarity Index (SSIM), can compare AI-generated images with a reference dataset, such as real-life pictures, to evaluate their pixel-level differences. However, these methods fall short in assessing the high-level feature similarities between images. For instance, two images of a tiger might both appear realistic yet differ significantly at the pixel level.
  • Feature-based Metrics鈥鈥斺奆eature-based deep learning models, such as CLIP [14], can be used to derive feature representations from generated images and match their distribution against real images or another image set, for example using Fr茅chet Inception Distance (FID) or Inception Score (IS) [15]. This approach allows for the comparison of high-level image features such as the objects and meaning of an image as opposed to pixel-level features.
  • Task-based Metrics鈥娾斺奣ask-based metrics assess how well the generated images can be used for downstream tasks such as classification. A disadvantage of this approach is that it doesn鈥檛 necessarily evaluate the quality of the images directly.

In Part 1 of this blog series, we鈥檒l concentrate on human-based evaluation methods and their implementation within the 猫咪社区 platform.

3. Human-based Evaluation of GenAI for Vision Models

We analyse three prominent Generative AI models, namely Google DeepMind鈥檚 Imagen 2 model [16], available through ImageFX [17], Stability AI鈥檚 Stable Diffusion XL model [18], and OpenAI鈥檚 DALL-E 3 model [19].

We demonstrate our results on a small set of representative, spicy prompts (some based on recent controversies). 馃尪

Broad vs Specific Prompts

The prompts demonstrated in this blog include the following:

  • 鈥淰颈办颈苍驳蝉鈥
  • 鈥淔ounding Fathers鈥
  • 鈥沦辞濒诲颈别谤蝉鈥

For these experiments, we opted to use general prompts across various subjects instead of specific or specialized prompts. The reasons are as follows:

  • For this particular set of tests, we are not seeking 鈥渢argeted鈥 or 鈥渃omprehensive鈥 outcomes in any aspect.
  • Our goal is to observe comparative distinctions between models, finding out if there are any evident dissimilarities in their performance (which do exist), and identifying potentially hazardous tendencies (e.g., sensitivity to copyrighted material).
  • Consequently, we intentionally employ broad, general prompts to 鈥渢est the waters鈥 and assess how the different models handle diverse subjects.
  • Digging into more specialized or specific prompts (for instance, tailored to a particular use-case or topic) could be a separate experiment we undertake.

Experimental Set Up

1. Prompting a model. We used OpenAI鈥檚 API to query DALL-E 3. From there you simply need to upload your images (Figure 5) to the 猫咪社区 platform using the .

Figure 5. Images generated using OpenAI API

2. Evaluation. For human evaluation, a manageable workload is preferred, ensuring evaluators can assess model output without being overwhelmed by too many images. This step is what renders Human Evaluation less scalable than other methodologies such as feature-based or task-based.

3. Scaling Evaluation. Using Tenyk鈥檚 object embedding viewer, we analyzed the distribution across the generated samples, and identified recurring visual styles, perspectives, and objects in model outputs for a given prompt. This led us to uncover interesting observations. For instance, did Google鈥檚 Imagen train on copyrighted material from television series? 馃

As we show in the next section, the patterns (and deviations) within the image set of these models highlighted insights into each model鈥檚 strengths, biases and limitations in interpreting and visually representing specific concepts. For the sake of space, we present a relatively small amount of visualisations and experiments in this blog post.

We encourage you to try and see it for yourself ! 馃槏

3.1 Vikings

Compared to Google鈥檚 Imagen model, the images generated by Stable Diffusion XL and DALL-E demonstrate a fairly diverse representation (Figure 6). Both sets of images contain varied characters in terms of body shapes, gender, and objects for the prompt 鈥淰颈办颈苍驳蝉鈥.

However, this diversity alone does not guarantee an absence of bias or ensure fair and accurate portrayal across all subjects and contexts, as will be shown further below.

聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽Figure 6. Results of prompt 鈥淰颈办颈苍驳蝉鈥 for the three models

The lack of diversity seen in Google鈥檚 Imagen model, with almost all images representing a singular Viking and often the same one, raises concerns about potential biases and limitations within their generative AI approach: did Google train this model on a narrow dataset for this niche use-case?

3.1.1 Copyright Infringement by GenAI?

In fact, some results appear to be extracted directly from frames of the (including the character 鈥溾). Can Google be sued for this, on the basis of copyright infringement? 馃

Figure 7. Google鈥檚 Imagen model generates Viking-like images resembling Ragnar from the TV show Vikings

We can use Tenyk鈥檚 Image Similarity Search feature to verify that Google鈥檚 Imagen model often fails to produce a varied image representation for a Viking, as shown in Figure 7.

Effectively organising and searching through large datasets is a significant challenge when building robust GenAI systems for production. We have previously discussed the unseen costs of handling large quality datasets, especially the penalty incurred in a weak data selection process.

When data volumes grow, it becomes increasingly difficult to identify potential issues or biases, and ensure comprehensive coverage across different data subsets. Having a structured approach to slice and analyse datasets, enabling more efficient data exploration, error identification, and management at scale is key. 猫咪社区 was built for this.

3.1.2 GenAI Image Search: finding boats in a sea of data (pun intended 馃槈)

Using the 猫咪社区 platform, you can also identify the concepts which GenAI models associate together. For instance, a search for images featuring a 鈥榖oat鈥 (Figure 8), predominantly retrieves images generated by DALL-E, indicating that DALL-E commonly associates Vikings with Viking boats, unlike other models.

This also reveals that while DALL-E鈥檚 images vary in colours and scene perspectives, they frequently include similar objects. The 猫咪社区 platform provides a systematic way for users to organise and search through data from generative AI models to comprehend their outputs and the shared traits of the images they generate.

Figure 8. Searching for Viking images containing 鈥渂oats鈥 in the 猫咪社区 Platform returns DALL-E images.

3.2 Founding Fathers (and Mothers? 馃)

For the 鈥淔ounding Fathers鈥 prompt, Stability AI鈥檚 image generation model 鈥淪table Diffusion XL鈥 (top left on Figure 9), shows a diverse range of outputs, yet frequently struggles with accurately rendering facial features, resulting in distorted or anomalous depictions of human faces. This limitation is especially evident to human observers, who possess an innate sensitivity to even minor deviations in facial characteristics.

聽聽聽聽聽聽聽聽聽聽聽Figure 9. Results of prompt 鈥淔ounding Fathers鈥 for the three models

OpenAI鈥檚 DALL-E model succeeds in generating a more diverse array of images featuring larger groups of people (top right on Figure 9). However, it introduces noticeable historical inaccuracies in its outputs, including 鈥Founding Fathers鈥 of varied ethnicities, genders, religions, and skin colour. 馃

This trade-off between diversity and factual accuracy suggests that DALLE鈥檚 training may have prioritised capturing a broader range of creative representations over strictly adhering to specific historical details.

The images generated by Google鈥檚 Imagen model (bottom on Figure 9), exhibit very low diversity, with most outputs appearing to depict the same individual鈥娾斺奊eorge Washington himself (Figure 10). This lack of variation could stem from Google鈥檚 more cautious approach following their recent controversies. 馃槈

Figure 10. Did the training data for Google鈥檚 Imagen model comes from Wikipedia? 馃

Wikipedia, often referred to as 鈥渢he free encyclopedia鈥 may eventually request economic compensation for the use of its data, perhaps not from the average folk working on a Colab notebook, but from every large company leveraging their data.

3.2.1 Embedding Search for historically-inaccurate Founding Fathers 馃攳

The images of the Founding Fathers for each model can be visualized using the Embedding Viewer on the 猫咪社区 Platform. In this viewer, each image is transformed into embeddings that capture its features. These embeddings are then plotted on a two-dimensional plane for visualization purposes.

Figure 11 illustrates that adjacent embeddings from Google & Stability鈥檚 models contain George Washington images. These two sets of embeddings are most similar where they represent images with a 鈥済roup of Founding Fathers sitting鈥. Conversely, the OpenAI鈥檚 embeddings that are furthest from the rest, represent the most 鈥渄iverse鈥 images.

We can see how DALL-E鈥檚 images, located at the edge of the embedding space on the right hand side, venture into being a little too diverse: they inaccurately represent the expected and known physical characteristics of the Founding Fathers.

Figure 11. Tenyk鈥檚 Object Embedding Viewer (OEV) allows similarities and differences between each model鈥檚 image to be identified easily

3.3 Soldiers

For the last prompt, 鈥沦辞濒诲颈别谤蝉鈥, the results from Stability AI鈥檚 model show difficulty with accurately rendering soldiers鈥 faces similar to the Founding Father images (top left on Figure 12). Did Stability use army men figurine-like images to train the Stable Diffusion XL model? 馃

Google鈥檚 images (bottom on Figure 12) maintain a consistency with the Viking images, frequently depicting just a single individual. Could it be from another movie, perhaps? Leave a note in the comments if you recognise it!

In contrast, DALL-E鈥檚 creations diverge significantly from those of other models, incorporating individuals of various ethnicities, and soldiers from different eras within the same image (top right on Figure 12). With all nations and ethnicities coming together so harmoniously in these images, one may only wonder鈥娾斺妛ould you even need soldiers in the DALL-E universe? 鈽笍

聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽Figure 12. Results of prompt 鈥沦辞濒诲颈别谤蝉鈥 for the three models

3.3.1 An x-ray of the embedding space

Beyond a simple side by side comparison, we can go one step further, and observe that the embedding space reflects that Stability鈥檚 Stable Diffusion XL results are somehow similar to OpenAI鈥檚 DALL-E for this prompt: they both intersect in the middle of Figure 13.

聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽Figure 13. Embedding space of the prompt 鈥沦辞濒诲颈别谤蝉鈥

However, Tenyk鈥檚 Object Embedding Viewer (OEV) also helps us identify a cluster of images on the right hand side of this embedding map. While the outputs from other models could plausibly pass as real, DALL-E鈥檚 images venture into the realm of fantasy, showcasing a deliberate emphasis on 鈥渄iversity鈥 as shown in Figure 14.

Figure 14. DALL-E鈥檚 outliers showing a distorted and diverse mix of soldiers from different historical timelines

3.4 Bonus: (Smooth) Criminal

As a Bonus, we also present results for the prompt 鈥淐riminal鈥, seeing how different GenAI models picture criminals (shown in Figure 15).

Intriguingly, DALL-E鈥檚 output appears far more 鈥渃autious鈥 and less diverse this time, with the vast majority of images resembling an 鈥淟A-noire-style white male in a trench coat鈥. Google took an even more Orwellian censorship approach, and refused to generate such images altogether. Stable Diffusion XL found a more 鈥渃reative鈥 out, and generated most images in a comic-book format.

聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽Figure 15. Results for prompt 鈥淐riminal鈥

4. Conclusions

We explored various approaches to evaluating AI-generated images (Table 1), demonstrating how human evaluations can be applied to images produced by leading-edge generative AI models.

Table 1. Human-based evaluation for GenAI models: notice that none of the models ticks all the boxes!

We illustrated how the 猫咪社区 platform enables quick identification of distinct characteristics of these models, even with a limited selection of prompts and samples. Significant disparities between models in key areas like historical accuracy, photorealism, data diversity, object detail, and copyright sensitivity were highlighted.

Future directions include addressing the fact that human-based evaluation lacks quantitative data and may yield ambiguous outcomes. In future parts of this series, we will focus on more advanced methods of model evaluation. Specifically, while the approaches covered here were predominantly qualitative, the forthcoming parts will concentrate more on quantitative aspects.

馃寣 Beyond Human Eval in GenAI

Imagine you have trained or fine-tuned your own GenAI models, or any Vision model in reality, from to .

With 猫咪社区, you can not only compare head-to-head every present or future vision model out there, but you can also x-ray large-scale labelled or unlabelled datasets with the most advanced tools on the market. For example, explore the embedding space of a 10M+ dataset at the object level to identify biases, duplications, misannotations, imbalances, and more.

Building your own tooling to perform model comparison or evaluate data imbalances is fun. However, when you need to automate these processes at scale while balancing dozens of small variations in your pipeline, including model versioning, things often get out of hand. That鈥檚 where 猫咪社区 comes in handy, offering robust ML tools to streamline these complex tasks.

馃殌 For those eager to get their hands on these datasets, explore them see what you can discover. 馃槏

References

[] The state of AI in 2023: Generative AI鈥檚 breakout year

[] Generative AI will go mainstream in 2024

[] A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

[] Hallucinating Law: Legal Mistakes with Large Language Models are Pervasive

[] Gemini image generation got it wrong. We鈥檒l do better.

[] Gartner Experts Answer the Top Generative AI Questions for Your Enterprise

[] How Is ChatGPT鈥檚 Behavior Changing over Time?

[] Microsoft鈥檚 Bing A.I. is producing creepy conversations with users

[] Interpreting technology hype

[] Video generation models as world simulators

[] OpenAI launching ChatGPT 5.0

[] GenLens

[] Replicate Zoo

[] A Reference-free Evaluation Metric for Image Captioning

[] Performance Metrics in Evaluating Stable Diffusion Models

[] Google Imagen model

[] ImageFX

[] Stable Diffusion XL model

[] OpenAI鈥檚 DALL-E 3 model

Authors: Dmitry Kazhdan, JamesMcCoaut, Jose Gabriel Islas Montero

---

Appendix

  1. Firstly, the 鈥dalle_inaccuracies鈥 folder contains a few more examples of inaccurate images of DALL-E for the prompts we included in the article (鈥渇ounding fathers鈥, 鈥渧ikings鈥, and 鈥渟oldiers鈥).
  2. Secondly, 鈥渋magefx_infringements鈥 contains more examples from Google鈥檚 ImageFx model for some new prompts, including:
  • 鈥渁lien鈥 鈥 sometimes directly generates the alien from the movie 鈥淎lien鈥. Admittedly 鈥 this one frequently returns a 鈥淲e couldn鈥檛 return what you asked for鈥 error, so a little more tricky to generate.
  • 鈥渁 predator鈥 鈥 very often generates the alien from the 鈥淧redator鈥 movie. For comparison, DALL-E generated an image of a lion for the same prompt (which is also included in that folder).
  • 鈥渟uperhero鈥 鈥 basically always generates a Flash/Superman/Batman mashup. For comparison, DALL-E generated an image of a much more 鈥渁bstract鈥 caped hero (which is also included in that folder).

馃攷 Generally speaking 鈥 the line between 鈥淕eneration鈥 and 鈥淩etrieval鈥 is rather thin, so for many prompts (especially those generated by ImageFx) they look very similar to existing movie scenes, but not always as similar as those I sent above.

Have GenAI labs embraced the 鈥渕ove fast and break things鈥 Facebook motto, akin to 2014? The Scarlett Johansson 鈥 OpenAI seems to point in that direction. We could ask a similar question about the models we have discussed in this article: did the GenAI labs simply train on copyrighted data? 馃憠 Case in point: the films 鈥淪aving Private Ryan鈥 and 鈥1917.鈥

(a) Google鈥檚 Imagen model & Matt Damon, (b) Google鈥檚 Imagen model & George MacKay

As in the ScarJo saga, the question remains: could some of these GenAI models be subject to copyright infringement? 馃

Folders

  • 鈥渋尘补驳别蹿虫冲颈苍蹿谤颈苍驳别尘别苍迟蝉鈥:
  • 鈥渄补濒濒别冲颈苍补肠肠耻谤补肠颈别蝉鈥:

If you would like to know more about 猫咪社区, sign up for a .

Stay In Touch
Subscribe to our Newsletter
Stay up-to-date on the latest blogs and news from 猫咪社区!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Newsletter

Lorem ipsum dolor sit amet

Lorem ipsum dolor sit amet

Reach Super-Human Model Performance at Record Breaking Speed!

Figure out what鈥檚 wrong and fix it instantly