Image-generating algorithms, notably Google Research’s Imagen and Open AI’s DALL-E 2, have sparked a lot of interest recently. Whether you want to see a painting of a fox sitting in a field at sunrise in the style of Claude Monet or a majestic oil painting of a raccoon Queen wearing a red French royal gown, both systems combine natural language processing artificial intelligence (AI) with massive image training sets to produce extremely impressive results.
However, with the majority of people unable to access these two systems, the DALL-E 2’s predecessor, the DALL-E mini, has seen a comeback in popularity. Though it isn’t as stunning or realistic as the newest software, it is simple to use and produces some attractive (and, depending on the supplied text, strange or screwed up) outcomes. As it turns out, the only restriction is your imagination. Here are a few of our personal favorites, as well as a couple that will make you wish people didn’t have imaginations.
We introduce Imagen, a text-to-image diffusion model with unrivaled photorealism and a high level of language comprehension. Imagen relies on the strength of diffusion models in high-fidelity picture production and draws on the power of big transformer language models in interpreting text. Our main finding is that large language models (e.g. T5) pretrained on text-only corpora are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen improves sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
Without ever training on the COCO dataset, Imagen obtains a new state-of-the-art FID score of 7.27, and human raters judge Imagen samples to be on par with the COCO data itself in image-text alignment. We develop DrawBench, a comprehensive and demanding benchmark for text-to-image models, to analyze text-to-image models in more depth. We use DrawBench to compare Imagen to modern approaches such as VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that in side-by-side comparisons, human raters favor Imagen over other models in terms of sample quality and image-text alignment.