
DALL·E 2 is a powerful machine learning model developed by OpenAI that can generate completely new images based on a short text prompt. These images combine distinct and unrelated objects in semantically plausible ways. For example, given the prompt "a moon floating in sea swarmed with dolphins as digital art," DALL·E 2 could generate an image of a moon floating in the sea, surrounded by dolphins, in the form of digital art.
DALL·E 2 can also modify existing images, create variations of images that maintain their salient features, and interpolate between two input images.
To generate images, DALL·E 2 follows a three-step process. First, the text prompt is input into a text encoder that maps the prompt to a representation space. Next, a model called the prior maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding.
Finally, an image decoder generates an image based on this semantic information. The link between textual and visual semantics is learned by another OpenAI model called CLIP (Contrastive Language-Image Pre-training).
CLIP is trained on large amounts of images and their associated captions, learning how much a given text snippet relates to an image. This allows CLIP to learn the link between textual and visual representations of the same abstract object, which is crucial for the operation of DALL·E 2.