I’ve been transfixed by the images generated by Dall-E2 from Open AI. The name, a nod to Salvador Dali and Wall-E, the cute autonomous robot picking up after man destroys the world, appears to ring all kinds of surrealist, dystopian bells. When I looked at the images I had two thoughts:
This is like a fever dream (surreal, sure) come real
The metaverse is going to be wild and incredible
Ilya Sutskever, cofounder and chief scientist at OpenAI:
"One way you can think about this neural network is transcendent beauty as a service……Every now and then it generates something that just makes me gasp."
Transcendent beauty as a service. That’s an incredible phrase to associate with an AI.
Remember all those claims that AI was coming for accountants and truckers leaving creativity and art as the last bastion of human creation. Well, that lasted for a wink. Looks like transcendent beauty will now be available at the touch of a button (or a few words typed in). What art are you then creating, human?
How does DALL.E work?
Here’s some mumbo-jumbo on how it works (skip to next section if not interested)
A key difference between DALL.E and DALL.E 2 is that while the first version was based on a GPT-3 like text based model, while DALL.E 2 was fully redesigned. At a very basic level if GPT-3 was autocomplete for text, think of DALL.E 2 as autocomplete for pixels. It predicts the next best pixel to generate basis the enormous library of images it has learnt from and then produces that pixel:
Under the hood, it works in two stages. First, it uses OpenAI’s language-model CLIP, which can pair written descriptions with images, to translate the text prompt into an intermediate form that captures the key characteristics that an image should have to match that prompt (according to CLIP). Second, DALL-E 2 runs a type of neural network known as a diffusion model to generate an image that satisfies CLIP.
Key word here is diffusion models. Watch this video for the nitty gritties of why diffusion models are so good at image generation (as opposed to Generative Adversarial Network models)
At a high level, diffusion models are slower, more iterative at de-noising images and produce highly detailed output images. The model used specifically by OpenAI (GLIDE) also takes the text prompts into account as well. So in short the neural net is going from noise (which is just a random collection of pixels) into adding the right pixels iteratively basis all the text tokens generated from the text prompt (this is an oversimplification). There is an additional model (CLIP) being applied that is more focussed on accuracy of text to image mapping to generate a score on how close the image is to the text tokens from the prompt. And after all of this math, emerges transcended beauty.
It’s not perfect - nowhere close to perfect
DALL-E 2 isn’t perfect (far from it). Basis all the early adopter feedback:
Images are cherry picked. Several prompts produce crap which are ignored. So in a sense it’s a little bit of Infinite monkey theorem going on. (not in the same infinitesimal probabilities but more on that a little later)
Proportions and details may be just that little bit off. As an example, see this image below of a photo of a flower shop.
The amazing thing is that it is generated by DALL.E 2 using the following phrase: “A photo of a quaint flower shop storefront with a pastel green and clean white facade and open door and big window.”
Now, look closer at the door. It seems nonsensical, like some kind of Escherian stairwell with no beginning or end, not hinged correctly as a door should and in general having no way of functioning like a door. These are nuances that break when you look at specifics within the picture.
Struggles with generating people in images as this article articulates. This could also be intentional though:
This could be deliberate, for safety reasons -- realistic images of people are much more open to abuse than other things. Porn, deep fakes, violence, and so on are much more worrisome with people. They also mentioned that they scrubbed out lots of bad stuff from the training data; possibly one way they did that was removing most images with people.
Images containing text often contain just nonsense, incoherent text
But at a quick glance it is enough to stun us. The fact that it can produce extremely high fidelity graphics, cartoons or even photo realistic images of something that doesn’t exist in the real world feels like an incredible amount of progress for AI.
Infinite monkeys
We’ve all heard some version of the infinite monkey theorem:
….a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text, such as the complete works of William Shakespeare.
We are talking about really small probabilities here. Imagine as many monkeys as there are…atoms in the universe, typing extremely fast and for a long time (trillion times the life of the universe). Even so, the likelihood that they’d produce a complete book is so extremely small.
Still, the monkeys here are an abstraction for the idea that randomness can produce patterns that are beautiful and meaningful for humans. “Almost surely” here means 100% probability.
The problem of infinite monkeys is not of creation. If you can type infinite gibberish randomly, surely there is going to be all the works ever produced in the world in it. But who is going to sort through it and find out? The monkeys themselves will have no concept of which one is great work versus which ones are random letters strung together. The problem then is of curation.
Forget entire works and instead think about the first lines of novels. I love them. Often in bookstores, I browse book after book to just read the first line and feel that jolt of excitement of a new story beginning. “first sentences are doors to worlds” said Ursula Le Guin.
For instance, take the first line of The Hitchhiker's Guide to the Galaxy (one of all time greats):
“The story so far: in the beginning, the universe was created. This has made a lot of people very angry and has been widely regarded as a bad move.”
If that isn’t the perfect setup for a book of philosophical truths couched as rip roaring humour, I don't know what is.
Now, could you ever imagine infinite monkeys producing this line? It seems improbable and yet, in mathematical terms, they certainly will.
It’s a disturbing idea. And a sad one.
While the infinite monkeys theorem is a mathematical mental model it represents a very human curiosity. What makes us human? What breeds creativity?
More importantly, does intent matter?
If a random generator produced a work of fiction, let’s say the Harry Potter series and somehow magically they were released the same time as when JK Rowling published her works, are they exactly the same? On the one hand, it was an accident of randomness. On the other hand, a human, however flawed she may be, exhibited incredible creative process to produce this book.
So, does intent and process make a difference? In the future will we ascribe higher value to human produced work?
Exquisite handcrafted novels anyone?
These philosophical debates have raged for a long time, but now, with exponential compute power and neural networks, the infinite monkey theorem is not a thought experiment anymore. DALL.E 2 and GPT 3 and Deepmind are all running universes of bit monkeys that can resolve randomness into human-like (superhuman) creative outputs at terrifying speeds.
DALL.E 2 Fears
Coming back to DALL.E 2, there are several obvious things that are a little scary:
It can produce terrifyingly realistic deep fakes that can be used to worsen the miasma of misinformation we live in or to harass and troll people.
It may put stock photo businesses, artists and designers out of work. I am not sure if this is going to happen any time soon - it might actually supercharge artists to produce incredible works of imagination in the meantime.
It can create copyright and ownership issues by producing images that look eerily similar to ones produced by a human artist or creator. Another human can use the base art created painstakingly into the model and generate derivatives and could then go on to make a lot of money off that work with little real work put in.
It has the potential to erase the value of human generated graphical content completely over time.
Proliferation of offensive content or content that continues to perpetuate biases, stereotypes, etc.
The last point is a larger risk with AI system.To be fair, the Open AI team has been actively calibrating and working on most of these risks and more. The risk assessment paper is a really interesting read and goes into detailed breakdown of these topics. But upending human creative potential and the psychological impact it would have on humanity isn’t one of the risks anyone plans for.
One of the biggest strength that humans value is freedom to make choices but in the future this may be only get more and more diluted:
“The most-feared reversal in human fortune of the AI age is loss of agency. The trade-off for the near-instant, low-friction convenience of digital life is the loss of context about and control over its processes. People’s blind dependence on digital tools is deepening as automated systems become more complex and ownership of those systems is by the elite.”
When we have all the time in the world, comforts at the click of a button and potentially anything available for cheap and near instantly, what are we left to do or even create?
On the bright side….the metaverse is going to be wild
If we look beyond the scare though, DALL.E 2 opens up a wonderful world of possibilities. Read this thread from an artist on how she used DALL.E to produce surprising and delightful imagery:
Image generation from text prompts is going to keep getting wildly better. Currently, there are several different models that provide image generation from text apart from DALL.E:
GauGAN AI from NVIDIA - that can generate a scene on fly from words and then allows artists to manipulate and edit individual objects
Dream app that lets you create AI generated painting like work using text prompts. The below is an image I produced with the prompt “Metaverse indian myth float.” I think I am off to sell this as an NFT next.
The applications of this are endless. Imagine the photoshop in the very near future but one that’s led by natural language processing and is as easy to use as speaking and imagining. Subtle changes can be done easily through prompts instead of tweaking a dozen dials in photoshop or lightroom “Remove all people from this photo” or “Add in my pet dog to the existing photo” can be prompts that instantly produce realistic looking altered images of photos you take. We are very close to this already but soon it would be indistinguishable from professional work.
Imagine this taken up several notches in the next few years as we immerse deeper and deeper into the metaverse. The messiah of metaverse, Mark Zuckerberg, is already thinking / working towards it:
In a pre-recorded demo video, Zuckerberg walked viewers through the process of making a virtual space with Builder Bot, starting with commands like “let’s go to the beach,” which prompts the bot to create a cartoonish 3D landscape of sand and water around him. (Zuckerberg describes this as “all AI-generated.”) Later commands range from broad demands like creating an island to extremely specific requests like adding altocumulus clouds and — in a joke poking fun at himself — a model of a hydrofoil.
We are far from the territory of three dimensional worlds being rendered on the fly using prompts but we can see a path towards it. NVidia is already working on turning two dimensional images to three dimensional models. Google’s Deepmind is working on the ability to turn internet photos into full 3D models.
Perhaps in less than a decade we are going to have AI generated complete three dimensional worlds which are rendered on the fly. Imagine being able to instantly visualise the interior of a house you are planning with just simple word prompts.Or games where you build your world with simple cues instantly and then go play in it and invite your friends to play. Perhaps customers can generate models on new products (think sneakers or cars!) and these could get voted into a company’s portfolio of products. Design could be crowdsourced like never before!
And eventually, all roads will lead us into the metaverse the glorious, weird, full of possibilities and perhaps a tad scary.
“It was the dawn of new era, one where most of the human race now spent all of their free time inside a videogame” - Ready Player One
I will leave you with this GIF.
Could be worse,
Tyag