We Should All Be Afraid of OpenAI’s New AI Video Generator
Yesterday, Sam Altman, CEO of OpenAI, announced Sora , a new AI-powered video generator. Like DALL-E and ChatGPT before it, Sora is capable of accepting user commands in natural language, understanding the request, and playing it back as advertised. Only instead of generating text responses or images, Sora generates full, realistic video better than any AI program I’ve ever seen. I don’t mean this as a compliment.
Initial impression of Sora: Terror.
On the Sora announcement page, OpenAI has a series of different videos showing what it can do, and they are amazing – in the worst way. Sora can create animated content, such as a “short furry monster kneeling next to a melting red candle” or a “cartoon kangaroo dancing at a disco.” While the end results don’t match the quality of, say, Pixar or DreamWorks, overall they look professional (and some definitely look better than others). I doubt many people would have guessed at first glance that people were not involved in this process.
But while its animation potential is quite unsettling, it’s the realistic videos that are downright chilling. OpenAI showed off “drone footage” of a historic church on the Amalfi Coast, a parade of people celebrating Chinese Lunar New Year, and a tracking shot of a snow-covered street in Tokyo, and I promise you that you’ll think these videos were real your first hours. I mean, some of them still don’t seem like AI creation to me, and I know it.
Even those that have AI flaws such as resource warping and shifting can be confused with video compression. There’s a video of puppies playing in the snow, and while there are some glitches that you’ll notice once you know it’s not real, the physics and image quality dispel the illusion. Why are none of these puppies real? They clearly love snow. God, are we already living in the Matrix?
How does Sora work?
While we don’t have all the details, OpenAI describes Sora’s core processes in its technical report . First, Sora is a diffusion model. Like AI image generators, Sora creates a video by essentially starting with a bunch of static noise and removing that noise until it looks like the image you’re looking for.
Sora is trained on units of data called patches: these patches are created by compressing images and videos into a “lower-dimensional latent space” and then broken down into “space-time” patches, units that the model actually understands. These patches contain space and time information for a given video. Sora then generates the video in this “hidden” space, and the decoder maps it back into “pixel” space, producing the final result.
However, the company does not confirm where this video and photo data came from. (Curious.) They say Sora is built on research from its DALL-E and GPT models, and uses the same re-captioning technique as DALL-E 3 to train the model on the user’s descriptive cues.
What else can Sora do?
While OpenAI can apparently generate videos from standard prompts, OpenAI claims that Sora can generate videos from still images. Apple researchers are working on the same type of process in their Keyframer program .
It can also expand an existing video forward or backward in time. OpenAI showed an example of this in a video of a tram in San Francisco. He added about 15 seconds of additional video to the beginning in three different ways. So, at first all three seem different, but in the end they all sync into the same original video clip. They can also use this technique to create “perfect loops”.
OpenAI believes that Sora is ideal for modeling worlds. (Amazing!) It can create videos with consistent 3D elements so people and objects stay in place and interact as they should. Sora doesn’t lose sight of people or objects when they leave the frame; it can remember the actions of people and objects that have an impact on the “world”, such as when someone draws on a canvas. It can also generate Minecraft on the fly, simulating the player and at the same time generating the world around him.
Sora isn’t perfect
To OpenAI’s credit, it points out Sora’s current weaknesses and limitations. The model may have trouble accurately reproducing physics in a “complex scene” as well as in certain cause-and-effect situations, the company says. OpenAI gives an example of a video of a person eating a cookie, but when you see the cookie afterwards, there are no bite marks on it. Apparently glass breaking is also a rendering issue.
The company also states that Sora may mess up the “spatial details” in your prompt (such as mixing up left and right) and may not be able to correctly display events that happen over time.
Some of these limitations can be seen in the videos that OpenAI shows as evidence of Sora making “mistakes”. In response to Sora’s request to create a running man, Sora creates a man running in the wrong direction on a treadmill; When the clue suggests that archaeologists discover a plastic chair in the desert, the “archaeologists” pull a leaf out of the sand and the chair essentially materializes out of nowhere. (This is especially interesting to watch).
The future is not now, but it is very soon
If you scroll through Sora’s introduction site, you may have a mini-panic attack. But excluding videos that OpenAI considers bugs, these are the best videos Sora can create right now, designed to demonstrate its capabilities.
Following this announcement , Sam Altman took to Twitter and asked users to send him replies so he could send them through Sora. He tweeted the final results of about eight options, and I doubt any of them would have made it onto the announcement page. The first attempt, “Half-duck, half-dragon flies through a beautiful sunset with a hamster dressed in adventure gear on his back,” was laughably bad and looked like something out of the first draft of a 2000s straight-to-DVD cartoon. .
On the other hand , the end result of “two golden retrievers podcasting on top of a mountain” was confusing: it looks like someone took footage of all the assets and quickly edited them on top of each other. It doesn’t look as “real” as it does photoshopped, which again raises the question of what exactly Sora is trained to do:
These quick demos did make me feel a little better, but nothing more. I don’t think Sora is at the point where it can create realistic, indistinguishable videos on a whim. There are probably thousands upon thousands of results that OpenAI went through before settling on the highlights we see in its announcement.
But that doesn’t mean Sora isn’t scary. It won’t take much research or time to improve it. I mean, 10 months ago, this is where AI video creation was . I wonder what Sora would say if he was given the same clue:
OpenAI is adamant that it takes proper precautions: it is currently working with red teams on harm reduction research and wants to give Sora-generated content a watermark similar to other artificial intelligence programs so you can always tell when something has been created using OpenAI technology. .
But I mean, come on : some of these videos are too good. We don’t pay attention to things that may deceive you at first glance, but look fake in retrospect. It’s hard to believe that some of these videos aren’t real. If this material can impress those of us who watch AI content for a living, then how does the average social media user know that a realistic video on their Facebook feed was created by robots?
Without going into too much detail, more than 50 countries will have high-stakes elections this year , and in the United States , AI has already been used to try to deceive voters —and that was just through audio . This year you’ll really have to turn your bullshit detectors to maximum because I think we’ll see some of the most compelling multimedia scams and disinformation campaigns yet.
You guys better hope these watermarks actually work. It’s going to be a wild ride.