OpenAI collapses media reality with Sora, a photorealistic AI video generator

On Thursday, OpenAI announced Sora, a text-to-video AI model that can generate 60-second-long photorealistic HD video from written descriptions. While it's only a research preview that we have not tested, it reportedly creates synthetic video (but not audio yet) at a fidelity and consistency greater than any text-to-video model available at the moment. It's also freaking people out.

"It was nice knowing you all. Please tell your grandchildren about my videos and the lengths we went to to actually record them," wrote Wall Street Journal tech reporter Joanna Stern on X.

"This could be the 'holy shit' moment of AI," wrote Tom Warren of The Verge.

"Every single one of these videos is AI-generated, and if this doesn't concern you at least a little bit, nothing will," tweeted YouTube tech journalist Marques Brownlee.

For future reference—since this type of panic will some day appear ridiculous—there's a generation of people who grew up believing that photorealistic video must be created by cameras. When video was faked (say, for Hollywood films), it took a lot of time, money, and effort to do so, and the results weren't perfect. That gave people a baseline level of comfort that what they were seeing remotely was likely to be true, or at least representative of some kind of underlying truth. Even when the kid jumped over the lava, there was at least a kid and a room.

The prompt that generated the video above: "A movie trailer featuring the adventures of the 30 year old space man wearing

a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors."

Technology like Sora pulls the rug out from under that kind of media frame of reference. Very soon, every photorealistic video you see online could be 100 percent false in every way. Moreover, every historical video you see could also be false. How we confront that as a society and work around it while maintaining trust in remote communications is far beyond the scope of this article, but I tried my hand at offering some solutions back in 2020, when all of the tech we're seeing now seemed like a distant fantasy to most people.

In that piece, I called the moment that truth and fiction in media become indistinguishable the "cultural singularity." It appears that OpenAI is on track to bring that prediction to pass a bit sooner than we expected.

Prompt: Reflections in the window of a train traveling through the Tokyo suburbs.

OpenAI has found that, like other AI models that use the transformer architecture, Sora scales with available compute. Given far more powerful computers behind the scenes, AI video fidelity could improve considerably over time. In other words, this is the "worst" AI-generated video is ever going to look. There's no synchronized sound yet, but that might be solved in future models.

How (we think) they pulled it off

AI video synthesis has progressed by leaps and bounds over the past two years. We first covered text-to-video models in September 2022 with Meta's Make-A-Video. A month later, Google showed off Imagen Video. And just 11 months ago, an AI-generated version of Will Smith eating spaghetti went viral. In May of last year, what was previously considered to be the front-runner in the text-to-video space, Runway Gen-2, helped craft a fake beer commercial full of twisted monstrosities, generated in 2-second increments. In earlier video-generation models, people pop in and out of reality with ease, limbs flow together like pasta, and physics doesn't seem to matter.

Sora (which means "sky" in Japanese) appears to be something altogether different. It's high-resolution (1920x1080), can generate video with temporal consistency (maintaining the same subject over time) that lasts up to 60 seconds, and appears to follow text prompts with a great deal of fidelity. So how did OpenAI pull it off?

OpenAI doesn't usually share insider technical details with the press, so we're left to speculate based on theories from experts and information given to the public.

OpenAI says that Sora is a diffusion model, much like DALL-E 3 and Stable Diffusion. It generates a video by starting off with noise and "gradually transforms it by removing the noise over many steps," the company explains. It "recognizes" objects and concepts listed in the written prompt and pulls them out of the noise, so to speak, until a coherent series of video frames emerge.

Sora is capable of generating videos all at once from a text prompt, extending existing videos, or generating videos from still images. It achieves temporal consistency by giving the model "foresight" of many frames at once, as OpenAI calls it, solving the problem of making sure a generated subject remains the same even if it falls out of view temporarily.

OpenAI represents video as collections of smaller groups of data called "patches," which the company says are similar to tokens (fragments of a word) in GPT-4. "By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions, and aspect ratios," the company writes.

An important tool in OpenAI's bag of tricks is that its use of AI models is compounding. Earlier models are helping to create more complex ones. Sora follows prompts well because, like DALL-E 3, it utilizes synthetic captions that describe scenes in the training data generated by another AI model like GPT-4V. And the company is not stopping here. "Sora serves as a foundation for models that can understand and simulate the real world," OpenAI writes, "a capability we believe will be an important milestone for achieving AGI."

One question on many people's minds is what data OpenAI used to train Sora. OpenAI has not revealed its dataset, but based on what people are seeing in the results, it's possible OpenAI is using synthetic video data generated in a video game engine in addition to sources of real video (say, scraped from YouTube or licensed from stock video libraries). Dr. Jim Fan of Nvidia, who is a specialist in training AI with synthetic data, wrote on X, "I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!" Until confirmed by OpenAI, however, that's just speculation.

Sora as world simulator

Along with Sora, OpenAI released a corresponding technical document called "Video generation models as world simulators." That technical analysis merits a deeper dive than we have time or space for here, but how Sora models the world internally has computer scientists like Fan speculating about deeper things on X. "If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine," he wrote. "It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, 'intuitive' physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths."

In the technical paper, OpenAI writes, "We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals, and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale."

OpenAI has also found that Sora can simulate Minecraft gameplay to some extent, bringing us one step closer to the potential of what might be called "neural rendering" in video games. Instead of rendering billions of polygons hand-crafted by artists, video game consoles of the future may generate interactive video streams using diffusion techniques in real time.

Sora is not perfect, however, and OpenAI notes Sora's deficiencies in its technical paper. "It does not accurately model the physics of many basic interactions, like glass shattering," the company writes. "Other interactions, like eating food, do not always yield correct changes in object state." OpenAI also lists "incoherencies that develop in long duration samples" and "spontaneous appearances of objects" as failures.

Here's an example of Sora when it doesn't do what you might expect to a glass sitting on a table.

There's also skepticism that tech like Sora may not be the universal solution to video generation. Computer scientist Grady Booch wrote, "I'm beginning to think that, while there will certainly be some economically- and creatively-interesting use cases, I see strong parallels to the domain of no code/low code efforts. In both those visual and programming domains, it is easy to produce splashy demos; it is easy to automate relative straightforward things. But nudging those systems to get the precise details you want? That’s another story."

With a release like this, there are many dimensions of impact to consider, and we'll discuss those in future articles. Already, some are concerned about the implications for the film industry, the source of the training data, and the misinformation or disinformation that could come from being able to synthesize complex, high-resolution video on demand.

As a result, OpenAI says it is currently red-teaming Sora (putting it through adversarial testing) using "domain experts in areas like misinformation, hateful content, and bias" before it sees a public release. Even if OpenAI were to keep Sora locked in a vault forever, if history is any guide, open weights models will eventually catch up and similar technology will be available to all. Our main takeaway is this: If trusting video from anonymous sources on social media was a bad idea before, it's an even worse idea now.

Recommended Comments

There are no comments to display.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Add a comment...

× Pasted as rich text. Paste as plain text instead

Only 75 emoji are allowed.

× Your link has been automatically embedded. Display as a link instead

× Your previous content has been restored. Clear editor

× You cannot paste images directly. Upload or insert images from URL.

Insert image from URL

Sign In

OpenAI collapses media reality with Sora, a photorealistic AI video generator

Hello, cultural singularity—soon, every video you see online could be completely fake.

How (we think) they pulled it off

Sora as world simulator

User Feedback

Recommended Comments

Join the conversation

Recently Browsing 0 members

nsane.down

News

Browse

Activity