The future of video: OpenAI's text-to-video model Sora

It is here. OpenAI is launching a text-to-video model. Last year the company released its text-to-text model (ChatGPT), then came text-to-image (DALL-E), and now they are introducing text-to-video: Sora.

A text-to-video AI transforms text into video by interpreting the content, developing a scenario, and selecting or creating visual and audio assets. Those are stitched together and refined to produce the final result. The complexity lies in understanding the nuances of text, like context, tone, and implicit information, and translating them accurately into visuals. It requires advanced AI techniques, including natural language processing and computer vision, to identify and visualise different elements such as emotions, actions, and settings. That makes building an accurate, convincing text-to-video AI both challenging and time-intensive.

OpenAI is not the first to work on text-to-video AI. Companies like Google DeepMind and Adobe have been developing similar technologies, each with their own approach. Google DeepMind has focused on improving machine learning models for better understanding and creation of complex video content. Adobe, known for its powerful creative software, has been exploring AI-driven video editing tools that turn text into video. There are also startups like Synthesia and Lumen5 producing video content from text for commercial use, opening up a wide range of options for marketers and creators.

About nine months ago, a Reddit user posted a video they generated with Stable Diffusion using the prompt: "Will Smith eating spaghetti and meatballs". The result looked like this:

Now, nine months later, OpenAI is unveiling Sora. The footage speaks for itself:

Are we there yet? Not quite, but we are close. You are watching these clips knowing they are AI-generated, which means you are looking for tells. If you did not know they were AI, you probably would not catch on. Right now these models are ideal for stock video or B-roll: the kind of supporting footage used to add visual variety and reinforce a story between the main shots of a video.

There is no official release date yet, but we hope to see more of the model soon. OpenAI has just published a blog post announcing Sora; it is still a research project and the official paper is not out. So general availability could still be some way off. On the other hand, development of these models is moving extremely fast, and competition for OpenAI is starting to emerge, so they have an interest in getting a new model out quickly. For now, we are watching with interest and keeping an eye on the latest developments.

Your Team

Your Tools

The future of video: OpenAI's text-to-video model Sora

Want to put AI to work in your organisation?