Sora, VFX, and the GenAI Valley of Unknowns.

Gary Palmer + Richie Murray

May 20, 2024 • 7 min read

Sora, VFX, and the GenAI Valley of Unknowns.

Introduction

The rapid advancements in generative AI (GenAI) stands to disrupt the landscape of creating content, particularly in the realm of visual effects (VFX). As we stand at the precipice of this transformative era, it is crucial to examine the implications of these developments for the VFX industry. We will explore the evolution of multimodal AI, its impact on video content creation, and the challenges and opportunities that lie ahead for VFX professionals in the face of the GenAI valley of Unknowns.

Before we look forward, let's look back at the transformative journey of multimodal AI and how that’s led to Sora. If you wish to skip the history click here to jump to our thoughts on the impacts for VFX.

The Evolution of Multimodal AI

Transformer models were introduced by Vaswani et al. in 2017. These models were initially developed to address limitations in processing sequential data for language tasks, employing self-attention mechanisms to assess the relevance of words within a sentence.

As transformers proved adept at managing sequences of data, they quickly became foundational to large language models (LLMs) such as OpenAI's Generative Pretrained Transformer (GPT) series (e.g ChatGPT).

An unexpected emergent property of LLMs is that these models could generate coherent and contextually relevant text over extended narratives, demonstrating a deep comprehension of human language and knowledge.

These properties make it surprisingly good at things like code generation, debugging and written ideation.

Adapting Transformers for Multimodal AI and Video Content

The adaptability of transformers, however, is not limited to text. Researchers began to explore their potential in processing and generating other data types, notably images and videos. The model's architecture, capable of handling sequential and contextual information, proved remarkably effective in understanding the temporal and spatial dynamics of video content.

For instance in 2019, inspired by successful LLMs that learn from vast amounts of text data, models like VideoBERT adapted the transformer architecture to understand video content by treating frames as sequential data points, similar to how words are treated in sentences. This allowed for a deeper understanding of video content, enabling tasks like automatic captioning, content categorisation, and even generating video sequences from textual descriptions.

Following that, countless text-to-image models appeared (covered in another article here), and then some more notable text-to-video models like modelscopeAI from Alibaba vision lab appeared.

While outputs were impressive as the first iteration of generative AI, it was obvious they were generated. One of the more infamous examples from this model is the below video of Will Smith eating spaghetti.

Then OpenAI entered the ring. Earlier this year the company released beta test footage from Sora, its multimodal model that turns text into video.

Watch Air Head.

Large Language Models (LLMs) like ChatGPT process text by taking words or subwords as tokens, analysing these individual components to understand and generate text based on learned patterns. Extending this conceptual framework to video, OpenAI's Sora compresses videos by converting them into a 'spacetime' representation. This technique allows Sora to treat patches of video frames—both spatial across individual frames and temporal across multiple frames—as analogous to "tokens" in text processing.

Sora's method involves creating a unified representation where each patch in the video can include information from multiple frames over time, not just a single moment by encoding standard videos into a different representation common for machine learning models called a latent space. This allows the model to capture dynamic changes within the video, such as movement or varying lighting conditions, much like how LLMs track changes in context or meaning from one word to the next.

Just as LLMs analyse relationships and contexts between tokens (words or subwords), Sora analyses the relationships between these spacetime patches. This analysis helps in understanding motion, changes over time, and other video-specific dynamics.

By efficiently encoding these spacetime patches, Sora compresses the video data, prioritising patches that carry more novel or significant information across frames, similar to how more contextually important or less predictable words might be given more focus in text-based models.

Emergent Properties

A fascinating result of this is that to create a realistic looking video, Sora needs to simulate physics like a game engine, and as a result has an emergent property of world simulation. So Sora is essentially a physics engine that happens to output a video file.

Whilst there were initially many rumours that OpenAI trained off of unreal engine 3D data, the 3D simulation capabilities are supposedly just purely a phenomena of the scale of the model that was trained.

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the people/objects within them.

If a system can understand and simulate the physical and causal relationships of the world through text, it can make decisions and predictions with a high degree of accuracy and realism.

Applications and the implications for this are far beyond the M&E industry and the scope of this article, but I may cover my thoughts on this another time.

The Crossroads for VFX

Focusing back on VFX, the industry stands at a crossroads, contending with tight margins that only allow for focusing on the current project and not about iterating on the meta-process.

With studio budgets tightening and an insatiable demand for more high-quality content, faster, against fierce competition, Content owners and VFX studios find themselves struggling to keep up with the demands of a fast growing and evolving market.

The Valley of Unknowns: Fear vs. Opportunity

This is where GenAI and the valley of unknowns comes into play. Artists and studios alike face uncertainty in the middle of great technological change, fears of being supplanted by AI-powered tools are prevalent in the industry, overlooking the potential of GenAI as a huge catalyst for growth and creative expression.

Meanwhile, genAI presents a big opportunity to enable artists to create much faster.

This is an opportunity to resolve the struggles with post production costs and slow timelines. The future of professional content creation hinges on embracing these innovations as tools, and attaching rockets to the generation and iteration process.

Building Bridges with Iteration Control

Text is an inherently lossy medium of communication, and the likelihood of a text prompt providing the complete picture is vanishingly small. But it may provide an asset or landscape you could use with your existing footage. The key to bridging the valley then lies in iteration control.

What's needed is a way for creatives to separate generated media into layers and to pick what they like from generative tools, and refine them further. Traditional methods like rotoscoping and compositing remain integral to add the finishing touches, and necessitate a balanced integration of AI tools into existing workflows.

The fast generation cycles must be met with fast iteration cycles, the development of tools that allow for rapid prototyping and refinement is crucial. By collaborating with developers to create intuitive and user-friendly interfaces with new workflows, VFX artists can reimagine the slow processes to harness the power of GenAI to enhance their creativity and productivity.

The New Creative Battlefield

Creativity, at its very core, after all, is about iteration and control - GenAI tools, if built correctly, have the potential to accelerate the creative process by orders of magnitude.

The new creative battlefield will be the last mile of taking videos from 80% to 100%. Artists will be 5x more productive quickly iterating to 80% and then taking it from there to the final pixel.

However, it won’t be without challenges. If VFX artists don’t help build the bridges across the valley of unknowns in collaboration with the developers, they will find their voices will be eclipsed by the prosumer content creators and vendors creating tools that they can’t use. To thrive in this evolving landscape, VFX professionals must be ready to learn new skills that align with the latest advancements in AI and digital content production. This requires not only a willingness to adapt but also an active engagement in shaping the tools and technologies that will define the future of the industry.

Moreover, VFX artists and studios will need to be prepared to build new workflows and leave the old ones behind. Embracing new technologies means rethinking traditional processes and being open to radical changes in how projects are executed. This shift will involve exploring new software platforms, adopting more agile methodologies in creation, and potentially redefining roles within creative teams to better leverage AI-driven capabilities. By doing so, they can ensure they remain competitive and continue to produce cutting-edge work in a market that increasingly values speed, efficiency, and innovation.

Closing Thoughts

In conclusion, Generative AI represents a transformative opportunity for the visual effects (VFX) industry, provided it is wholeheartedly embraced by the professional community.

The key to harnessing this potential lies in fostering the right community, guiding the correct tooling being built, and spearheading workflows to a more control focused iterativeAI workflow on the new “last mile” creative battlefield.