Video Generative AI Needs to Process 3D Inputs to Achieve True Realism

Simeon Spencer
Nov 27, 2023
5 min read

In the realm of artificial intelligence, video-generative AI occupies a curious niche. It's easy to lump it together with other AI marvels like ChatGPT, which crafts text with such finesse that you might mistake it for a human's handiwork, or DALL-E and Stable Diffusion, known for creating images so vivid they seem plucked from reality. The leap, then, is to assume video generative AI can do the same for videos. But that's not quite right.

The hype around video-generative AI often spirals out of a certain misunderstanding. Media outlets have sensationalized its capabilities, particularly with Deepfakes - those eerily realistic video fabrications. Take, for instance, the Deepfake of Ukrainian President Zelensky, which was broadcast on Ukrainian TV, falsely showing him instructing his troops to surrender to Russia.

Such instances might mislead those not closely following current events, but they aren't as sophisticated as they seem. It's crucial to understand that these high-profile Deepfakes, like the one featuring Elon Musk peddling "TeslaCoin," aren't birthed fully formed from video generative AI.

Here's how it actually works: Deepfakes are crafted through a process where AI learns from a vast collection of a person's videos and images. The AI then grafts the learned facial expressions and voice onto a different video. It's a meticulous process, aligning expressions and mannerisms just so. And even then, the end product often requires further refinement through video editing software to achieve a semblance of realism.

So, while video generative AI does play a role in the creation of Deepfakes, it's not yet at a stage where it can conjure up these complex fabrications from thin air. It's a collaborator, not the sole artist.

Video Generative AI Content is Far from Realistic

The current landscape of video-generative AI is a mixed bag. Consider Runway Gen-2, one of the more prominent text-to-video AI tools accessible to the public. Unlike its contemporaries, which typically tweak existing footage, Runway strives to create entirely new content. But how well does it fare?

Here is a 4-second video generated by Runway Gen-2 from the following prompt:

“Close-up of a lady reading a book” with an additional 35mm camera focal length filter

https://video.wixstatic.com/video/60f44b_b65ce2d08c0348fd87bbf3ed8e55e8df/720p/mp4/file.mp4

The initial output seems promising, but as the video rolls, the cracks show. The woman's wrinkles inexplicably fade, her hands are unnaturally distorted, and, most bizarrely, she sprouts a third hand to hold the book. Despite the simpler prompt, clearly, the result is far from lifelike.

But what if we gave Runway Gen-2 a source image to generate a video from? Could it bring it to the same levels of realism as Deepfakes? We took an online stock image from Unsplash and added it to a prompt for Runway.

Here’s the 4-second video generated by Runway Gen-2 using the above image and the same prompt above:

“Close-up of a lady reading a book” with an additional 35mm camera focal length filter

https://video.wixstatic.com/video/60f44b_62a8c15d1307483fab75a2d5e63245a4/720p/mp4/file.mp4

The result is an improvement: the video looks more convincing at first glance. However, on closer inspection, you notice the woman's face blurring over time, and the book in her hands behaving erratically. It's closer to realism but still misses the mark for true photorealism.

So far, Runway has only been hit by close-up single-subject prompts, which are fairly simple for the AI to render. Intrigued by its limitations, I decided to challenge Runway further with a more complex scenario:

“People walking around Times Square in Manhattan, New York”

Times Square, with its iconic status, has graced countless films, adverts, and various media. It's a logical bet that Runway, a video-generative AI, would have been fed a rich diet of Times Square imagery during its training. Yet, when put to the test, the results are somewhat underwhelming.

https://video.wixstatic.com/video/60f44b_51a652300002419fa75b2ffe7e6c5d3f/720p/mp4/file.mp4

Video-generative AI, as it stands, hasn't quite caught up with its siblings in the realms of text, image, and audio generation. This isn't surprising, given the intricate nature of video as a medium. It's a tapestry of moving images, each frame a complex interplay of light, shadow, and motion, far more challenging to synthesize convincingly than a single image or a string of text.

However, the tide may be turning. Stability AI is venturing into this challenging space with its new offering, Stable Video Diffusion. This move signals a potential shift in the capabilities of video-generative AI, possibly bridging the gap between the current state and the more advanced realms of AI in text and image generation.

Stable Video Diffusion (SVD), the latest entrant in the video generative AI arena, comes with a promise of enhanced capabilities. It's set to debut with two distinct models that transform still images into videos. These models can produce sequences of either 14 or 25 frames, with the flexibility to adjust the frame rate anywhere from 3 to 30 frames per second. A feature for directly converting text prompts to videos is also in the pipeline, although it's slated for a future update.

Feedback prior to its release indicates a preference for SVD's output over that of established players like Runway and Pika Labs, suggesting a leap forward in video AI quality. Despite the buzz, a specific release date for SVD remains unannounced.

Now, the question is what could be the key to enabling video generative AI to create realistic videos?

Could Training on 3D Model Data Be the Key to Video Realism

CGI animators have become increasingly capable of animating life-like CGI people, for example, Paul Walker in the final scenes of Furious 7. But as seen above, video generative AI does not come close in quality despite being trained on large datasets of videos and images. That is because Video generative AI models typically do not train directly on complex 3D model files (like those used in CGI software such as .obj, .fbx, .3ds, or .alembic). Instead, they more commonly train on rendered images or videos, as the rendered output is closer to the final content that AI models aim to generate.

The detailed information in 3D model files about geometry, textures, lighting, and animation data, is exactly the type of data missing in the videos generated by generative AI. Now, I’m no AI scientist or dev myself, and if any AI scientist or dev is reading this, you will probably want to strangle me for mentioning this. I get it, training AI on complex 3D models isn't just adding another ingredient to a soup; it's almost like reinventing the recipe. But imagine the possibilities: an AI trained in 3D models could gain a profound understanding of space, shape, and shadows. This could catapult video-generative AI into realms of realism and accuracy, potentially going toe-to-toe with the CGI of today, especially for scenes bursting with dynamic movements and interactions.

There currently are generative AIs purely for generating 3D models such as Masterpiece X and 3DFY AI, but these models lack the ability to create video content. However, the first video-generative AI to harness 3D model data, either in training or as input, might just be the one to craft the first movie-grade video.

3 Comments

sarah

Nov 27, 2023

The article was well written, effectively showcasing the present capabilities of video GenAI with actual examples. I also liked that the article outlined what could be required for the future of video GenAI to be successfully secured.

Walter Tong

It was interesting seeing the different videos generated based on the various prompts given. However, I think these generative AI videos still require some fine tuning as it is still obvious that they are not able to fully emulate real life videos yet. It will be interesting to see how this technology continues to develop.

guanlong he

Interesting article! This has definitely gotten me thinking about the data points used to create AI-generated images and the potential applications and possibilities of AI trained with 3D data points.

Video Generative AI Needs to Process 3D Inputs to Achieve True Realism

Recent Posts

3 Comments

Want to Know When We Post?

Heading 2

Already Accessed Free Article !