Where is Midjourney for video? Text to video is coming (but not here yet)

4 text to video research papers, and 3 text to video tools you can use today.

Mar 24, 2023

Imagine entering a prompt like ‘woman holding a small octopus underwater’ and getting a video like this:

Click the image to see the YouTube short.

We can already create an image of the above prompt using text to image systems like Midjourney, Stable Diffusion and Dalle.

Can you guess which image was made by which model?

So why can’t we make a video from a simple text prompt yet?

Why text to video is hard

There are four problems a text to video system has to overcome.

Learned imagery - The system has to know what the world looks like
Learned imagery description - The system has to know how we describe what the world looks like
Learned movement - The system has to know how the world moves.
Learned movement description - The system has to know how we describe how the world moves.

Text to image systems only have to worry about numbers 1 and 2.

If you give the prompt ‘ballerina dancing on a wooden stage, surrounded by candles’ the system needs to do problems 1, and 2.

2 - Understand from the text what you are describing.

1 - Know what a ballerina looks like, what candles look like, what a wooden stage looks like, and what those all would look like together.

The fact that text to image systems can do that is still amazing to me.

To gain that capability, billions of images with text labels were collected and the models were trained to be able to reproduce images from noise. (See this article for how stable diffusion works)

A text to video system would have to tackle all 4 problems. Not only does it need to know what a ballerina looks like, and how we describe it, but it also has to know how the world moves, and know how we describe that movement.

It is one thing to know what a ballerina looks like, another to know how a ballerina moves.

An additional problem is a similarly sized collection of videos with text labels does not exist.

Labeling a single image is one thing, but how would you label every frame or every second of a video?

In additon, some researchers question whether we need to create a label video dataset, because text to image models already have been trained.

There have been 4 major research papers released, and a few tools we can use (we cover those later with examples) but no text to video model exists which can produce high quality, artifact free, coherent videos yet.

Lets look at the 4 papers released so far.

Text to video papers

Facebook’s Make-A-Video

https://makeavideo.studio/

Facebook released an impressive paper. Showing their text to video capabilities. They use image generation models, paired with unsupervised learning from videos to understand how objects would move in the real world.

The did not provide any code or demos.

Make-A-Video example. I think this paper is the most impressive of the 3, with less noise, and greater coherence than others.

ByteDance (TikTok)’s Magic Video

https://magicvideo.github.io/

ByteDance released their paper Magic Video. They includes three steps: keyframe generation, frame interpolation between the images, then super-resolution to increase the size.

They don’t include code or a demo.

image not in path — Magic Video example. Lots of noise and boiling, but still impressive.

Tsinghua University’s CogVideo

https://github.com/THUDM/CogVideo

Tsinghua University released their CogVideo paper.

The included both a demo, as well as the code, althought the demo isn’t functioning currently. Also unfortuatnely for English speakers, it only works on Chinese currently.

CogVideo example. Notice the coherence issues as the boys jump from frame to frame.

KAIST and Google Research ‘Projected latent video diffusion models (PVDM)’

https://sihyun.me/PVDM/

KAIST and Google Research released their PVDM paper. They include code but no demo. They focused on generating videos from text using less resources than other systems.

PVDM example. Low quality, jittery, and misshapen, yet still impressive.

While these papers are amazing, what about text to video tools that you can use today?

Text to video tools you can use today

There are three text to video tools that you can use right now.

ModelScope
Runway Gen2*
Deforum

ModelScope

Developed by the community of developers at Hugging Face, ModelScope is the first freely usable text to video tool you can use.

https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis

I tried out ModelScope. It is… a cool demo, but the results are not usable.

She moves kind of like a ballerina, but where are her legs?

A redditor made a video about Darth Vader in Walmart using ModelScope and the results are interesting but not great.

RunwayML Gen-1 and Gen-2*

March 2023 Update: Gen-1 is a general release now, and is no longer behind a waitlist.

Gen-1 also do not have text to video capabilities. Gen-2 does.

Gen-1 can modify video footage, using image or text prompt inputs, but can’t create video from just a text prompt. Gen-2 has text to video capabilities and is not publically available yet.

Paul Trillo @paultrillo

“An Uncanny Hall Of Mirrors” A completely synthesized reality made with @runwayml beta of #gen2 … and wow… it kinda rattled me to my core. All you need is an image or a text prompt - still a very early version but another paradigm shift in the world of AI filmmaking. #aiart #ai

Yining Shi @yining_shi

I got beautiful and promising videos after the first few tries with Gen-2! Check it Gen-2: Text to Video: by @runwayml #ai #AIart

Runway @runwayml

Generate videos with nothing but words. If you can say it, now you can see it. Introducing, Text to Video. With Gen-2. Learn more at https://t.co/PsJh664G0Q https://t.co/6qEgcZ9QV4

Gen-1 is often confused with text to video. But it can only do style transfer using a text of image prompt.

I took Gen-1 for a test drive. To see what it could do.

I uploaded this video of a beautiful breakfast.

And applied the ‘Pen and Ink’ style to it.

I also gave this image as an effect driver

And rendered it again:

While the results are impressive, they feel like an experiment still, rather than something I could use in a video today. Excited to see how the tool progresses to Gen-2 when you can enter a text prompt and create a video without a source video.

Deforum

I include Deforum on this list but it is interesting. It doesn’t try to solve the 4 problems, it just makes beautiful images that meld into one another. A ballerina will become a candle which becomes a lamp post. It doesn’t know how ballerinas move, and doesn’t try. But you can create videos from text and the results can be mesmerizing (and motion sickness inducing 😅).

Here is an example video I made using Deforum.

This video has 4 prompts.

tiny cute swamp bunny, highly detailed, intricate, ultra hd, sharp photo, crepuscular rays, in focus, by tomasz alen kopera
anthropomorphic clean cat, surrounded by fractals, epic angle and pose, symmetrical, 3d, depth of field, ruan jia and fenghua zhong
a beautiful coconut --neg photo, realistic
a beautiful durian, trending on Artstation

Deforum interpolates between them using Stable Diffusion.

I made a second example, using the same prompt as the ModelScope example above: ‘A ballerina dancing on a wooden stage, surrounded by ballerina dancing on a wooden stage, surrounded by candles’, You can see the results on YouTube here, and in the gif below.

The images are beatiful, but the model doesn’t understand how a ballerina would move.

Text to video, not here yet but coming.

We have 4 papers, 3 tools, but nothing that can create a high quality video out of thin air from just a text prompt.

However, AI is moving so fast, I predict within 6 months, there will be a text to video model that will be usable for creating videos.

Thanks for reading. See you next week!

-Josh

Mythical AI

Discussion about this post