Where is Midjourney for video? Text to video is coming (but not here yet)
4 text to video research papers, and 3 text to video tools you can use today.
Imagine entering a prompt like ‘woman holding a small octopus underwater’ and getting a video like this:
We can already create an image of the above prompt using text to image systems like Midjourney, Stable Diffusion and Dalle.
So why can’t we make a video from a simple text prompt yet?
Why text to video is hard
There are four problems a text to video system has to overcome.
Learned imagery - The system has to know what the world looks like
Learned imagery description - The system has to know how we describe what the world looks like
Learned movement - The system has to know how the world moves.
Learned movement description - The system has to know how we describe how the world moves.
Text to image systems only have to worry about numbers 1 and 2.
If you give the prompt ‘ballerina dancing on a wooden stage, surrounded by candles’ the system needs to do problems 1, and 2.
2 - Understand from the text what you are describing.
1 - Know what a ballerina looks like, what candles look like, what a wooden stage looks like, and what those all would look like together.
The fact that text to image systems can do that is still amazing to me.
To gain that capability, billions of images with text labels were collected and the models were trained to be able to reproduce images from noise. (See this article for how stable diffusion works)
A text to video system would have to tackle all 4 problems. Not only does it need to know what a ballerina looks like, and how we describe it, but it also has to know how the world moves, and know how we describe that movement.
It is one thing to know what a ballerina looks like, another to know how a ballerina moves.
An additional problem is a similarly sized collection of videos with text labels does not exist.
Labeling a single image is one thing, but how would you label every frame or every second of a video?
In additon, some researchers question whether we need to create a label video dataset, because text to image models already have been trained.
There have been 4 major research papers released, and a few tools we can use (we cover those later with examples) but no text to video model exists which can produce high quality, artifact free, coherent videos yet.
Lets look at the 4 papers released so far.
Text to video papers
Facebook’s Make-A-Video
Facebook released an impressive paper. Showing their text to video capabilities. They use image generation models, paired with unsupervised learning from videos to understand how objects would move in the real world.
The did not provide any code or demos.

ByteDance (TikTok)’s Magic Video
ByteDance released their paper Magic Video. They includes three steps: keyframe generation, frame interpolation between the images, then super-resolution to increase the size.
They don’t include code or a demo.
Tsinghua University’s CogVideo
https://github.com/THUDM/CogVideo
Tsinghua University released their CogVideo paper.
The included both a demo, as well as the code, althought the demo isn’t functioning currently. Also unfortuatnely for English speakers, it only works on Chinese currently.
KAIST and Google Research ‘Projected latent video diffusion models (PVDM)’
KAIST and Google Research released their PVDM paper. They include code but no demo. They focused on generating videos from text using less resources than other systems.
While these papers are amazing, what about text to video tools that you can use today?
Text to video tools you can use today
There are three text to video tools that you can use right now.
ModelScope
Runway Gen2*
Deforum
ModelScope
Developed by the community of developers at Hugging Face, ModelScope is the first freely usable text to video tool you can use.
https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis
I tried out ModelScope. It is… a cool demo, but the results are not usable.
A redditor made a video about Darth Vader in Walmart using ModelScope and the results are interesting but not great.
RunwayML Gen-1 and Gen-2*
March 2023 Update: Gen-1 is a general release now, and is no longer behind a waitlist.
Gen-1 also do not have text to video capabilities. Gen-2 does.
Gen-1 can modify video footage, using image or text prompt inputs, but can’t create video from just a text prompt. Gen-2 has text to video capabilities and is not publically available yet.



Gen-1 is often confused with text to video. But it can only do style transfer using a text of image prompt.
I took Gen-1 for a test drive. To see what it could do.
I uploaded this video of a beautiful breakfast.
And applied the ‘Pen and Ink’ style to it.
I also gave this image as an effect driver
And rendered it again:
While the results are impressive, they feel like an experiment still, rather than something I could use in a video today. Excited to see how the tool progresses to Gen-2 when you can enter a text prompt and create a video without a source video.
Deforum
I include Deforum on this list but it is interesting. It doesn’t try to solve the 4 problems, it just makes beautiful images that meld into one another. A ballerina will become a candle which becomes a lamp post. It doesn’t know how ballerinas move, and doesn’t try. But you can create videos from text and the results can be mesmerizing (and motion sickness inducing 😅).
Here is an example video I made using Deforum.
This video has 4 prompts.
tiny cute swamp bunny, highly detailed, intricate, ultra hd, sharp photo, crepuscular rays, in focus, by tomasz alen kopera
anthropomorphic clean cat, surrounded by fractals, epic angle and pose, symmetrical, 3d, depth of field, ruan jia and fenghua zhong
a beautiful coconut --neg photo, realistic
a beautiful durian, trending on Artstation
Deforum interpolates between them using Stable Diffusion.
I made a second example, using the same prompt as the ModelScope example above: ‘A ballerina dancing on a wooden stage, surrounded by ballerina dancing on a wooden stage, surrounded by candles’, You can see the results on YouTube here, and in the gif below.
The images are beatiful, but the model doesn’t understand how a ballerina would move.
Text to video, not here yet but coming.
We have 4 papers, 3 tools, but nothing that can create a high quality video out of thin air from just a text prompt.
However, AI is moving so fast, I predict within 6 months, there will be a text to video model that will be usable for creating videos.
Thanks for reading. See you next week!
-Josh