Text to X, Image to X, Audio to X - What tasks is AI good at and what are the best tools?
Text to image = great. Image to audio = maybe?
It feels like AI can do ‘anything’. Just give it some training data, let it rip on some GPUs and presto, you have a model.
But there are some tasks that AI is great at, and some researchers are still trying to make work.
Using the primary data types of text, images, and audio, lets go through all the different combos and see if AI is up to the task, and what tools are the best.
Let’s start with Text to X.
Text to X - AI use cases
✅ Text To Image
Text to Image is having its breakout moment in 2022. We have not one but 3 different models come out which can take a text prompt and create output of sometimes stunning quality.
Dalle2 - paid API
Stable Diffusion - free and open source
Midjourney - paid API
✅ Text to audio (text to speech)
Text to audio aka text to speech has been around for a long time. It just hasn’t been very good. In fact the original Macintosh Demo had the Macintosh speaking on stage. Lately new techniques and models have greatly improved text to speech making it possible to create voices that sound natural, and can match known human speakers.
https://murf.ai/
https://play.ht/
https://www.lovo.ai/
https://podcast.ai/
🤏Text to audio (music)
Text to music feels like it should be close, but overall, is not here yet.
Last weeks episode was all about where we are with Dalle for music and why it might be harder than assumed.
🤏Text to video
Text to video is a natural extension of text to image. And researchers are getting very close.
Both Facebook (https://ai.facebook.com/blog/generative-ai-text-to-video/) and Google (https://imagen.research.google/video/paper.pdf) researchers have demonstrated videos created from text prompts. The quality still needs to be improved but it is only a matter of time before you can say ‘clown fish doing loops, Pixar style’ and get a cute looking orange fish swimming around in a video.
There are current text to video solutions but they don’t create the exact video you describe. They basically put B roll, transitions and pictures together to try and create a video from your text input.
https://www.steve.ai/
https://elai.io/
https://www.profjim.com/
✅ Text to Text
Text to Text AI has been around for a long time. What is more exciting recently is instead of specialized use cases, new text AI provide a general-purpose “text in, text out” interface, allowing it to be used for many more use cases. Some text to text use cases are text generation, summarization, rephrasing, and ideation.
Overall AIs can handle text quite well. Next lets investigate the state of AIs with images.
Image to X - AI use cases
Image to Text
Image to text is also called Machine Vision. This is a very established field with tons of great tools that can work very well.
Side note, the CV means computer vision, not a resume. Here are some of the top tools many of which are open source and free to use.
❓Image to Audio
Not sure what this would even look like?
Currently you could chain together tasks to get this result. You could use machine vision to turn an image into text, then use a voice synthesizer to turn the text into speech.
🚫Image to Video
Upload an image and then get a video from it. Either animating the subject matter or moving around the scene.
There are lots of researchers working on it, and they are making progress. I predict this area will have a lot of great tools in about a year or two.
The Runway video editing tool lets you try and animate object in photos.
Googles DeepMind can create a video of sorts from one photo.
Several family history tools allow you to animate old photos.
And Googles newest AI lets you ‘fly’ into a single photo.
❓Image to Music
This is another maybe strange use case. Upload an image, get a song. Currently an experimental area.
Tools like https://melobytes.com/en/app/ai_image2sound let you upload an image and get a song but since humans would have a hard time making a song from an image, AIs have trouble as well.
Unsure if this will every become a well used or well supported use case.
✅ Image to Image
Image to image or image processing is a huge use case area with lots of types of jobs to be done. The types of jobs that AI recently got good at, and can be done with image in image out include
Style transfer
Background removal
Image inpainting
Super Resolution
✅Style Transfer
All the text to image APIs can do style transfer to some degree. However Dreambooth really seems to shine at it as it can learn the subject then create them in a new style.
https://dreambooth.github.io/
Other style transfer APIs
https://www.fritz.ai/style-transfer/
https://deepai.org/machine-learning-model/fast-style-transfer
✅Background Removal
I wrote an article about AI background removers and how they work here -
✅Image Inpainting
All 3 of the text to image models do this very well. Select a portion of an image and ask the AI to fill it in.
While image is less support in AI use cases than text, it still has some very useful and impressive applications. Next let us look at audio.
Audio to X - AI use cases
✅ Audio to text
Transcription is a well established field with many players. However Whisper, an Open Source transcription model from OpenAI has made things exciting again.
Whisper - free and open source
❓Audio to Image and ❓Audio to Video
Another curious use case. It could be achieve by chaining tasks together. Use a transcription service to turn voice into text, and then use text to image to create an image of it.
Not sure if this area will get a ton of tools or use cases.
🚫Audio to Music
This could be an amazing use case. Upload audio of you singing or playing an instrument then ask the AI to transform it into a different style. Hear yourself like Frank Sinatra, Tim McGraw or Taylor Swift.
We are very far from this, but as text to music improves, this use case could become possible.
✅ Audio to Audio
Audio to audio or audio processing is a very well established field. There are a few new use cases where AI is rapidly improving which include:
Stem separation
Cleaning or enhancing
✅ Stem separation
Stem separation lets you upload a music track and then remove or isolate parts like the singing, guitar etc. Some tools that have make this much easier recently are:
https://www.lalal.ai/
https://www.stemroller.com/
✅ Cleaning or enhancing
Cleaning up bad audio is a huge industry already. Most audio tools have methods of doing some of this, but AI is learning what the audio could sound like and making it much easier to fix bad audio.
https://audo.ai/
https://www.nonoisy.com/
https://ai-coustics.com/
Chaining AI tasks together.
AI can do amazing things. But the best part about computers, is you can start putting together tools like legos.
Here are some unique use cases where AI tools can be chained together for great results.
Audio to text to image (subtitles)
This is also know as AI subtitles. Take audio, transcribe it to text, then output it as an image. There are tons of subtitling tools available.
Video to text to video (edit video like a text file)
Imagine opening a video file, and getting an automatic transcription. Then you can edit the transcript to remove parts of the video. That is exactly what Descript does.
It is exciting how fast AI is moving. I can imagine very soon where you will be able to write a text prompt, get an image, turn that into a 3D model, and then describe how you want that model animated. You could then describe the song you want created as the soundtrack. Basically Pixar in your pocket.
Thanks for reading. If you got this far, reply telling me your favorite AI tool.
Until next week, lets go create something awesome
-Josh