Speech to Text AI - the best apps, free tools, and APIs
Boost your productivity and save time and money
Hello! I’ve been trying to eat more protein lately, and plain Greek yogurt + a Crystal Light drink mix = high protein, low cal delicious raspberry yogurt. Enough about food, let’s dive into the best speech to text apps!
Last episode we covered the history of speech to text (the first machine was built in 1952!) and learned how speech to text actually works.
Today in part 2, we are going to cover:
The 3 best speech to text apps you can use today,
7 open-source speech to text models you can build into products for free
9 paid APIs you can use if you don’t want to mess with open-source software
Let’s get into it.
Speech to text apps you should try today
I only recommend apps I or someone I trust have tested personally. So I will split this list up into two sections, 1. apps I have used personally, and 2. apps that look awesome but I haven’t used yet.
My top 3 speech to text apps are:
Apple Dictation (hands free, free transcription in Apple Notes )
Otter.ai (automatic meetings note taking and audio recordings)
Descript (edit videos by deleting or typing text)
Apple Dictation
Price: Free
If you want to capture longer thoughts on the go, nothing beats using the Notes app with Dictation.
In fact, I captured this whole section just by speaking into my iPhone after pressing the microphone button.
It is quick, pretty accurate, and a great way to take notes hands-free.
It is WAY BETTER than just using Siri to capture a note because Apple Dictate transcribes your speech as you go along. Siri tries to do it all in one go, so if there is bad service or an error it loses all your thoughts and you have to start over.
Apple Dictate doesn’t record the audio, so you can’t reference it later. If you want something that records the audio and transcribes, I would recommend…
Otter.ai
Price: Free starter plan, then starts at 9 dollars a month
I take a lot of notes during meetings. Sometimes I have been accused of being so focused on the notes that I am not participating in the meeting.
Otter.ai solves that. You can use use your computer or phone speaker, or you can connect Otter to your calendar and it can automatically join and record your meetings on Zoom, Microsoft Teams, and Google Meet.
It detects different speakers and is pretty accurate.
You can click on any word and it will start playing from that spot, highlighting the words as the audio plays.
Their free plan includes 300 minutes a month which is pretty good for occasional meetings.
The last of my top 3 tools is a video editing tool that feels like magic.
Descript
Price: Free starter plan, then starts at 12 dollars a month
If you edit videos, Descript is actually magical. You upload your video and it automatically transcribes it. Then if you want to cut out a section, you just delete the text, and it will remove that section of the video.
When your video editing software knows what you are saying in the video, it can do some pretty cool things.
They offer a free plan that works to get started and the first plan costs 12 dollars a month.
Descript can do a ton of other cool things like learn your voice so you can generate audio of ‘you’ speaking, just by typing out the text you want ‘your’ voice to say.
I would highly recommend trying out Apple Dictation, Descript, and Otter.ai today.
Now here is the 2nd part of the list. These are very cool speech to text apps that I haven’t been able to test yet, so I can’t recommend them, but they intrigue me.
Cool speech to text apps I haven’t tried:
Gling.ai - AI video editing. Cut silence, pauses and bad takes out of videos
Meeple.ai - Meeple.ai analyzes your sales calls
ToWords - Create YouTube video transcripts from a URL
Supernormal - automatic meeting notes
Vienna Scribe - on device private transcriptions
Echo.win - AI answers your business phone calls, and gives info
Poised - Communication coach analyze and improve your presentation skills and talking skills.
Forum by Waverly - Auotmatic live translation into 20+ languages
The coolest thing about speech to text in 2023, is that what was formerly hard and expensive, has become cheap, even free.
Open-source speech to text models are giving anyone the ability to create speech to text apps.
If you don’t create apps, or don’t ever want to create apps, you can skip this section. 😄 No hard feelings.
Free open source speech to text models
The history of speech-to-text technology was for a long time dominated by proprietary software and libraries. Companies would build and sell their systems but wouldn’t release their research and secret sauce to the public. If you wanted a computer to understand speech, you had to pay.
Luckily this has changed. A bunch of open source, free-to-use speech to text models have been created.
My top recommendation for libraries to use would be Whisper. It is the newest, uses state of the art techniques, and has a lot of features built in.
If you need a small speech to text model that can be used on in less powerful hardware like a Raspberry Pi, Vosk is a good choice.
The 7 open source speech to text libraries are:
Whisper
Julius
Kaldi
Vosk
Nemo
Athena
SpeechBrain
DeepSpeech
Coqui
Flashlight
Whisper
Whisper performs multiple tasks such as language detection, voice activity detection, speech to text, and translation. The other systems on this list only do speech to text.
Whisper supports 99 different languages.
https://github.com/openai/whisper
Julius
The research for Julius began in 1991 at the University of Kyoto.
Julius’ can do real-time speech to text operations and is compatible with Linux, Windows, macOS, and Android-based smartphones. Julius only supports English and Japanese languages.
https://github.com/julius-speech/julius
Kaldi
Kaldi’s creation began in 2009 and the code was released in 2011. Kaldi is expandable with third-party modules provided by the community. It is powerful and complicated to use.
If you are doing research, Kaldi could be a good choice. If you want to build practical applications with a plug and play library, a different option is Vosk.
Vosk
Vosk is built on top of Kaldi. Easier to use, Vosk is also a smaller model, released in 2019. It is compatible with Raspberry Pi, iOS, and Android devices.
Vosk supports ten languages including English, German, French, Turkish, Chinese, Hindi, Spanish, Russian.
https://github.com/alphacep/vosk-api
NVIDIA Nemo
NVIDIA NeMo is a conversational artificial intelligence toolkit designed for researchers working on automatic speech recognition, text-to-speech synthesis, large language models, and natural language processing. The main aim of NeMo is to assist researchers from both industry and academia in reusing previous work (including code and pretrained models) and simplifying the process of developing new conversational AI models.
https://github.com/NVIDIA/NeMo
Athena
An end-to-end speech recognition engine which implements ASR.
Written in Python has a large model available for both English and Chinese languages.
https://github.com/athena-team/athena
SpeechBrain
A PyTorch-based transcription toolkit currently in Beta and sponsored by large companies such as Nuance, NVIDIA and Samsung.
https://speechbrain.github.io/
DeepSpeech
Created by Mozilla in 2017, DeepSpeech’s error rate on LibriSpeech’s test-clean set is 6.5%, which is close to human level performance.
Mozilla is winding down its participation in DeepSpeech but the model is well tested and fast.
https://github.com/mozilla/DeepSpeech
Coqui
Founded by former Mozilla DeepSpeech engineers. Coqui has a small English model size at 47 MB, which makes it mobile and embedded friendly.
https://github.com/coqui-ai/STT
Flashlight (formerly wav2letter)
Very fast. Written by Facebook researchers, Flashlight is the fastest open source speech recognition framework.
In some cases Facebook claims wav2letter++ is more than 2x faster than other frameworks
https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr
Of course setting up open source software, and creating an API so your app can use it is difficult. Many companies offer paid services where you can sign up, integrate with them, and use their infrastructure.
Paid Speech to text APIs
Paid services are nice because you hand them an audio file, they give you back the text.
If you are just getting started, I would suggest using Deepgram or AssemblyAI. Both have great documentation.
If you are an experienced programmer, Google Speech-to-Text and Amazon Transcribe are great options, but much more difficult to configure and use.
The 9 paid speech to text services are:
Deepgram
AssemblyAI
Google Speech-to-Text
Amazon Transcribe
Microsoft Azure Speech-to-Text
Picovoice
Sonix
LumenVox
Rev
Deepgram
Price: $0.0145 per minute
Transcribe an hour of audio in under 20 seconds. Great documentation.
Supports 30 languages and dialects and 40+ file types
AssemblyAI
Price: $0.015 per minute
Claims setup times of 5 minutes of less. Only support English currently, but they have tons of tutorials and very good documentation.
https://www.assemblyai.com/
Google Speech-to-Text
Price: $0.016 per minute
A popular audio transcription engine that supports over 125 languages.
Can be a bit complex to get started. Google offers a $300-worth free credits for the first six months.
https://cloud.google.com/speech-to-text
Amazon Transcribe
Price: $0.024 per minute
Like anything Amazon and AWS, powerful and super complex to setup.
Also offers a custom Speech-to-Text API for the healthcare industry, and the first hour of transcription is free every month for the first year of use, then $1.44 per hour
https://aws.amazon.com/transcribe/
Microsoft Azure Speech-to-Text
Price: $0.0166 per minute
is a very accurate Speech-to-Text engine with the flexibility to customize models and multi-language and various feature support. It offers five hours of free transcription per month. Even harder to use than Amazon and Google.
https://azure.microsoft.com/en-us/products/cognitive-services/speech-to-text/
Picovoice
Price: ? complicated pricing
Converts speech to text locally without sending data to a 3rd party cloud. This platform could be a web browser, mobile application, single-board computer (Raspberry Pi) or server. Since it doesn’t use API calls, can reduce the cost by 10 to 100x.
https://picovoice.ai/platform/cat/
Sonix
Price: $0.166 per minute - more expensive.
Speech-to-text in 38+ languages. In-browser editor allows you to search, play, edit, organize, and share your transcripts.
https://sonix.ai/
LumenVox
Price: classic enterprise ‘request’ demo, no pricing available
Enterprise high end speech to text. Their website doesn’t list pricing.
https://www.lumenvox.com/
Rev
Price: $1.50 / audio minute
More expensive because it uses AI and humans for 99% accuracy.
https://www.rev.com/
Other speech to text APIs I haven’t had time to research include:
Speech to Text is getting better and better
I remember distinctly my older sister training Dragon software.
She has a learning disability and can’t type well.
My parents bought her the software and the microphone, and she would sit at the family computer for hours training then using the software. It was expensive and slow but it was how she graduated high school.
Now there are many free and paid apps and libraries that give everyone the power of speech to text, faster, easier, and more powerful than ever.
Our kids are going to look back on this classic Star Trek clip and not understand why it was funny, because computers have actually do listened and responded to them for their whole lifetime.
Fun update, Google released a paper about their text to speech engine. It beats Whisper, supports over 100 languages. https://arxiv.org/abs/2303.01037
Of course you can't use it yet, Google has trouble releasing, but exciting times are ahead!