Text to speech (TTS) synthesis has been with us for quite some time now. While it does the job of making written text understandable most of the time, it is still clearly a machine sound with it’s robotic quirks.
So with machine learning getting some considerable traction, TTS is getting more of the neural machine oil into those logical cogs and it’s getting better and better.
Amazon Polly is one of the key contenders in this category. With its initial release in 2016 as of today it is available in 29 languages with the standard engine and 4 (11 voices) with the neural engine.
What’s the difference between the standard and the neural engine in Amazon Polly other than the cool name?
Well the standard TTS processing is built on concatenative synthesis, which is basically concatenating samples of recorded sounds together with varying lengths and it depends on the implementation.
NTTS aka the Neural Text To Speech approach is a bit different.
First it creates a spectrogram from sequences of phonemes, this basically breaks down the sound into measurable energy levels on different frequency bands, then a vocoder converts the spectrograms into one signal. Instead of just looking at the corresponding inputs in a sentence, it analyses all the sequence of elements and chooses the best spectrogram model for the output. Then a neural vocoder converts the spectrogram into speech waveform.
And the difference between Polly TTS vs Polly NTTS is obvious:
AWS Polly, Salli Standard:
AWS Polly, Salli Neural:
Polly can be also used utilising Speech Synthesis Markup Language (SSML) to fine tune the speech including breathing sounds, whispering, phonetic pronunciation, breaks etc.
With SSML the whole speech can be synthesised to specific styles with presets
AWS Polly, Joanna Neural Newscaster Style Using SSML:
<speak> <amazon:domain name="news"> Hello there, Joanna here with the latest news: People around the globe woke up realising there is no cloud, its just someone else's computer and they went woah. </amazon:domain> </speak>
Also with presets and some fine tuning
AWS Polly, Mathew Neural Conversational Style Using SSML:
<speak> <amazon:domain name="conversational"> So get this right? I spoke with Joanna yesterday and she was saying there is no cloud, <prosody rate="slow">over and over again and again</prosody>. Not sure what protomolecule hit her stack. </amazon:domain> </speak>
This widens the spectrum of utilising AWS Polly NTTS, language schools, audio books, podcasting, blogs, automated radio stations, just to name a few. I specifically used it to make a voice over for my friends pizza recipe on youtube as some of us(like me) have terrible voices on a microphone and Sally the voice I used in this instance within Polly, won’t scare viewers away like I would.
AWS Polly, Salli Neural, no styling:
Before you give it a test run, a few words about AWS Polly’s pricing structure.
-Standard engine 5 million characters per month for 12months starting from the first request
-Neural engine 1 million characters per month for 12 months starting from the first request
Pay as you go(per character):
-Standard engine $4 per 1 million characters after the free tier
-Neural engine $16 per 1 million characters after the free tier
Considering that 1 million characters equals ~23 hours of speech for $4 / $16 is not too bad.
Transcribing Mark Twain’s Huckleberry Finn would cost you $2.40 / $9.60
The easiest way to give it a test run is to setup an AWS account if you don’t have one and log into your aws management console and go to https://console.aws.amazon.com/polly/ or search for Polly within the console.
There is an instant form field where you can paste some text up to 3000 characters and make it playable / downloadable instantly. Don’t forget to select the neural option for enhanced quality.
Be aware that using this field is also counted towards the regular use and will be chargeable(once you are outside the free tier)
AWS has an extensive documentation on how to use Polly with nice examples.