IBM Watson's Text-to-Speech & Speech-to-Text Applications. Could Vloggers/Video-Makers take advantage of them?

in GEMS2 years ago

Even though, it is still a long way to go, AI has changed the world into a faster and smarter one over the last decade or so, and what would be the best and finest example there than IBM Watson in this regard.

Source: IBM Watson

Yes, that's right, that IBM Watson, which won famous American Quiz Show, Jeopardy! back in 2011.

Well, giving the answers smartly was/is one teeny tiny part of its functionalities, features and what it can do.

What makes IBM Watson different, they came up with a unique AI learning technique, Transfer Learning. Transfer Learning consists of three layers:

  • Bottom Layer: made of out-of-the-box general knowledge.
  • Middle Layer: contain knowledge about a specific industry/category.
  • Top Layer: User/Customer/Client specific learning; customization.

Unlike other AI-based systems which need to be built from the scratch with each and every detail and data, Watson can be fed with prior data (Bottom & Middle Layers), that's what makes it more faster and gives it the ability to complete the task in hours/minutes which other systems perform in days/hours.

Top layer is an entirely separate layer which makes it more secure.

Even this is only the tip of the iceberg.


But, that's not the goal here to discuss the whole IBM Watson System. In this post, focus is on its Text-to-Speech & Speech-to-Text Applications, so we leave the rest for a later time.

IBM Line Separator 1.jpg


With Watson Text to Speech, you can generate human-like audio from written text.


It supports text-to-speech conversion for several languages; English, French, Arabic, German, Portuguese, Spanish, Mandarin, Dutch, Italian, Japanese & Korean.

Corresponding voice (English speaker male/female voice for English text, Arabic speaker male/female voice for Arabic text) can be selected from the drop down menu in Voice Selection right above the text box.

Quality of the output voice is excellent and it is human-like, almost perfect. This quality is not available in any other text-to-speech converter.

IBM Line Separator 1.jpg


Watson Speech to Text is a cloud-native solution that uses deep-learning AI algorithms to apply knowledge about grammar, language structure, and audio/voice signal composition to create customizable speech recognition for optimal text transcription.

IBM Watson's Speech-to-Text Application is in its beta, last updated on 28th April, 2020. Its speech recognition for different languages is being improved with each update.

Further details can be obtained from IBM Cloud Docs / Speech to Text

Not all of its features are available in the Demo Version but whatever ones are available, they are sufficient enough for video makers to meet their transcription needs.


  • Go to Speech-to-Text Demo Version
  • It works with two options; Record the Live Audioand/or Upload Audio File. Scroll down and select any of the options.


Image Source
Only .mp3, .mpeg, .wav, .flac, or .opus formats are allowed for the Audio File Upload.


I chose Robert Downey Jr. OnePlus 8 Pro Ft. commercial for the transcription.

It gave the result in around 2 minutes.

I have sped it up, the whole process was 2 minutes 15 seconds long, not bad for a 1 minute 57 seconds video.

Quality of the resulting text can be improved by using different options.

IBM Line Separator 1.jpg

Vloggers & Video Makers:

The objective of all this exercise, I mean transcription, is to add subtitles/caption in the video/s. This would...

  1. increase the reach to more audience, specially if the video is not in audiences' native language.
  2. facilitate hearing impaired people.
  3. optimize for Google search. Google search is still at the level where it only understands text.

Of course, Youtube could always be used and now a days there are many applications available for Text-to-Speech & Speech-to-Text conversion, even the free ones, but in quality, features, technology and learning process, there is almost no comparison for IBM Watson. Furthermore, there could be many reasons for not uploading the video on the internet before finalizing them, IBM Watson needs audio file for the transcription.

Now, you have to decide, if you are a video maker or plans to make videos, whether this would be useful for you or...

Please do share your experience with different applications and software in this category in the comments. Consider it a help for the fellow video makers and newbie in the field.

IBM Line Separator 1.jpg

NOTE: This post is written by keeping vloggers and video makers in mind and about demo versions, it neither even cover tiniest part of what these IBM Watson Applications/API offer nor the wide range of their use cases.

References & Resources:

- IBM Watson Products
- Text-to-Speech
- Text-to-Speech Demo
- Speech-to-Text
- Speech-to-Text Demo
- IBM Watson Youtube
- IBM Cloud Docs / Speech to Text
- Getting started with Speech to Text


Congratulations @mobi72! You received a personal badge!

Happy Hive Birthday! You are on the Hive blockchain for 2 years!

You can view your badges on your board And compare to others on the Ranking