Speech to Text

Across the years we’ve hear that speech-to-text is coming “really soon”, and yet every new offering started with excitement followed by disappointment, yet again.

It seems like the time has finally come when practical speech-to-text is a reality. There are now multiple providers of speech-to-text APIs* that are providing highly accurate results. Microsoft claim their Cordana speech to text transcription service has hit an accuracy rate of 93.1% or a failure rate of 5.9%, which is reportedly the same accuracy as you’re paying $1 or $2 a minute for right now.

IBM Watson and Google – among others – also provide high quality speech-to-text APIs that have become exponentially more accurate and useful since late 2014 when I wrote how difficult useful speech-to-text is.

The first app within the post-production space is SpeedScriber, which uses one of these online services (which one is unsurprisingly a trade secret) tied in with a very good human-editing interface, with the goal of having fully accurate. At the end of 2016, SpeedScriber is still in beta, but tests by this author have been very positive.

These services provide true content metadata – a textual representation of what was said – that forms the foundation of content extraction. These contrast with the Nexidia technology Avid licenses for Media Central, which is a phonetic search engine that uses waveform matching to search media content. As there is no text version ever created, this technology is limited to phonetic search

*Application Programming Interface. In these examples, a developer packages an audio file and sends it to the API and gets back a time-stamped text file of the speech, usually in XML or JSON formats. It is then up to the developer to process the results into their apps, according to the need of the app. There are small fees associated with each use of the API beyond a very basic free level.