Of course, we also included our Voicegain recognizer, because we wanted to see how we stacked against those. We compared the results presented by Jason to the results from the big 3 - Google, Amazon, and Microsoft - recognizers as of June 2020. By the time we decided to do a retest of Jason's benchmark, 4 videos were no longer accessible, so our benchmark presented here uses data from only 44 videos. The article presents a comparison of Speech Recognizers from various companies using a set of 48 YouTube videos (taking 5 minutes of audio from each of the videos). Rather than collecting the test files ourselves, we decided to use the data set described in " Which Automatic Transcription Service is the Most Accurate? - 2018" from September 2018 by Jason Kincaid. Because we are building a general recognizer for an unspecified use case, we intentionally decided to use a very broad set of audio files. The benchmark results that we are presenting here are somewhat different than the use-case driven tests or benchmarks. If you are going to repeat this process often, e.g., to evaluate new candidates on the recognizer marker, it is good to standardize the test set, basically creating a repeatable benchmark that can be referenced in the future. The combined results will present a clear picture of how the recognizers perform on the specific speech audio that we care about.
After that, things can be automated - transcribe each file on the recognizers being evaluated, compute WER against the reference for each of the generated transcripts, and collate the results. For each speech audio file from the set one would obtain a gold/reference transcript that is 100% accurate. As a test set, one would choose a set of audio files, that accurately represent the spectrum of the speech that will be encountered by the recognizer in the expected use cases. Testing / Benchmarking Speech-to-Text Accuracyīecause the accuracy or Word Error Rate questions are somewhat meaningless without specifying the type of speech audio, it is important to do testing when choosing a speech recognizer. Is the recording quality bad, e.g., due to a codec or insane archival compression levels.recorder placed on one edge of a very long table) Are there variations in the recording volume (e.g. Is there room reverb or echo in the recording?.Are parts of the speech audio unusually slow or fast?.Is there background noise? What is the type of noise?.Is there music in the background - very common for youtube productions.Are there more than one speakers? Are they constantly switching over or even talk over one another.people or other names, will make life difficult for the NLM (natural language model). Rare and obscure words or word combinations, like e.g. lower WER (word error rate) scores compared to unscripted speech. Does the speech follow proper grammar or is the speaker making things up as they are saying it.Basically, accuracy can be all over the place depending on factors like: Accuracy of automated speech recognition (ASR) depends on the audio in many ways and the effect is not small. However, "that depends" is really the right answer. Often we answer "that depends" and we get a feeling that the other side thinks "must be really bad if they do not give a straight answer".
That is the question that we are frequently asked by our potential customers. "What is the accuracy of your recognizer?"