I wonder why all voice-to-text programs do not try to identify the topic and load terms specific to that topic? Here’s a conversation about, say, horses. Load a dictionary with terms, horse brands, typical names, names of racetracks, or whatever else, and run recognition again, using terms from this dictionary with more weight than terms from, say, IT or cooking. Understanding that it’s about horses from the text AI can do for a long time. But adapting, it cannot.
Or consider recognition in Teams. Microsoft, you have gigabytes of messages from chats and groups. It’s quite obvious that roughly the same words will be heard in the audio. Why not compile a dictionary of such words and load it into the voice recognition system to make transcriptions more accurate? I’m not even talking about the fact that the same person writes and speaks generally about the same topic. If we take his messages for the dictionary and apply this to the statements he makes, and secondarily to those who are in the call, it would be just perfect.
One could think about how to improve existing recognitions. Like, over a week accumulate knowledge of how, say, Medik8 sounds and is spelled (from chats), and then change all incorrectly recognized ‘medicate’ to Medik8 in past meetings (including updating the search functionality to reflect the changes). Understanding that it is wrongly recognized is nontrivial for a machine, but still possible, since the word medicate clearly does not fit grammatically, unlike Medik8.
A proper startup needs to emerge that integrates with messengers and meeting apps, doing all this smartly, while charging some money. If all internal meetings were transcribed (properly! with replicas, names, taking into account the topics) and there was a unified search considering access rights (you can only search meetings you are invited to), it would be a supertool.
