I came up with an interface that, it seems, nobody has even implemented in a prototype yet, and it looks very cool. And technically, nothing is impossible.
The idea is that the system (whether it be a website, online store, operating system – it doesn’t matter) is constantly listening to the user (microphone, ordinary speech). But unlike a voice interface (such as Siri or Alexa), here it serves as an additional channel for user-system interaction, not the primary one. If the system does not understand the user, it does nothing. If the system understands the user, it changes the interface so that more relevant results appear “under the mouse.”
For example, imagine you are browsing an online store, looking for boots. You open some you come across first; the style is silly, but the color is fine. You simply say this. The system then adjusts the navigation to bring closer what you liked. Naturally, with all other things being equal – if you’re looking for green, then red, which you don’t like, shouldn’t be shown among the green. But nothing stops you from creating dynamic personalized lists of recommended products. For example, if a user sees a product and says it’s cool, then the system, using machine learning, shows them products they might like (this is nothing new).
Theoretically, if combined with eye tracking and mouse tracking, you could create a very powerful prototype. The advantage of the system is that it is unobtrusive. It simply observes and listens. And there won’t be a scenario like in that elevator from the clip (Eleven!).
Of course, there’s a difficulty: sending sound and video back to the server is not possible, no user would agree to that. But processing it locally on the user’s computer, and sending the system-approved decisions—that can indeed be done.
This is an example using an online store. But in general, the concept could work anywhere. Allocating a strip of the screen where contextual links based on speech analysis and the user’s interaction with the site could pop up might be quite useful. For instance, you go to a public services site and say, “where to pay fines here…” and at the bottom, “Fines – here!” immediately pops up. Or it doesn’t, if you mumbled something unintelligible, but meanwhile, you’re also searching something in the menu and not really relying on voice. Over time, the quality of recognition will improve (at least because the system has guessed right about your fines before and will increase the weight of this guess for next time).
