Voice is taking over the world – and for good reason. Amazon revealed that 100 million Alexa-enabled devices were purchased by this time last year. This week, the company reported that Alexa’s presence has doubled since then. Google Assistant has 500 million monthly users, according to stats published by Google.
Why is voice so popular?
Easy: convenience. It’s quicker, easier and less cumbersome to speak out computing commands with your voice, than to type them into a device. People can use their voice while doing other things, like driving, or while cooking in the kitchen. It’s not a surprise that the global voice assistant market is estimated to reach $5.4bn by 2024. And we feel that is a conservative estimate.
However, there are some limitations with voice technology. Primarily, the accuracy of results is severely limited by background noise. In noisy conditions, words are misheard, and search results are either skewed or totally wrong.
The problem is, almost every time people use voice assistants, there will naturally be some background noise. This includes in-car voice assistants, which will pick up noises from the road, traffic noises, radio sounds and humming from the engine of the car itself.
In the home, there will be interfering noises from the TV, electrical appliances, voices of family members, etc.
Degradation of results in noisy conditions
It is known that the biggest innovators in voice technology (Amazon; Google) are investing in R&D to solve this problem, but it’s impossible to know the scale or breadth of their research efforts.
Below, we’ve compiled our knowledge of the technologies that can provide solutions.
Top five technologies that can improve voice assistants
Improved AI models are trained with noise – they are robust to it because they hear it all the time. The algorithms are designed to operate in real-life environments – not in sterile, silent lab conditions.
In domain-based training systems, AI algorithms are taught to “listen” from the outset with background voices: noises typical of a busy restaurant, or the noises inside a car, for example. With this advanced approach, silent conditions are avoided for more real-life models.
The number of layers and complexity of the Deep Neural Network architecture can be increased. This requires vast computing power to train the models and it requires vast amounts of speech data. Often this data is harvested from videos already existing on the internet.
While AI may be the software solution to the problem, improving hardware can make a big impact. Microphone technology is evolving, often using an array of microphones that can isolate noise on one channel. For example, rather than a single microphone in the cockpit of a car, microphones will be located in three dimensions all around the driver’s head. The microphone array can pick up voice commands better, helping to isolate the driver’s voice from the sound of the road and the car’s engine.
With this technology, stereo microphones can identify which aspects of an audio signal are the real signal, and which are simply noise. Beamforming is a technique used not just to improve audio signals, but also sonar, radar, antennae, etc. This technology is already present in some high-end mobile phones. The phone may include two microphones for this reason. Multiple microphones are also present in home voice assistants, including Echo, Echo Dot and Echo Show.
Beamforming is known for being computationally intensive. As computation evolves, the innovation is becoming more mainstream. It should start to proliferate into more middle-market devices.
Filtering certain noises, such as a hiss noise
Noise signals can be filtered out to remove non-voice sounds. Using techniques such as Speech Enhancement and Blind Source Separation, attempts are made to remove, or de-emphasis, the noise in a captured signal. Filtering works by analysing the signal and processing it to remove artifacts that seem out of place.
There is a growing acceptance that audio signals are always going to be noisy. At Liopa we have developed a solution to bypass the problem of noise altogether. Our LipRead solution conducts automatic lip reading. The application simply requires a video of the speaker’s lips to determine what they are saying, with greater accuracy than voice alone.
Cameras are de rigueur on all mobile devices, and with a front-facing camera, it’s easy to capture video of a speaker’s lip movements. While this is the most likely use case, there are also other possibilities for capturing video, apart from mobile phones.
Voice assistants such as the Echo Show are already coming onto the market with screens and cameras. The household TV has taken on new prominence in the race to include voice assistants and cameras. Tech giants like Amazon are battling for TV partnerships at this month’s CES show. In the automotive industry, the cockpit of a car can easily be fitted with a camera trained at the driver, either by the manufacturer or post-purchase.
This is the next evolution – it will very much become the norm to have cameras in most of your living situations within 5-10 years. These cameras open up many new possibilities. There are applications across automotive, home, and medical care to add a video stream to the pre-existing voice stream.
Analysis of lip movements are combined with the audio/voice signal, so the lip reading results will be combined with the audio results to give a much more accurate guess at what the person was saying.
By adding video, the accuracy of results from voice can be improved, even when the signal is compromised by background noise.