In-car speech recognition technology enables drivers to interact with a multitude of useful applications using their voice. Drivers can invoke navigation applications, voice calls, messaging applications, play music, query fuel consumption & set driving modes by simply speaking commands.
Improved road safety has been the primary motivation for the deployment of speech recognition systems in modern cars. Today’s driver wants to be “connected” at all times but manual interaction with infotainment systems and smartphones whilst on the move is dangerous, and illegal! Speech recognition technology facilitates this interaction through voice, without the driver having to take their hands off the wheel and eyes off the road.
Although the safety risks from this type of driver distraction are well known, the statistics are still shocking. The US-based National Highway Traffic Safety Administration group estimate that texting while driving …
- is responsible for 1 in 4 of all car accidents
- has an equivalent cognitive impairment to driving after consuming 8 units of alcohol
- increases the risk of an accident 23 times.
The use of the in-car voice interface, along with increased penalties, should radically reduce the frequency and severity of accidents attributable to this type of driver distraction.
The roadmap for speech-based interaction in cars extends well beyond the current infotainment uses cases. The ability, through voice, to access & diagnose vehicular information such as battery life, range for electric cars, integrated driver manual, HVAC control, access to Smart Home control etc., are all planned by most auto makers.
Broadly speaking, there are 2 sources of in-vehicle speech recognition system in use today –
- factory–embedded solutions sourced by automakers (e.g. BMW, VW) from speech recognition technology companies (e.g Nuance) through Tier 1 auto-suppliers (e.g. Harman, Panasonic)
- smartphone-based applications (e.g. Apple CarPlay, Android Auto) that bring a paired down version of iOS/Android to the car display, effectively replacing the car’s factory-installed infotainment system, and allowing the driver to access relevant smartphone apps (voice calls, messaging, navigation).
Whilst the technology has promised a lot, mass market adoption has, to date, been hampered by usability issues. As a result in-car voice control frequently tops driver surveys of the most annoying new auto innovations. A recent study conducted by J.D. Power found that 67 percent of owners said their infotainment system couldn’t follow or misinterpreted their verbal orders.
In-car speech recognition uses Audio Speech Recognition (ASR) technology to decipher speech, by analysing the audio signal produced by the driver when issuing voice commands. Modern in-car ASR systems use Machine Learning based algorithms and context-specific language models to increase accuracy. The interior of a car, however, is a problematic environment for even the best ASR systems. Background noise can corrupt the audio signal being analysed and severely impact the resultant word accuracy. The vehicle cockpit is a small, confined space and any noise is impactful. The effect is exaggerated when you add in other passengers talking, engine/road noise, stereo etc., making the task of discerning spoken commands amongst the other noise sources increasingly difficult. Furthermore, microphones are sited at a distance and angle from the driver and are thus prone to picking up a considerable portion of the background noise. This is particularly true for smartphone-based systems (CarPlay, Android Auto) which leverage a single device microphone – as opposed to high-end embedded systems, which commonly employ an array of microphones positioned optimally inside the cockpit.
This inability of ASR-based systems to perform well in the inherently noisy car cockpit results in the frequent misinterpretation of driver commands, and is a key factor in the negative press currently surrounding the usability of this technology in cars.
Liopa’s LipRead, a Visual Speech Recognition (VSR) system, deciphers speech from an analysis of lip movements. LipRead processes video only, ignoring the audio track, and is thus agnostic to audio noise. Liopa is developing a number of techniques that optimally combine 3rd Party ASR technology with LipRead to create an Audio-Visual Speech Recognition (AVSR) system that delivers better word accuracy results in noisy conditions. The AVSR solution dynamically weights the output of the video and audio analysis, based on a range of factors, to produce the optimal word accuracy for a range of audio and video noise levels. In short, when compared with audio-based speech recognition technology, the AVSR system will provide better word accuracy in noisy conditions.
Improving in-car speech recognition performance is an ideal use case for an AVSR solution. Inward-facing cameras are increasingly being deployed in–vehicle to authenticate drivers and to monitor alertness/fatigue levels. Cradle-mounted smartphones running Apple Carplay & Android Auto all come with high resolution cameras pointed towards the driver. The video from these cameras can be input to Liopa’s VSR technology which, when combined with the ASR technology in use within the vehicle, will create an AVSR system delivering improved word accuracy in noisy conditions. Additionally the vocabulary to be supported is contextually constrained (e.g. Call <…>, Bluetooth On) which helps both the ASR and VSR systems determine the words/phrases being uttered.
Liopa is currently optimising the integration techniques used to combine LipRead with 3rd party ASR systems for use in use-cases such as the in-car example listed above. Watch this space!