VISUAL VOICE Generator
lip movements become Audio, instantly
Currently an R&D project, Visual Voice Generator represents an exciting new application for Liopa’s technology.
It uses our back-end technology in a new way, to review video footage of a person speaking, and to train our system to recognise learned lip movements. The system then recreates the audio of what they’ve said. Like our other apps, Visual Voice Generator makes use of Deep Learning algorithms, but there are many challenges in this, which we are solving through new and unique R&D.
How is Visual Voice Generator Different from our Other Apps?
Liopa’s work to-date has centred on trying to figure out what someone is saying from their lip movements. This is what we call our VSR, or Visual Speech Recognition system. The main challenge is that VSR requires a known context to be highly accurate.
Conversely, with Voice Generator, rather than trying to decipher words the user is saying, we are mapping lip movement directly to audio. There will be some constraints on language, but it will be much less constrained in comparison to VSR, or our SRAVI app.
A Side-by-side Comparison
Watch an Example
This first video shows a person saying “I’m comfortable”.
In the second video, our VVG engine is recognising their words by reading their lip movements. The app then generates a voice speaking those words over the original audio.
Features of Visual Voice Generator (VVG)
How does this work in Real-life?
Using Voice Generator, the listener will hear the audio created from the speaker’s lip movements. Should the audio be less than perfect, the listener can use their own intuition and the contextual setting to interpret the intent of what the speaker is attempting to say.
When VVG May be Useful
We believe there are many applications of VVG, including:
- Correcting corrupted or missing audio track – There are instances where videos of people speaking have corrupted audio tracks – or the audio is missing altogether. VVG could help restore what speakers have said
- Improving accuracy of Speech Recognition systems – in situations where audio is distorted or missing, today’s Speech Recognition systems won’t work well. In these situations a new and cleaner audio track, generated by VSS, could instead be inputted to the Speech Recognition system, thus achieving better results
- Liopa is also planning on using VVG in our SRAVI application. SRAVI is a speech-generating device for voice-impaired patients that uses VSR to generate text based on the user’s lip movements. This text can be ‘vocalised’ at the press of a button. By using VVG in SRAVI, we can support the option of directly creating the audio from the patient’s lip movements, thus bypassing the text generation stage. This may be more suitable for certain patient/carer dialogues