LipRead VSR Platform
Liopa’s mission is to provide an accurate, easy-to-use and robust Visual Speech Recognition (VSR) platform. Known as LipRead, it will be focused on the Speech Recognition market. Liopa is a spin out from the Centre for Secure Information Technologies (CSIT) at Queen’s University Belfast (QUB). Liopa is onward developing and commercialising ten years of research carried out within the university into the use of Lip Movements (visemes) in Speech Recognition. The company is leveraging QUB’s renowned excellence in the area of speech, speaker and dialogue modelling to position in the market as a leading independent provider of VSR technology.
Liopa has created a novel, robust and convenient VSR platform, known as LipRead. Specifically, the Liopa technology can determine speech by analysing the movement of a user’s lips as they speak into a camera. These lip movements are known as visemes and are the visual equivalent of a phoneme or unit of sound in spoken language.
LipRead is initially being offered as a standalone VSR system capable of recognising a limited and predefined vocabulary – for applications such as Liveness Checking during on-line authentication.
Liopa is also developing a version of LipRead which can be integrated with Audio Speech Recognition (ASR) systems. ASR accuracy levels universally degrade in noisy (real-world) environments – LipRead can solve this problem and maintain high accuracy levels in challenging environments. Where a camera can be trained on the head of the speaker, the combined Audio-Visual Speech recognition (AVSR) system will boost word accuracy when background noise levels increase.
The Liopa technology requires no additional hardware and will work on any device with a standard forward facing camera (e.g. smartphone, tablet, laptop, desktop, in-vehicle dashboard etc.). LipRead supports standard RGB cameras, but IR/ToF sensor support is currently being developed.
As shown in the figure below, the LipRead platform has two main components:
- A Training pipeline which is used to ingest large amounts of training data and create a universal VSR based model. The training data is pre-recorded videos of speakers repeating known phrases from the use case grammar.
- A Speech Recognition pipeline which analyses videos of a user speaking to determine what they have said.
The Liopa VSR technology is based on the principle of Viseme analysis. A viseme is a generic lip movement that can be used to describe a particular sound. A viseme is the visual equivalent of a phoneme or unit of sound in spoken language. Using visemes, the hearing-impaired can view sounds visually via studying a subject’s lip movement. Liopa’s current VSR technology mimics this process by:
- capturing a video of a subject speaking
- tracking and extracting the movement of the subject’s lips
- performing feature extraction of the lip movement
- using Deep Neural Network (DNN) based techniques to analyse the lip movement
- comparing the results of the analysis (on a viseme by viseme basis) with a universal model to determine what has been spoken.
Deep Neural Network (DNN)
A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers giving the potential of modelling complex data with fewer units than a similarly performing shallow network. Liopa has developed a DNN based VSR system which leverages a proprietary and patent pending combination of leading edge neural network techniques.
Using these techniques Liopa has created a very flexible VSR platform, the most notable aspects of which are:
works in real world situations where environmental conditions are non-ideal for speech recognition
utilises existing phones and computing devices (i.e. no new hardware required)
can be trained for single/multiple speakers or be speaker independent
unique image feature extraction and analysis methodology which enables good quality speech recognition
new and unique Deep Neural Network based implementation of viseme analysis