Q How long have you been at Liopa?
Since mid-2018, so roughly four years.
Q Which side of the technology do you work on the most?
I fit somewhere between a research engineer and a software engineer. I spend half the time researching the current state-of-the-art and developing new models, techniques, and training. The other half of my time, I’m developing production-ready code for deploying on servers that run demos and current products like SRAVI – so the back-end elements. My work is fairly varied.
Q Do you work with AI and computer vision?
Yes, my main background is in ML and computer vision. I’ve developed the front-end of the pipeline that takes a video sequence of lipreading, and then feeds that into the AI-based computer vision programme. That software engineering project was one of the first things I worked on. I’ve also helped with productionising the pipeline so it’s fast, efficient and able to be deployed onto servers for live usage.
I spend the remainder of my time on R&D work – I’ve always naturally fallen between the two, sort of bridging the gap between research and software engineering.
Q Can you describe the work you’ve done on Keyword Spotting, and describe how that’s different to regular lipreading?
I’m developing an appropriate model for keyword spotting. These models are attempting to spot keywords in a video sequence as opposed to transcribing the video.
For example, if an elderly patient is saying a series of words and phrases, the system is designed to pick out a keyword such as ‘toilet’, so you can gather their intent. It’s about trying to find the intent in a communication, irrespective of how it’s said, or the order it’s said in.
Q Where did you work before?
I was at an AI consultancy – with a similar kind of role – working on a wide range of computer vision and ML problems. I was approached by Darryl Stewart – he was my PhD supervisor when I was at QUB. My PhD was in visual speech recognition, so it was exciting to hear about this company that had been started up. Liopa was trying to commercialise this technology that we had spent three years of our lives developing.
Q But you’re no longer based in Belfast?
I’m in Stoke-on-Trent now. I started my PhD at QUB in 2008 and finished in 2011. It still feels like coming back home when I come back to Belfast.
Q What is the future for the Keyword Spotting (KWS) feature?
In the first instance, I can see it improving the robustness of command-and-control scenarios, where you have a limited set of intents / fixed vocabulary. Or, perhaps, in an automotive setting where there are a fixed set of commands you’d be speaking.
A KWS system allows for a lot more flexibility with the way users can communicate, and how the system determines their intent.
The alternative would be to use a visual speech recognition (VSR) system to transcribe everything that’s said, and infer their intent from that – that might be a trickier way to solve the problem.
In terms of where else it could go – it could be useful in security apps like in CCTV footage – but also tagging and retrieval of videos based on keywords. For example, you could run a search on videos where a particular subject has been spoken about.
Q How is KWS a technically different challenge from regular lipreading (VSR)?
It’s essentially a binary classification problem – to spot keywords in video sequence, you’ll be training a model to output a yes or no – either that occurred in this sequence or didn’t occur.
VSR – as we currently implement it –is a generative modelling approach where we generate what we think is being said, and match that with a list. That’s how the two systems work differently.
Q Did you work a lot on the AI training of the models?
It’s been a combined effort, but I had a lot of involvement in training the models that are currently being used for SRAVI for example. Both the back-end VSR and the front-end.
A big part of the job is training models. There’s a lot of development in training several models. Once you have a model, there’s still the job of fine-tuning it, and trying to optimise it.
Q What do you enjoy about this job?
Generally, I love working on the cutting edge of a technology like this. We’re breaking new ground and trying to solve problems that haven’t necessarily been solved yet. It’s particularly exciting at the moment because VSR has reached a level of maturity where you’re starting to see a lot of interest from the bigger companies; for instance, Google and Meta have taken a big interest. We’ve reached a point where our hard work has created software that is complex enough to run VSR, and yet it can be run on mobile devices. It can become as ubiquitous as audio-based speech recognition and other AI techs like face recognition – it’s cool to be at the forefront of something.