Gerasimos – an expert in the fields of speech recognition and computer vision – has recently joined the team at Liopa. He is based remotely as a Professor at the University of Thessaly in Greece. He joined us for a chat about his career thus far.
A career in speech and image
Gerasimos worked at both AT&T Labs and IBM after graduating in 1994 with his PhD from Johns Hopkins University, where the Centre for Language and Speech Processing (CLSP) had just been formed.
That sparked an interest in Gerasimos – previously a scholar in image processing – to look at speech, in addition.
“Similar techniques were being used in the two fields,” he explains. Gerasimos says that he began “looking at techniques for modelling language” during his two-year Post Doctoral position at CLSP.
While in the US, at AT&T Labs, he joined a project combining speech with videos. “The work shared the same goals with what Liopa is doing in lip-reading,” he explains.
There, he also worked on visual synthesis and facial animation as well, and he had the opportunity to interact with Dr. Eric Petajan of Bell Labs, considered the father in the lip-reading field.
“However, camera and computer processing power were limited back then. Now, we benefit from huge advancements in these areas,” he says.
In 1999, he moved to the IBM T. J. Watson Research Center, who were starting a project on lip-reading – in Westchester County, NY.
“There was a computational workshop series that CLSP hosts every year. In 2000 one of the topics was on lip-reading – they were many researchers looking at this area in our team – we were all working together to advance the field,” he says.
What were the real-life applications of this research into automated lip-reading?
“We were researching that,” Gerasimos says. “Dictation, and voice-activated car navigation systems – and you could see this as a very noisy environment. We looked at it [lip reading] as a modality that could help make speech more robust.”
“We faced challenges in the visual domain, as well as data – not having enough. There have been tremendous advances in computational power, better algorithms, and more data, since,” he says.
What is most interesting in this area of visual speech recognition?
Gerasimos says, “Clearly voice assistants is the most interesting domain.”
He says that making the system more robust against background noise, and events that happen as you use the assistants, is paramount.
“Also the in-car domain is very interesting.”
There is a social element to it as well: “This modality can help people with disabilities. It can help traditional speech-based technologies.”
What made you interested in joining Liopa?
He says, “Liopa is in a unique position to be the largest – and to my knowledge also the first – startup focusing on this domain of automated lip-reading. It’s very exciting to see a group of determined people working to bring this technology to market.”
What academic institutions are leading in automated lip-reading research?
- Many universities in the UK have been working on this topic – Oxford and East Anglia, Surrey, Imperial – all have publications from the field in recent years
- In France, the University of Grenoble has published many papers on AV speech processing technologies
- KTH in Sweden has been researching facial animation
He concludes: “In almost every country there are efforts – although not necessarily continually over the years.”
Within 10 years will we develop an unconstrained automated lip-reading application?
“Yes, I believe it is possible – we have seen substantial growth and interest – so we are getting much closer than we were 10 years ago.”