Synthesizing realistic human speech just got a lot easier

Register now

(Bloomberg) -- Voysis, a Dublin-based startup, said it’s shrunk the processing power required to run cutting-edge “Wavenet” voice generation technology so it can work on a mobile phone or other consumer device even without a connection to the internet. The company began selling the system to developers and manufacturers as of Thursday, and said the advances it’s made will make it easier and less expensive to create chatbots and digital assistants with realistic-sounding synthesized human voices.

The market for text-to-speech applications is forecast to grow to more than $3 billion by 2022, up from $1.3 billion in 2016, according to data compiled by Ireland-based Research and Markets. Sales of digital assistants, many of which will incorporate computer-generated voices, are expected to hit $4 billion by the same year, according to Colorado-based market intelligence firm Tractica.

DeepMind, the London-based artificial intelligence company owned by Alphabet Inc., first developed Wavenets, an AI-based method for creating more human-like computer speech, in 2016. The method has since found its way into the digital assistant and cloud-computing offerings of DeepMind’s sister company Google, which uses them to create 30 distinct voices in 14 different languages.

Before Wavenets, voice synthesizers worked primarily by combining individual syllables, either voiced by a human actor or generated digitally, to create words. Wavenets, by contrast, are trained on sound waves instead of syllables. They learn to accurately predict how the shape of the wave will change, making a new prediction as frequently as 24,000 times per second. The result is much more realistic sounding speech, which can include nuances like accents, lip smacks and verbal ticks.

But, despite improvements Google has made to reduce the amount of computing power this technique uses, it still requires a consistent connection to its datacenters, where the company uses specialized computer processors designed specifically for AI applications. Baidu, the Chinese Internet giant, has also experimented with using cloud-based Wavenets, but has not yet put them into its products because of the amount of processing power the technique requires.

Peter Cahill, a former academic researcher in computer-generated speech who co-founded Voysis in 2012, said his company had managed to shrink its system to the point where, once the AI is trained, the software uses as little as 25 megabytes of memory – about the same size as four Apple Music or Spotify songs.

Cahill said the company intends to publish an academic research paper on its technology, including benchmark tests of its performance against other voice synthesizers, including cloud-based Wavenet systems, within the next six weeks.

Simon King, a professor who specializes in speech technology at the University of Edinburgh but is not affiliated with Voysis, said the kinds of efficiency improvements Voysis is announcing could spur more companies to adopt Wavenets for computer speech.

“It’s likely to become the de facto approach used in commercial applications very soon,” he said in a statement. “It provides more natural-sounding speech than all previous technologies.”

In addition to Cahill, Voysis’s team includes Ian Hodson, a veteran software engineer who headed Google’s text-to-speech efforts from 2010 to 2016. Google had acquired Phonetic Arts, a speech synthesis company he helped found. Hodson also sold a previous voice synthesis company, Rhetorical Systems, to Nuance Communications in 2004.

Voysis currently employs 35 people in Dublin, Edinburgh and Boston and has received $8 million in venture funding from Boston-based Polaris Venture Partners. The company sells a suite of voice and natural-language processing services and said it has several large U.S. consumer companies as existing customers, but declined to name any, citing confidentiality agreements.

Bloomberg News