How do machines mimic human speech?

The idea of non-humans learning to talk like us has piqued our interest as a species for a very long time. It is one of the reasons that parrots are such a common house pet; their ability to learn to speak making them exotic birds that fascinate and entertain. Similarly, the concept of talking machines has always been present in science fiction ever since the inception of basic computers.

The idea of a self-aware artificially intelligent computer or robot has been a point of interest for many, ranging from experts in the fields of computers and robotics to dreamers and fans of fictional universes. You can find dozens upon dozens of movies, TV shows, and novels that explore the concept. It is only natural then, that we have worked upon and made great strides in this field.

Though robotic butlers and A.I system managers are still far from reality, machines are learning to talk. Through the combined efforts of language scientists, acousticians and electronics experts, synthetic speech is allowing clocks to announce the time, machines to read to the blind and cars to warn their owners that it’s time to fill up.

In order to develop such chatty contraptions, linguists first had to learn what makes up a word. Linguists have broken down human language into a small number of identifiable sounds, or phonemes. All the words in Standard English are said to be composed of just 40 to 50 basic phonemes stung together and adjusted for syntax.

A computer is taught to recognize and synthesize words in one of two ways. In the first, known as synthesis by analysis, it takes recorded samplings of the human voice and analyses their sound waves every one-hundredth of a second. It then extracts and stores certain key attributes, such as predominant frequencies and energy levels. Later, the machine is able to mimic these impulses electrically and using filters, oscillators and noise generators, turn them into sounds. Since the computer monitors each tiny nuance, synthesis by analysis can produce extremely lifelike voices. Vocabulary, however, is limited to those words actually programmed into its memory.

The other method, synthesis by rule, allows enormous versatility because any word can be produced. The computer is programmed with the basic phonemes and the rules of pronunciation and stress, from which it assembles words. But what is gained in flexibility is lost in clarity, since it’s difficult to reduce all the permutations of pronunciation and inflection to a single set of rules. Regardless of the technique used, voice-synthesis systems are becoming ever more commonplace.

We are moving ever closer toward a world where typing away at keyboards or tapping at screens would be a thing of the primitive past. You can already find speech recognition software and talking digital assistants in modern smartphones and computers. However, we still have some ground to cover before this technology reaches perfection.

There are still multiple hurdles that we need to overcome. For example, it can be quite difficult for a machine to separate voices from other background noise. It needs to learn and adapt to recognize useful and useless sound waves. Other than that, there is also the matter of people sounding different from each other, or some words being pronounced exactly the same way but meaning drastically different things.

There’s also the problem of people talking too quickly and getting some words all jumbled up. Though our brains might be able to make sense of such phonetic nonsense thanks to our extraordinary capabilities for predicting things, computers have a long way to go before they can handle such complexities and complications.

Related post: