Digitizing Human Vocal Communication

by Priscilla Oppenheimer

Computers have started to learn the ancient art of human language. Computers can talk, listen, and sing. They can understand simple sentences and follow basic vocal instructions. Despite these achievements, many challenges lie ahead before computers can comprehend natural spoken language. A super-intelligent, talking machine, such as HAL in the Kubrick and Clarke "2001: A Space Odyssey" movie, is still science fiction, not science. In the meantime, scientists continue to make progress on the fundamentals of computerized language, including digitizing speech, compressing digitized speech, speech synthesis, and speech recognition.

Digitizing Speech

Computers work with discrete digital values. For example, a digital watch is called digital because it can display a finite number of times. In contrast, watches with hands are analog because the hands move continuously around the clock face. As the minute hand travels around the circle, it touches the numbers 1 through 12 and also the infinite number of points in between.

The human voice produces an analog signal. When a speaker pushes air out the lungs through the glottis, air pulses escape out the mouth and sometimes the nose. These pulses produce small variations in air pressure that result in an analog signal. According to Fromkin and Rodman, "The sounds we produce can be described in terms of how fast the variations of the air pressure occur, which determines the fundamental frequency of the sounds and is perceived by the hearer as pitch. We can also describe the magnitude or intensity of the variations, which determines the loudness of the sounds." (Fromkin 364)

Human speech can be represented as an analog wave that varies over time and has a smooth, continuous curve. The height of the wave represents intensity (loudness), and the shape of the wave represents frequency (pitch). The continuous curve of the wave accommodates an infinity of possible values. A computer must convert these values into a set of discrete values, using a process called digitization. Once speech is digitized, a computer can store speech on a hard drive and transmit speech across digital networks, including corporate networks, the Internet, and telephone-company networks, which are increasingly using digital components.

To digitize speech, an analog-digital converter samples the value of the analog signal repeatedly and encodes each result in a set of bits. Before sampling, the converter filters the signal so that most of it lies between 300 and 3400 Hz. While humans can hear frequencies as high as 20 kHz, most of the information conveyed in speech does not exceed 4 kHz. (A Hertz, or Hz, is a unit of frequency equal to one cycle per second.)

Sampling uses a theorem developed by the American physicist Harry Nyquist in the 1920s. Nyquist's Theorem states that the sampling frequency must be at least twice as high as the highest input frequency for the result to closely resemble the original signal. Thus, the voice signal is sampled at 8000 Hz so that frequencies up to 4000 Hz can be recorded. Every 125 microseconds (1/8000th of a second), the value of the analog voice signal is recorded in an 8-bit byte, the basic unit of storage on modern-day computers. By sampling this often, according to Nyquist, the result will be a faithful representation of the original signal, and the human ear will not hear distortion.

Compressing Digitized Speech

When an analog voice signal is sampled 8000 times per second, and each sample is stored in an 8-bit byte, the bandwidth required is 64,000 bits per seconds (64 Kbps). Recent research in the field of digitized voice focuses on algorithms that convert from analog to digital using a lower bit rate. Although 64 Kbps is not a large portion of the bandwidth on local-area networks (LANs), it is a large portion for remote-access networks and some wide-area networks (WANs). Consider a home user who might still be using a 56-Kbps modem. This user can not use new Internet voice applications unless voice is compressed.

Voice compression is handled by coder/decoders, or codecs, which are implemented in either hardware or software. A hardware implementation in a silicon chip is preferable because codecs must work quickly so that no delay is perceptible. In recent years, however, typical personal computers have become so fast that software codecs are now effective. Major players in the computer industry, including Intel, Microsoft, and Apple Computer, provide voice codecs. These companies, along with smaller companies and international standards organizations, have developed three types of speech compression algorithms: waveform codecs, source codecs, and hybrid codecs. Hybrid codecs use a mix of waveform and source methods.

Waveform codecs are the least complex of the codec types. The Pulse Code Modulation (PCM) codec is a common waveform codec that works in the classic way already described in this paper. The analog speech signal is filtered to remove high and low frequency components, and is sampled 8000 times per second. The sampled value is represented in 8 bits, resulting in a need for 64-Kbps bandwidth. The downside of the PCM codec is that it is not optimized for low-bandwidth networks.

The Adaptive Differential Pulse Code Modulation (AD-PCM) codec is a more advanced waveform codec. Instead of transmitting the actual sample values, after the first few samples, the AD-PCM codec transmits the difference between the actual input and an estimated value based on previous speech. Since most people cannot change their voice much in 125 microseconds, this method works well. By transmitting a difference from the prediction, instead of transmitting a full value, AD-PCM provides high-quality speech at sub-PCM bit rates. The codec can store the value of the difference in 2, 3, 4, or 5 bits. If the codec uses 2 bits, AD-PCM requires only 16-Kbps bandwidth, but the speech sounds distorted. Using 5 bits offers better fidelity, but requires 40 Kbps, which is too much bandwidth for some situations. Engineers developed the lower bit-rate source codecs to avoid this problem.

Source codecs are designed specifically for speech, whereas waveform codecs work well with any type of sound. Source codecs are based on a model for the human voice, called the source-filter model, developed by linguists, biologists, and physicists. In the source-filter model, the source is the lungs and vocal folds, which provide the raw acoustic energy. The filter is the vocal tract, including the throat, tongue, nasal cavity, teeth, and lips, which shape sounds into particular vowels and consonants. During speech, the vocal tract filters an excitation signal from the lungs and vocal chords. Source codecs emulate the behavior of the excitation signal and the vocal tract filter.

With source codecs, a single bit is used to specify whether a sound is voiced or unvoiced. (A voiced sound is a sound produced when the vocal folds are closed and vibrating.) For unvoiced sounds, the codec uses "white noise" (i.e. a random value) for the excitation signal. For voiced sounds, the codec determines a fundamental frequency for the excitation signal. Source codecs also convey information on how the vocal tract has affected the raw sound. Acoustic engineers have determined that the vocal tract produces a combination of tones that match the natural resonant frequencies of the air within the tract. As people speak, they raise and lower the resonant frequencies, also known as formant frequencies, by moving their tongues and lips.

To understand why the formant frequencies change with different sounds, think of the vocal tract as a tube closed at one end by the vocal folds and open at the other end at the lips. As in any transmission system, the shape and length of the transmission medium affect frequencies. In the case of voice, the resonant formant frequencies are affected by tongue height and placement for vowels, and the manner and place of articulation for consonants. Formant frequencies are the scientific rendering of what linguists know as bilabials, labiodentals, fricatives, glides, liquids, high vowels, front vowels, and so on.

In a sidebar in an article on throat singers in Tuva (a region in Siberia), George Musser, staff writer for Scientific American, describes the formants that encode vowel frequencies. He writes, "The frequency of the first formant, F1, is inversely related to tongue height (F1 falls as the tongue rises, as during the change from /a/ in "hot" to /i/ in "heed"). The frequency of the second formant, F2, is related to tongue advancement (F2 rises as the tongue moves forward, as when /o/ in "hoe" moves toward /i/ in "heed"). Theoretically, the vocal tract has an infinite number of formants, but the arrangement of the first two or three accounts for most of the difference among vowel sounds." (Edgerton 80)

Source codecs produce voice that sounds artificial, which is why engineers developed hybrid codecs. Hybrid codecs use a combination of source modeling and waveform analysis that produces high-quality speech without requiring a lot of bandwidth. Hybrid codecs process a set of excitation signals through a filter to see which excitation produces the best match of the original waveform. Once the best match is obtained, the codec transmits the filter variables rather than the original waveform or excitation signal. This method results in high-fidelity sound that may require as little as 5.4 Kbps bandwidth, depending on which implementation is used.

Speech Synthesis

Speech synthesis refers to a computer producing sound that resembles human speech. Speech synthesizers can read text files and output sentences in an audible and intelligible voice. Many systems allow the user to choose the type of voice, for example, male or female. Speech synthesis systems are particularly valuable for seeing-impaired individuals. Speech synthesizers can also give vocal ability to a mute person or someone who had a laryngectomy. Other applications include telephone directory assistance and airline systems that provide spoken automated arrival and departure times.

One of the more interesting applications for speech synthesis is talking Web pages. The World Wide Web Consortium (W3C) Voice Browser Working Group is developing a speech synthesis markup language for Web page authors. (See Hunt in "Bibliography.") With the W3C markup language, authors can include information in Web documents that helps a synthesizer correctly say sentences. The markup language gives authors a standard way to control pronunciation, volume, pitch, and rate. For example, the author can ensure that the speech synthesizer pauses appropriately by indicating the beginning and ending of phrases, paragraphs, and sentences.

A text-to-speech (TTS) system supports the speech synthesis markup language and is responsible for rendering a document as spoken output. One of the main jobs that a TTS system has is to convert words to a string of phonemes. A phoneme is an abstract unit in a language that corresponds to a single, distinctive sound. Web-page authors can use the "phoneme element" in the markup language to provide a phonemic sequence for words. If the author does not specify a phoneme element, the TTS system automatically determines the pronunciation by looking up words in a pronunciation dictionary. The phoneme element uses the International Phonetic Alphabet (IPA). For example, the source for a Web page might include a statement with the following marked text: <phoneme ph="q ru"> through </phoneme>.

Document creators can also use the emphasis, break, and prosody elements of the W3C markup language to guide a TTS system in generating appropriate prosodic features. Prosody is the set of speech features that includes pitch, timing, and emphasis. Producing human-like prosody is important to correctly convey meaning and to make speech sound natural.

In addition to the work done by the W3C, many computer and telecommunications vendors support speech synthesis, including Apple Computer, IBM, Microsoft, Lucent Technologies, and AT&T. Apple Computer has included speech synthesis software in the Mac OS for many years. AT&T has a fun Web page that lets you type in text to be synthesized. (See AT&T in "Bibliography.") You will notice when trying the AT&T system that the voice sounds somewhat dull and "computerized," which is still a problem for many speech synthesizers.

Speech Recognition

Speech recognition allows a computer to recognize words and follow basic vocal instructions by distinguishing phonemes (distinct sounds) and morphemes, the smallest units of linguistic meaning in a language. (Comprehending complete and uncontrived sentences falls under a different field of computer science called natural language processing, which is discussed in the next section.) Speech recognition systems are useful when individuals are unable to use a keyboard because their hands are occupied or disabled. Instead of typing commands or using a mouse to select menu items, an individual can speak into a computer's microphone.

Speech recognition systems usually require a training session during which the computer learns a particular voice and accent. Some systems also require the speaker to talk slowly and distinctly, and to separate words with short pauses. When compared to the human brain, today's computer brains are not skilled at speech recognition. As Alan Phelps wrote in the July 2000 issue of Smart Computing magazine, "Although babies all over the world pick up language abilities at an astonishing rate, computers muddle along with their superduper electronic brains, incapable of understanding even rudimentary speech. Even your lowly mutt who sleeps on a pile of old blankets in the garage can make more sense of your voice." (Phelps 71)

The first step in speech recognition is converting a speaker's voice into something a computer can recognize. This is accomplished with sampling, as described in the "Digitizing Speech" section. In addition to sampling speech, the computer also listens for pauses so it can sample background noises and remove them from the data. Speech recognition programs compare the sampled sound to known characteristics of human speech and remove obvious noise, such as automobiles and ringing phones.

After sampling, the next step for voice recognition software is to find phonemes within the string of incoming values. A computer can usually recognize phonemes, but not always, depending on the speaker's tone, accent, and rate of speaking. If a phoneme is not obvious, the software makes educated guesses based on linguistic research into which phonemes typically follow others. These conjectures are aided by the fact that the software already learned how the current user speaks.

Once the software understands phonemes, it begins combining the phonemes into morphemes and words. Once again, statistical analysis helps complete the job. For example, perhaps you want to compliment the computer and say, "This software is great." Maybe you are unsure of this statement, so you don't articulate "great." The software thinks it heard, "This software is" and then something with the /e/ phoneme. By combining the recognized sounds and word-probability scores, the software can make a good guess that you said, "This software is great," and not "This software is ate" or "This software is grape."

Computers can be programmed with a knowledge base that models a human expert and is capable of distinguishing useful utterances from similar, but not useful, utterances. For example, expert systems for the medical industry let doctors ask questions that the computer answers. If the doctors ask open-ended questions, however, the computer also needs natural language processing capabilities, which is a harder problem.

As computers become faster, software can analyze numerous probabilities so quickly that the user does not notice a slow response time as the speech recognition software moves from phonemes to morphemes to words. Being able to find the probable next phoneme or word is not the same as language comprehension, however. Human language is open-ended. It does not consist of a finite database of fixed words and sentences. Instead, human speakers learn a grammar that lets them construct novel sentences and assign meaning to original utterances by others. Computer engineers face a major hurdle when attempting to emulate these sophisticated comprehension abilities.

There are many applications for speech recognition, for example, telephone systems that let you say numbers rather than enter them with your fingers. User-interface software from Apple Computer, Microsoft and other companies can turn voice commands into the mouse clicks that programs expect. Many new wireless phones let drivers dial numbers by voice so they can keep their hands on the steering wheel. Technologies such as Microsoft's Auto PC also let you ask for directions, control a CD player, and read your e-mail aloud, all while driving your car. One of the more fun uses for voice recognition is surfing the Web with your voice. Dragon System's NaturallySpeaking and IBM's ViaVoice products offer the ability to surf the Web without touching the keyboard.

Security systems that recognize words and identify individual voiceprints are used to guard access to ATMs, buildings, computers, voice mail, and wireless phones. VoiceTrack Corporation's voice recognition products allow corrections officers to track the whereabouts of parolees. Using voice recognition for identification and authorization has some limitations, however. Most systems cannot account for a user with laryngitis, for example. Also, a good recording can fool some systems. Using speech recognition for security should be combined with other biometrics, such as finger or retina scanning.

Natural Language Processing

A computer that can recognize natural human sentences, and take action based on the meaning of the sentences, is the Holy Grail for researchers in the fields of computational linguistics and natural language processing. Most computational linguists believe that the best approach is to model human speech comprehension. This task includes more than simply modeling auditory speech perception, however. A computer must also be capable of advanced pattern matching and sophisticated decision-making that depends on context as well as logic.

The cognitive processing of the human linguistic brain is complex and not completely understood, which means that computer scientists are hard-pressed to emulate human abilities. Moreover, the processes that are understood are difficult for today's computers because of the huge number of possible sentences and decision branches that must be considered.

Early researchers assumed that a computer could do a good job with natural language processing by using a bottom-up method that involves first sampling the analog signal, then recognizing phonemes and morphemes, and finally constructing the phonemes and morphemes into words and sentences. Human brains, however, probably do at least some top-down processing, wherein they proceed from semantic and syntactic information to the sensory input.

Experiments show that subjects make fewer word-identification errors when the words occur in sentences than when they are presented alone. This discovery suggests that people use syntax in addition to acoustics to deduce meaning. Experiments also show that when subjects listen to recorded sentences in which a sound is removed and a cough is substituted, they hear the sentence without the missing sound. Although the acoustic signal is distorted, the brain's top-down processing makes an early decision about the meaning of the sentence, which preempts the acoustic problem. (Fromkin 367)

To become accurate language processors, computers must get better at the human ability to parse sentences. Parsing determines the syntactic structure of an expression. Parsing also conducts semantic processing by looking up words in a cerebral lexicon, and combining word meanings into workable phrases and sentences. The human parser develops a tree structure of determinators, noun phrases, verb phrases, and other parts of speech, configured in a mental hierarchy. Developing the tree requires an understanding of rules, for example, a rule that states that noun phrases require a verb phrase to complete a sentence.

Parsing operates as an iterative process that uses temporary memory that is similar to the random-access memory (RAM) in computers. The parser temporarily stores a phrase in memory while it works on other phrases. For example, in a sentence such as "The orange bird hopped," the parser must store "The orange" until it has enough knowledge to deduce that "orange" refers to the color of the bird, and not a type of fruit.

Computers do well at some aspects of parsing, for example, looking up words in a lexicon and applying rules. Computers are also good at placing phrases into temporary storage and later retrieving them. In fact, computers are better than humans at storing and retrieving numerous items. Humans lose track when sentence comprehension requires the storage of more than about seven items. Compared to humans, however, computers perform poorly in the decision-making that is required to correctly parse a sentence.

According to Pinker, in his book The Language Instinct, computer parsers are too meticulous for their own good. They find ambiguities that are legitimate for English grammar rules, but would never occur to a human. Pinker describes one of the first computer parsers that found five different meanings for the simple sentence "Time flies like an arrow." In three of the versions, the computer decided that "time" was an imperative verb, meaning, "determine the duration of," and that "flies" was a noun. In one version, the computer decided that there is such a thing as a "time-fly," and that time-flies are fond of arrows. (Pinker 209)

Despite the problems that computers have emulating human language processing, computational linguistics is making progress. Recent research into neural networks shows promise, with language comprehension being modeled as a large set of simple, interconnected nodes that are trained to recognize speech. Other computer scientists are making progress by giving up on the idea of emulating the human brain and letting the computer do what it's good at - massive storage and retrieval, recursive iterations, and rapid computations.

If computer evolution continues at its current speed, computers may be so powerful in the next twenty years that they can comprehend language by brute force. Perhaps the best approach is not to emulate humans, whom Pinker calls "sensitive, scheming, second-guessing, social animals." (Pinker 230) Instead, computers may excel by using a crude approach that zooms through hundreds of potential sentence meanings, storing and discarding phrases as necessary.

In his book "The Age of Spiritual Machines: When Computers Exceed Human Intelligence," inventor and visionary Ray Kurzweil reminds us that human evolution is inefficient and stuck "with the very slow computing speed of the mammalian neuron." Computers, on the other hand, double their processing power every two years, following what is known as Moore's Law, formulated by Gordon Moore, one of the founders of Intel. (Kurzweil 42) Although some industry experts claim that increases in computer-processing speed cannot continue to follow Moore's Law, Kurzweil proposes that science always finds a way for the acceleration of computing power to continue.

Summary

Human language can be digitized and compressed so that computers can provide talking Web pages, innovative telephone applications, and expert systems that can answer spoken questions. Although computers cannot communicate with humans in an uncontrived fashion yet, computer evolution moves at an exponential pace that means machines will be capable of a wide range of advanced human skills in the future, including language. The future may be molecular computers that harness the DNA molecule as a computing device, or nanobots (miniature robots) that permit imaging of neural patterns of the brain, or some other invention that we cannot predict. Regardless of the details, according to Kurzweil, computers will outpace humans in computational power within twenty years. Although we are unlikely to have a talking HAL-like computer within the next few years, language comprehension may not be difficult for the mighty computers of the future.

Bibliography

(Links to books are provided in association with Amazon.com.)

AT&T. "Text-to-Speech Synthesis." 2003. < http://www.bell-labs.com/project/tts/voices.html> (05 04).

Edgerton, Michael E. and Theodore C. Levin. "The Throat Singers of Tuva." Scientific American 281.3 (1999): 80-87. See the "Forming Formants" sidebar by George Musser.

Fromkin, Victoria and Robert Rodman. An Introduction to Language, 6th ed. Fort Worth, Texas: Harcourt Brace College Publishers, 1998.

Hunt, Andrew and Mark R. Walker. "Speech Synthesis Markup Language." 2003. < http://www.w3.org/TR/speech-synthesis> (05 04).

Keagy, Scott. Integrating Voice and Data Networks. Indianapolis, Indiana: Cisco Press, 2000.

Kurzweil, Ray. The Age of Spiritual Machines: When Computers Exceed Human Intelligence. New York, New York: Penguin, 2000.

Phelps, Alan. "Use Voice Recognition Software: Free Your Hands from the Keyboard." Smart Computing 6.7 (2000): 71-75.

Pinker, Steven. The Language Instinct. New York, New York: HarperPerennial, 1994.


Copyright © Priscilla Oppenheimer.

Hosted by Open Door Networks.