Listening To Speech

Accurate perception of speech is a more complex achievement than might be imagined, partly because language is spoken at a rate of up to 12 phonemes (basic speech sounds) per second. Amazingly, we can understand speech artificially speeded up to 50-60 sounds per minute (Werker & Tees, 1992). In normal speech, phonemes overlap, and there is co-articulation, in which producing one speech segment affects the production of the following segment. The "linearity problem" refers to the difficulties for speech perception produced by co-articulation.

Another problem related to the linearity problem is the "non-invariance problem". This arises because the sound pattern for any given speech component such as a phoneme is not invariant. Instead, it is affected by the sound or sounds preceding and following it. This is especially the case for consonants, because their sound patterns often depend on the following vowel.

Speech typically consists of a continuously changing pattern of sound with few periods of silence. This contrasts with our perception of speech as consisting of separate sounds. The continuous nature of the speech signal produces the "segmentation problem", which involves deciding how the continuous stream of sound should be divided up into words.

Spectrograms and running spectral displays

Much valuable information about the speech signal has been obtained from use of the spectrograph. With this instrument, sound enters through a microphone, and is then converted into an electrical signal. This signal is fed to a bank of filters selecting narrow-frequency bands. Finally, the spectrograph produces a visible record of the component frequencies of sound over time; this is known as a spectrogram (see Figure 11.1). The spectrogram provides information aboutformants, which are frequency bands emphasised by the vocal apparatus when saying a phoneme. Vowels have three formants; these are numbered first, second, and third, starting with the formant of lowest frequency. However, vowels can usually be identified on the basis of the first two formants. Most vowel sounds fall below 1200 Hertz (Hz), which is a measure of sound frequency. In contrast, many consonants have sounds falling in the region from 2400 Hz upwards.

Spectrograms may seem to provide an accurate picture of those aspects of the sound wave having the greatest influence on the human auditory system. However, this is not necessarily the case. For example, formants look important in a spectrogram, but this does not prove they are of value in human speech perception. Evidence that the spectrogram is of value has been provided by making use of a pattern playback or vocoder, which allows the spectrograph to be played back. Thus, the pattern of frequencies in the spectrogram was produced by speech, and pattern playback permits the spectrogram to be reconverted into speech again. Liberman, Delattre, and Cooper (1952) constructed "artificial" vowels on the spectrogram based only on the first two formants of each vowel. These vowels were easily identified when they were played through the vocoder, suggesting that formant information is used to recognise vowels.

Sussman, Hoemeke, and Ahmed (1993) asked various speakers to say the same short words starting with a consonant. They found clear differences from speaker to speaker in their spectrograms. How do listeners cope with such differences? Sussman et al. (1993) focused on two aspects of the information contained in the spectrogram record:

1. The sound frequency at the transition point where the second formant starts.

2. The steady frequency of the second formant.

There was relational invariance between these two measures: those speakers who had high frequencies for (1) also had high frequencies for (2), whereas other speakers had low frequencies for both measures. Listeners probably use information about this relational invariance to identify the word being spoken.

An alternative way of turning speech sounds into visual form is by running spectral displays. These displays provide information about changes in sound frequencies that occur in successive brief periods of time. The advantage of running spectral displays over spectrograms is that they show more precisely how much energy is present at each frequency. Kewley-Port and Luce (1984) found their participants were able to identify the voicing and place of articulation of many phonemes with an accuracy of between 80% and 90% on the basis of running spectral displays.

Categorical speech perception

Speech perception differs from other kinds of auditory perception. For example, there is a definite left-hemisphere advantage for perception of speech but not other auditory stimuli. Speech perception exhibits categorical speech perception: speech stimuli intermediate between two phonemes are typically categorised as one phoneme or the other, thus producing a discrimination boundary that is more clear cut than is warranted by the physical stimuli (see Miller & Eimas, 1995, for a review). For example, the Japanese langugage does not distinguish between [l] and [r]. As these sounds belong to the same category for Japanese listeners, it is no surprise that they find it very hard to discriminate between them (see Massaro, 1994). This differs from the case with non-speech sounds, where discrimination between pairs of sounds is superior to the ability to label them as belonging to separate categories.

There is clear evidence for categorical perception in our conscious experience of speech perception. However, that does not necessarily mean that the earlier stages of speech processing are also categorical. In fact, the evidence suggests that they are not (see Massaro, 1994).

The major differences between speech perception and auditory perception in general led Mattingly and Liberman (1990) to argue that speech perception involves a special module or cognitive processor functioning independently of other modules. This issue was addressed by Remez, Rubin, Pisoni, and Carrell (1981). According to them, use of the speech perception module (if it exists) should not be influenced by some quite separate factor such as the listener's belief about the nature of the signal. They played a series of tones to two groups of participants. One group was told that they would be listening to synthetic or artificial speech, and their task was to write down what was said. These participants had no difficulty in carrying out this task. The other group was simply told to describe what they heard. They reported hearing electronic sounds, tape-recorder problems, radio interference, and so on, but they did not perceive any speech. The dependence of speech processing on the manipulation of the listeners' expectations suggests that speech perception does not involve a special module.

Word recognition

A key issue in research on speech perception is to identify the processes involved in spoken word recognition. There are numerous studies on this topic (see Moss & Gaskell, 1999, for a review). We will first consider some of the major processes involved, and will then turn to a discussion of influential theories of spoken word recognition.

Bottom-up and top-down processes

Spoken word recognition is generally achieved by a mixture of bottom-up or data-driven processes triggered by the acoustic signal, and top-down or conceptually driven processes generated from the linguistic context. However, as we will see, there have been disagreements about precisely how information from bottom-up and top-down processes is combined to produce word recognition.

Spoken language consists of a series of sounds or phonemes incorporating various features. Among the features for phonemes are the following:

• Manner of production (oral vs. nasal vs. fricative, involving a partial blockage of the airstream).

• Place of articulation.

• Voicing: the larynx vibrates for a voiced but not for a voiceless phoneme.

The notion that bottom-up processes in word recognition make use of feature information was supported in a classic study by Miller and Nicely (1955). They gave their participants the task of recognising consonants presented auditorily against a background of noise. The most frequently confused consonants were those differing on the basis of only one feature.

Evidence that top-down processing based on context can be involved in speech perception was obtained by Warren and Warren (1970). They studied what is known as the phonemic restoration effect. Participants heard a sentence in which a small portion had been removed and replaced with a meaningless sound. The sentences that were used were as follows (the asterisk indicates a deleted portion of the sentence):

The perception of the crucial element in the sentence (i.e., *eel) was influenced by sentence context. Participants listening to the first sentence heard "wheel", those listening to the second sentence heard "heel", and those exposed to the third and fourth sentences heard "meal" and "peel", respectively. The auditory stimulus was always the same, so all that differed was the contextual information.

Samuel (1981) identified two possible explanations for the phonemic restoration effect. First, context may interact directly with bottom-up processes; this would be a sensitivity effect. Second, the context may simply provide an additional source of information; this would be a response bias effect. Participants listened to sentences, and meaningless noise was presented briefly during each sentence. On some trials, this noise was superimposed on one of the phonemes of a word; on other trials, that phoneme was deleted. The task was to decide whether or not the crucial phoneme had been presented. Finally, the word containing this phoneme was predictable or unpredictable from the sentence context.

Performance in Samuel's (1981) study was better when the word was predictable, indicating the importance of context. If context improves sensitivity, then the ability to discriminate between phoneme plus noise and noise alone should be improved by predictable context. If context affects response bias, then participants should simply be more likely to decide that the phoneme was presented when the word was presented in a predictable context. Context affected response bias but not sensitivity, suggesting that contextual information did not have a direct effect on bottom-up processing.

Samuel (1990) reported further studies on the phonemic restoration effect. The effect was more likely to occur in long words than in short words, presumably because long words provide additional contextual information. There was more evidence for the phonemic restoration effect when the phoneme that was masked and the masking noise were similar in sound. Samuel (1990) concluded that contextual information influences the listener's expectations in a top-down fashion, but these expectations then need to be confirmed with reference to the sound that is actually presented.

Prosodic patterns

Spoken speech contains prosodic cues in the form of stress, intonation, and so on. This information can be used by the listener to work out the syntactic or grammatical structure of each sentence. For example, in the ambiguous sentence, "The old men and women sat on the bench", the women may or may not be old. If the women are not old, then the spoken duration of the word "men" will be relatively long, and the stressed syllable in "women" will have a steep rise in pitch contour. Neither of these prosodic features will be present if the sentence means that the women are old.

Most studies on listeners' ability to use prosody to interpret ambiguous sentences have only assessed this after an entire sentence has been presented. These studies have shown that prosodic patterns are generally interpreted correctly, but do not indicate when prosodic information is used. Beach (1990) presented a sentence fragment, and participants had to decide which of two sentences it had come from. For example, the fragment, "Sherlock Holmes didn't suspect", could be from the sentence, "Sherlock Holmes didn't suspect the beautiful young countess from Hungary", or the sentence, "Sherlock Holmes didn't suspect the beautiful young countess could be a fraud". Participants were fairly accurate at predicting the overall structure of sentences on the basis of a small fragment, indicating that prosodic information can be used rapidly by listeners.

Doubts about the role of prosodic cues were raised by Allbritton, McKoon, and Ratcliff (1996; see Chapter 13). Trained and untrained speakers were given ambiguous sentences in a disambiguating context, and told to read them out loud. Even the trained speakers only made modest use of prosodic cues to clarify the intended meaning of the ambiguous sentences.

Lip-reading

Many people (especially those who are hard of hearing) are aware that they use lip-reading to understand speech. However, this seems to happen far more than is generally believed among those whose hearing is normal. McGurk and MacDonald (1976) provided a striking demonstration of the importance of lip-

reading. They prepared a video-tape of someone repeating "ba" over and over again. The sound channel then changed so there was a voice saying "ga" repeatedly in synchronisation with the lip movements still indicating "ba". Participants reported that they heard "da", representing a blending of the visual and the auditory information.

The so-called McGurk effect is surprisingly robust. For example, Green, Kuhl, Meltzoff, and Stevens (1991) found the effect even when there was a female face and a male voice. They suggested that information about pitch becomes irrelevant early in speech processing, and this is why the McGurk effect is found even with a gender mismatch between vision and hearing.

Visual information from lip movements is used to make sense of speech sounds because the information conveyed by the speech sounds is often inadequate. Much is now known about the ways in which visual information provided by the speaker is used in speech perception (see Dodd & Campbell, 1986). Of course, there are circumstances (e.g., listening to the radio) in which no relevant visual information is available. We can usually follow what is said on the radio, because broadcasters are trained to articulate clearly.

Stop Anxiety Attacks

Stop Anxiety Attacks

Here's How You Could End Anxiety and Panic Attacks For Good Prevent Anxiety in Your Golden Years Without Harmful Prescription Drugs. If You Give Me 15 minutes, I Will Show You a Breakthrough That Will Change The Way You Think About Anxiety and Panic Attacks Forever! If you are still suffering because your doctor can't help you, here's some great news...!

Get My Free Ebook


Post a comment