Figure 112

Detection times for word targets presented in sentences. Adapted from Marslen-Wilson and Tyler (1980).

processing of speech and the processing of other auditory stimuli. A related position has been adopted by those contemporary theorists (e.g., Mattingly & Liberman, 1990) who argue that there is a separate module for speech perception.

Cohort theory

One of the most influential theories of spoken word recognition was put forward by Marslen-Wilson and Tyler (1980). The original cohort theory included the following assumptions:

• Early in the auditory presentation of a word, those words known to the listener that conform to the sound sequence that has been heard so far become active; this collection of candidates for the presented word is the "word-initial cohort".

• Words belonging to this cohort are then eliminated because they cease to match further information from the presented word, or because they are inconsistent with the semantic or other context.

• Processing of the presented word continues only until contextual information and information from the word itself are sufficient to eliminate all but one of the words in the word-initial cohort; this is known as the "recognition point" of a word.

According to cohort theory, various knowledge sources (e.g., lexical, syntactic, semantic) interact and combine with each other in complex ways to produce an efficient analysis of spoken language. This approach can be contrasted with the notion (e.g., Forster, 1979) that processing proceeds in a serial fashion, with spoken language being analysed in a fairly fixed and invariant series of processing stages.

Marslen-Wilson and Tyler (1980) tested some of their theoretical notions in a word-monitoring task, in which participants had to identify prespecified target words presented within spoken sentences. There were normal sentences, syntactic sentences (grammatically correct but meaningless), and random sentences

(unrelated words), and the target was a member of a given category, a word that rhymed with a given word, or a word that was identical to a given word. The measure of interest was the speed with which the target could be detected.

It is predicted by cohort theory that sensory information from the target word and contextual information from the rest of the sentence are both used at the same time. In contrast, it is predicted by serial theories that sensory information is extracted prior to the use of contextual information. The results conformed more closely to the predictions of cohort theory. Complete sensory analysis of the longer words was not needed when there was adequate contextual information (see Figure 11.2). It was only necessary to listen to the entire word when the sentence context contained no useful syntactic or semantic information (i.e., random condition).

Undue significance was given to the initial part of the word in the original cohort theory. It was assumed that a spoken word will generally not be recognised if its initial phoneme is unclear or ambiguous. There is evidence that the meanings of words not sharing an initial phoneme with the presented speech input are not immediately activated (e.g., Marslen-Wilson, Moss, & van Halen, 1996, discussed later). However, Connine, Blasko, and Titone (1993) referred to a study in which a spoken word ending in "ent" had an ambiguous initial phoneme between "d" and "t". There was evidence that the words "dent" and "tent" could both be activated at a short delay when the target word was presented.

Marslen-Wilson (1990) and Marslen-Wilson and Warren (1994) revised cohort theory. In the original version, words were either in or out of the word cohort. In the revised version, candidate words vary in their level of activation, and so membership of the word cohort is a matter of degree. Marslen-Wilson (1990) assumed that the word-initial cohort may contain words having similar initial phonemes, rather than being limited only to words having the initial phoneme of the presented word. These, and other, changes to cohort theory allow it to account for findings such as those of Connine et al. (1993).

There is a second major difference between the original and revised versions of cohort theory. In the original version, context influenced word recognition very early in processing. In contrast, the effects of context on word recognition are much more limited in the revised version, occurring only at a fairly late stage of processing. Evidence supporting the revised theory has come from studies on cross-modal priming, in which the participants listen to speech and perform a lexical decision task (deciding whether visual letter strings form words). The key assumption is that only words that have been activated by the speech input will show priming in the form of faster responding on the lexical decision task. Zwitserlood (1989) considered the effects of context on cross-modal priming. Context did not influence the initial activation of words (i.e., contextually inappropriate words as well as appropiate ones were activated), but it did have an effect after the point at which a spoken word could be uniquely identified.


Cohort theory has proved to be an influential approach to spoken word recognition. The revised version of the theory is generally preferable to the original version for two main reasons:

1. Its assumption that membership of the word cohort is flexible is more in line with the evidence.

2. Contextual effects on spoken word recognition typically occur late rather than early in processing, as proposed within the revised theory.

The major disadvantage with the revised version is that the modifications made to the original theory have made it less precise. As Massaro (1994, p. 244) pointed out, "These modifications are necessary to bring the model in line with empirical results, but they.. .make it more difficult to test against alternative models."

TRACE model

McClelland and Elman (1986) and McClelland (1991) produced a network model of speech perception based on connectionist principles (see Chapter 1). Their TRACE model of speech perception resembles the original version of cohort theory. For example, it is argued within both cohort theory and the TRACE model that several sources of information combine interactively to achieve word recognition. The TRACE model also resembles the interactive activation model of visual word recognition put forward by McClelland and Rumelhart (1981), which is discussed later.

The TRACE model is based on the following theoretical assumptions:

• There are individual processing units or nodes at three different levels: features (e.g., voicing; manner of production), phonemes, and words.

• Feature nodes are connected to phoneme nodes, and phoneme nodes are connected to word nodes.

• Connections between levels operate in both directions, and are only facilitatory.

• There are connections among units or nodes at the same level; these connections are inhibitory.

• Nodes influence each other in proportion to their activation levels and the strengths of their interconnections.

• As excitation and inhibition spread among nodes, a pattern of activation or trace develops.

• The word that is recognised is determined by the activation level of the possible candidate words.

The TRACE model assumes that bottom-up and top-down processing interact during speech perception. Bottom-up activation proceeds upwards from the feature level to the phoneme level and on to the word level, whereas top-down activation proceeds in the opposite direction from the word level to the phoneme level and on to the feature level. Evidence that top-down processes are involved in spoken word recognition was discussed earlier in the chapter (e.g., Marslen-Wilson & Tyler, 1980; Warren & Warren, 1970).

McClelland and Rumelhart (1986) applied the TRACE model to the phenomenon of categorical speech perception. According to the model, the discrimination boundary between phonemes becomes sharper because of mutual inhibition between phoneme units at the phoneme level. These inhibitory processes produce a "winner takes all" situation, in which one phoneme becomes increasingly activated while at the same time other phonemes are inhibitory. McClelland and Rumelhart (1986) carried out a simulation based on the model that successfully produced categorical speech perception.

Cutler et al. (1987) studied another phenomenon that lends itself to explanation by the TRACE model. They used a phoneme monitoring task, in which participants had to respond immediately to the presence of a target phoneme. They observed a word superiority effect, in that phonemes were detected faster when they were presented in words than in non-words. According to the TRACE model, this phenomenon occurs because of top-down activation from the word level to the phoneme level.

Marslen-Wilson et al. (1996) presented their participants with "words" such as p/blank, in which the initial phoneme was halfway between a/p/ and a/b/. They wanted to see whether this "word" would facilitate lexical decision for words related to plank (e.g., wood) or to blank (e.g., page). The TRACE model predicts that there would be a significant facilitation or priming effect because of spreading activation. In contrast, the original cohort theory assumed that only words matching the initial phoneme of the presented word are activated. Thus, the prediction is that there should be no priming effect. The findings supported the cohort theory and were inconsistent with the prediction of the TRACE model.


The TRACE model has various successes to its credit. It provides reasonable accounts of phenomena such as categorical speech perception and the word superiority effect in phoneme monitoring. A significant general strength of the TRACE model is its assumption that bottom-up and top-down processes both contribute to spoken word recognition, combined with explicit assumptions about the processes involved. However, the theory predicts that speech perception depends interactively on top-down and bottom-up processes, and this was not confirmed by Massaro (1989) on a phoneme-discrimination task. Bottom-up effects stemming from stimulus discriminability and top-down effects stemming from phonological context both influence performance, but they did so in an independent rather than interactive way.

There are other problems with the TRACE model. First, it is assumed that words that are phonologically similar to a presented word will be activated immediately, even though they do not match the presented word in the initial phoneme. In fact, this is typically not the case (e.g., Marslen-Wilson et al., 1996).

Second, the theory exaggerates the importance of top-down effects. For example, Frauenfelder, Segui, and Dijkstra (1990) gave their participants the task of detecting a given phoneme. The key condition was one in which a non-word closely resembling an actual word was presented (e.g. "vocabutaire" instead of "vocabulaire"). According to the model, top-down effects from the word node corresponding to "vocabulaire" should have inhibited the task of identifying the "t" in "vocabutaire", but they did not.

Third, the existence of top-down effects depends more on stimulus degradation than is predicted by the model. For example, McQueen (1991) presented ambiguous phonemes at the end of stimuli, and asked participants to categorise these phonemes. Each ambiguous phoneme could be perceived as completing a word or a non-word. According to the model, top-down effects from the word level should have produced a preference for perceiving the phonemes as completing words. This prediction was confirmed only when the stimulus was degraded.

Fourth, the model has problems in dealing with issues such as the timing of speech sounds and differences in speech rate from one speaker to another. The TRACE model assumes that there are time slots, with feature, phoneme, and word units or representations being replicated across time slots to allow them to be identified. However, as Ellis and Humphreys (1999, p. 349) pointed out, "The problem with this of course is that it requires massive numbers of units, and of connections between units. TRACE has local units that are set to a given time slot. There is no guarantee that the speech signal will match the time slots set in the model. As a consequence, the model may fail to generalise its recognition across different speech rates." The consequence of this is that TRACE cannot recognise speech (Protopapas, 1999).

Fifth, tests of the model have relied heavily on computer simulations involving a small number of one-syllable words. As a result, it is not entirely clear whether the model would perform satisfactorily if applied to the vastly larger vocabularies possessed by most people.

Sixth, we learn many aspects of speech perception during the course of development. In contrast, as Protopapas (1999, p. 420) pointed out, TRACE "does not learn anything. It is prewired to achieve all its remarkable results, thus effectively encoding the knowledge and intuition of its designers."

Section summary

Theories of spoken word recognition are becoming increasingly similar. Most theorists agree that activation of several candidate words occurs early in the process of word recognition. It is also generally assumed that the speed with which word recognition is usually achieved indicates that most of the processes involved proceed in parallel, or at the same time, rather than serially. There is also general agreement that the activation levels of candidate words are graded rather than being either very high or very low. Finally, nearly all theorists agree that bottom-up and top-down processes combine in some way to produce word recognition, although they disagree on how this happens. The revised version of cohort theory and the TRACE model both incorporate all these assumptions.

There are two issues in need of further research. First, there is still very little agreement on the size and number of the basic perceptual units in spoken word recognition, with theorists differing in the importance they attach to features, phonemes, syllables, and so on. Second, there is the issue of precisely how contextual and other forms of top-down information are used in spoken word recognition. As Harley (1995, p. 56) concluded, "It is difficult to draw any definite conclusions about the role of context in spoken word recognition, we need more detail on the time course of the different stages of word is difficult to be sure that these experiments [on context] are tapping processes before the selection of a unique candidate rather than reflecting post-access effects."

Business Correspondence

Business Correspondence

24 chapters on preparing to write the letter and finding the proper viewpoint how to open the letter, present the proposition convincingly, make an effective close how to acquire a forceful style and inject originality how to adapt selling appeal to different prospects and get orders by letter proved principles and practical schemes illustrated by extracts from 217 actual letter.

Get My Free Ebook

Post a comment