• Linguistic Processing

[ from Chapter 2 of Teaching and Researching Listening 4/e]






  • Phonological Processing: Psychoacoustic Effects in Perception                                                                      33
  • Morphological Processing: Recognizing Words 35
    • Recognizing Allophonic Variations Of Words 39
    • Assimilation of Consonant Clusters 40
    • Vowel Centering and Reduction 40
  • Syntactic Processing: Parsing Speech 41
    • Deriving an Argument Structure 42
    • Sources of Knowledge for Syntactic Parsing 44
    • Creating Propositional Representations 45
  • Integrating Multimodal Cues into Linguistic Processing 46
  • Summary: Merging Bottom-Up Cues with Top-Down Knowledge 48

References                                                                                        50

2.1   Introduction: Listening as Bottom-Up and Top-Down Processing

Linguistic processing during listening involves engaging in a complex set of coordinated cognitive activities that allow us to understand and interpret spoken lan- guage. This chapter deals with the most basic levels of linguistic processing: phono- logical processing, morphologicalprocessing, and syntactic processing.

These three decoding processes—of sounds, words, and syntax—are often referred to collectively as bottom-up processing. Bottom-up processing refers to a type of information processing where the understanding of a whole or higher-level concept is built from the analysis of individual elements or lower- level features. Bottom-up processing is a fundamental concept in cognitive psychology and sensory perception, and it is used to describe how we process sensory information, including in the domain of listening.

Phonological processing involves decoding the sounds of speech, recog- nizing the sequencing of phonemes, and identifying rhythm, stress, and inton- ation patterns. Morphological processing involves identifying the smallest units of language that carry meaning, such asprefixes, suffixes, and lemmas, or root words, and identifying boundaries of words. Syntactic processing involves parsing the sentencestructure, identifying the relationships between words, and applying the grammatical rules that govern how words are combined.

For native speakers and fluent nonnative speakers of a language, these bottom-up processes are largely automatized. They take placewithout our con- scious awareness. Only when some kind of anomaly or ambiguity is detected do we become aware of these processes.

Proper functioning of these processes is necessary for a complete understanding of the input, although it is possible to create meaning if only parts of the incoming signal are decoded. This type of compensation is pos- sible because of the power of semantic or top-down processing, which will be covered in the following chapter.


2.2   The Interdependence of Production and Perception

To understand the processes of speech perception, it is helpful to detail how the production and reception of speech operate in a coordinated fashion. The primary goal of speech production is to send communicative signals efficiently. To maximize communicative effectiveness,speakers structure speaking in such a way that their listeners can most readily retrieve their communicative intent (Broersma, 2012). Spoken languages have evolved congruently with this effi- ciency principle: both speaker and listener need to coordinate their aims for maximum effect (Brazil, 1995; Kager, 1999; Schneider et al., 2019).

Naturally occurring speech, also known as unplanned discourse, reveals several structural features that enable this coordination (Biber, 2019; Clark & Brennan, 1991; Pickering & Garrod, 2021). The most notable feature is that speakers produce speech in short bursts, not sentences (even though sentence- level grammar rules govern the overall structure) and change speeds and rhythms to create nuance, whichresults in frequent reductions and assimilations of sound sequences. Speakers also tend to use a lot of fillers (e.g., um, you know, like), false starts, and incomplete grammatic units (I was wondering if… Do you want to go together?), along with high-frequency content words (e.g.,come along vs. accompany), allowing them to plan and speak at a rapid rate. Natural speech also tends to feature more paratactic ordering(use of and, so, but, then), topic-comment ordering (My friend Alia, you really ought to meet her), and ellipsis (e.g., Are you Coming todinner? / I’ll be there In a minute) to allow for more rapid exchange of information.

While these features of conversational grammar allow for a cohesive flow of communication, they result in a radical simplification of phonology and grammar. This simplification occasionally places an extra burden on listeners in terms of decoding, particularly if they are not familiar with the speaker and the contextual frames within which the speaker is operating (Carter & McCarthy, 2017). (See Figure 2.1.)


2.3   Phonological Processing: Integrating the Acoustic Dimensions of Speech

Linguistic processing begins as soon as sound reaches the auditory cortex. The auditory cortex uses four physical characteristics of soundduring the perception process: intensity, frequency, waveform (spectral content), and duration. These characteristics of sound are converted into perceived attributes that are needed in decoding: loudness, pitch, and timbre (see Figure 2.2).

  • Intensity, measured in decibels (dB). Whispered language at one meter is about 20 dB, while everyday speech at one meter is about 60 dB. However, there is a normal fluctuation of up to 30 dB in a single utterance of any speaker in a typical conversation. Intensity is particularly important for detecting prominences in an utterance, as speakers will intentionally increase intensity (loudness) for specific effects.
  • Frequency, measured in hertz (Hz). Humans can hear sounds between 20 Hz and 20,000 Hz, but human languages typically draw uponsounds in the 100– 3,000 Hz range. Detecting movements in sound’s fundamental frequency is an important element of speech Every configuration of the vocal tract produces its own set of characteristics, represented as sound pressure variations and mapped as sine waves.
  • Timbre, which is a combination of parameters of the frequency shape of sound (its spectral envelope) and its harmonic structure (the relationship between its resonances of frequencies), is a subjective quality that enables the listener to distinguish the sources of sound, including specific speakers (Howard & Angus, 2017; Titze et al., 2015).
  • Duration, measured in milliseconds (ms). Languages differ in the average length of phonemes and syllables; for instance, in AmericanEnglish, syllables average about 75 ms; in French, syllables average about 50 ms. The typical duration of sounds in a language can varywidely, and speakers can intention- ally increase the duration for specific effects.


Once these acoustic elements in the speech signal are integrated, the input is passed along to the higher regions of the brain (the superiortemporal gyrus, the STG) for further processing (Wouters et al., 2024) .




Figure 2.1 Speaker–Listener Coordination

The speaker (right side) goes through five rapid steps in producing speech. Zero is the point on the timescale when words are actually uttered. The timings,therefore, are negative values. The listener (left side) goes through four rapid steps in comprehending speech. It takes just over half a second to comprehend the meaning of spoken words. (Based on Carter, 2019; Heylen, 2009; Stephens et al., 2010)














Figure 2.2 The Acoustic Dimensions of Speech

Intensity, frequency, waveform, and duration are physical signals that provide the listener with cues for decoding. In combination, these signals are perceived as loudness, pitch, and timbre. (Based on Almeida et al., 2021; Jenkins, 1961)


  • Phonological Processing: Psychoacoustic Effects in Perception

Because of the rapid speed of the phonological signal, a listener needs to use complementary sources of information in decoding speech: detection of speech signals, knowledge of articulatory causes, and prediction of speaker intentions.

The first source is made up of the psychoacoustic effects generated by the speech signal itself: intensity, frequency, waveform (spectralcontent), and dur- ation. These sources combine into a multidimensional signal, which can be represented in a spectrogram (see Figure 2.3). Because there is redundancy in the signal—each of the physical sources has some influence on the perception of loudness, pitch, and timbre—the listener can readily decode the sounds of the language and its phonemes in most contexts and with most speakers. Each phoneme exhibits a distinctive combination of acoustic qualities relative to other possible sounds, and each speaker has a unique timbre relative to allother speakers. Even though phonemes are always modified by context and rarely occur in pure form, an experienced listener in a languagecan identify the target phonemes in a range of variations (Cutler, 2012; Cutler & Broersma, 2005; Norris & Cutler, 2021).

The second source of information used for decoding speech is the listener’s subjective bodily experience of articulatory causes for thesounds that enter the auditory cortex. While auditing the sound, the listener can mirror the specific vocal configurations and vocal tractmovements needed in articulation to assist in perception. This mirroring process occurs in the brain’s motor cortex, which is also involved in speech comprehension.




Figure 2.3 A Visual Representation of Speech Input

A spectrogram of a sentence: Every year it’s the same thing. This is a visual representa- tion of the spectrum of frequencies in a speech signal as it changesover time. To create a spectrogram, the speech signal is first divided into short time segments (in this example, in hundredths of a second; the total time of theinput here is just over two seconds), usu- ally using a technique called windowing. Each time segment is then transformed into a frequency domain. This generates a spectrum that represents the signal’s amplitude at each frequency. The darker areas indicate higher energy or amplitude at those frequen- cies. (Credit: University of California, Phonetics Laboratory.)




For the consonants, these articulatory causes are the precise configurations and movements in the oral cavity (the lips, the teeth, thetongue, the palate, the glottis), the larynx (the hollow muscular organ forming an air passage to the lungs and holding the vocal cords, also known as the voice box), and the pharynx (the muscle-lined space that connects the nose and mouth to the larynx and esophagus).

For the vowels, the configurations of articulation are the positions of the tongue (the highest part of the tongue body) in the mouth relative to the front or back of the mouth (the horizontal dimension) and the top or bottom of the mouth (the vertical dimension) (see Figure 2.4). The listener’s motor-muscle memory of making these sounds further enhances their ability to perceive these same sounds through a kind of physical mirroring process known as proximal stimulation (Goldstein & Fowler, 2003; Kersten, 2023).

The third source of information that assists the listener in phonemic decoding is the listener’s prediction of the speaker’s intentions,that is, the listener anticipating what the speaker is trying to say. If the listener is familiar with the phonotactic system of the speaker’s language and dialect—its allowable phonemes, configurations, and sequences—recognition will be faster. Personal and situational knowledge also assist in prediction and perception. If the lis- tener can easily anticipate the speaker’s targets, words, and syntactic structures, decoding the input also becomes easier (Linke & Ramscar, 2020).




Figure 2.4 The Articulatory Causes of the Consonants and Vowels in English

This is a visual mapping of the 24 consonants and 15 vowels that are most commonly used in most varieties of English. All consonants have articulation pointsin the mouth: various points along the upper palate, the teeth, or the lips. All of the vowels in English have approximate tongue positions (the highest position of the tongue mass) in the mouth, varying along two dimensions: high to low (vertical axis) and front to back (horizontal axis), and to the relativeopenness of the mouth and jaw while voicing the vowel (close– open axis.) (Based on Ashby, 2013; Cruttenden, 2014; Shibles, 1994)





2.4   Morphological Processing: Recognizing Words

Morphological processing, recognizing word boundaries in the stream of speech, is the pivotal aspect of linguistic processing. Simple mathreveals the range of this speed. Real-time language comprehension, which typically involves


processing of 120–180 words per minute, requires rapid word recognition, often four or five words per second (Huettig et al., 2022).

Spoken word recognition is an approximating process, rather than a linear matching process; it involves a graduated mapping the incoming phonological signal to entries in the listener’s mental lexicon. The listener achieves word recognition in a chunking fashion by first following the flow of the speech stream and isolating identifiable word and phrase boundaries inside each burst of speech. Word boundariesare identified through prosodic cues for onsets of words indicated by transitional probabilities between individual sounds (Aitchison, 2012; Hagiwara, 2015). Each time there is a perceptible change in pitch or loudness, the listener automatically recognizes this as atransitional cue to be used to recognize upcoming words (see Figure 2.5).

Unlike written text, spoken text does not have distinct word boundaries to confirm word identities. Hence, the listener needs to formulate multiple candidates until the context eliminates all but the correct choice. The logogen model, a connectionist theory proposed byMcClelland and Elman in the 1980s




Figure 2.5 The Timing of Speech Detection.

This spectrogram of a short spoken sentence—Two plus seven is less than ten—represents the speed at which a listener decodes speech. In less than two seconds (refer to the time axis), the speaker has uttered seven words consisting of 23 distinct phonemes. The listener uses acoustic cues—duration (shown inthe x-axis), frequency (shown in the y-axis), and intensity (shown by the density of the shading)—to determine word boundaries. Shifts in pitch and intensity allow the listener to identify word boundaries. (Credit: University of California, Phonetics Laboratory)


(McClelland & Elman, 1986), suggests that the mental lexicon, a personal mental store of words and their associated information, containsunits called logogens. According to logogen theory, each word in the mental lexicon is represented by a logogen, consisting of severalinformative layers. These layers include phono- logical information (how the word sounds), orthographic information (how the word is spelled), semantic information (the word’s meaning), and contextual information (how the word is often used).

When a listener encounters a series of words, logogen theory proposes that the logogens associated with similar-sounding words are activated in parallel. As the speech input unfolds over time (i.e., a period of milliseconds), logogens compete with each other for activation, and the most highly activated logogen is eventually considered the target word (Pisoni & McLennan, 2016). An illus- tration of this processfor identifying the word in the phrase “recognize speech” is given in Figure 2.6.

Logogen theory also emphasizes the role of context in word recognition. Contextual information, particularly the surrounding words andsentence struc- ture, influences the activation levels of logogens. The sentence structure and the



Figure 2.6 Word Recognition in Real Time

This illustration shows that for this two-word input, “recognizing speech,” as many as 10 individual words (wreck, reckon, nice, ice, eye, bee, beach, aren’t, not, I) may have been activated in parallel as candidates before the target words “recognize” and “speech” were identified


developing meaning of the ongoing discourse help narrow down the range of pos- sible words, aiding in selecting the correct word from similar-sounding alternatives. More recent research contends that this model of categorical perception of speech as candidates is metaphorically valid, butthat the neurological processing of word candidates may not be as isolable as the model suggests (McMurray, 2022).

We do know that the listener gathers evidence of the input from two directions, from the bottom up, or from the linguistic signal as it unfolds,and from the top down, using judgments based on higher-level ideas active in the listener’s mind. From a bottom-up perspective, the input consists of nine layers of identifiable components:


  • Feature: glides, obstruents, sonorants, sibilants, continuants, aspiration, nasality
  • Segment: (phoneme), e.g., [k], [æ], and [t] in cat
  • Mora (μ): half-syllable or unit of syllable weight, used in some languages, such as Japanese and Hawaiian
  • Syllable (σ): syllables themselves consist of parts: onset (optional), nucleus

(required), coda (optional)

  • Foot (F): strong–weak syllable sequences such as ladder, button, eat some
  • Clitic group: a focal item plus grammaticalizing elements, g., an apple
  • Phonological word (P-word): a word or set of words uttered and interpreted as a single item, e.g., in the house
  • Lexical phrase, a formulaic element consisting of frequently used clitic groups and phonological words, e.g., try as one might
  • Pause Unit (PU)/Intonation Unit (IU)/phonological phrase (P-phrase): a phonological unit consisting of a lexically stressed item plus supporting grammatical elements uttered in a single burst of speech.


Though all these components can be measured in the spoken input, they are not necessarily perceived consciously by the listener. The listenertends to give con- scious attention to the longer units that are large enough to be psychologically valid. Psychologically valid units are those that the listener uses for processing meaning. These are the phonological word (a lexical item) and the phono- logical phrase (a meaningful group of lexical items) (Alderete & O’Séaghdha, 2022; Selkirk, 2011).

All views of word recognition emphasize the overriding roles of context: the speed and efficiency of the word recognition process dependon various context effects. A context effect occurs when the perception of a particular stimulus is influenced by the linguistic environment in which it occurs (Frost, 1998; Jusczyk & Luce, 2002; Luce & McLennan, 2005). Context effects allow the lis- tener to be primed to quicklyrecognize lexical items. There are three main types of context effects. Lexical effect, a tendency to identify known lexical items in a stream ofspeech rather than a random series of sounds; schematic effect, a ten- dency to hear plausible lexical items, items that are likely to occur in aparticular


setting or context; and syntactic effect, a tendency to anticipate plausible syn- tactic continuations for utterances. Context effects prime the auditory cortex, spreading activation in neural networks during listening. As soon as one word is recognized, activation spreads to neighboring(i.e., closely related) lexical items or concepts in the listener’s mental lexicon. This anticipation leads to a faster and more reliable recognition of upcoming words (Hendrickson et al., 2020).

  • Recognizing Allophonic Variations of Words

An essential aspect of word recognition is equating allophonic variations, the alternate pronunciations of a citation form of a word or phrasethat occur due to context. Allophonic variations (e.g., gonna versus going to) occur in every lan- guage because of efficiency principles in production. Speakers in spontaneous discourse often use only the minimum energy, decreasing volume and articula- tory precision of the phonemes they produce. As a result, nearly all phrases in any natural spoken discourse sample are less clearly articulated than pure cit- ation forms would be (Kleinschmidt & Jaeger, 2015).

These sandhi variations are brought about through three related co- articulation processes within and between word boundaries: assimilation, vowel reduction, and elision. (Sandhi is a Sanskrit term coined by the Sanskrit grammarian Panini some 2500 years ago, as this phonological phenomenon occurs in Sanskrit as well as other Indo-Aryan and Indo-European languages, including English.)

Consonant assimilation occurs when the fully articulated sound of a con- sonant changes due to phonological context (sounds that occurbefore and after), as shown in these examples:

  • /t/ changes to /p/ before /m/, /b/, or /p/ (labialization):

best man                                                                      mixed blessing

cigarette paper                                                              mixed marriage

circuit board                                                                 pocket money

  • /d/ changes to /b/ before /m/, /b/, or /p/ (labialization):

bad pain                                                                       good cook

blood bank                                                                   good morning

  • /n/ changes to /m/ before /m/, /b/, or /p/ (nasalization):

Common Market                                                           cotton belt

button pusher

  • /t/ changes to /k/ before /k/ or /g/ (velarization):

cut glass                                                                      short cut

  • /d/ changes to /g/ before /k/ or /g/ (glottalization):

sad girl                                                                        hard court

About The Author

, Basics of linguistic processing, Lateral Communications
Michael Rost, principal author of Pearson English Interactive, has been active in the areas of language teaching, learning technology and language acquisition research for over 25 years. His interest in bilingualism and language education began in the Peace Corps in West Africa and was fuelled during his 10 years as an educator in Japan and extensive touring as a lecturer in East Asia and Latin America. Formerly on the faculty of the TESOL programs at Temple University and the University of California, Berkeley, Michael now works as an independent researcher, author, and speaker based in San Francisco.

More Posts You May Find Interesting