• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/108

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

108 Cards in this Set

  • Front
  • Back
Perception of Speech
How we assign meaning to the acoustic sound wave stimulating our auditory system.
What is the segmentation problem?
There is no one-to-one correspondence between individual phonemes and their acoustic underpinnings.

There is variability (caused by certain factors) in phoneme production.
What types of events do we perceive?
Linguistic events
What types of events do we hear?
Acoustic events
What are the challenges to speech perception?
Lack of segmentation

Lack of phonemic invariance

Lack of Linearity
What is phonemic invariance?
The idea that phonemes always stay the same no matter who the speaker is or in what context they are delivered.
Why does speech have a lack of phonemic invariance?
Rate of speech, patterns of stress and intonation, dialect, fatigue and many other factors contribute to the variability in phoneme production.
What is the linearity principle?
A specific sound in a word corresponds to a specific phoneme. The sounds that make up the word are distinct from each other and occur in a particular sequence.
Why does the speech signal have a lack of linearity?
Because of coarticulation, information is present about the acoustic properties of a specific phoneme, as well as the phonemes that precede and follow it.

The acoustic boundaries between phonemes are blurred.
What are the cues for place of articulation?
Formant spacing

F2 transition

Frequency of noise
What are the cues for voiced/voiceless distinction?
voice bar

timing
How does a listener perceive front vowels?
The frequency of F1 and an average of F2 and F3 (since these two formants are close in frequency)
How does a listener perceive back vowels?
The average of F1 and F2 as well as the frequency of F3
What is "target undershoot"?
The F1/F2 vowel space becomes reduced so the formant patterns of different vowels become similar to each other.

Amount of undershoot depends on which phonemes precede and follow the particular vowel, as well as speaker and speaking style.
What yields the most important acoustic information for identifying coarticulated vowels?
The changing formant patterns
How are diphthongs perceived?
On the basis of their formant transitions.

Specifically, it is how fast the formants change that is the most salient cue.
Categorical perception of consonants
If a series of consonant sounds was heard by a listener, with the sounds differing in one acoustic aspect by small equal steps, the listener would perceive some of the sounds as the same phoneme until a boundary was reached.
Listeners perceive consonants using ____________ that are fused into the perception of a single phoneme.
multiple acoustic cues
Perceptual information is obtained from ___________ as well as from the target sound itself.
sounds adjacent to the target sound
How are liquids perceived?
On the basis of their formant transitions

Much more rapid than those in diphthongs

F3 frequency is also important for differentiating between the liquids
How are glides perceived?
Transitions are shorter in duration than those of diphthongs.
How are nasals perceived?
On the basis of their internal formant structure, as well as on the basis of the formant transitions of the vowels occurring before and after the nasal sound

F1, F2 and F3 are weak in intensity because of the antiresonances resulting from the structure of the nasal cavities.

Also have an extra formant, the nasal formant
How are stops perceived?
On the basis of numerous acoustic cues that are intertwined with the acoustic cues for the vowels and consonants surrounding the phoneme.

Also, inserting an interval of silence between two phonemes can cause a stop sound to be heard. (e.g. "slit" becomes "split")
How are fricatives perceived?
-Major cue?
-Place of articulation
The duration of the frication noise provides the major perceptual cue

Place of articulation of fricatives is perceived on the basis of the spectrum and intensity of the noise.
How are affricates perceived?
They share acoustic features with stops and fricatives.

The rise time can be particularly important in recognizing affricates versus fricatives.

Duration of the frication noise is also longer for fricatives than for affricates.
What is a theory?
A way of integrating current knowledge about a particular phenomenon

It needs to be testable (verifiable or falsifiable)

Creates the complexity that is simplified by models.
They help to explain observed data and info and can be used to make predictions about events related to the phenomenon in question.

Because it is based on incoming info and new research, it is always subject to change.
What is a model?
A simplification of a system or its parts that can be manipulated in a controlled manner to test theories.

Helps us to break down complex systems into parts.
What are the three major issues that most models of speech production try to address?
The serial-order issue

Degrees of freedom

Context-sensitivity problem
Serial-Order Issue
Speech is a continually varying waveform composed of linguistic elements

The linguistic elements are produced in a serial order and this order is important for meaning.

Which elements are serialized?
-Features of a sound (voicing, nasality, etc.), phonemes, syllables, parts of syllables, etc.

At what level are we programming the movements?
Degrees of Freedom
There are a huge number of muscles involved in speech and many structures in the vocal tract can move in different ways, at different speeds and in different combinations (range, temporal, speed).

Each individual muscular movement is one degree of freedom.

There are more than 70 different muscular degrees of freedom (tongue, velum, jaw, larynx and respiratory system)

We need to regulate all muscular contractions very rapidly.
Context-Sensitivity Problem
Sounds vary with context and are influenced by speaking rate, stress, clarity of articulation and other factors.

Coarticulation results in huge variability in the production of a target sound.
What are the main models of speech production?
Target Models

Feedback and Feedforward Models

Dynamic Systems Models

Connectionist Models
Target Models
Speech production is a process of attempts to attain a sequence of targets.

Targets have been specified as being spatial or in acoustic-auditory terms

Timing of artic. movements is programmed by a mechanism that uses an internal model of relations among commands to the articulators, their movements and the acoustic consequences of those movements.

This internal model is acquired and maintained by the use of auditory and somatosensory feedback.

Once speech has been learned, auditory feedback is not used to control the articulators moment-to-moment although somatosensory feedback still is used this way (but at lower levels)

Production involves movement of primary and secondary articulators

Hierarchical organization reduces degrees of freedom and accounts for flexibility in the system.
Spatial Target Models
There is an internalized map of the vocal tract in the brain that allows the speaker to move his/her articulators to specific regions within the vocal tract (regardless of the starting point)
-Shows that movements must be variable.
Acoustic-auditory Target Models
The goal is the acoustic output while the articulatory movements used to achieve this output may vary (based on context, speaker's speech rate and different patterns of stress)
Feedback Model
Transfer of part of the output of a system back to the input to regulate and correct any errors in the output.

As we speak, we hear (auditory channel) what we are saying and obtain info about the movements of our articulators through proprioceptive, kinesthetic and tactile channels.

If there is a discrepancy between the actual and intended movements, an error signal is generated and sent back to the periphery (muscles) to correct the problem.

Motor command to the structure that causes the system to respond is called the "reference signal" or "actuating signal"
Proprioception
Knowledge about where our body is in space
What problems does the feedback model encouter with explaining speech production?
Feedback channels tend to be relatively slow but the movements required for connected speech are very fast.

Disruption of feedback channels should have a serious effect on speech production. However, most studies show that speech is minimally affected, if at all, by these disruptions.
Feedforward Model
Signals make adjustments at the periphery so that the system is primed to move in an efficient, coordinated manner.

Signals bias the system so status is suited to the movement to be performed (for planning and predicting)

This is a much faster process than feedback but it is likely that both feedback and feedforward are involved in the regulation and control of speech production.
Dynamic Systems Model
Groups of muscles link up together (combined or coordinated) to perform a particular task.

"Linkages" or "synergies" are not fixed - A muscle may be grouped into a synergy or coordinative structure with other muscles for a specific task and grouped with different muscles for different tasks.

No single muscle is controlled alone. They are controlled as an integrated unit, which addresses the degrees of freedom problem
Connectionist Models
AKA spreading activation models, parallel-distributed processing models

Networks of dense connections accomplish speech movement and connectivity changes with experience.

Non-hierarchical - Cooperative and parallel processing of signals with many input units and many connections

Efficient because of temporal overlap

Connections are weighted via inhibitory and excitatory signals
What properties do synergies have in the Dynamic Systems Model?
They are task specific (making sounds vs. moving food in mouth), context specific (different speech sounds), adaptive (based on experience)

They possess essential qualitative parameters (e.g. lip closure for bilabials)

They possess non-essential parameters, quantitative variations to accommodate rate, stress, phonetic context
What is the airplane analogy (Dynamic Systems)?
Different parts control movement of the plane (elevators, rudder, wing parts) The pilot does not control each part individually. Instead, he/she moves all at once, so 5 degrees of freedom control one movement
What is a non-hierarchical model of speech production?
Connectionist models
Speaker Normalization
lack of invariance due to differences across speakers
What are the types of speech perception theories?
Active

Passive

Bottom-up

Top-down

Autonomous

Interactive
What perceptual cues do we rely on in addition to the acoustic information?
Body language / facial expression

Content - topic cue

Familiarity with the speaker

Movement of the articulators (speechreading)

Intonation (but this is acoustic)

Knowledge of the language (syntax/rules of grammar, vocabulary)
What are the perceptual cues for vowels?
Transitions to and away from the vowels

Relationships to and of the formants

Phonetic context
What are the perceptual cues of diphthongs?
Relatively gradual (when compared to semi-vowels) transition from one vowel position to another.
What are the perceptual cues that differentiate vowels from consonants?
Redundancy of acoustic information

Consonants: Shorter duration, lower intensity, voiced or voiceless, more constriction
Categorical Perception
-Definition
The ability to discriminate only as well as one can identify.

Sounds differing in some small way are perceived as the same phoneme until a boundary is reached.
There is evidence of categorical perception for __________.
Place

Manner

Voicing
What are the questions that can be asked about categorical perception?
Do people perceive speech differently than they perceive nonspeech?

Do we categorically perceive any other sounds?

Does learning of a language alter speech perception?

Is categorical perception innate or learned?

What are the auditory vs. linguistic contributions?
Categorical perception for place has been studied between which phonemes?
b, d, g

r, l
Categorical perception for manner
Transition duration - b/g to w/j to ua/ia
Categorical perception for voicing
VOT from -150 to +150
Perception is more categorical for _________ than for _________. Why could this be?
consonants

vowels

It is easier since consonants are more rapidly changing than vowels.
What influences the difficulty of shifting perceptual boundaries?
Age, time exposed to the language, whether it is the first or second language being taught.
What are trading relations?
Multiple cues are present in consonants so there is a trade-off between these cues

e.g. first formant height (high=voiceless) and VOT (longer=voiceless)
How does coarticulation affect perception?
We integrate info that is spread out over several phonetic segments to make a decision about phoneme identity.
How does rate affect perception?
Influences consonant identification
If you keep the VOT the same and speed up speech, what will happen to perception of VOT?
VOT will seem relatively long.
What is the main acoustic cue for distinction of /r/ and /l/?
Formant transitions

F3 is lower for /r/ than for /l/

This is essential for distinction of the two phonemes
What is the main acoustic cue for glides?
Shorter transition duration (40-60 msec to 100-150 msec)

If transition is shorter than 40 msec, it is perceived as a stop.

If longer transition, it is perceived as a diphthong
What are the acoustic cues for manner of stops?
Preceding silence or sound attenuation, transient noise burst, rapid formant transition
What are the acoustic cues for place of stops?
Noise burst frequency, formant transitions to vowels, VOT (shortest for bilabials, longest for velars)
What are the acoustic cues for voicing of stops?
Duration of silent interval

initial position of stops = VOT, F1 cutback, F0 of vowel (higher than voiceless stops)

final position = vowel duration (longer before voiced stops)
What are the acoustic cues that differentiate between fricatives?
High/Low frequency noise

/s/, /z/ = higher frequency
/sh/, /zh/ = lower frequency
What are the acoustic cues for the manner of fricatives?
friction noise, longer in duration than for stops (130 ms or greater)
What are the acoustic cues for place of fricatives?
Sibilants more intense than non-sibilants and high-frequency spectral peaks
non-sibilants flat spectra
prominent peaks higher for s, z than for sh, zh
F2 transitions
What are the acoustic cues for voicing for fricatives?
Friction noise (longer duration and more intense for voiceless fricatives)
What are the three roles that context can have in speech perception?
Acoustic Context

Linguistic Context

Topic/Situational Context
Acoustic Context
Information regarding several phonemes converge on a specific sound
Linguistic Context
Language specific sound sequences (e.g. ng and nd at the end of syllable, never at beginning), other language based expectations regarding syntax and semantics
Topic/Situational Context
Familiarity with subject matter
Speech Perception Process
Interaction between incoming and stored information (e.g. how articulators produce speech, language rules/words, subject matter
What are the main acoustic cues for nasals?
-overall
-manner cue
-place cue
Internal formant structure, vowel formant transitions

manner cue-weak formants, nasal formant

place cue-F2 transition frequency and duration
What is speaker normalization?
Lack of invariance due to differences across speakers

Sounds can be remarkably different even down to the physical make-up of the sound.
What does it mean to say that the basic unit of perception varies with context?
We have different strategies of perception that are flexible depending on the context.

e.g. quiet vs. loud environment
What does it mean to say that a speech perception theory is "active"?
There is a link between speech perception and speech production and a knowledge of how sound is produced.

We refer the input to our knowledge of speech production.
What does it mean to say that a speech perception theory is "passive"?
It emphasizes sensory aspects of speech perception

e.g. We have a particular filter for speech analysis and tune in to the presence and absence of distinctive features.
What is a bottom-up theory?
All information necessary for recognition of sounds is contained in the acoustic signal.

We are not relying on higher level analysis at all
What is a top-down theory?
Higher level linguistic and cognitive operations involved in identification and analysis of sounds

We make use of our knowledge of the language to identify input.
What is an autonomous theory?
Signal is processed in a serial manner (phonetic-lexical-syntactic-semantic).

We have different levels of analysis that are sequential (no overlap)

Doesn't allow for a top-down process since each level is working on its own.
What is an interactive theory?
We get input from many sources (info/knowledge) available at all stages of processing for speech perception

Could be high level and low level working together
Motor Thoery
An active theory that stresses the link between perception and production

We have experience in producing speech and are aware of the connection between movement for speech and acoustic consequences

We have a special neural center for decoding speech

Listener perceives some abstract articulatory plan (gestures) that yields perfect production and compares it to the input.

We retrieve the intended gesture from the variable acoustic signal
What is the basic unit of speech production according to the motor theory?
Gestures: phonetic, invariant (but accommodates variation), in special phonetic module in the brain
What are the limitations of the motor theory?
Infants and people with severe sensorimotor problems or structural anomalies can perceive and discriminate even though they can't produce.
How does motor theory address the lack of phonemic invariance and the problem of coarticulation?
The gestures accommodate variation.

We retrieve an intended gesture from the variable acoustic signal.
What is the Acoustic Invariance Theory?
We have a set of acoustic features for each distinct phoneme.

A passive, bottom-up theory

Some core of acoustic features are always present and create a template against which the listener compares the incoming sound.

Or...some set of features with minimal acoustic contrasts (e.g. +/- nasality); listener abstracts the essential features.
What is the unit of analysis with the Acoustic Invariance Theory?
The "core" of acoustic features that are always present.
What is the Direct Realism Thoery?
An active theory in which we directly perceive the object, rather than reconstructing it from sensory input

Our experiences with objects affects our perception of them. The more you experience an object, the easier it is to perceive it.

We directly perceive the acoustic speech signal, not gestures of the vocal tract.
What is the TRACE Model Theory?
Connectionist

Both top-down and bottom-up

Integrative and parallel processing of multiple sources of information

Network (links between units) with processing within and across levels of the system.

Inhibitory element within a level - activate a connection if feature is present, suppress competing features
What is the Logogen Theory?
Interactive theory that focuses on word recognition, rather than acoustic-phonetic aspects of the signal.

If information (meaning, phonetic structure, syntactic function) is detected+confirmed by neural activity then logogen is activated and word is recognized

With experience, threshold for identification of the word lowers (it gets easier to identify).
What is the level of analysis with the Logogen Theory?
the word
Logogen
Neural processing device associated with a word in individual vocabulary.
What is the Cohort Theory?
There are two stages in word recognition: autonomous and interactive stages
What is the autonomous stage in word recognition (Cohort Theory)?
Bottom-up stage of the theory that relies on acoustic-phonetic information at the beginning of the word

List=cohort

Activates all words in memory bank with those beginning features/information
What is the interactive stage in word recognition (Cohort Theory)?
A top-down operation that is used to narrow the field

Linguistic, cognitive, and context knowledge are used to narrow the field of possibilities for the word.

e.g. predictive text
What is the "fuzzy" aspect (step 1) of the Fuzzy Logical Model of Perception?
Presence of feature in a particular sound interval; continuous values (0 to 1)
0 = absent
1 = present
0.5 = ambiguous

The closer to 1, the more impact on classification decisions
What is the Fuzzy Logical Model of Perception?
States that there are 3 operations in phoneme identification
1. "Fuzzy" aspect
2. Prototype matching
3. Pattern classification

Rejects the notion of specialized speech perception processes (e.g. categorical perception), we simply integrate these features along segments that are being evaluated.
What is the Prototype Matching operation (step 2) in the Fuzzy Logical Model of Perception?
By comparing the features (determined by step 1) to stored phoneme prototypes.
What is the Pattern Classification operation in the Fuzzy Logical Model of Perception?
Best match between stored prototype and possibilities for input - doesn't need to be a perfect match.

There are logical rules for doing this.
Which issues does the Fuzzy Logical Model of Perception address?
Segmentation and lack of invariance issues
Native Language Magnet Theory
Phonetic categories of a language are organized in terms of prototypes (perfect exemplars of sounds)

This process begins in infancy. With experience, we organize the sounds we hear.

Prototypes function as perceptual magnets (assimilate other members of the same phonetic category)

Perceptual prototypes sere as speech production targets.
How does the Native Language Magnet Theory explain how babies learn to discriminate?
Babies develop prototypes to discriminate sounds of the language to which they are exposed (can't discriminate between sounds of other languages)
What is an example of how we can flexibly alter our strategies?
In a quiet environment, we can use bottom-up (only using auditory signal, no higher level processing)

In a less ideal listening environment, we may need to use top-down processes (use our knowledge of the language)
Parkell's target model
Both articulatory and acoustic targets