• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/15

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

15 Cards in this Set

  • Front
  • Back

What things have to be considered when transforming speech signal to symbol level?

1. Speech: Language, dialect, speaking style, …
2. Speaker:
– Dependent vs. independent vs. adaptive
– Known vs. unknown
– Cooperative vs. uncooperative
3. Target units: Type, number, complexity
4. Environment: Background noise, transmission channels, etc.


What makes speech recognition difficult?

1. Variances and invariances in the speech signal have to be differentiated


2. Contextual knowledge helps to understand: „to recognize speech“ <-> „to wreck a nice beach“
3. Speech is a continuous signal, not a sequence of elementary sounds
(even across word boundaries)
4. Articulation depends on the sourrounding sounds (co-articulation)


Explain the schematic setup of speech recognizer (ASR).

1. Feature extraction from signal
2. Acoustic model (features matched with sound/phonetic patterns)
3. Lexicon: which sound patterns can make sequences
4. Language model: what are the probabilities of different sequences calculated by lexicon, bas...

1. Feature extraction from signal
2. Acoustic model (features matched with sound/phonetic patterns)
3. Lexicon: which sound patterns can make sequences
4. Language model: what are the probabilities of different sequences calculated by lexicon, based on probability analysis of some language material
5. Algorithms that choose the most probable sound pattern

In which state of speech recognizer are HMM and neural networks usually used?

Acoustic model and lexicon. Language model usually utilized HMM.

Name 3 ideas of Feature Extraction

Feature Extraction - Extract information which allow to differentiate between sounds


- First idea: (Fourier) Spectrum
- Second idea: Separation of excitation and vocal tract modulation
- Third idea: Further improvements by considering hearing characteristics


Explain how feature extraction of Mel-scaled ceptrum works.

1. FFT for a signal window
2. weighting by triangular filters, coefficients produced
3. log of coefficients
4. IFT to coefficients
5. time behaviour added by calculating derivative and double derivative of these coefficients

Explain the steps of perceptual linear predictive (PLP) coding.

1. Windowed via a Hamming-window
2. Spectral analysis
3. Transform spectrum to a scale similar to the bark scale
4. Weighting filter for the calculation of mel-scaled cepstral coefficients
5. Convolved with an artificial frequency band masking curve
6. the spectrum is sampled in steps of 1 bark -> smoothing of the spectrum.
7. amplify high frequencies to balance the frequency dependence of the loudness
8. the intensity representation is transferred into an (approximate) loudness representation.
9. Re-transformed loudness representation into the time domain
10. Calculate the LPC coefficients.

Explain what RASTA tries to solve and how it works.

Tries to be robust against interferences, additive and multiplicative. It is better against multiplicative.

Tries to be robust against interferences, additive and multiplicative. It is better against multiplicative.

What is the idea of Markov chain in speech recognition. When is it called hidden?

One state corresponds to one symbol. Transitions to next states (and to itself) determined by transition probabilities.
In speech recognition usually only one way, i.e. no going back in the chain.
The result is the emitted symbols. One state can e...

One state corresponds to one symbol. Transitions to next states (and to itself) determined by transition probabilities.


In speech recognition usually only one way, i.e. no going back in the chain.


The result is the emitted symbols. One state can emit several symbols and these several symbols have emission probabilities (not one-to-one match).


This means output sequence doesn't tell which states were visited, that's why it's called a hidden model.


Hidden = two stochastic levels 1)transitions 2)emitted symbols

What is phoneme HMM? How does it differ from normal HMM?

Doesn't output discrete symbols, but ... Takes co-articulation into account.

Doesn't output discrete symbols, but ... Takes co-articulation into account.

What can be achieved with Viterbi algorithm?

Finding the most likely path in HMM quickly.

What does perceptron do in speech recognition? What is its basic structure?

It's a two class classifier. It calculated the weighted sum of feature vector and outputs either class 1 or 2. Putting perceptions in series makes it possible to have more than two classes. Good for one phoneme classification.

It's a two class classifier. It calculated the weighted sum of feature vector and outputs either class 1 or 2. Putting perceptions in series makes it possible to have more than two classes. Good for one phoneme classification.

What is n-gram language model?

The probabilities of n words succeeding each other are analysed from texts.
Those probabilities are used in making decisions in speech recognition.
If certains words are not succeeded by each other in grammar, single word probabilities are used, t...

The probabilities of n words succeeding each other are analysed from texts.


Those probabilities are used in making decisions in speech recognition.


If certains words are not succeeded by each other in grammar, single word probabilities are used, that is called n-gram back-off language model.

How can you measure speech recognizer error?

1. Two ways: Word Error Rate (WER) or Word Accuracy (WA). Also sentence error rates etc. possible.
2. Labelled data compared with recognizer output data. Datas need to be aligned in time domain, e.g. with dynamic time warping. 
3. Correctly recogn...

1. Two ways: Word Error Rate (WER) or Word Accuracy (WA). Also sentence error rates etc. possible.
2. Labelled data compared with recognizer output data. Datas need to be aligned in time domain, e.g. with dynamic time warping.
3. Correctly recognized, swapped, deleted and inserted words counted. WA = 1-WER = 1 - (swapped+inserted+deleted)/all_the_words


What can be done to improve speech recognizer results?

1. Training material in real conditions
2. Choosing a good feature set, preprocessing
3. Parallel recognizers (multi-stream speech recognition) for different frequency ranges. It's assumed interferences only effect some frequency ranges