LSA 352: Speech Recognition and Synthesis

LSA 352
Speech Recognition and Synthesis
LSA Summer Institute 2007

COURSE INFORMATION

Instructor

Dan Jurafsky, jurafsky@stanford.edu
Office: Margaret Jacks Hall (bld 460) 113

Time
Mondays and Thudays 3:45-5:30 PM

Location
Bldg 160 Room 319

Textbook

The new edition of Jurafsky and Martin. 2007. Speech and Language Processing.

Selected chapters from Taylor, Paul. 2007. Text-to-Speech Synthesis
Some papers from the literature.

Description
Introduction to automatic speech recognition and speech synthesis. In speech recognition we will learn key algorithms in the noisy channel paradigm, focusing on the standard 3-state Hidden Markov Model (HMM), including the Viterbi decoding algorithm and the Baum-Welch training algorithm. We will also learn about representations of the acoustic signal like MFCC coefficients, and the use of Gaussian Mixture Models (GMMs) and context-dependent triphones for acoustic modeling. Finally, we will cover N-gram language modeling and perplexity. In speech synthesis we will focus on concatenative synthesis, covering text normalization, grapheme-to-phoneme conversion, prosodic modeling, and waveform synthesis. We will also give a brief overview of other speech processing tasks, such as speaker and language ID and the use of forced alignment for automatic phonetic labeling. Course will involve lectures and programming homeworks.

Prerequisites: Strictly Required: Programming ability, a class in Phonetics, and some probability theory [the probability theory can be acquired from presession Math Refresher course] Recommended: any basic intro to computational linguistics, or intro to artificial intelligence

Required Presession Courses: Mathematics Refresher for Computational Linguistics or equivalent

Required Work

Homeworks: 3 homeworks. You can work together, and for homework 3 you can use any programming language you want.

Readings: To be read before the class period in which they will be discussed. THERE IS A LOT OF READING IN THIS COURSE!!! We are covering what are really two entire fields (speech recognition, speech synthesis) in 7 lectures, and not everything can be covered in each lecture, so you need to do all the reading.

Determination of final grade:

75%: 3 homeworks (25% each)
25%: class participation

SCHEDULE

Date

HW

Lec

Topic and Readings

Thu July 6
Lec 1 (ppt)
Lec 1 (6-up pdf)
Overview of Course, Intro to Probability Theory, and ASR Background: N-gram Language Modeling

Refresh your probability theory. Here's a gentle introduction if you've forgotten a lot
J+M New Chapter 4: N-grams

Mon July 9
HW 1 due Lec 2 (ppt)
Lec 2 (6-up pdf)
TTS: Background (part of speech tagging, machine learning, classification, NLP) and Text Normalization

Read this only if you haven't had phonetics: J+M New Chapter 7: Phonetics

Note that there is a lot of reading for today.
J+M New Chapter 8 Speech Synthesis, pages 1-10
J+M New Chapter 5 Word Classes and Part-of-Speech Tagging, pages 1-36, but skip section 5.6
These notes on learning decision trees, up to the section called "Assessing the Performance"

Optional Advanced Reading: Chapter 4 "Text Segmentation and Organisation" from from Taylor, Paul. 2007. Text-to-Speech Synthesis
Optional Advanced Reading: Chapter 5 "Text Decoding" from Taylor, Paul. 2007. Text-to-Speech Synthesis

Thu July 12
Lec 3 (ppt)
Lec 3 (6-up pdf)
TTS: Grapheme-to-phoneme, Prosody (Intonation, Boundaries, and Duration) and the Festival software

J+M New Chapter 8, pages 10-25
Read sections 1, 2, 3, 4, 5, and 6.1 and 6.1.1 from Alan Black's lecture notes on TTS and Festival.
You should be looking at the Festival manual. You don't have to read the whole thing through, but you should skim it so you know where things are in the manual.
For those who don't know Scheme, (used as Festival's scripting language): Introduction to Scheme for C Programmers, from Cal Tech.

Optional Advanced Reading: Chapter 6 "Prosody Prediction from Text" from Taylor, Paul. 2007. Text-to-Speech Synthesis

Mon July 16
Lec 4 (ppt)
Lec 4 (6-up pdf)
TTS: Waveform Synthesis (Diphone and Unit Selection Synthesis)

J+M New Chapter 8, pages 25-end

Optional Advanced Reading: Chapter 16 "Unit Selection Synthesis" from Taylor, Paul. 2007. Text-to-Speech Synthesis
Optional Advanced Reading: Optional: Section 7 from Alan Black's lecture notes on TTS and Festival.
Optional Advanced Reading: The rest of Section 6 of Alan Black's lecture notes

Thu July 19
HW 2 due Lec 5 (ppt)
Lec 5 (6-up pdf)
ASR: Noisy Channel Model, Bayes, HMMs, Forward, Viterbi

J+M New Chapter 6: Hidden Markov Models, pages 1-20
J+M New Chapter 9: Automatic Speech Recognition, pages 1-12

Mon July 23

Lec 6 (ppt)
Lec 6 (6-up pdf)
ASR: Feature Extraction and Acoustic Modeling, Evaluation

J+M New Chapter 9: Automatic Speech Recognition pages 12-31, 31-45
J+M New Chapter 10: Speech Recognition: Advanced Topics pages 11-16

Thu July 26
HW 3 due Lec 7 (ppt)
Lec 7 (6-up pdf)
ASR: Learning (Baum-Welch) and Disfluencies

J+M New Chapter 9: Automatic Speech Recognition 46-52
J+M New Chapter 10: Speech Recognition: Advanced Topics pages 1-11

COURSE INFORMATION
Instructor	Dan Jurafsky, jurafsky@stanford.edu Office: Margaret Jacks Hall (bld 460) 113
Time	Mondays and Thudays 3:45-5:30 PM
Location	Bldg 160 Room 319
Textbook	The new edition of Jurafsky and Martin. 2007. Speech and Language Processing. Selected chapters from Taylor, Paul. 2007. Text-to-Speech Synthesis Some papers from the literature.
Description	Introduction to automatic speech recognition and speech synthesis. In speech recognition we will learn key algorithms in the noisy channel paradigm, focusing on the standard 3-state Hidden Markov Model (HMM), including the Viterbi decoding algorithm and the Baum-Welch training algorithm. We will also learn about representations of the acoustic signal like MFCC coefficients, and the use of Gaussian Mixture Models (GMMs) and context-dependent triphones for acoustic modeling. Finally, we will cover N-gram language modeling and perplexity. In speech synthesis we will focus on concatenative synthesis, covering text normalization, grapheme-to-phoneme conversion, prosodic modeling, and waveform synthesis. We will also give a brief overview of other speech processing tasks, such as speaker and language ID and the use of forced alignment for automatic phonetic labeling. Course will involve lectures and programming homeworks. Prerequisites: Strictly Required: Programming ability, a class in Phonetics, and some probability theory [the probability theory can be acquired from presession Math Refresher course] Recommended: any basic intro to computational linguistics, or intro to artificial intelligence Required Presession Courses: Mathematics Refresher for Computational Linguistics or equivalent
Required Work	Homeworks: 3 homeworks. You can work together, and for homework 3 you can use any programming language you want. Readings: To be read before the class period in which they will be discussed. THERE IS A LOT OF READING IN THIS COURSE!!! We are covering what are really two entire fields (speech recognition, speech synthesis) in 7 lectures, and not everything can be covered in each lecture, so you need to do all the reading. Determination of final grade: 75%: 3 homeworks (25% each) 25%: class participation

	SCHEDULE
Date	HW	Lec	Topic and Readings
Thu July 6		Lec 1 (ppt) Lec 1 (6-up pdf)	Overview of Course, Intro to Probability Theory, and ASR Background: N-gram Language Modeling Refresh your probability theory. Here's a gentle introduction if you've forgotten a lot J+M New Chapter 4: N-grams
Mon July 9	HW 1 due	Lec 2 (ppt) Lec 2 (6-up pdf)	TTS: Background (part of speech tagging, machine learning, classification, NLP) and Text Normalization Read this only if you haven't had phonetics: J+M New Chapter 7: Phonetics Note that there is a lot of reading for today. J+M New Chapter 8 Speech Synthesis, pages 1-10 J+M New Chapter 5 Word Classes and Part-of-Speech Tagging, pages 1-36, but skip section 5.6 These notes on learning decision trees, up to the section called "Assessing the Performance" Optional Advanced Reading: Chapter 4 "Text Segmentation and Organisation" from from Taylor, Paul. 2007. Text-to-Speech Synthesis Optional Advanced Reading: Chapter 5 "Text Decoding" from Taylor, Paul. 2007. Text-to-Speech Synthesis
Thu July 12		Lec 3 (ppt) Lec 3 (6-up pdf)	TTS: Grapheme-to-phoneme, Prosody (Intonation, Boundaries, and Duration) and the Festival software J+M New Chapter 8, pages 10-25 Read sections 1, 2, 3, 4, 5, and 6.1 and 6.1.1 from Alan Black's lecture notes on TTS and Festival. You should be looking at the Festival manual. You don't have to read the whole thing through, but you should skim it so you know where things are in the manual. For those who don't know Scheme, (used as Festival's scripting language): Introduction to Scheme for C Programmers, from Cal Tech. Optional Advanced Reading: Chapter 6 "Prosody Prediction from Text" from Taylor, Paul. 2007. Text-to-Speech Synthesis
Mon July 16		Lec 4 (ppt) Lec 4 (6-up pdf)	TTS: Waveform Synthesis (Diphone and Unit Selection Synthesis) J+M New Chapter 8, pages 25-end Optional Advanced Reading: Chapter 16 "Unit Selection Synthesis" from Taylor, Paul. 2007. Text-to-Speech Synthesis Optional Advanced Reading: Optional: Section 7 from Alan Black's lecture notes on TTS and Festival. Optional Advanced Reading: The rest of Section 6 of Alan Black's lecture notes
Thu July 19	HW 2 due	Lec 5 (ppt) Lec 5 (6-up pdf)	ASR: Noisy Channel Model, Bayes, HMMs, Forward, Viterbi J+M New Chapter 6: Hidden Markov Models, pages 1-20 J+M New Chapter 9: Automatic Speech Recognition, pages 1-12
Mon July 23		Lec 6 (ppt) Lec 6 (6-up pdf)	ASR: Feature Extraction and Acoustic Modeling, Evaluation J+M New Chapter 9: Automatic Speech Recognition pages 12-31, 31-45 J+M New Chapter 10: Speech Recognition: Advanced Topics pages 11-16
Thu July 26	HW 3 due	Lec 7 (ppt) Lec 7 (6-up pdf)	ASR: Learning (Baum-Welch) and Disfluencies J+M New Chapter 9: Automatic Speech Recognition 46-52 J+M New Chapter 10: Speech Recognition: Advanced Topics pages 1-11