LSA 352 Homework 3: Summer 2007
Homework 3: Viterbi in ASR |
Due: Thursday July 26 before class. |
WARNING: Please read this entire page before you start!!!!!
Today's homework, which Bryan Pellom very kindly created for me a few years ago (Bryan also created Sonic), is to decode a mystery sequence of digits that are originally from a speech file.
You are going to do this by implementing the Viterbi decoding algorithm, and applying it to a file we will give you that contains phone likelihoods from GMMs, together with a lexicon.
The output of your program should be the correct word sequence and the path probability of the most-likely path.
What you should turn in: the correct word sequence, the path probability of the most-likely path, and your code. Please submit your code as an attachment, and not as plain text inside the body of the email.
You may write your code in any language you want.
You will need the following three input files:
The lexicon file just has the string of phones for the pronunciation of each word, separated by spaces, with a pound-sign (#) at the end of each line. Notice that there is a special phone for silence, called SIL:
silence SIL # oh OW # zero Z IH R OW # one W AH N # two T UW # three TH R IY # four F AO R # five F AY V # six S IH K S # seven S EH V AX N # eight EY TD # nine N AY N #
The phone file gives the set of phones in order. The phone set is a slightly different version of the ARPAbet than the one we used for Festival, but almost all the phones that are used in this homework are identical to what you've seen.
The likelihoods file has the output from the Gaussian models in the following format:
FRAME PHONE STATE LOG P(O|Q,state) 0 AA 0 -48.804 0 AA 1 -50.494 0 AA 2 -50.240 0 AE 0 -47.295 0 AE 1 -49.929 0 AE 2 -46.059 0 AH 0 -51.344 0 AH 1 -50.410 0 AH 2 -47.449 0 AO 0 -54.502 ... 1 ZH 1 -46.419 1 ZH 2 -48.539 2 AA 0 -52.124 ... 557 ZH 2 -59.244
The first number is the frame number; there are 558 frames in this utterance. Since frames are every 10 milliseconds, this means that the utterances was 5.58 seconds long. The second symbol is the phonetic unit. The third number is the state (this is a three-state HMM). The last number is the log-probability (base 10, in case you're interested) of observing the feature vector given the base phoneme and HMM state. Recall that the two reasons we use log probabilities are to avoid underflow and because adding is faster than multiplying.
Your Viterbi implementation will need to read in the observations file, the phone file, and the lexicon, create a Viterbi_prob matrix, and then fill it out from left to right. You will need to keep an array of back-pointers so when the matrix is complete, you can start from the end and find the most likely sequence of words.
Your output should look something like this:
The best path probability is -500000.00 The best path contains the following words: seven three four etc etc (or whatever)
Hint: In order to get the correct answer, you will need to add a word transition penalty of -50 to each path every time it transitions from word to word.
Hint 2: The HMM structure you should assume for each word is technically a "left-to-right non-skip self-loop 3-state HMM". That means that each phone has exactly three substates, that you can't skip any of these states, and that they all have a self-loop. In addition, I assume all the state transition probabilities are 1.0. Technically this means that they aren't true probabilities (if they were true probabilities, they would have to be 0.5, since the two arcs leaving each state must sum to 1). But in practice, since all that matters is that they are all equal, it makes things simpler to just assign everything a probability of 1, since the log of 1 is 0; that means you can just ignore the probabilities altogether, except for the word transition penalty. Anyhow here's an example for the word "two":
Hint 3: You'll need to keep track, for each node of your lattice, which state (subphone) of which model (phone) of which word it corresponds to. This is automatically available from the row number in the viterbi lattice. There are exactly three transition cases you need to consider in your Viterbi algorithm as you move from one frame (column) to the next
Hint 4: Some frequently asked questions for this homework:
Hint 5:
If you would like to check that your code is correct, we have provided
an additional output file, available here which
you may run your code on. The correct answer for this file is:
silence
three
one
silence
one
five
eight
two
six
five
six
one
three
six
seven
five
silence
silence
silence
silence
log probability: -19202.51304840347
NOTE: Please note that in order to match this output, you must force your Viterbi implementation to end in an end state (instead of ending mid-word).