STANFORD LSA 352 Homework 3: Summer 2007
Homework 3: Viterbi in ASR
Due: Thursday July 26 before class.

WARNING: Please read this entire page before you start!!!!!

Today's homework, which Bryan Pellom very kindly created for me a few years ago (Bryan also created Sonic), is to decode a mystery sequence of digits that are originally from a speech file.

You are going to do this by implementing the Viterbi decoding algorithm, and applying it to a file we will give you that contains phone likelihoods from GMMs, together with a lexicon.

The output of your program should be the correct word sequence and the path probability of the most-likely path.

What you should turn in: the correct word sequence, the path probability of the most-likely path, and your code. Please submit your code as an attachment, and not as plain text inside the body of the email.

You may write your code in any language you want.

You will need the following three input files:

  1. The lexicon file
  2. The phone set for this exercise
  3. The phone likelihoods

The lexicon file just has the string of phones for the pronunciation of each word, separated by spaces, with a pound-sign (#) at the end of each line. Notice that there is a special phone for silence, called SIL:

silence         SIL #
oh              OW #
zero            Z IH R OW #
one             W AH N #
two             T UW #
three           TH R IY #
four            F AO R #
five            F AY V #
six             S IH K S #
seven           S EH V AX N #
eight           EY TD #
nine            N AY N #

The phone file gives the set of phones in order. The phone set is a slightly different version of the ARPAbet than the one we used for Festival, but almost all the phones that are used in this homework are identical to what you've seen.

The likelihoods file has the output from the Gaussian models in the following format:

    0   AA      0    -48.804
    0   AA      1    -50.494
    0   AA      2    -50.240
    0   AE      0    -47.295
    0   AE      1    -49.929
    0   AE      2    -46.059
    0   AH      0    -51.344
    0   AH      1    -50.410
    0   AH      2    -47.449
    0   AO      0    -54.502
    1   ZH      1    -46.419
    1   ZH      2    -48.539
    2   AA      0    -52.124
  557   ZH      2    -59.244

The first number is the frame number; there are 558 frames in this utterance. Since frames are every 10 milliseconds, this means that the utterances was 5.58 seconds long. The second symbol is the phonetic unit. The third number is the state (this is a three-state HMM). The last number is the log-probability (base 10, in case you're interested) of observing the feature vector given the base phoneme and HMM state. Recall that the two reasons we use log probabilities are to avoid underflow and because adding is faster than multiplying.

Your Viterbi implementation will need to read in the observations file, the phone file, and the lexicon, create a Viterbi_prob matrix, and then fill it out from left to right. You will need to keep an array of back-pointers so when the matrix is complete, you can start from the end and find the most likely sequence of words.

Your output should look something like this:

The best path probability is -500000.00
The best path contains the following words:  seven three four etc etc  (or whatever)

Hint: In order to get the correct answer, you will need to add a word transition penalty of -50 to each path every time it transitions from word to word.

Hint 2: The HMM structure you should assume for each word is technically a "left-to-right non-skip self-loop 3-state HMM". That means that each phone has exactly three substates, that you can't skip any of these states, and that they all have a self-loop. In addition, I assume all the state transition probabilities are 1.0. Technically this means that they aren't true probabilities (if they were true probabilities, they would have to be 0.5, since the two arcs leaving each state must sum to 1). But in practice, since all that matters is that they are all equal, it makes things simpler to just assign everything a probability of 1, since the log of 1 is 0; that means you can just ignore the probabilities altogether, except for the word transition penalty. Anyhow here's an example for the word "two":

Hint 3: You'll need to keep track, for each node of your lattice, which state (subphone) of which model (phone) of which word it corresponds to. This is automatically available from the row number in the viterbi lattice. There are exactly three transition cases you need to consider in your Viterbi algorithm as you move from one frame (column) to the next

  1. Staying in the same state
  2. (only if you are not in the last state of the current word) Transitioning to the next state in the same word
  3. (only if you are in the last state of the current word) Transitioning to the first state of every possible word.

Hint 4: Some frequently asked questions for this homework:

  • Q: should we create separate HMMs for each word, or one big HMM?
  • A: Mathematically it's one big HMM, but as mentioned above you need to represent the separate words in your state column in your lattice, since you need to know which state of which word you are in.

  • Q: Is the "f" of "four" the same state as the "f" of "five"?
  • A: No. You keep a distinct state for each phone of each word. Those two 'f's WILL share the same observation likelihood, i.e. the same "b" matrix. BUT they will have different viterbi scores because they correspond to different rows in the Viterbi lattice, and will be in different paths.

    Hint 5: If you would like to check that your code is correct, we have provided an additional output file, available here which you may run your code on. The correct answer for this file is:


    log probability: -19202.51304840347

    NOTE: Please note that in order to match this output, you must force your Viterbi implementation to end in an end state (instead of ending mid-word).