Articulatory feature recognition using dynamic Bayesian networks

Joe Frankel
Centre for Speech Technology Research, University of Edinburgh

Abstract

An ongoing project at Edinburgh University is to build a speech recognition system in which a set of multi-level discrete articulatory features (AF), rather than phones, mediate between words and acoustic observations.

The motivation for such an approach is to use a state representation which is tailored toward characterizing the variation present in natural speech, variation which arises from the asynchronous, overlapping nature of speech production.

In this talk I describe work to date which has largely focused on developing articulatory feature recognition using dynamic Bayesian networks (DBN). A DBN approach allows us to build a model which incorporates dependencies between feature streams. The model is initialized through training on AF labels derived from a time-aligned phone segmentation, then by applying an embedded training scheme, a set of asynchronous feature changes is learned in a data-driven manner.

I will present the results of AF recognition experiments on the OGI Numbers corpus which demonstrate performance improvements due to modelling inter-feature dependencies, and that the embedded training scheme reduces the dependence on phone-derived articulatory feature labels. Finally, I will discuss future directions and recent developments.