University of Cambridge > > CUED Speech Group Friday Lunchtime Series > Interspeech practice session

Interspeech practice session

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Rogier van Dalen.

12:00 – 12:40 Oral presentations

  • Shakti P. Rath (with Daniel Povey, Karel Vesely, and Jan Cernocky), Improved Feature Processing for Deep Neural Networks
  • Yongqiang (Eric) Wang (with Mark Gales), An Explicit Independence Constraint for Factorised Adaptation in Speech Recognition

12:40 – 13:30 Posters and sandwiches:

  • Pierre Lanchantin, Improving Lightly Supervised Training for Broadcast Transcription
  • Jingzhou (Justin) Yang (with Rogier van Dalen and Mark Gales), Infinite Support Vector Machines in Speech Recognition


Shakti P. Rath, Daniel Povey, Karel Vesely, Jan Cernocky

Improved Feature Processing for Deep Neural Networks

In this paper, we investigate alternative ways of processing MFCC -based features to use as the input to Deep Neural Networks (DNNs). Our baseline is a conventional feature pipeline that involves splicing the 13-dimensional front-end MFC Cs across 9 frames, followed by applying LDA to reduce the dimension to 40 and then further decorrelation using MLLT . Confirming the results of other groups, we show that speaker adaptation applied on the top of these features using feature-space MLLR is helpful. The fact that the number of parameters of a DNN is not strongly sensitive to the input feature dimension (unlike GMM -based systems) motivated us to investigate ways to increase the dimension of the features. In this paper, we investigate several approaches to derive higher-dimensional features and verify their performance with DNN . Our best result is obtained from splicing our baseline 40-dimensional speaker adapted features again across 9 frames, followed by reducing the dimension to 200 or 300 using another LDA . Our final result is about 3% absolute better than our best GMM system, which is a discriminatively trained model.

Yongqiang Wang and Mark Gales

An Explicit Independence Constraint for Factorised Adaptation in Speech Recognition

Speech signals are usually affected by multiple acoustic factors, such as speaker characteristics and environment differences. Usually, the combined effect of these factors is modelled by a single transform. Acoustic factorisation splits the transform into several factor transforms, each modelling only one factor. This allows, for example, estimating a speaker transform in a noise condition and applying the same speaker transform in a different noise condition. To achieve this factorisation, it is crucial to keep factor transforms independent of each other. Previous work on acoustic factorisation relies on using different forms of factor transforms and/or the attribute of the data to enforce this independence. In this work, the independence is formulated in mathematically, and an explicit constraint is derived to enforce the independence. Using factorised cluster adaptive training (fCAT) as an application, experimental results demonstrates that the proposed explicit independence constraint helps factorisation when imbalanced adaptation data is used.

Y. Long, M.J.F. Gales, P. Lanchantin, X. Liu, M.S. Seigel, P.C. Woodland

Improving Lightly Supervised Training for Broadcast Transcription

This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data. Standard lightly supervised training uses automatically derived decoding hypotheses using a biased language model. However, as the actual speech can deviate significantly from the original programme scripts that are supplied, the quality of standard lightly supervised hypotheses can be poor. To address this issue, word and segment level combination approaches are used between the lightly supervised transcripts and the original programme scripts which yield improved transcriptions. Experimental results show that systems trained using these improved transcriptions consistently outperform those trained using only the original lightly supervised decoding hypotheses. This is shown to be the case for both the maximum likelihood and minimum phone error trained systems.

Jingzhou Yang, Rogier van Dalen, Mark Gales

Infinite Support Vector Machines in Speech Recognition

Generative feature spaces provide an elegant way to apply discriminative models in speech recognition, and system performance has been improved by adapting this framework. However, the classes in the feature space may be not linearly separable. Applying a linear classifier then limits performance. Instead of a single classifier, this paper applies a mixture of experts. This model trains different classifiers as experts focusing on different regions of the feature space. However, the number of experts is not known in advance. This problem can be bypassed by employing a Bayesian non-parametric model. In this paper, a specific mixture of experts based on the Dirichlet process, namely the infinite support vector machine, is studied. Experiments conducted on the noise-corrupted continuous digit task AURORA 2 show the advantages of this Bayesian non-parametric approach.

This talk is part of the CUED Speech Group Friday Lunchtime Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2014, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity