University of Cambridge > > CUED Speech Group Seminars > Selection of Talks from Interspeech 2020

Selection of Talks from Interspeech 2020

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Kate Knill.

Seminar on zoom

The first seminar of Lent Term will be 3 presentations of papers from Interspeech 2020:

  • Spoken Language ‘Grammatical Error Correction, Yiting ‘Edie’ Lu
  • Universal Adversarial Attacks on Spoken Language Assessment Systems, Vyas Raina
  • Attention Forcing for Speech Synthesis, Qingyun Dou

’Spoken Language ‘Grammatical Error Correction’, Yiting ‘Edie’ Lu, Mark J.F. Gales. Yu Wang

Spoken language ‘grammatical error correction’ (GEC) is an important mechanism to help learners of a foreign language, here English, improve their spoken grammar. GEC is challenging for non-native spoken language due to interruptions from disfluent speech events such as repetitions and false starts and issues in strictly defining what is acceptable in spoken language. Furthermore there is little labelled data to train models. One way to mitigate the impact of speech events is to use a disfluency detection (DD) model. Removing the detected disfluencies converts the speech transcript to be closer to written language, which has significantly more labelled training data. This paper considers two types of approaches to leveraging DD models to boost spoken GEC performance. One is sequential, a separately trained DD model acts as a pre-processing module providing a more structured input to the GEC model. The second approach is to train DD and GEC models in an end-to-end fashion, simultaneously optimising both modules. Embeddings enable end-to-end models to have a richer information flow. Experimental results show that DD effectively regulates GEC input; end-to-end training works well when fine-tuned on limited labelled in-domain data; and improving DD by incorporating acoustic information helps improve spoken GEC .

Universal Adversarial Attacks on Spoken Language Assessment Systems, Vyas Raina, Mark J.F. Gales, Kate M. Knill

There is an increasing demand for automated spoken language assessment (SLA) systems, partly driven by the performance improvements that have come from deep learning based approaches. One aspect of deep learning systems is that they do not require expert derived features, operating directly on the original signal such as a speech recognition (ASR) transcript. This, however, increases their potential susceptibility to adversarial attacks as a form of candidate malpractice. In this paper the sensitivity of SLA systems to a universal black-box attack on the ASR text output is explored. The aim is to obtain a single, universal phrase to maximally increase any candidate’s score. Four approaches to detect such adversarial attacks are also described. All the systems, and associated detection approaches, are evaluated on a free (spontaneous) speaking section from a Business English test. It is shown that on deep learning based SLA systems the average candidate score can be increased by almost one grade level using a single six word phrase appended to the end of the response hypothesis. Although these large gains can be obtained, they can be easily detected based on detection shifts from the scores of a “traditional” Gaussian Process based grader.

’Attention Forcing for Speech Synthesis’, Qingyun Dou, Joshua Efiong, Mark J.F. Gales

Auto-regressive sequence-to-sequence models with attention mechanisms have achieved state-of-the-art performance in various tasks including speech synthesis. Training these models can be difficult. The standard approach guides a model with the reference output history during training. However during synthesis the generated output history must be used. This mismatch can impact performance. Several approaches have been proposed to handle this, normally by selectively using the generated output history. To make training stable, these approaches often require a heuristic schedule or an auxiliary classifier. This paper introduces attention forcing, which guides the model with the generated output history and reference attention. This approach reduces the training-evaluation mismatch without the need for a schedule or a classifier. Additionally, for standard training approaches, the frame rate is often reduced to prevent models from copying the output history. As attention forcing does not feed the reference output history to the model, it allows using a higher frame rate, which improves the speech quality. Finally, attention forcing allows the model to generate output sequences aligned with the references, which is important for some down-stream tasks such as training neural vocoders. Experiments show that attention forcing allows doubling the frame rate, and yields significant gain in speech quality.

This talk is part of the CUED Speech Group Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2021, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity