The Psychometrics of Automatic Speech Recognition

Are deep nets good brain models? Psychometric testing of automatic speech recognition systems shows they're not like humans, yet. Newer ones getting closer, but still ignore important cues.

Preprint with Lotte Weerts, Stuart Rosen and @ClopathLab. https://t.co/JAdHY9bkzD

🧵👇
— Dan Goodman (@neuralreckoning) May 21, 2021

We're excited about the huge performance leaps in vision and hearing DNNs, and how they're being used as models of the visual and auditory system, e.g. @JoshHMcDermott (https://t.co/ZqTxQY9q47) and @HearingTechLab (https://t.co/ueeSHQfE1w). But do these models work like people?
— Dan Goodman (@neuralreckoning) May 21, 2021

TL;DR: Automatic speech recognisers are less robust than people at distorted speech, and seem to use different cues (particularly temporal fine structure and periodicity). The most recent model we tested - @facebookai Wav2Vec - was the closest to humans, and also the best overall
— Dan Goodman (@neuralreckoning) May 21, 2021

As well as being the best, most robust model, Wav2Vec was also the only end-to-end trained model not using a hand-designed speech recognition front-end (MFCC features). This raises questions about the validity of these front-ends and is a hopeful sign for future developments!
— Dan Goodman (@neuralreckoning) May 21, 2021

So although current models are not good enough to be used as proxies for the auditory system, we expect this to change as these models improve. We provide our benchmarks as an open source library HumanlikeHearing to make it easy to test future systems:https://t.co/isnCbhQeZG
— Dan Goodman (@neuralreckoning) May 21, 2021

Now, let's get on to some of the gory details. There's a lot more in the paper if you're a sucker for punishment. 😉
— Dan Goodman (@neuralreckoning) May 21, 2021

We tested three recent automatic speech recognition (ASR) systems on a range of psychometric tests designed for humans, to compare overall performance, patterns of errors, and work out which auditory cues they were using (think texture vs shape for vision).
— Dan Goodman (@neuralreckoning) May 21, 2021

The ASR systems we tested were:
👂 Zurow's Kaldi nnet3, a DNN-HMM hybrid
👂 @Mozilla DeepSpeech, based on LSTMs
👂 @facebookai Wav2Vec, a CNN-Transformer model
The first two use a standard speech recognition front-end (MFCCs) while Wav2Vec is trained end-to-end (important 👇).
— Dan Goodman (@neuralreckoning) May 21, 2021

First up: how well do they work with a reduced frequency range? Answer: not well. Humans reach ceiling performance with just 12 semitones around 1.5kHz, while even the best ASR needed around 40. Note that CNN-Transformer did the best here: you'll see that again. pic.twitter.com/HHNWPKdduF
— Dan Goodman (@neuralreckoning) May 21, 2021

Next up: peak and centre clipping. Peak clipping is what happens when your microphone saturates, and centre clipping with some noise suppression systems. ASRs all perform badly with peak clipping, but CNN-Transformer and DNN-HMM match really well for centre clipping. pic.twitter.com/NgBP4M3Wmg
— Dan Goodman (@neuralreckoning) May 21, 2021

We looked at how ASR systems use spectral and temporal modulations, following @TheunissenLab method for removing certain modulations. They're overall less robust (we had to use higher SNR to get comparable results), but seem to be using these modulations in a similar way. pic.twitter.com/5LoYX7cfkY
— Dan Goodman (@neuralreckoning) May 21, 2021

Sounds have slow (envelope) and fast (temporal fine structure) components. TFS has been suggested to be important for hearing in noisy environments. We expected the end-to-end CNN-Transformer might use this better than the systems using MFCCs which mostly discard TFS, but no. pic.twitter.com/8sgmP1oW4A
— Dan Goodman (@neuralreckoning) May 21, 2021

Similarly, none of the ASR systems seem to use periodicity information in the same way as humans. They aren't as robust to distortion and don't show the same patterns of errors for different distortions. pic.twitter.com/KZWkwoEhwG
— Dan Goodman (@neuralreckoning) May 21, 2021

One of the big challenges for both humans and ASR systems is handling competing talkers. Although they perform less well, requiring higher SNRs, the DNN-HMM and CNN-Transformer show a similar pattern to humans, "glimpsing" signals in dips in the noise. pic.twitter.com/dgnwafnegP
— Dan Goodman (@neuralreckoning) May 21, 2021

Digging deeper into this, humans can benefit from both periodicity and fluctuations in masking noise. Of the ASR systems, only CNN-Transformer shows a benefit from both and a somewhat similar trend. pic.twitter.com/wRwnqYB0O3
— Dan Goodman (@neuralreckoning) May 21, 2021

Finally, you might ask if these comparisons are fair because these models are all trained with clean speech. We fine tuned CNN-Transformer with bandpass filtered speech and it improved performance on that test, but made noise robustness worse. pic.twitter.com/GBKGVGYmji
— Dan Goodman (@neuralreckoning) May 21, 2021

We could probably improve by fine tuning across all our tests, but this isn't really the point. Most of the distortions we tested are ones that human listeners haven't previously encountered either.
— Dan Goodman (@neuralreckoning) May 21, 2021

In summary, the ASR systems are quite different to humans, but end-to-end training with CNN-Transformer lets it get a lot closer. If humans are a guide, future models may benefit from making more use of TFS and periodicity information.
— Dan Goodman (@neuralreckoning) May 21, 2021

And if you've made it all the way to here, congratulations! We'd love to get feedback on the paper, and if you have a go at using our code, let us know (and file bug reports if you find any!).

Thank you for reading. 😊
— Dan Goodman (@neuralreckoning) May 21, 2021

The Psychometrics of Automatic Speech Recognition

Lotte Weerts
Rosen S
Clopath C
Dan Goodman

Weerts L, Rosen S, Clopath C, Goodman DFM

Preprint

Abstract

Deep neural networks have had considerable success in neuroscience as models of the visual system, and recent work has suggested this may also extend to the auditory system. We tested the behaviour of a range of state of the art deep learning-based automatic speech recognition systems on a wide collection of manipulated sounds used in standard human psychometric experiments. While some systems showed qualitative agreement with humans in certain tests, in others all tested systems diverged markedly from humans. In particular, all systems used spectral invariance, temporal fine structure and speech periodicity differently from humans. We conclude that despite some promising results, none of the tested automatic speech recognition systems can yet act as a strong proxy for human speech recognition. However, we note that the more recent systems with better performance also tend to better match human results, suggesting that continued cross-fertilisation of ideas between human and automatic speech recognition may be fruitful. Our open source toolbox allows researchers to assess future automatic speech recognition systems or add additional psychoacoustic measures.

Related software

HumanlikeHearing

Python package for psychophysical tests of automatic speech recognition systems.

The Psychometrics of Automatic Speech Recognition

Abstract

Links

Related software

Related videos

Categories