Are deep nets good brain models? Psychometric testing of automatic speech recognition systems shows they're not like humans, yet. Newer ones getting closer, but still ignore important cues.
— Dan Goodman (@neuralreckoning) May 21, 2021
Preprint with Lotte Weerts, Stuart Rosen and @ClopathLab. https://t.co/JAdHY9bkzD
π§΅π
We're excited about the huge performance leaps in vision and hearing DNNs, and how they're being used as models of the visual and auditory system, e.g. @JoshHMcDermott (https://t.co/ZqTxQY9q47) and @HearingTechLab (https://t.co/ueeSHQfE1w). But do these models work like people?
— Dan Goodman (@neuralreckoning) May 21, 2021
TL;DR: Automatic speech recognisers are less robust than people at distorted speech, and seem to use different cues (particularly temporal fine structure and periodicity). The most recent model we tested - @facebookai Wav2Vec - was the closest to humans, and also the best overall
— Dan Goodman (@neuralreckoning) May 21, 2021
As well as being the best, most robust model, Wav2Vec was also the only end-to-end trained model not using a hand-designed speech recognition front-end (MFCC features). This raises questions about the validity of these front-ends and is a hopeful sign for future developments!
— Dan Goodman (@neuralreckoning) May 21, 2021
So although current models are not good enough to be used as proxies for the auditory system, we expect this to change as these models improve. We provide our benchmarks as an open source library HumanlikeHearing to make it easy to test future systems:https://t.co/isnCbhQeZG
— Dan Goodman (@neuralreckoning) May 21, 2021
Now, let's get on to some of the gory details. There's a lot more in the paper if you're a sucker for punishment. π
— Dan Goodman (@neuralreckoning) May 21, 2021
We tested three recent automatic speech recognition (ASR) systems on a range of psychometric tests designed for humans, to compare overall performance, patterns of errors, and work out which auditory cues they were using (think texture vs shape for vision).
— Dan Goodman (@neuralreckoning) May 21, 2021
The ASR systems we tested were:
— Dan Goodman (@neuralreckoning) May 21, 2021
π Zurow's Kaldi nnet3, a DNN-HMM hybrid
π @Mozilla DeepSpeech, based on LSTMs
π @facebookai Wav2Vec, a CNN-Transformer model
The first two use a standard speech recognition front-end (MFCCs) while Wav2Vec is trained end-to-end (important π).
First up: how well do they work with a reduced frequency range? Answer: not well. Humans reach ceiling performance with just 12 semitones around 1.5kHz, while even the best ASR needed around 40. Note that CNN-Transformer did the best here: you'll see that again. pic.twitter.com/HHNWPKdduF
— Dan Goodman (@neuralreckoning) May 21, 2021
Next up: peak and centre clipping. Peak clipping is what happens when your microphone saturates, and centre clipping with some noise suppression systems. ASRs all perform badly with peak clipping, but CNN-Transformer and DNN-HMM match really well for centre clipping. pic.twitter.com/NgBP4M3Wmg
— Dan Goodman (@neuralreckoning) May 21, 2021
We looked at how ASR systems use spectral and temporal modulations, following @TheunissenLab method for removing certain modulations. They're overall less robust (we had to use higher SNR to get comparable results), but seem to be using these modulations in a similar way. pic.twitter.com/5LoYX7cfkY
— Dan Goodman (@neuralreckoning) May 21, 2021
Sounds have slow (envelope) and fast (temporal fine structure) components. TFS has been suggested to be important for hearing in noisy environments. We expected the end-to-end CNN-Transformer might use this better than the systems using MFCCs which mostly discard TFS, but no. pic.twitter.com/8sgmP1oW4A
— Dan Goodman (@neuralreckoning) May 21, 2021
Similarly, none of the ASR systems seem to use periodicity information in the same way as humans. They aren't as robust to distortion and don't show the same patterns of errors for different distortions. pic.twitter.com/KZWkwoEhwG
— Dan Goodman (@neuralreckoning) May 21, 2021
One of the big challenges for both humans and ASR systems is handling competing talkers. Although they perform less well, requiring higher SNRs, the DNN-HMM and CNN-Transformer show a similar pattern to humans, "glimpsing" signals in dips in the noise. pic.twitter.com/dgnwafnegP
— Dan Goodman (@neuralreckoning) May 21, 2021
Digging deeper into this, humans can benefit from both periodicity and fluctuations in masking noise. Of the ASR systems, only CNN-Transformer shows a benefit from both and a somewhat similar trend. pic.twitter.com/wRwnqYB0O3
— Dan Goodman (@neuralreckoning) May 21, 2021
Finally, you might ask if these comparisons are fair because these models are all trained with clean speech. We fine tuned CNN-Transformer with bandpass filtered speech and it improved performance on that test, but made noise robustness worse. pic.twitter.com/GBKGVGYmji
— Dan Goodman (@neuralreckoning) May 21, 2021
We could probably improve by fine tuning across all our tests, but this isn't really the point. Most of the distortions we tested are ones that human listeners haven't previously encountered either.
— Dan Goodman (@neuralreckoning) May 21, 2021
In summary, the ASR systems are quite different to humans, but end-to-end training with CNN-Transformer lets it get a lot closer. If humans are a guide, future models may benefit from making more use of TFS and periodicity information.
— Dan Goodman (@neuralreckoning) May 21, 2021
And if you've made it all the way to here, congratulations! We'd love to get feedback on the paper, and if you have a go at using our code, let us know (and file bug reports if you find any!).
— Dan Goodman (@neuralreckoning) May 21, 2021
Thank you for reading. π
The Psychometrics of Automatic Speech Recognition
Abstract
Links
Related software
Python package for psychophysical tests of automatic speech recognition systems.
Related videos
-
The Psychometrics of Automatic Speech RecognitionTalk / 2022
Talk on applying psychometric testing to automatic speech recognition systems.
Categories