The short version
Are deep nets good brain models? Psychometric testing of automatic speech recognition systems shows they're not like humans, yet. Newer ones getting closer, but still ignore important cues. We're excited about the huge performance leaps in vision and hearing DNNs, and how they're being used as models of the visual and auditory system (e.g. <\ href="https://doi.org/10.1016/j.neuron.2018.03.044">Kell et al. 2018; Baby et al. 2021). But do these models work like people?
TL;DR: Automatic speech recognisers are less robust than people at distorted speech, and seem to use different cues (particularly temporal fine structure and periodicity). The most recent model we tested - Facebook's Wav2Vec - was the closest to humans, and also the best overall. As well as being the best, most robust model, Wav2Vec was also the only end-to-end trained model not using a hand-designed speech recognition front-end (MFCC features). This raises questions about the validity of these front-ends and is a hopeful sign for future developments!
So although current models are not good enough to be used as proxies for the auditory system, we expect this to change as these models improve. We provide our benchmarks as an open source library HumanlikeHearing to make it easy to test future systems.
Now, let's get on to some of the gory details. There's a lot more in the paper if you're a sucker for punishment. π
We tested three recent automatic speech recognition (ASR) systems on a range of psychometric tests designed for humans, to compare overall performance, patterns of errors, and work out which auditory cues they were using (think texture vs shape for vision).
The ASR systems we tested were:
π Zurow's Kaldi nnet3, a DNN-HMM hybrid
π Mozilla DeepSpeech, based on LSTMs
π Facebook Wav2Vec, a CNN-Transformer model
The first two use a standard speech recognition front-end (MFCCs) while Wav2Vec is trained end-to-end (important π).
First up: how well do they work with a reduced frequency range? Answer: not well. Humans reach ceiling performance with just 12 semitones around 1.5kHz, while even the best ASR needed around 40. Note that CNN-Transformer did the best here: you'll see that again.
Next up: peak and centre clipping. Peak clipping is what happens when your microphone saturates, and centre clipping with some noise suppression systems. ASRs all perform badly with peak clipping, but CNN-Transformer and DNN-HMM match really well for centre clipping.
We looked at how ASR systems use spectral and temporal modulations, following Thorold Theunissen's method for removing certain modulations. They're overall less robust (we had to use higher SNR to get comparable results), but seem to be using these modulations in a similar way.
Sounds have slow (envelope) and fast (temporal fine structure) components. TFS has been suggested to be important for hearing in noisy environments. We expected the end-to-end CNN-Transformer might use this better than the systems using MFCCs which mostly discard TFS, but no.
Similarly, none of the ASR systems seem to use periodicity information in the same way as humans. They aren't as robust to distortion and don't show the same patterns of errors for different distortions.
One of the big challenges for both humans and ASR systems is handling competing talkers. Although they perform less well, requiring higher SNRs, the DNN-HMM and CNN-Transformer show a similar pattern to humans, "glimpsing" signals in dips in the noise.
Digging deeper into this, humans can benefit from both periodicity and fluctuations in masking noise. Of the ASR systems, only CNN-Transformer shows a benefit from both and a somewhat similar trend.
Finally, you might ask if these comparisons are fair because these models are all trained with clean speech. We fine tuned CNN-Transformer with bandpass filtered speech and it improved performance on that test, but made noise robustness worse.
We could probably improve by fine tuning across all our tests, but this isn't really the point. Most of the distortions we tested are ones that human listeners haven't previously encountered either.
In summary, the ASR systems are quite different to humans, but end-to-end training with CNN-Transformer lets it get a lot closer. If humans are a guide, future models may benefit from making more use of TFS and periodicity information.
The Psychometrics of Automatic Speech Recognition
Abstract
Links
Related software
Related videos
-
The Psychometrics of Automatic Speech RecognitionTalk / 2022
Talk on applying psychometric testing to automatic speech recognition systems.
Categories
The short version
We're excited about the huge performance leaps in vision and hearing DNNs, and how they're being used as models of the visual and auditory system (e.g. <\ href="https://doi.org/10.1016/j.neuron.2018.03.044">Kell et al. 2018; Baby et al. 2021). But do these models work like people?
TL;DR: Automatic speech recognisers are less robust than people at distorted speech, and seem to use different cues (particularly temporal fine structure and periodicity). The most recent model we tested - Facebook's Wav2Vec - was the closest to humans, and also the best overall. As well as being the best, most robust model, Wav2Vec was also the only end-to-end trained model not using a hand-designed speech recognition front-end (MFCC features). This raises questions about the validity of these front-ends and is a hopeful sign for future developments!
So although current models are not good enough to be used as proxies for the auditory system, we expect this to change as these models improve. We provide our benchmarks as an open source library HumanlikeHearing to make it easy to test future systems.
Now, let's get on to some of the gory details. There's a lot more in the paper if you're a sucker for punishment. π
We tested three recent automatic speech recognition (ASR) systems on a range of psychometric tests designed for humans, to compare overall performance, patterns of errors, and work out which auditory cues they were using (think texture vs shape for vision).
The ASR systems we tested were:
π Zurow's Kaldi nnet3, a DNN-HMM hybrid
π Mozilla DeepSpeech, based on LSTMs
π Facebook Wav2Vec, a CNN-Transformer model
The first two use a standard speech recognition front-end (MFCCs) while Wav2Vec is trained end-to-end (important π).
First up: how well do they work with a reduced frequency range? Answer: not well. Humans reach ceiling performance with just 12 semitones around 1.5kHz, while even the best ASR needed around 40. Note that CNN-Transformer did the best here: you'll see that again.
Next up: peak and centre clipping. Peak clipping is what happens when your microphone saturates, and centre clipping with some noise suppression systems. ASRs all perform badly with peak clipping, but CNN-Transformer and DNN-HMM match really well for centre clipping.
We looked at how ASR systems use spectral and temporal modulations, following Thorold Theunissen's method for removing certain modulations. They're overall less robust (we had to use higher SNR to get comparable results), but seem to be using these modulations in a similar way.
Sounds have slow (envelope) and fast (temporal fine structure) components. TFS has been suggested to be important for hearing in noisy environments. We expected the end-to-end CNN-Transformer might use this better than the systems using MFCCs which mostly discard TFS, but no.
Similarly, none of the ASR systems seem to use periodicity information in the same way as humans. They aren't as robust to distortion and don't show the same patterns of errors for different distortions.
One of the big challenges for both humans and ASR systems is handling competing talkers. Although they perform less well, requiring higher SNRs, the DNN-HMM and CNN-Transformer show a similar pattern to humans, "glimpsing" signals in dips in the noise.
Digging deeper into this, humans can benefit from both periodicity and fluctuations in masking noise. Of the ASR systems, only CNN-Transformer shows a benefit from both and a somewhat similar trend.
Finally, you might ask if these comparisons are fair because these models are all trained with clean speech. We fine tuned CNN-Transformer with bandpass filtered speech and it improved performance on that test, but made noise robustness worse.
We could probably improve by fine tuning across all our tests, but this isn't really the point. Most of the distortions we tested are ones that human listeners haven't previously encountered either.
In summary, the ASR systems are quite different to humans, but end-to-end training with CNN-Transformer lets it get a lot closer. If humans are a guide, future models may benefit from making more use of TFS and periodicity information.


