Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?

Question

Last month, a user called @jojek told me in a comment the following advice:

I can bet that given enough data, CNN on Mel energies will outperform MFCCs. You should try it. It makes more sense to do convolution on Mel spectrogram rather than on decorrelated coefficients.

Yes, I tried CNN on Mel-filterbank energies, and it outperformed MFCCs, but I still don't know the reason!

Although many tutorials, like this one by Tensorflow, encourage the use of MFCCs for such applications:

Because the human ear is more sensitive to some frequencies than others, it's been traditional in speech recognition to do further processing to this representation to turn it into a set of Mel-Frequency Cepstral Coefficients, or MFCCs for short.

Also, I want to know if Mel-Filterbank energies outperform MFCCs only with CNN, or this is also true with LSTM, DNN, ... etc. and I would appreciate it if you add a reference.

Update 1:

While my comment on @Nikolay's answer contains relevant details, I will add it here:

Correct me if I’m wrong, since applying DCT on the Mel-filterbank energies, in this case, is equivalent to IDFT, it seems to me that when we keep the 2-13 (inclusive) cepstral coefficients and discard the rest, is equivalent to a low-time liftering to isolate the vocal tract components, and drop the source components (which have e.g. the F0 spike).

So, why should I use all the 40 MFCCs since all I care about for the speech command recognition model is the vocal tract components?

Update 2

Another point of view (link) is:

Notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.

References:

https://tspace.library.utoronto.ca/bitstream/1807/44123/1/Mohamed_Abdel-rahman_201406_PhD_thesis.pdf

Nikolay Shmyrev · Accepted Answer · 2020-02-28 14:50:24Z

9

+50

The thing is that the MFCC is calculated from mel energies with simple matrix multiplication and reduction of dimension. That matrix multiplication doesn't affect anything since any other neural networks applies many other operations afterwards.

What is important is reduction of dimension where instead of 40 mel energies you take 13 mel coefficients dropping the rest. That reduces accuracy with CNN, DNN or whatever.

However, if you don't drop and still use 40 MFCCs you can get the same accuracy as for mel energy or even better accuracy.

So it doesn't matter MEL or MFCC, it matters how many coefficients do you keep in your features.

answered Feb 28, 2020 at 14:50

Nikolay Shmyrev

25.1k5 gold badges44 silver badges88 bronze badges

Correct me if I’m wrong, since applying DCT on the Mel-filterbank energies, in this case, is equivalent to IDFT, it seems to me that when we keep the 2-13 (inclusive) cepstral coefficients and discard the rest, is equivalent to a low-time liftering to isolate the vocal tract components, and drop the source components (which have e.g. the F0 spike). So, why should I use all the 40 MFCCs since all I care about for the speech command recognition model is the vocal tract components? However, I'm using 26 Mel-filterbank as a common choice, but I'm not sure if I should use 40 instead. Thank you.
– Abdulkader
Commented Feb 28, 2020 at 15:38
1

26 or 40, it depends on your choice. For neural networks the more coefficients you have the better.
– Nikolay Shmyrev
Commented Feb 28, 2020 at 16:19
OK, thank you. Kindly edit your answer to add more details about the first part of my comment.
– Abdulkader
Commented Mar 6, 2020 at 18:34

Add a comment |

张子豪 · Accepted Answer · 2024-05-24 07:12:04Z

0

man， i am also consufed by the same problem, for now i have several hypothesis, and i am still trying to uynder stand it. (1) mfcc is useful in dealing with harmonics with different foundamental frequency, for example log(mn)-log(mq)=log(n)-log(q), which mean no matter how foudamental frequency change distance between two harmoncis do not change. this result is quite useful in generalization situations, since mffc will not change even if different people pronouce vowels with different foundamental frequencies.

(2) for now mfcc is easily affected by nosie and other sound components so, before applying mfcc, now i am trying to extract harmonic sound.

answered May 24 at 7:12

张子豪

1

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
– Nipul Rathod
Commented May 27 at 4:37

Add a comment |

Collectives™ on Stack Overflow

Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
deep-learning
conv-neural-network
speech-recognition
feature-extraction
mfcc
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged deep-learningconv-neural-networkspeech-recognitionfeature-extractionmfcc or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
deep-learning
conv-neural-network
speech-recognition
feature-extraction
mfcc
or ask your own question.