Why is MFCC widely used in speech recognition?

MFCC compresses the spectrum on a mel scale that mimics human hearing and uses the logarithm and DCT to separate the spectral envelope (vocal tract) from the fine structure (vocal cords). The low-order MFCCs encode the vocal tract shape needed to discriminate phonemes in only 12 to 13 dimensions, making them a long-standing standard input feature for HMM and DNN acoustic models.

Can MFCCs also be used for speaker identification?

Yes. The shape of a speaker's vocal tract and voice prints are strongly reflected in the MFCC, so MFCCs have long been a basic feature for speaker identification. Modern speaker embeddings such as i-vectors and x-vectors are also typically built on MFCCs (or similar filterbank features). For text-independent speaker identification, post-processing such as channel and mean normalization (CMVN) and sufficient utterance length matter a lot.

What are delta and delta-delta MFCCs for?

MFCCs themselves describe the static spectral shape of a single short frame (about 25 ms), but speech also carries strong cues in its time evolution. Delta MFCCs are first-order regression coefficients (a discrete time derivative) computed from a few neighbouring frames, and delta-delta MFCCs are the derivative of those. The classical configuration of 13 MFCC plus 13 delta plus 13 delta-delta gives a 39-dimensional feature long used in HMM-based speech recognition.

Are MFCCs still needed in the era of neural speech recognition?

Recent end-to-end models often feed in raw waveforms or log-mel spectrograms (the same pipeline without the DCT). Since the network can learn the dimensionality reduction that DCT performs, there is little reason to throw information away. However, MFCCs remain very competitive in settings where compute and feature size matter—lightweight embedded devices, keyword spotting, speaker identification—and are still actively used in production.

MFCC Feature Simulator — Free Online Calculator

Parameters

Dominant frequency 1 f₁

Dominant frequency 2 f₂

Number of mel filters M

filters

Number of MFCC coefficients L

coefs

While paused, move the sliders to update the result instantly.

Sampling frequency F_s = 16000 Hz, frame length N = 512 and pre-emphasis coefficient α = 0.97 are assumed. Noise is generated by a deterministic LCG.

Results

—

Number of mel filters M

—

Number of MFCC coefficients L

—

Dominant mel band

—

First MFCC c₁

MFCC Extraction Pipeline

Top to bottom: input signal x[n] (blue) / power spectrum |X[k]|² (green) / mel filterbank energies log E_m (orange) / MFCC coefficients c_n (red)

Theory & Key Formulas

MFCC compresses a speech signal by combining "a mel scale that mimics human hearing" with "cepstral analysis". The pipeline is pre-emphasis → window → DFT → mel filter bank → log → DCT.

Pre-emphasis to boost high frequencies (α ≈ 0.97):

$$y[n] = x[n] - \alpha\,x[n-1]$$

Take the DFT of N Hamming-windowed samples and form the power spectrum $|X[k]|^2$. Mel-scale conversion:

$$m(f) = 2595\,\log_{10}\!\left(1 + \frac{f}{700}\right)$$

Place M triangular mel filters $H_m$ uniformly on the mel axis and form the log filterbank energies:

$$\log E_m = \log\!\left(\sum_{k} H_m[k]\,|X[k]|^2\right)$$

Finally extract cepstral coefficients with the DCT:

$$c_n = \sum_{m=0}^{M-1} \log E_m \cdot \cos\!\left(\frac{\pi n (m+\tfrac{1}{2})}{M}\right)$$

The role of the DCT is to separate "envelope (vocal tract)" from "fine structure (vocal cords)" of the log spectrum. Low-order c_n encode the vocal tract shape; high-order c_n encode glottal vibration.

What is the MFCC Feature Simulator

🙋

I keep hearing about speech recognition, but the computer is not actually comparing the raw waveforms of words, right?

🎓

Good catch. Even the same vowel "a" looks completely different in raw waveform depending on the speaker, pitch and recording. So we transform it into a "feature vector" — a small set of numbers in which phonemes are easier to discriminate. MFCC, short for Mel-Frequency Cepstral Coefficients, is the classic and very powerful feature, and the simulator above shows the four stages — waveform, spectrum, mel-log, DCT — laid out as four stacked plots.

🙋

What is "mel"? Why not just use the ordinary frequency axis?

🎓

The human ear is sensitive to low frequencies and dull at high frequencies. The gap between 1000 Hz and 2000 Hz feels large, but the gap between 6000 Hz and 7000 Hz is barely audible. The mel scale is a non-linear axis that models this auditory behaviour, defined by $m(f) = 2595\,\log_{10}(1 + f/700)$. Look at the third panel: the filters are packed densely at low frequencies and spread out at high frequencies.

🙋

What is the final DCT for? After taking the logarithm we transform again?

🎓

The DCT plays two roles: compression and separation. The log mel spectrum mixes the vocal tract envelope with the fine structure of vocal-cord vibration. After the DCT, the slow-varying components — the vocal tract shape — sit in the low-order coefficients, while the fast-varying components — vocal cords — sit in the high-order ones. Phoneme discrimination needs the vocal tract, so taking only the first 12 to 13 coefficients is enough. Increase L from 4 to 20 in the simulator and you can see that the high-order coefficients are small and carry little information.

🙋

I see. When I drop the number of mel filters M to 10 the bars in the filterbank panel get fewer. Is going too low a problem?

🎓

Exactly. M controls spectral resolution. At 10 it is too coarse and fine phoneme distinctions get washed out. Push it up to 40 and the resolution is high but neighbouring filters become strongly correlated and the downstream DCT becomes redundant. In practice 20 to 40 is the sweet spot, and 26 or 40 are particularly common. Sweeping M while watching the third and fourth panels gives a feel for the trade-off.

Frequently Asked Questions

Speech signals carry most of their energy at low frequencies because of the glottal source, with a roll-off of about 6 dB per octave at high frequencies. Pre-emphasis $y[n] = x[n] - \alpha x[n-1]$ (α ≈ 0.97) is a high-pass filter that compensates for this slope, so that the high-frequency phoneme cues — especially consonants — are not buried in the downstream spectral representation.

The DFT assumes the signal is periodic, so cutting out a finite-length frame creates discontinuities at the boundaries that show up as spurious high-frequency content (spectral leakage) in the spectrum. A Hamming window is a taper that smoothly drops both ends of the frame to zero, drastically reducing leakage and sharpening the peaks of the dominant components. The choice of window (Hamming, Hann, Blackman, etc.) trades peak sharpness against side-lobe level.

Triangles are widely adopted because they are computationally simple and easy to design so that adjacent filters overlap by half on the mel axis. Physically they approximate the band integration that the ear's "critical bands" perform, so each filter corresponds to one auditory channel. Replacing the shape with Gaussian or rectangular filters changes the result very little; the triangle is a practical compromise.

A different microphone, room reverberation, or channel response adds a DC bias to the log spectrum and offsets every MFCC coefficient. CMVN is a normalization that subtracts the mean and divides by the standard deviation of each MFCC dimension over an utterance or a few-second window, producing channel-invariant features. In production it is almost mandatory because it greatly improves robustness to mismatch between speakers and recording conditions.

Real-World Applications

Automatic speech recognition (ASR): The standard front-end in the HMM-GMM era extracted 13-dimensional MFCC plus delta and delta-delta (39 dimensions in total) every 10 ms and fed it to the acoustic model. MFCC remains the default feature in widely used speech recognition toolkits such as Kaldi.

Speaker recognition and verification: The vocal tract shape and voice-cord characteristics of a speaker show up strongly in the MFCC envelope. From the classical GMM-UBM through i-vectors to modern x-vectors, many speaker-embedding methods use MFCC as their input feature. Phone wake words such as "OK Google" or "Hey Siri" rely internally on similar features.

Music information processing: MFCCs are a basic feature in instrument timbre identification, genre classification and song similarity search. Some music-fingerprinting algorithms also combine MFCCs as auxiliary features. Python libraries such as Librosa expose MFCC extraction by default.

Anomaly detection and predictive maintenance: In CAE-adjacent fields where machine sound is monitored, MFCC is a useful way to compress a time waveform into a low-dimensional feature. Bearing faults in rotating machinery, corona discharges in power equipment, and engine knocking — tasks where humans "hear the difference" — are routinely automated with MFCC features.

Common Misconceptions and Cautions

The most common misconception is to think that "MFCC is the frequency spectrum itself". MFCC is the cepstrum: the DCT applied to the log of the spectrum. Its horizontal axis is not frequency but a time-like dimension called "quefrency". Low-order c_n describe the global shape of the log spectrum (vocal tract); high-order c_n describe rapid variations (vocal-cord pitch, noise). Compare the third panel (log E_m) with the fourth panel (c_n) in the simulator — although both have a log-energy vertical axis, the meaning of their horizontal axis is completely different.

The next most common error is to assume that "the longer the frame N, the more accurate". The simulator fixes N = 512 (about 32 ms), which is the typical value in speech processing. If N is too long, several phonemes or voiced/unvoiced transitions land in a single frame and the features become blurry. If it is too short, low-frequency resolution suffers. In practice, a 20–30 ms frame with a 10 ms shift is the standard compromise that balances time and frequency resolution.

Finally, take care not to blindly believe MFCC is the universal speech feature. MFCC is a classic feature alongside linear prediction and the Fourier transform, but in recent end-to-end neural speech recognition, log-mel spectrograms (the same pipeline without the DCT) and even raw waveforms are increasingly fed in directly. The dimensionality reduction by DCT trades off computational saving and a smaller parameter count against the loss of high-order information. Whether MFCC, log-mel or raw waveform is best depends on the task, the compute budget and the model architecture.

MFCC Feature Simulator — Mel-Frequency Cepstral Coefficients

What is the MFCC Feature Simulator

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Cautions

How to Use

Worked Example

Practical Notes

MFCC Feature Simulator — Mel-Frequency Cepstral Coefficients

What is the MFCC Feature Simulator

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Cautions

Related Tools

How to Use

Worked Example

Practical Notes