MFCC Feature Simulator Back
Speech Signal Processing Simulator

MFCC Feature Simulator — Mel-Frequency Cepstral Coefficients

Visualize, stage by stage, how a speech signal is turned into MFCC features through pre-emphasis, a Hamming window, FFT, a mel filter bank, logarithm and DCT — the standard front-end of speech recognition.

Parameters
Dominant frequency 1 f₁
Hz
Dominant frequency 2 f₂
Hz
Number of mel filters M
filters
Number of MFCC coefficients L
coefs

Sampling frequency F_s = 16000 Hz, frame length N = 512 and pre-emphasis coefficient α = 0.97 are assumed. Noise is generated by a deterministic LCG.

Results
MFCC c₀ (log-energy term)
MFCC c₁
Peak-energy filter index
Centre frequency of last mel filter
MFCC Extraction Pipeline

Top to bottom: input signal x[n] (blue) / power spectrum |X[k]|² (green) / mel filterbank energies log E_m (orange) / MFCC coefficients c_n (red)

Theory & Key Formulas

MFCC compresses a speech signal by combining "a mel scale that mimics human hearing" with "cepstral analysis". The pipeline is pre-emphasis → window → DFT → mel filter bank → log → DCT.

Pre-emphasis to boost high frequencies (α ≈ 0.97):

$$y[n] = x[n] - \alpha\,x[n-1]$$

Take the DFT of N Hamming-windowed samples and form the power spectrum $|X[k]|^2$. Mel-scale conversion:

$$m(f) = 2595\,\log_{10}\!\left(1 + \frac{f}{700}\right)$$

Place M triangular mel filters $H_m$ uniformly on the mel axis and form the log filterbank energies:

$$\log E_m = \log\!\left(\sum_{k} H_m[k]\,|X[k]|^2\right)$$

Finally extract cepstral coefficients with the DCT:

$$c_n = \sum_{m=0}^{M-1} \log E_m \cdot \cos\!\left(\frac{\pi n (m+\tfrac{1}{2})}{M}\right)$$

The role of the DCT is to separate "envelope (vocal tract)" from "fine structure (vocal cords)" of the log spectrum. Low-order c_n encode the vocal tract shape; high-order c_n encode glottal vibration.

What is the MFCC Feature Simulator

🙋
I keep hearing about speech recognition, but the computer is not actually comparing the raw waveforms of words, right?
🎓
Good catch. Even the same vowel "a" looks completely different in raw waveform depending on the speaker, pitch and recording. So we transform it into a "feature vector" — a small set of numbers in which phonemes are easier to discriminate. MFCC, short for Mel-Frequency Cepstral Coefficients, is the classic and very powerful feature, and the simulator above shows the four stages — waveform, spectrum, mel-log, DCT — laid out as four stacked plots.
🙋
What is "mel"? Why not just use the ordinary frequency axis?
🎓
The human ear is sensitive to low frequencies and dull at high frequencies. The gap between 1000 Hz and 2000 Hz feels large, but the gap between 6000 Hz and 7000 Hz is barely audible. The mel scale is a non-linear axis that models this auditory behaviour, defined by $m(f) = 2595\,\log_{10}(1 + f/700)$. Look at the third panel: the filters are packed densely at low frequencies and spread out at high frequencies.
🙋
What is the final DCT for? After taking the logarithm we transform again?
🎓
The DCT plays two roles: compression and separation. The log mel spectrum mixes the vocal tract envelope with the fine structure of vocal-cord vibration. After the DCT, the slow-varying components — the vocal tract shape — sit in the low-order coefficients, while the fast-varying components — vocal cords — sit in the high-order ones. Phoneme discrimination needs the vocal tract, so taking only the first 12 to 13 coefficients is enough. Increase L from 4 to 20 in the simulator and you can see that the high-order coefficients are small and carry little information.
🙋
I see. When I drop the number of mel filters M to 10 the bars in the filterbank panel get fewer. Is going too low a problem?
🎓
Exactly. M controls spectral resolution. At 10 it is too coarse and fine phoneme distinctions get washed out. Push it up to 40 and the resolution is high but neighbouring filters become strongly correlated and the downstream DCT becomes redundant. In practice 20 to 40 is the sweet spot, and 26 or 40 are particularly common. Sweeping M while watching the third and fourth panels gives a feel for the trade-off.

Frequently Asked Questions

Speech signals carry most of their energy at low frequencies because of the glottal source, with a roll-off of about 6 dB per octave at high frequencies. Pre-emphasis $y[n] = x[n] - \alpha x[n-1]$ (α ≈ 0.97) is a high-pass filter that compensates for this slope, so that the high-frequency phoneme cues — especially consonants — are not buried in the downstream spectral representation.
The DFT assumes the signal is periodic, so cutting out a finite-length frame creates discontinuities at the boundaries that show up as spurious high-frequency content (spectral leakage) in the spectrum. A Hamming window is a taper that smoothly drops both ends of the frame to zero, drastically reducing leakage and sharpening the peaks of the dominant components. The choice of window (Hamming, Hann, Blackman, etc.) trades peak sharpness against side-lobe level.
Triangles are widely adopted because they are computationally simple and easy to design so that adjacent filters overlap by half on the mel axis. Physically they approximate the band integration that the ear's "critical bands" perform, so each filter corresponds to one auditory channel. Replacing the shape with Gaussian or rectangular filters changes the result very little; the triangle is a practical compromise.
A different microphone, room reverberation, or channel response adds a DC bias to the log spectrum and offsets every MFCC coefficient. CMVN is a normalization that subtracts the mean and divides by the standard deviation of each MFCC dimension over an utterance or a few-second window, producing channel-invariant features. In production it is almost mandatory because it greatly improves robustness to mismatch between speakers and recording conditions.

Real-World Applications

Automatic speech recognition (ASR): The standard front-end in the HMM-GMM era extracted 13-dimensional MFCC plus delta and delta-delta (39 dimensions in total) every 10 ms and fed it to the acoustic model. MFCC remains the default feature in widely used speech recognition toolkits such as Kaldi.

Speaker recognition and verification: The vocal tract shape and voice-cord characteristics of a speaker show up strongly in the MFCC envelope. From the classical GMM-UBM through i-vectors to modern x-vectors, many speaker-embedding methods use MFCC as their input feature. Phone wake words such as "OK Google" or "Hey Siri" rely internally on similar features.

Music information processing: MFCCs are a basic feature in instrument timbre identification, genre classification and song similarity search. Some music-fingerprinting algorithms also combine MFCCs as auxiliary features. Python libraries such as Librosa expose MFCC extraction by default.

Anomaly detection and predictive maintenance: In CAE-adjacent fields where machine sound is monitored, MFCC is a useful way to compress a time waveform into a low-dimensional feature. Bearing faults in rotating machinery, corona discharges in power equipment, and engine knocking — tasks where humans "hear the difference" — are routinely automated with MFCC features.

Common Misconceptions and Cautions

The most common misconception is to think that "MFCC is the frequency spectrum itself". MFCC is the cepstrum: the DCT applied to the log of the spectrum. Its horizontal axis is not frequency but a time-like dimension called "quefrency". Low-order c_n describe the global shape of the log spectrum (vocal tract); high-order c_n describe rapid variations (vocal-cord pitch, noise). Compare the third panel (log E_m) with the fourth panel (c_n) in the simulator — although both have a log-energy vertical axis, the meaning of their horizontal axis is completely different.

The next most common error is to assume that "the longer the frame N, the more accurate". The simulator fixes N = 512 (about 32 ms), which is the typical value in speech processing. If N is too long, several phonemes or voiced/unvoiced transitions land in a single frame and the features become blurry. If it is too short, low-frequency resolution suffers. In practice, a 20–30 ms frame with a 10 ms shift is the standard compromise that balances time and frequency resolution.

Finally, take care not to blindly believe MFCC is the universal speech feature. MFCC is a classic feature alongside linear prediction and the Fourier transform, but in recent end-to-end neural speech recognition, log-mel spectrograms (the same pipeline without the DCT) and even raw waveforms are increasingly fed in directly. The dimensionality reduction by DCT trades off computational saving and a smaller parameter count against the loss of high-order information. Whether MFCC, log-mel or raw waveform is best depends on the task, the compute budget and the model architecture.