Sampling frequency F_s = 16000 Hz, frame length N = 512 and pre-emphasis coefficient α = 0.97 are assumed. Noise is generated by a deterministic LCG.
Top to bottom: input signal x[n] (blue) / power spectrum |X[k]|² (green) / mel filterbank energies log E_m (orange) / MFCC coefficients c_n (red)
MFCC compresses a speech signal by combining "a mel scale that mimics human hearing" with "cepstral analysis". The pipeline is pre-emphasis → window → DFT → mel filter bank → log → DCT.
Pre-emphasis to boost high frequencies (α ≈ 0.97):
$$y[n] = x[n] - \alpha\,x[n-1]$$Take the DFT of N Hamming-windowed samples and form the power spectrum $|X[k]|^2$. Mel-scale conversion:
$$m(f) = 2595\,\log_{10}\!\left(1 + \frac{f}{700}\right)$$Place M triangular mel filters $H_m$ uniformly on the mel axis and form the log filterbank energies:
$$\log E_m = \log\!\left(\sum_{k} H_m[k]\,|X[k]|^2\right)$$Finally extract cepstral coefficients with the DCT:
$$c_n = \sum_{m=0}^{M-1} \log E_m \cdot \cos\!\left(\frac{\pi n (m+\tfrac{1}{2})}{M}\right)$$The role of the DCT is to separate "envelope (vocal tract)" from "fine structure (vocal cords)" of the log spectrum. Low-order c_n encode the vocal tract shape; high-order c_n encode glottal vibration.