Mel Frequency Cepstral Coefficients approximation without FFT

Mel Frequency Cepstral Coefficients (MFCC) are commonly used as features for speech recognition that are based on our understanding of the human ear's response to pitch. They are normally calculated using the the FFT of an incoming signal. In this Notebook I try to find a way to create a Mel filter bank which does not require the FFT.

A quick introduction to the MFCC algorithm

The steps to get the MFCC are as shown.

  1. Take the Fourier transform of (a windowed excerpt of) a signal.
  2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows (usually 26 such windows).
  3. Take the logs of the powers at each of the mel frequencies.
  4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
  5. The MFCCs are the amplitudes of the resulting spectrum.

For a more indepth discussion of MFCCs take a look at this page.

Constructing Triangular Filter Banks

FFTs are expensive on microcontrollers, lets try to eliminate this step. A good starting point is to look at the triangular filters. Lets try to find a way to create similar digital filters. Triangles are essentially made of two different lines. So how do we create a filter that gives us a linear response to frequency? The answer is simple: Differentiation. Imagine a signal $$ f(x) $$, we can write this as a fourier series like so: $$ f(x) = \Sigma_{n=0}^\infty \phi_nsin2\pi nx $$ Therefore $$ \frac{d}{dx}f(x) = \Sigma_{n=0}^\infty 2\pi n\phi_ncos2\pi nx $$ Thus differentiation of a signal gives us a response where the amplitude of the output is proportional to its frequency (Note: differentiation does affect the phase but phase information is unimportant in this case). Adjusting the slope of this response is a simple matter of multiplication of the original signal. How about adjusting the x intercept of the response? Well lets look at this problem from the fourier domain. In the fourier domain we'd like to construct a filter with the following response: $$ y = m\omega + b $$ Lets take the inverse fourier transform of this expression: $$ \mathcal{F}^{-1}\{y\} (t)= \mathcal{F}^{-1}\{m\omega\}(t) + \mathcal{F}^{-1}\{b\}(t)$$ We already discussed how we can find a way of making the frequency response proportional the frequency of the signal component. Thus the only thing we need to worry about is the b term. The good news is $$\mathcal{F}^{-1}\{b\}(t) = b\delta (t)$$ i.e the Dirac delta function. Unfortunately for us converting the dirac delta to a discrete function is not quite possible. However, we could just use white noise to model the frequency response of the dirac delta function. Now the triangles are made of two lines that start at some frequency and end at some other frequency

In [351]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def fill_below_intersection(x, S, Z):
    """
    fill the region below the intersection of S and Z
    """
    #find the intersection point
    ind = np.nonzero( np.absolute(S-Z)==min(np.absolute(S-Z)))[0]
    # compute a new curve which we will fill below
    Y = np.zeros(S.shape, dtype=np.float)
    Y[:ind[0]] = S[:ind[0]]  # Y is S up to the intersection
    Y[ind[0]:] = Z[ind[0]:]  # and Z beyond it
    plt.fill(x, Y, facecolor='red', alpha=0.5)

x = np.linspace(0, 3, 500)
plt.plot(x, x)
plt.plot(x, -x+3)
plt.xlabel("Frequency")
plt.ylabel("Power")
plt.title('A Simple Triangular Filter')
fill_below_intersection(x,x,-x+3)
plt.show()

As you can see from the diagram above we are interested in the red colored region. In order to do that we can break this filter into two different parts the first is the area from the left all the way to the intersection as belonging to the blue line and the subsequent part belongin to the red line. What we need is a nice band pass filter between the frequencies 0 and 1.5 and another filter between 1.5 to 3.0.

So to create a simple triangular filter we can describe the processing the following manner.

  1. Create two band pass filters A and B. One from the lower cut off frequency to the peak frequency another from the peak frequency to the upper cut off frequency.
  2. Differentiate the resulting signal from both filters.
  3. Add white noise of appropriate amplitude to translate the frequency response lines.

Now you have a way to create a triangular filter.

Trying our new filter out

Ok... Lets give this filter a try! I'm going to create a signal made of several frequencies:

In [352]:
plt.clf()
x = np.linspace(0, 3, 500)
#signal = np.sin(x*27*2*np.pi+3)
for i in range(1,60):
    signal += np.sin(x*i*np.pi+np.random.rand())
plt.plot(x, signal)
plt.xlabel("Time")
plt.ylabel("Power")
plt.show()
In [353]:
plt.clf()
fourier_unfiltered = np.abs(np.fft.fft(signal)**2)
freqs = np.fft.fftfreq(fourier_unfiltered.size, 3/500)
idx = np.argsort(freqs)[250:]
plt.plot(freqs[idx], fourier_unfiltered[idx])
Out[353]:
[<matplotlib.lines.Line2D at 0x116ef9ac8>]

OK, now lets give the triangle window a trial. Lets make our triangle filter peak around 2π rad/s and start at 0 rad/s, ending at 4π rad/s.

Step 1: Create the bandpass filters

Here I use butterworth filters to do the job. But you could also use other filters such as chebyshev. These filters are of course not ideal so what we get is sort of an approximation of MFCCs.

In [354]:
from scipy.signal import butter, lfilter, freqz

def butter_lowpass(cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return b, a

def butter_lowpass_filter(data, cutoff, fs, order=5):
    b, a = butter_lowpass(cutoff, fs, order=order)
    y = lfilter(b, a, data)
    return y

lowercut = butter_lowpass_filter(signal,2*np.pi,500/3)
uppercut = butter_lowpass_filter(signal,4*np.pi,500/3)-lowercut
plt.clf()
plt.plot(x,lowercut,label = "0-2πrad/s")
plt.plot(x,uppercut,label = "2π-4πrad/s")
plt.plot(x,signal,label = "original")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.show()

Step 2: Differentiation

In [355]:
dx =np.gradient(x)
lowergrad = np.gradient(lowercut,dx,edge_order=2)
uppergrad = np.gradient(uppercut,dx,edge_order=2)
plt.clf()
plt.plot(x,lowergrad,label = "0-2πrad/s")
plt.plot(x,uppergrad,label = "2π-4πrad/s")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.title("Gradient of lower and upper frequencies")
plt.show()

Notice the higher frequencies are Much more amplified just as we discussed before.

Step 3: White Noise

This is a tricky step we need to keep track of the amount of power added by the white noise.

In [364]:
def fftnoise(f):
    f = np.array(f, dtype='complex')
    Np = (len(f) - 1) // 2
    phases = np.random.rand(Np) * 2 * np.pi
    phases = np.cos(phases) + 1j * np.sin(phases)
    f[1:Np+1] *= phases
    f[-1:-1-Np:-1] = np.conj(f[1:Np+1])
    return np.fft.ifft(f).real
def band_limited_noise(min_freq, max_freq, samples=1024, samplerate=1):
    freqs = np.abs(np.fft.fftfreq(samples, 1/samplerate))
    f = np.zeros(samples)
    idx = np.where(np.logical_and(freqs>=min_freq, freqs<=max_freq))[0]
    f[idx] = 1
    return fftnoise(f)
example_whitenoise = fftnoise(np.ones(500,dtype=np.complex))
plt.clf()
plt.plot(x,example_whitenoise)
plt.xlabel("time (s)")
plt.title("White Noise Sample")
plt.show()

Let's check the power of the sample of white noise:

In [365]:
power_wn = np.sum(np.abs(example_whitenoise))/500
power_wn
Out[365]:
0.0359128026590095

And the spectrum

In [366]:
plt.clf()
fourier_whitenoise = np.abs(np.fft.fft(example_whitenoise)**2)
freqs = np.fft.fftfreq(fourier_whitenoise.size, 3/500)
idx = np.argsort(freqs)[250:]
plt.plot(freqs[idx], fourier_whitenoise[idx])
Out[366]:
[<matplotlib.lines.Line2D at 0x1042e5fd0>]

Lets now shape the filters and add the white noise. Recall the first plot. We want a filter with the following specifications:

  • lower cutoff: 0 rad/s
  • upper cutoff: 2π rad/s
  • peak frequency: 4π rad/s
In [367]:
x_c = np.linspace(0, 4*np.pi, 500)
plt.clf()
plt.plot(x_c, x_c)
plt.plot(x_c, -x_c+4*np.pi)
plt.xlabel("Frequency (rad/s)")
plt.ylabel("Power")
plt.title('A Simple Triangular Filter')
fill_below_intersection(x_c,x_c,-x_c+4*np.pi)
plt.show()

So the green line has a y intercept of 4π. Thus we need to scale the power up to 4π.

In [368]:
scale_factor = 8*np.pi/power_wn
scale_factor
Out[368]:
699.82678509813422
In [378]:
scaled_whitenoise=example_whitenoise*scale_factor
filtered_result = scaled_whitenoise*5-uppergrad+lowergrad
plt.clf()
plt.plot(x,filtered_result,label="triangle filter")
plt.plot(x,signal,label="unfiltered")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.show()

Looks quite noisy but lets look at the fourier response:

In [379]:
plt.clf()
fourier_final = np.abs(np.fft.fft(filtered_result)**2)
freqs = np.fft.fftfreq(fourier_final.size, 3/500)
idx = np.argsort(freqs)[250:]
ax1 = plt.subplot(211)
ax1.plot(freqs[idx], fourier_final[idx], label = "filtered")
ax1.set_xlabel("Frequency (rad/s)")
ax1.legend()

ax2 = plt.subplot(212)
ax2.plot(freqs[idx], fourier_unfiltered[idx], label = "original")
ax2.set_xlabel("Frequency (rad/s)")
ax2.legend()
plt.show()

It Works! as you can see the frequency response of the peak at 27 rad/s is completely subdued. In fact the peak that is most prominent at 2πrad/s, followed by a smaller peak at 5 rad/s, demonstrating the triangular nature of the filter. Theres a bit of leakage but thats you're butterworth filter at work. Nothing to be done about it :(.

Conclusion

Here we developed a way to estimate a triangular filter bank which could be used in calculating the MFCCs of a signal. However there are a few glaring mistakes. Firstly, most implmentations of a FFT use a periodogram, not a a normal fourier spectra to calculate the coefficients. Secondly the butterworth filter here is in itself a bit expensive to implement and leads to a fair deal of leakage.