Mel Frequency Cepstral Coefficients approximation without FFT¶

Mel Frequency Cepstral Coefficients (MFCC) are commonly used as features for speech recognition that are based on our understanding of the human ear's response to pitch. They are normally calculated using the the FFT of an incoming signal. In this Notebook I try to find a way to create a Mel filter bank which does not require the FFT.

A quick introduction to the MFCC algorithm¶

The steps to get the MFCC are as shown.

1. Take the Fourier transform of (a windowed excerpt of) a signal.
2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows (usually 26 such windows).
3. Take the logs of the powers at each of the mel frequencies.
4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
5. The MFCCs are the amplitudes of the resulting spectrum.

For a more indepth discussion of MFCCs take a look at this page.

Constructing Triangular Filter Banks¶

FFTs are expensive on microcontrollers, lets try to eliminate this step. A good starting point is to look at the triangular filters. Lets try to find a way to create similar digital filters. Triangles are essentially made of two different lines. So how do we create a filter that gives us a linear response to frequency? The answer is simple: Differentiation. Imagine a signal $$f(x)$$, we can write this as a fourier series like so: $$f(x) = \Sigma_{n=0}^\infty \phi_nsin2\pi nx$$ Therefore $$\frac{d}{dx}f(x) = \Sigma_{n=0}^\infty 2\pi n\phi_ncos2\pi nx$$ Thus differentiation of a signal gives us a response where the amplitude of the output is proportional to its frequency (Note: differentiation does affect the phase but phase information is unimportant in this case). Adjusting the slope of this response is a simple matter of multiplication of the original signal. How about adjusting the x intercept of the response? Well lets look at this problem from the fourier domain. In the fourier domain we'd like to construct a filter with the following response: $$y = m\omega + b$$ Lets take the inverse fourier transform of this expression: $$\mathcal{F}^{-1}\{y\} (t)= \mathcal{F}^{-1}\{m\omega\}(t) + \mathcal{F}^{-1}\{b\}(t)$$ We already discussed how we can find a way of making the frequency response proportional the frequency of the signal component. Thus the only thing we need to worry about is the b term. The good news is $$\mathcal{F}^{-1}\{b\}(t) = b\delta (t)$$ i.e the Dirac delta function. Unfortunately for us converting the dirac delta to a discrete function is not quite possible. However, we could just use white noise to model the frequency response of the dirac delta function. Now the triangles are made of two lines that start at some frequency and end at some other frequency

In [351]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def fill_below_intersection(x, S, Z):
"""
fill the region below the intersection of S and Z
"""
#find the intersection point
ind = np.nonzero( np.absolute(S-Z)==min(np.absolute(S-Z)))[0]
# compute a new curve which we will fill below
Y = np.zeros(S.shape, dtype=np.float)
Y[:ind[0]] = S[:ind[0]]  # Y is S up to the intersection
Y[ind[0]:] = Z[ind[0]:]  # and Z beyond it
plt.fill(x, Y, facecolor='red', alpha=0.5)

x = np.linspace(0, 3, 500)
plt.plot(x, x)
plt.plot(x, -x+3)
plt.xlabel("Frequency")
plt.ylabel("Power")
plt.title('A Simple Triangular Filter')
fill_below_intersection(x,x,-x+3)
plt.show()


As you can see from the diagram above we are interested in the red colored region. In order to do that we can break this filter into two different parts the first is the area from the left all the way to the intersection as belonging to the blue line and the subsequent part belongin to the red line. What we need is a nice band pass filter between the frequencies 0 and 1.5 and another filter between 1.5 to 3.0.

So to create a simple triangular filter we can describe the processing the following manner.

1. Create two band pass filters A and B. One from the lower cut off frequency to the peak frequency another from the peak frequency to the upper cut off frequency.
2. Differentiate the resulting signal from both filters.
3. Add white noise of appropriate amplitude to translate the frequency response lines.

Now you have a way to create a triangular filter.

Trying our new filter out¶

Ok... Lets give this filter a try! I'm going to create a signal made of several frequencies:

In [352]:
plt.clf()
x = np.linspace(0, 3, 500)
#signal = np.sin(x*27*2*np.pi+3)
for i in range(1,60):
signal += np.sin(x*i*np.pi+np.random.rand())
plt.plot(x, signal)
plt.xlabel("Time")
plt.ylabel("Power")
plt.show()

In [353]:
plt.clf()
fourier_unfiltered = np.abs(np.fft.fft(signal)**2)
freqs = np.fft.fftfreq(fourier_unfiltered.size, 3/500)
idx = np.argsort(freqs)[250:]
plt.plot(freqs[idx], fourier_unfiltered[idx])

Out[353]:
[<matplotlib.lines.Line2D at 0x116ef9ac8>]

OK, now lets give the triangle window a trial. Lets make our triangle filter peak around 2π rad/s and start at 0 rad/s, ending at 4π rad/s.

Step 1: Create the bandpass filters¶

Here I use butterworth filters to do the job. But you could also use other filters such as chebyshev. These filters are of course not ideal so what we get is sort of an approximation of MFCCs.

In [354]:
from scipy.signal import butter, lfilter, freqz

def butter_lowpass(cutoff, fs, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = butter(order, normal_cutoff, btype='low', analog=False)
return b, a

def butter_lowpass_filter(data, cutoff, fs, order=5):
b, a = butter_lowpass(cutoff, fs, order=order)
y = lfilter(b, a, data)
return y

lowercut = butter_lowpass_filter(signal,2*np.pi,500/3)
uppercut = butter_lowpass_filter(signal,4*np.pi,500/3)-lowercut
plt.clf()
plt.plot(x,signal,label = "original")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.show()


Step 2: Differentiation¶

In [355]:
dx =np.gradient(x)
plt.clf()
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.title("Gradient of lower and upper frequencies")
plt.show()


Notice the higher frequencies are Much more amplified just as we discussed before.

Step 3: White Noise¶

This is a tricky step we need to keep track of the amount of power added by the white noise.

In [364]:
def fftnoise(f):
f = np.array(f, dtype='complex')
Np = (len(f) - 1) // 2
phases = np.random.rand(Np) * 2 * np.pi
phases = np.cos(phases) + 1j * np.sin(phases)
f[1:Np+1] *= phases
f[-1:-1-Np:-1] = np.conj(f[1:Np+1])
return np.fft.ifft(f).real
def band_limited_noise(min_freq, max_freq, samples=1024, samplerate=1):
freqs = np.abs(np.fft.fftfreq(samples, 1/samplerate))
f = np.zeros(samples)
idx = np.where(np.logical_and(freqs>=min_freq, freqs<=max_freq))[0]
f[idx] = 1
return fftnoise(f)
example_whitenoise = fftnoise(np.ones(500,dtype=np.complex))
plt.clf()
plt.plot(x,example_whitenoise)
plt.xlabel("time (s)")
plt.title("White Noise Sample")
plt.show()


Let's check the power of the sample of white noise:

In [365]:
power_wn = np.sum(np.abs(example_whitenoise))/500
power_wn

Out[365]:
0.0359128026590095

And the spectrum

In [366]:
plt.clf()
fourier_whitenoise = np.abs(np.fft.fft(example_whitenoise)**2)
freqs = np.fft.fftfreq(fourier_whitenoise.size, 3/500)
idx = np.argsort(freqs)[250:]
plt.plot(freqs[idx], fourier_whitenoise[idx])

Out[366]:
[<matplotlib.lines.Line2D at 0x1042e5fd0>]

Lets now shape the filters and add the white noise. Recall the first plot. We want a filter with the following specifications:

In [367]:
x_c = np.linspace(0, 4*np.pi, 500)
plt.clf()
plt.plot(x_c, x_c)
plt.plot(x_c, -x_c+4*np.pi)
plt.ylabel("Power")
plt.title('A Simple Triangular Filter')
fill_below_intersection(x_c,x_c,-x_c+4*np.pi)
plt.show()


So the green line has a y intercept of 4π. Thus we need to scale the power up to 4π.

In [368]:
scale_factor = 8*np.pi/power_wn
scale_factor

Out[368]:
699.82678509813422
In [378]:
scaled_whitenoise=example_whitenoise*scale_factor
plt.clf()
plt.plot(x,filtered_result,label="triangle filter")
plt.plot(x,signal,label="unfiltered")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.show()


Looks quite noisy but lets look at the fourier response:

In [379]:
plt.clf()
fourier_final = np.abs(np.fft.fft(filtered_result)**2)
freqs = np.fft.fftfreq(fourier_final.size, 3/500)
idx = np.argsort(freqs)[250:]
ax1 = plt.subplot(211)
ax1.plot(freqs[idx], fourier_final[idx], label = "filtered")
ax1.legend()

ax2 = plt.subplot(212)
ax2.plot(freqs[idx], fourier_unfiltered[idx], label = "original")