Mel Frequency Cepstral Coefficients (MFCC) are commonly used as features for speech recognition that are based on our understanding of the human ear's response to pitch. They are normally calculated using the the FFT of an incoming signal. In this Notebook I try to find a way to create a Mel filter bank which does not require the FFT.
The steps to get the MFCC are as shown.
For a more indepth discussion of MFCCs take a look at this page.
FFTs are expensive on microcontrollers, lets try to eliminate this step. A good starting point is to look at the triangular filters. Lets try to find a way to create similar digital filters. Triangles are essentially made of two different lines. So how do we create a filter that gives us a linear response to frequency? The answer is simple: Differentiation. Imagine a signal $$ f(x) $$, we can write this as a fourier series like so: $$ f(x) = \Sigma_{n=0}^\infty \phi_nsin2\pi nx $$ Therefore $$ \frac{d}{dx}f(x) = \Sigma_{n=0}^\infty 2\pi n\phi_ncos2\pi nx $$ Thus differentiation of a signal gives us a response where the amplitude of the output is proportional to its frequency (Note: differentiation does affect the phase but phase information is unimportant in this case). Adjusting the slope of this response is a simple matter of multiplication of the original signal. How about adjusting the x intercept of the response? Well lets look at this problem from the fourier domain. In the fourier domain we'd like to construct a filter with the following response: $$ y = m\omega + b $$ Lets take the inverse fourier transform of this expression: $$ \mathcal{F}^{-1}\{y\} (t)= \mathcal{F}^{-1}\{m\omega\}(t) + \mathcal{F}^{-1}\{b\}(t)$$ We already discussed how we can find a way of making the frequency response proportional the frequency of the signal component. Thus the only thing we need to worry about is the b term. The good news is $$\mathcal{F}^{-1}\{b\}(t) = b\delta (t)$$ i.e the Dirac delta function. Unfortunately for us converting the dirac delta to a discrete function is not quite possible. However, we could just use white noise to model the frequency response of the dirac delta function. Now the triangles are made of two lines that start at some frequency and end at some other frequency
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def fill_below_intersection(x, S, Z):
"""
fill the region below the intersection of S and Z
"""
#find the intersection point
ind = np.nonzero( np.absolute(S-Z)==min(np.absolute(S-Z)))[0]
# compute a new curve which we will fill below
Y = np.zeros(S.shape, dtype=np.float)
Y[:ind[0]] = S[:ind[0]] # Y is S up to the intersection
Y[ind[0]:] = Z[ind[0]:] # and Z beyond it
plt.fill(x, Y, facecolor='red', alpha=0.5)
x = np.linspace(0, 3, 500)
plt.plot(x, x)
plt.plot(x, -x+3)
plt.xlabel("Frequency")
plt.ylabel("Power")
plt.title('A Simple Triangular Filter')
fill_below_intersection(x,x,-x+3)
plt.show()
As you can see from the diagram above we are interested in the red colored region. In order to do that we can break this filter into two different parts the first is the area from the left all the way to the intersection as belonging to the blue line and the subsequent part belongin to the red line. What we need is a nice band pass filter between the frequencies 0 and 1.5 and another filter between 1.5 to 3.0.
So to create a simple triangular filter we can describe the processing the following manner.
Now you have a way to create a triangular filter.
Ok... Lets give this filter a try! I'm going to create a signal made of several frequencies:
plt.clf()
x = np.linspace(0, 3, 500)
#signal = np.sin(x*27*2*np.pi+3)
for i in range(1,60):
signal += np.sin(x*i*np.pi+np.random.rand())
plt.plot(x, signal)
plt.xlabel("Time")
plt.ylabel("Power")
plt.show()
plt.clf()
fourier_unfiltered = np.abs(np.fft.fft(signal)**2)
freqs = np.fft.fftfreq(fourier_unfiltered.size, 3/500)
idx = np.argsort(freqs)[250:]
plt.plot(freqs[idx], fourier_unfiltered[idx])
OK, now lets give the triangle window a trial. Lets make our triangle filter peak around 2π rad/s and start at 0 rad/s, ending at 4π rad/s.
Here I use butterworth filters to do the job. But you could also use other filters such as chebyshev. These filters are of course not ideal so what we get is sort of an approximation of MFCCs.
from scipy.signal import butter, lfilter, freqz
def butter_lowpass(cutoff, fs, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = butter(order, normal_cutoff, btype='low', analog=False)
return b, a
def butter_lowpass_filter(data, cutoff, fs, order=5):
b, a = butter_lowpass(cutoff, fs, order=order)
y = lfilter(b, a, data)
return y
lowercut = butter_lowpass_filter(signal,2*np.pi,500/3)
uppercut = butter_lowpass_filter(signal,4*np.pi,500/3)-lowercut
plt.clf()
plt.plot(x,lowercut,label = "0-2πrad/s")
plt.plot(x,uppercut,label = "2π-4πrad/s")
plt.plot(x,signal,label = "original")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.show()
dx =np.gradient(x)
lowergrad = np.gradient(lowercut,dx,edge_order=2)
uppergrad = np.gradient(uppercut,dx,edge_order=2)
plt.clf()
plt.plot(x,lowergrad,label = "0-2πrad/s")
plt.plot(x,uppergrad,label = "2π-4πrad/s")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.title("Gradient of lower and upper frequencies")
plt.show()
Notice the higher frequencies are Much more amplified just as we discussed before.
This is a tricky step we need to keep track of the amount of power added by the white noise.
def fftnoise(f):
f = np.array(f, dtype='complex')
Np = (len(f) - 1) // 2
phases = np.random.rand(Np) * 2 * np.pi
phases = np.cos(phases) + 1j * np.sin(phases)
f[1:Np+1] *= phases
f[-1:-1-Np:-1] = np.conj(f[1:Np+1])
return np.fft.ifft(f).real
def band_limited_noise(min_freq, max_freq, samples=1024, samplerate=1):
freqs = np.abs(np.fft.fftfreq(samples, 1/samplerate))
f = np.zeros(samples)
idx = np.where(np.logical_and(freqs>=min_freq, freqs<=max_freq))[0]
f[idx] = 1
return fftnoise(f)
example_whitenoise = fftnoise(np.ones(500,dtype=np.complex))
plt.clf()
plt.plot(x,example_whitenoise)
plt.xlabel("time (s)")
plt.title("White Noise Sample")
plt.show()
Let's check the power of the sample of white noise:
power_wn = np.sum(np.abs(example_whitenoise))/500
power_wn
And the spectrum
plt.clf()
fourier_whitenoise = np.abs(np.fft.fft(example_whitenoise)**2)
freqs = np.fft.fftfreq(fourier_whitenoise.size, 3/500)
idx = np.argsort(freqs)[250:]
plt.plot(freqs[idx], fourier_whitenoise[idx])
Lets now shape the filters and add the white noise. Recall the first plot. We want a filter with the following specifications:
x_c = np.linspace(0, 4*np.pi, 500)
plt.clf()
plt.plot(x_c, x_c)
plt.plot(x_c, -x_c+4*np.pi)
plt.xlabel("Frequency (rad/s)")
plt.ylabel("Power")
plt.title('A Simple Triangular Filter')
fill_below_intersection(x_c,x_c,-x_c+4*np.pi)
plt.show()
So the green line has a y intercept of 4π. Thus we need to scale the power up to 4π.
scale_factor = 8*np.pi/power_wn
scale_factor
scaled_whitenoise=example_whitenoise*scale_factor
filtered_result = scaled_whitenoise*5-uppergrad+lowergrad
plt.clf()
plt.plot(x,filtered_result,label="triangle filter")
plt.plot(x,signal,label="unfiltered")
plt.legend()
plt.xlabel("time (s)")
plt.ylabel("power")
plt.show()
Looks quite noisy but lets look at the fourier response:
plt.clf()
fourier_final = np.abs(np.fft.fft(filtered_result)**2)
freqs = np.fft.fftfreq(fourier_final.size, 3/500)
idx = np.argsort(freqs)[250:]
ax1 = plt.subplot(211)
ax1.plot(freqs[idx], fourier_final[idx], label = "filtered")
ax1.set_xlabel("Frequency (rad/s)")
ax1.legend()
ax2 = plt.subplot(212)
ax2.plot(freqs[idx], fourier_unfiltered[idx], label = "original")
ax2.set_xlabel("Frequency (rad/s)")
ax2.legend()
plt.show()
It Works! as you can see the frequency response of the peak at 27 rad/s is completely subdued. In fact the peak that is most prominent at 2πrad/s, followed by a smaller peak at 5 rad/s, demonstrating the triangular nature of the filter. Theres a bit of leakage but thats you're butterworth filter at work. Nothing to be done about it :(.
Here we developed a way to estimate a triangular filter bank which could be used in calculating the MFCCs of a signal. However there are a few glaring mistakes. Firstly, most implmentations of a FFT use a periodogram, not a a normal fourier spectra to calculate the coefficients. Secondly the butterworth filter here is in itself a bit expensive to implement and leads to a fair deal of leakage.