13
$\begingroup$

I am using the Average Magnitude Difference Function to estimate the fundamental frequency of a quasi-periodic audio signal. The AMDF is defined as

$$ D_n = \frac{1}{N-n}\sum_{k=n}^{N-1}|S_k - S_{k-n}| $$

where $N$ is the length of the signal. This function exhibits a minimum when the signal is shifted by an amount equal to its period.

This is the code I am using to extract the pitch (in Matlab):

 a = amdf(f);
 a = a/max(a);
 [p l] = findpeaks(-a, 'minpeakprominence', 0.6);
 pitch = round(sample_freq/l(1);

However, I am dealing with an audio signal where the fundamental frequency is very low:

spectrum of the audio signal

As a consequence, a pitch doubling problem arises: the detected minimum corresponds to half the period of the signal (i.e. the second harmonic):

AMDF of the signal above

I tried to extract the largest peak and not just the first, but sometimes this problem remains. How can I improve my code and/or the AMDF function in order to deal with low fundamental?

$\endgroup$
2
  • $\begingroup$ Psycho-acoustics and human perception influence perceived pitch and octave uncertainty. It may require experimentation to determine under what conditions the largest AMDF peak makes an audible difference. $\endgroup$
    – hotpaw2
    Commented Apr 7, 2016 at 17:36
  • $\begingroup$ how low are your frequencies? is there any example for me to listen? $\endgroup$
    – ederwander
    Commented Apr 10, 2016 at 13:56

2 Answers 2

15
$\begingroup$

This is what we call in the pitch-detection biz, the "octave problem".

First of all, I would change the AMDF to ASDF. And I would not reduce the window size as the lag increases. (Also, I am changing notation to what I consider to be more conventional. "$x[n]$" is a discrete-time signal.)

The Average Squared Difference Function (ASDF) of $x[n]$ in the neighborhood of sample $x[n_0]$ is:

$$ Q_x[k, n_0] \triangleq \frac{1}{N} \sum\limits_{n=0}^{N-1} \left(x[n+n_0-\left\lfloor \tfrac{N+k}{2}\right\rfloor] \ - \ x[n+n_0-\left\lfloor \tfrac{N+k}{2}\right\rfloor + k] \right)^2 $$

$\left\lfloor \cdot \right\rfloor$ is the floor() function and, if $k$ is even then $ \left\lfloor \frac{k}{2}\right\rfloor = \left\lfloor \frac{k+1}{2}\right\rfloor = \frac{k}{2} $.

Now, expand the square and consider what the summations look like as $N \to \infty$ (not that $N$ is going to infinity, but to give you an idea if $N$ is large). The ASDF is directly related to the autocorrelation. It is essentially the autocorrelation turned upside down. These steps I will leave to you. take a look at this answer.

So now consider this finite-length "autocorrelation" (in the neighborhood of sample $x[n_0]$) defined from the ASDF:

$$ R_x[k,n_0] = R_x[0,n_0] - \tfrac12 Q_x[k, n_0] $$

where

$$ R_x[0, n_0] \triangleq \frac{1}{N} \sum\limits_{n=0}^{N-1} \Big(x[n+n_0-\left\lfloor \tfrac{N}{2}\right\rfloor]\Big)^2 $$

This value $R_x[0,n_0]$ is a measure of the mean power of the signal $x[n]$ in the neighborhood of $n \approx n_0$. Since $Q_x[0,n_0]=0$ and $Q_x[k,n_0] \ge 0$ for all lags $k$, that means that $ R_x[k,n_0] \le R_x[0,n_0] $ for all lags $k$.

Another useful way to look at this autocorrelation taking place in the neighborhood centered at sample $x[n_0]$ is to normalize $R_x[k, n_0]$ with $R_x[0, n_0]$:

$$ r_x[k,n_0] \triangleq \frac{R_x[k,n_0]}{R_x[0,n_0]} = 1 - \frac{Q_x[k,n_0]}{2 R_x[0,n_0]} $$

This normalized autocorrelation has $r_x[0,n_0]=1$ and $r_x[k,n_0] \le 1$ for all other $k$.

Suppose for a minute that $x[n]$ is periodic with period $P$ (and $P$ happens to be an integer), then

$$ x[n+P] = x[n] \quad \forall n $$

and $Q_x[mP, n_0] = 0$ and $R_x[mP, n_0] = R_x[0, n_0] \ge R_x[k, n_0]$ for any integer number of periods ($m$ is an integer). So you get a peak at $k=0$ and at $k$ equal to any other multiple of $P$ if $x[n]$ is periodic. If $x[n]$ is not perfectly periodic, what we might expect is the biggest peak at $k=0$, another peak (but slightly smaller) at $k=P$ (the period we are looking for) and progressively smaller peaks for larger multiples of $P$.

We can then expect that the value of the normalized autocorrelation, $r_x[k,n_0]$ evaluated at a lag of $k=P$ or other multiples of $P$ should be pretty close to 1. That value $r_x[P,n_0]$ can be thought of as a measure of the degree of periodicity (sometimes called the pitch confidence) of the estimated period $P$ for the quasiperiodic $x[n]$ in the neighborhood of $n \approx n_0$. If $r_x[P,n_0]=1$, we can say that $x[n]$ is perfectly periodic with period $P$. If the best $r_x[k,n_0]$ you can get (with $k$ that's not close to $k=0$) is very small, then $x[n]$ shows no periodicity and your pitch confidence is low.

So the octave problem comes about because of a couple of reasons. First of all, $P$ is not necessarily an integer. That is an interpolation problem, not a big deal.

The second reason and more difficult problem is that of subharmonics. Consider that you're listening to a nice periodic tone at exactly A-440 Hz and it sounds like an A that is 9 semitones above middle C. Now suppose someone adds to that tone a very tiny-amplitude (like down 60 dB) A-220? What will it sound like and mathematically what is the "true" period?


Choosing the "right" peak for the period.

Let's say you run your note through a DC-blocking filter, so that the mean of $x[n]$ is zero. It turns out that causes the mean of the autocorrelation $R_x[k, n_0]$ for every $n_0$ to also be zero (or close to it if $N$ is large). That means $R_x[k, n_0]$ must sum (over $k$) to be about zero which means there is as much area above zero as below.

Okay, so $R_x[0, n_0]$ represents the power of $x[n]$ in the vicinity around $n=n_0$ and must be non-negative. $R_x[k, n_0]$ never exceeds $R_x[0, n_0]$ but can get as large as it when $x[n]$ is periodic. $R_x[P, n_0] = R_x[0, n_0]$ if $x[n+P]=x[n]$. So if $x[n]$ is periodic with period $P$ and you have a bunch of peaks spaced apart by $P$ and you have an idea for how high those peaks should be. And if the DC component of $R_x[k, n_0]$ is zero, that means in-between the peaks, it must have negative values.

If $x[n]$ was "quasi-periodic", one cycle of $x[n]$ will look a lot like an adjacent cycle, but not so much like a cycle of $x[n]$ farther down the signal in time. That means the first peak $R_x[P, n_0]$ will be higher than the second at $R_x[2P, n_0]$ or the third $R_x[3P, n_0]$. One could use the rule to always pick the highest peak and expect the highest peak to always be the first one. But, because of inaudible subharmonics, sometimes that is not the case. sometimes the second or possibly the third peak is oh-so-slightly higher. Also, because the period $P$ is likely not an integer number of samples but $k$ in $R_x[k, n_0]$ is always an integer, so the true peak will likely be in-between integer values of $k$. Even if you were to interpolate where the smooth peak is (which I recommend and quadratic interpolation is good enough), and how high it really is between integer $k$, your interpolation alg could make a peak slightly higher or slightly lower than it really is. So choosing the absolutely highest peak can result in spuriously picking the second over the first peak (or vise versa) when you really wanted the other.

So somehow you have to handicap the peaks at increasing $k$ so that the first peak has a slight advantage over the second, and the second over the fourth (the next octave down), etc. How do you do that?

You do that by multiplying $R_x[k, n_0]$ with a decreasing function of $k$ so that the peak at $k=2P$ is reduced by some factor, relative to an identical peak at $k=P$. Turns out that the power function (not the exponential) does that. so compute

$$ k^{-\alpha} \ R_x[k, n_0] $$

So, if $x[n]$ were perfectly periodic with period $P$, and ignoring interpolation issues for non-integer $P$, then

$$ R_x[2P, n_0] = R_x[P, n_0] $$

but

$$\begin{align} (2P)^{-\alpha} R_x[2P, n_0] & = \\ (2P)^{-\alpha} R_x[P, n_0] & < P^{-\alpha} R_x[P, n_0] \\ \end{align}$$

The factor by which the peak for a pitch of one octave lower is reduced is the ratio

$$ \frac{(2P)^{-\alpha} R_x[2P, n_0]}{P^{-\alpha} R_x[P, n_0]} = \frac{(2P)^{-\alpha}}{P^{-\alpha}} = 2^{-\alpha} $$

So if you want to give your first peak a 1% boost over the second peak, which means you will not choose the pitch to be the sub-harmonic pitch, unless the sub-harmonic pitch autocorrelation is at least 1% more than the first peak, you would solve for $\alpha$ from

$$ 2^{-\alpha} = 0.99 $$

That is the consistent way to weight or de-emphasize or handicap the peak corresponding to the subharmonic pitch one octave below.

It still leaves you with a thresholding issue. You have to choose $\alpha$ well. But this is a consistent way emphasize the first peak over the second, which is an octave lower, but not so much that if the note really is an octave lower, but the energy in all of the even harmonics was strong, compared to the odd harmonics, this will still leave a possibility for the second peak being chosen.

$\endgroup$
13
  • 1
    $\begingroup$ To answer your last question: if you add a 220 Hz amplitude, then the pitch will be 220 Hz where 440 Hz is the first harmonic after the fundamental (mathematically speaking). My case is similar but there are also higher harmonics, so the missing fundamental is not a problem from a perceptual point of view. I don't understand how replacing AMDF with ASDF could solve the octave problem $\endgroup$
    – firion
    Commented Apr 8, 2016 at 7:47
  • $\begingroup$ but the other half of the question is *"what will it sound like"? answer that and then let's see what you want your pitch detector to do. $\endgroup$ Commented Apr 8, 2016 at 16:18
  • $\begingroup$ try calculating and plotting $R_x[k,n_0]$ for the same piece of tone that you have done for the AMDF. should look something like the AMDF upside-down. $\endgroup$ Commented Apr 8, 2016 at 16:20
  • $\begingroup$ If you don't have other higher harmonics but just the 440 Hz one, and the 220 Hz tone is sufficiently low, you will hear a 440 Hz pitch. Above some level (I don't know which one), you will hear also the 220 Hz tone and so a 220 Hz pitch. $\endgroup$
    – firion
    Commented Apr 8, 2016 at 20:51
  • $\begingroup$ there is a reason why i said -60 dB. now what do you want your pitch detector to say, that it's a 220 Hz or a 440 Hz note or something else? $\endgroup$ Commented Apr 8, 2016 at 21:01
0
$\begingroup$

Heuristically, the fundamental frequency of voiced speech will lie in the interval [70, 400] Hz. So, the first step would be to apply a bandpass filter to approximately isolate that band.

Secondly, you could apply a weighting function to the power spectrum. Near the fundamental, the weight should be near 1, while closer to the end of the band, the weight should be near 0. This weighting is normalized of course. I would recommend something super-linear: quadratic, quartic etc -- to really kill the octaves off.

$\endgroup$
1
  • $\begingroup$ How can I apply the weight? I don't know where the fundamental is. Also, my signal is an instrument's note, so the range is larger $\endgroup$
    – firion
    Commented Apr 7, 2016 at 20:20

Not the answer you're looking for? Browse other questions tagged or ask your own question.