15
$\begingroup$

I'm trying to work out how (if possible) to extract the frequency components of an arbitrary audio sample (typically music) in an FFT-like manner, but in my research on the FFT algorithm, I'm learning that it suffers some severe restrictions for this purpose.

There are 3 problems which the FFT is presenting:

  1. Because the FFT bin-resolution is equivalent to your window size, to achieve a pretty reasonable accuracy (say 1 Hz), you need an unreasonably long window (say 1 second). This means you can't detect transients or newly introduced frequencies quickly. It also means that the problem can't be solved with a faster CPU and a higher sample-rate -- the restriction is intrinsically tied to time.

  2. Humans perceive frequency logarithmically, but FFT bins are spaced linearly. Eg a difference of 20hz at the low end of our hearing is huge, whereas a difference of 20hz at the high end is imperceptible. So to get the accuracy we require at low frequencies, we have to calculate far more than we require at high frequencies.

  3. Some of these problems can be resolved by interpolating between FFT bins. This may work for much musical audio, because frequencies will often be spaced quite far apart and so no more than 1 frequency will leak into a pair of bins. But this will not always be the case, particularly for inharmonic sounds like percussive instruments. So interpolation is really just guesswork.

From what I understand of the DFT/FFT algorithm, the outputs (bin amplitudes) are effectively the correlation of the sine/cosine at each bin's frequency. It strikes me that if the algorithm could be redesigned so that the bin frequencies are spaced non-linearly (i.e. we correlate a different set of sines/cosines), then we could achieve pyschoacoustically-equal resolution at all frequencies. Is this possible, or is it a pipe-dream based on my incomplete understanding of the maths involved?

I guess I could also solve the problem with brute-force, by correlating sines/cosines at every single frequency I'm interested in. I'm not too clued up on the maths here. Is this possible? What sort of efficiency? Would it solve my problem?

Is there a different way to achieve more accurate, real-time, frequency decomposition of a signal? CPU efficiency is a concern, but not the major concern -- I'm partly interested in whether it can theoretically be done at all. However, something that's feasible in realtime on a modern desktop machine would be ideal.

$\endgroup$
4
  • 3
    $\begingroup$ Which problem are you trying to solve? f0 detection, multiple-f0 detection (for transcription), chord recognition, timbre modeling...? There are ad-hoc solutions for some of these problems. Do you care about invertibility (to be used in an analysis->transformation->resynthesis framework)? $\endgroup$ Commented Oct 7, 2012 at 10:21
  • $\begingroup$ The problem I'm trying to solve is admittedly rather open-ended. I have a general interest in digital music, covering most of your list. But my vagueness is partly due to my lack of knowledge of what can be done and what are the specific industry-standard or best ways of solving each problem you mention (till I asked this question, I'd always assumed FFT was it). But the item on your list of most interest to me is timbre modelling. I'd also like to find ways of extracting complex timbres sounding simultaneously in a recording. Resynthesis is exciting. AI algorithms are of interest. $\endgroup$
    – bryhoyt
    Commented Oct 14, 2012 at 7:58
  • $\begingroup$ A more specific problem which I've attempted to solve in the past and would like to try again sometime: I would like to write a program to "improvise" in real-time with a group of players or singers recorded with a microphone. I got as far as having my computer "whistle" a sine with me, noticeably delayed and out-of-tune. It would be vital for such an improvisation to be accurately on-tune and on-beat. Sure, there are other ways of achieving this (players play digital instruments, or giving the computer some "inside information" like a pre-set chord progression etc) but this isn't my goal. $\endgroup$
    – bryhoyt
    Commented Oct 14, 2012 at 8:10
  • $\begingroup$ "algorithm could be redesigned so that the bin frequencies are spaced non-linearly, then we could achieve pyschoacoustically-equal resolution at all frequencies." Sounds like a continuous Morlet wavelet transform $\endgroup$
    – endolith
    Commented Sep 12, 2014 at 17:38

4 Answers 4

5
$\begingroup$

As I commented on a previous post, the time-frequency analysis method known as "short term Fourier transform" $X$ is equivalent to a filter bank, analysing your signal $x$. For a given analysis window $w_n$, of size $N$, the filter at frequency $k/N$ is : $$ h_n=w_{−n}e^{j2\pi\frac{nk}{N}}$$

For usual analysis windows (Hann, Hamming, or even rectangle), this correspond to a low-pass filter, with cut-off frequency around $1/N$, which is "shifted" to frequency bin $k$ (thanks to the complex exponential modulation), therefore leading to a band-pass filter.

At this point, in order to answer directly your concern about reflecting human perception, some people derived the ["constant-Q transform" (CQT)][Brown91]. It relies on the same principle as the FT, in its filter bank interpretation. However, the centers $f_k$ are not linearly spaced as for a "normal" FT, but rather log2-spaced. The scale is then closely related to a Western musical scale: if one chooses $f_{k+1} = 2^{1/12} f_k$, then we obtain 12 frequencies per octave (rings a bell? :-) ), and the bandwidth is set to, say, $\frac{2^{1/12} - 1}{2} f_k$. You can also choose other centers, as best suits your need.

You can find implementations of the CQT here and there, a recent one by Prof. Klapuri, coming with a rather decent inverse can be found here. The Audio group at Telecom ParisTech also has an implementation by Prof. Prado, but I did not try it yet.

[Brown91] J. Brown, "Calculation of a Constant Q Spectral Transform", Journal of the Acoustical Society of America, 1991, 89, 425-434

EDIT 20121014: some answers and comments to your (bryhoyt's) questions.

  1. Just general ideas on your own comments to the main question: You seem to be interested in many applications which are, to me, not quite trivial problems to address. "Timbre modelling" sounds to me more related to speech recognition or the like, for which pitch or frequency resolution or precision is not much of an issue (consider how MFCCs are usually computed).

    Consider also how many top researchers (F. Pachet and the repmus team at IRCAM, France, to cite a few) are working on the topic of automatic improvisation and accompaniment: the task is not impossible, but requires expertise in many areas. To summarize, a typical system needs to imitate the human auditory system (at least), implement sound/music/pitch/rhythm perception, know about music theory and take decisions based on the estimations of all the previous steps. The Fourier transform, or any signal representation, is just one (tiny) step towards the end goal - and potentially, in my opinion, the best understood so far.

    That said, there is still the possibility that everyone is looking far beyond what actually happens, and that you may crack it down in a simple, thus elegant solution! Don't forget to publish about it once it's done! :-)

  2. a sample of 0.1s at 44kHz is enough to contain a vast range of frequencies

    This would lead, in the case of FT, to a resolution of the order of $F_s / N = 44100/4410 = 10Hz$, at all the frequency bins of the FT. That's almost 2 semitones at 100Hz! Could be better...

  3. The FFT can't detect this for low and high frequencies, but you say other algorithms can: what's the tradeoff?

    Short answer: read my thesis on melody estimation!

    To elaborate a bit more: many pitch estimation algorithm go beyond the limitations of the FT, thanks to assumptions on the sounds to process. We expect notes from natural sounds (human voice, oboe, sax, piano...) to be more complex than single sinusoids. Most pitched sounds are more or less harmonic, which means that they can be modelled as sums of sinusoids whose frequency is a multiple of the fundamental frequency.

    It is therefore useful to take into account these harmonics when estimating the pitch, with methods using detection functions such as spectral sums, spectral products or auto-correlation functions exist. Someone started a related topic recently.

  4. What are the tradeoffs? More specifically, what level of frequency accuracy can I expect for a reasonably short window? (I understand the window size in CQT is variable -- how much so?) Even more specifically, how close will I be able to get to my approx. goal of 0.5% frequency difference with a window of 0.005s?

    As previously said, with a window of 0.005s, you can expect something like 200Hz of "frequency leak". That's really a problem only when you have 2 sinusoids with frequencies that are closer than 200Hz, such that the FT won't be able to show that they are 2 different sinusoids. Well, we are far from your 0.5% (by the way, a semitone is at 6% of the frequency!) and 0.005s is really a bit small for your purpose. However, if you want to provide an estimate every 0.005s, you can still process longer overlapping frames, as usually done in speech/music processing. Is that what you actually want?

    As for the size of the windows, you can refer to [Schoerkhuber2010], with frame lengths equal to: $$ N_k = \frac{F_s}{f_k (2^{1/B} - 1)} $$ where $B$ is the number of frequency bins per octave that is desired for the CQT. That means very long windows: $B=48$ and $f_k=100Hz$ require about 0.7s long windows. It's nothing to say that we then lose a bit of the temporal resolution... But as mentioned earlier, this is a problem only if we forget the structure of the sound. Additionally, psychoacoustics considers that below 500Hz, humans do not really distinguish the sinusoids so well: even humans are challenged there. Of course, we can hope our computers can do better than us, but here, we face a tough issue!

    At last, note that other ways of computing a time-frequency representation of a sound exist, consider for instance gammatone filter-banks. The advantage of the CQT I mentioned previously is that there is software for both the transform and its invert. Personally, I still stick to the STFT, though, for its simplicity and because, so far, I never needed better resolution in low frequencies, even for source separation.

    [Schoerkhuber2010] Schoerkhuber, C. and Klapuri, A., " Constant-Q transform toolbox for music processing,", 7th Sound and Music Computing Conference, Barcelona, Spain, 2010.

$\endgroup$
3
  • $\begingroup$ A little remark: the CQT might help to solve point 1 and 2 of you concerns, but not point 3. As for point 3, there's always a trade-off between time and frequency resolution, and if you want a good frequency resolution in low frequency components, you very likely need to accept to lose time resolution. Now, for pitch estimation, there might be some other solutions, you could read mine in my PhD thesis if you are interested :D $\endgroup$ Commented Oct 10, 2012 at 21:29
  • $\begingroup$ I don't quite understand. I know that you don't get anything for free -- I can't expect an algorithm to accurately detect frequencies that haven't been sampled at a good resolution for at least a couple periods of the lowest frequency. But a sample of 0.1s at 44kHz is enough to contain a vast range of frequencies, which a human can distinguish accurately (in relative terms -- "here's a 5th", "there's a flat diminished 4th", etc), proving the info is in there somewhere. The FFT can't detect this for low and high frequencies, but you say other algorithms can: what's the tradeoff? $\endgroup$
    – bryhoyt
    Commented Oct 14, 2012 at 7:28
  • $\begingroup$ Of all the excellent answers above, the CQT looks like the most exact fit to the question I was asking. What are the tradeoffs? More specifically, what level of frequency accuracy can I expect for a reasonably short window? (I understand the window size in CQT is variable -- how much so?) Even more specifically, how close will I be able to get to my approx. goal of 0.5% frequency difference with a window of 0.005s? (That's my rough guesstimate of when a human could start hearing something's out of tune or off-beat) $\endgroup$
    – bryhoyt
    Commented Oct 14, 2012 at 8:05
5
$\begingroup$

First, with the classic short-term Fourier transform approach, there are alternative to interpolation - in particular techniques making use of phase information to recover the instantaneous frequency (See this question) which can give you very accurately the position of a spectral peak without an increase of FFT size. The drawback, as you correctly said, is that you are not increasing the ability of the system to discriminate adjacent peaks - but this is already a great improvement compared to using the central frequency of the FFT bin index.

Regarding your brute force approach... FFT has this resolution limitation (with bins at $\frac{sr}{FFT\_size}$) because it is "probing" the signal with complex exponentials that have a null integral over the analysis window - and these must have a period which is an integer division of the length of the analysis window. If you try to naively use the same approach to "probe" frequencies at other ratios, it won't work because you'll be "probing" your signal with functions that do not contain a complete cycle. If you wanted to make it work, you would have to increase your analysis window to the least common multiple of the periods of all the signals in your brute-force search, and this is precisely what you want to avoid!

There is another brute-force approach that works: "probe" your signals with windowed complex exponential (Gabor wavelets). These are characterized by a center frequency, a center time, and a bandwidth (which measures how the wavelet is spread over time or over frequency). You'll have to evaluate many, many, many correlations between your signal and these wavelets at as many time offsets, frequencies, and bandwidths you want. The result will be the same as a very flexible "tiled" STFT in which an optimal window size is selected for each time-range and each frequency-band. Besides the computational cost, the downside is that there is no efficient algorithm, and no causal algorithm (you will need to know as many samples in advance as the longest wavelet in your dictionary). If you want to experiment with these techniques, have a look at MPTK.

A last class of approaches working quite well for music signals are subspace tracking / high resolution methods. They work the problem from a model estimation angle - "assuming this signal is a sum of $k$ exponentially damped sinusoids, what are the best amplitudes/frequencies giving the least square error?", and as such the number of samples required to get an accurate estimate is only dependent on the noise level. There are some hurdles to make them work, though:

  • The number of components $k$ must ideally be known in advance. Order estimation methods like ESTER have been proposed to optimally pick the model order.
  • They perform well in the presence of white noise - this requires the signal to be whitened before analysis; performing the analysis in individual channels of a filter bank helps too.

These are computationally expensive, but they can work online, with short windows if model orders and/or noise are low.

$\endgroup$
4
$\begingroup$

Frequency or pitch? There are already tons of research papers and books on human pitch perception. But, IIRC, humans tend to be bad at accurately "extracting" frequencies unless they happen to be a pitch fundamental. And multiple frequency peaks within a "critical band" tend to be perceived as noise. So any method with "near human accuracy" may also have to include some human perceptive estimation failures as well.

An FFT is just a bank of filters that are not optimal for many purposes unless orthogonality and invertibility are requirements. Other filter banks are possible if you don't require those two (and human perception clearly does not), such as a MEL frequency filter bank. Once a frequency peak is identified by a MEL frequency filter bank, further analysis by FFT interpolation or phase vocoder techniques might be useful to refine a frequency estimate of any isolated spectral frequency peak.

Note that no further information is actually gathered by any of these filtering techniques used over the same span of time-domain data, compared with an FFT. What is happening might actually be the loss of information to better match the "inaccuracy" or anomalies of the human hearing system.

And pitch estimation from a set of frequencies is a completely different problem, again a topic with many research papers and chapter in books on audiology and such.

The last part of your question about performance may be a red herring. One can do dozens of FFTs and dozens of different filter banks in real-time on a cell phone processor these days. Given very efficient FFT libraries available from CPU vendors, a FFT with 1000's of "excess" bins may be more efficient than a significantly smaller but more naive-coded filter bank.

$\endgroup$
1
  • $\begingroup$ Very informative answer, thanks. I'm aware of the difference between pitch and frequency, but your answer really helped to highlight how much human accuracy depends on the sound meeting certain requirements. It rings true to my knowledge of harmony that humans are quite bad at extracting frequencies that aren't a pitch fundamental. I can accurately distinguish in-tune intervals from each other, and from out-of-tune intervals (consonant intervals more easily than dissonant). But I'd have trouble distinguishing two out-of-tune intervals (other than "flat", "very flat", "sharp", etc). $\endgroup$
    – bryhoyt
    Commented Oct 14, 2012 at 7:43
2
$\begingroup$

There are many alternatives, but it depends on what you are doing. Physically, I would argue our ears are more like a parallel filter bank than an FFT, which gives them good time resolution, and a process called "focusing" gives them good frequency resolution. So, in some cases, you could theoretically use a filter bank, but this requires a lot of processing leaves you with a lot of data to process.

It is possible to view wavelets as a set of particularly efficient and related filters. The problem with wavelets for musical and audio analysis is that they typically only give you 1 octave resolution (although you can do various things about this, I haven't really seen wavelets be particularly useful in audio).

Another approach is to use overlapping FFT windows. You can increase the frequency resolution of the FFT by looking not just at the magnitude information, but the phase information. This allows you to use much shorter windows than you might otherwise use, which results in better performance and better time resolution. Overlapping windows are hard to resynthesize correctly, and making too many assumptions about phase can be hazardous as well. Be that as it may, these sorts of tricks are probably the staple of solving complex time-frequency analysis problems.

There are a number of other tools for specific applications as well.

$\endgroup$
2
  • 1
    $\begingroup$ Computing the short term Fourier transform (STFT) is equivalent to analysing the signal with a filter bank. Let $x_n$ be a signal. The STFT $X$, at frequency $k$ and frame $m$, with an analysis window $w_n$, is usually computed as: $$X_{fm} = \sum_n x_{n+m} w_n e^{-j2\pi\frac{nk}{N}}$$ i.e. we successively compute $N$-long Fourier transforms (FT) on $x_n$ shifted by $m$ samples. This can also be expressed as: $$X_{fm} = \sum_p x_{p} w_{p-m} e^{-j2\pi\frac{(p-m)k}{N}} = \sum_p x_p h_{m-p}$$ where $h_n = w_{-n} e^{j2\pi\frac{nk}{N}}$, hence the filter bank interpretation! $\endgroup$ Commented Oct 10, 2012 at 20:45
  • 1
    $\begingroup$ An STFTs may be a filter bank, but not all filterbanks are STFTs. $\endgroup$ Commented Oct 10, 2012 at 21:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.