First, some background: The STFT is the best general-purpose tool I know of for analysing a (musical or other) signal into its component frequencies. For many purposes the STFT works fine, but there are two related problems that make it non-ideal for musical purposes -- 1) the frequency bins are evenly-spaced, not logarithmically like the human hearing system works, and 2) by the mathematical nature of the DFT/FFT, you need to capture a long period of audio to get decent frequency resolution at low frequencies.
For example, to be able to distinguish semitones accurately in the bass range of a piano, you'd need approximately 1hz resolution, meaning you'd need to capture 1 second of audio. For music, this is a completely unacceptable time resolution.
There are various ways to enhance the output of the FFT, including a) zero-padding to create a larger window, or b) overlapping analysed segments, or c) magic with the phase to detect frequency change. These are useful, but still have problems -- more computation overhead, and while (a) can increase the apparent frequency resolution at a given temporal resolution, it doesn't help with a situation where two frequencies happen to fall into the same bin.
So... I've been researching other possibilities, out of sheer interest. CWT and DWT looked very promising at first, but the CWT is very slow to compute, and the DWT only has octave frequency resolution. (That right?) I've also considered taking multiple FFTs at different time/frequencies resolutions to cover the space better, or solutions like zoom-FFT. These all help solve different specific problems, but none of them really make the fundamental problem go away: we don't have a fast (N log N) way of analysing the frequencies in a signal at good temporal and frequency resolution.
A lot of people point to the Heisenberg uncertainty principle at this point, implying that there's a mathematical impossibility of actually achieving this at all. I'm pretty sure this is simply wrong: sure, the uncertainty principle is valid, but we're not even close to running into it yet. The problem here is that the DFT/FFT can only analyse frequencies at integer multiples of the fundamental. However, the other frequencies are certainly there to be analysed, at good temporal resolution -- if we're willing to use a (computationally slow) sine sweep or continuous wavelet transform. The uncertainty principle doesn't cause problems until long after the FFT ceases to be useful.
So, my solution. I'd like you to tell me if and where it's wrongheaded: Let's correlate the input signal with probe phasors at select frequencies. At first, this sounds like an N^2 algorithm (N operations for a correlation at each frequency, times the number of frequencies). However, the key point is that for equivalent log-frequency resolution to an FFT of the same frequency resolution (at low frequencies), you only need to analyse log N frequencies -- because the resolution at high frequencies is logarithmically less important in musical signals.
So, we have an algorithm that works like this:
- Decide on your frequency resolution. Let's say octaves for now, though I'd want sub-semi-tones in most applications.
- Pre-load some buffers with probe phasors as frequencies spaced at that interval -- let's say 16, 32, 64, 128, 256, 512, 1024hz, [...]. Because higher frequencies can be detected in a shorter amount of time, you need fewer samples (= better temporal resolution, as well as fewer operations) at higher frequencies. For the sake of it, let's use a few periods at each frequency.
- Pre-window these buffers, if it makes sense.
- Correlate each of these with a long-enough segment of the input signal at time t (i.e. the sum of the component-wise multiplication of the segment and the probe). Probably normalize this value.
- The result of this correlation is the magnitude of the signal at that frequency, at time t. So, the table of results-vs-frequencies should be your frequency spectrum at the desired logarithmic resolution (octaves, in this case).
From what I can work out, this would give you superb temporal resolution proportional (a few cycles of each frequency) at whatever frequency resolution you like, using approximately N operations times log N frequencies, so long as your frequencies were logarithmically spaced -- which is exactly what you want for music, and exactly the problem with the STFT -- thus, solving the problems with the STFT (for musical signals), with equivalent computational complexity.
What am I missing? Would this work at all? Are there caveats or pitfalls? Is my complexity calculation mathematically sound?
It feels a bit like inventing a perpetual-motion machine -- it looks like it should work, but my gut feeling is that I'm glossing over some important detail.