2
$\begingroup$

First, some background: The STFT is the best general-purpose tool I know of for analysing a (musical or other) signal into its component frequencies. For many purposes the STFT works fine, but there are two related problems that make it non-ideal for musical purposes -- 1) the frequency bins are evenly-spaced, not logarithmically like the human hearing system works, and 2) by the mathematical nature of the DFT/FFT, you need to capture a long period of audio to get decent frequency resolution at low frequencies.

For example, to be able to distinguish semitones accurately in the bass range of a piano, you'd need approximately 1hz resolution, meaning you'd need to capture 1 second of audio. For music, this is a completely unacceptable time resolution.

There are various ways to enhance the output of the FFT, including a) zero-padding to create a larger window, or b) overlapping analysed segments, or c) magic with the phase to detect frequency change. These are useful, but still have problems -- more computation overhead, and while (a) can increase the apparent frequency resolution at a given temporal resolution, it doesn't help with a situation where two frequencies happen to fall into the same bin.

So... I've been researching other possibilities, out of sheer interest. CWT and DWT looked very promising at first, but the CWT is very slow to compute, and the DWT only has octave frequency resolution. (That right?) I've also considered taking multiple FFTs at different time/frequencies resolutions to cover the space better, or solutions like zoom-FFT. These all help solve different specific problems, but none of them really make the fundamental problem go away: we don't have a fast (N log N) way of analysing the frequencies in a signal at good temporal and frequency resolution.

A lot of people point to the Heisenberg uncertainty principle at this point, implying that there's a mathematical impossibility of actually achieving this at all. I'm pretty sure this is simply wrong: sure, the uncertainty principle is valid, but we're not even close to running into it yet. The problem here is that the DFT/FFT can only analyse frequencies at integer multiples of the fundamental. However, the other frequencies are certainly there to be analysed, at good temporal resolution -- if we're willing to use a (computationally slow) sine sweep or continuous wavelet transform. The uncertainty principle doesn't cause problems until long after the FFT ceases to be useful.

So, my solution. I'd like you to tell me if and where it's wrongheaded: Let's correlate the input signal with probe phasors at select frequencies. At first, this sounds like an N^2 algorithm (N operations for a correlation at each frequency, times the number of frequencies). However, the key point is that for equivalent log-frequency resolution to an FFT of the same frequency resolution (at low frequencies), you only need to analyse log N frequencies -- because the resolution at high frequencies is logarithmically less important in musical signals.

So, we have an algorithm that works like this:

  1. Decide on your frequency resolution. Let's say octaves for now, though I'd want sub-semi-tones in most applications.
  2. Pre-load some buffers with probe phasors as frequencies spaced at that interval -- let's say 16, 32, 64, 128, 256, 512, 1024hz, [...]. Because higher frequencies can be detected in a shorter amount of time, you need fewer samples (= better temporal resolution, as well as fewer operations) at higher frequencies. For the sake of it, let's use a few periods at each frequency.
  3. Pre-window these buffers, if it makes sense.
  4. Correlate each of these with a long-enough segment of the input signal at time t (i.e. the sum of the component-wise multiplication of the segment and the probe). Probably normalize this value.
  5. The result of this correlation is the magnitude of the signal at that frequency, at time t. So, the table of results-vs-frequencies should be your frequency spectrum at the desired logarithmic resolution (octaves, in this case).

From what I can work out, this would give you superb temporal resolution proportional (a few cycles of each frequency) at whatever frequency resolution you like, using approximately N operations times log N frequencies, so long as your frequencies were logarithmically spaced -- which is exactly what you want for music, and exactly the problem with the STFT -- thus, solving the problems with the STFT (for musical signals), with equivalent computational complexity.

What am I missing? Would this work at all? Are there caveats or pitfalls? Is my complexity calculation mathematically sound?

It feels a bit like inventing a perpetual-motion machine -- it looks like it should work, but my gut feeling is that I'm glossing over some important detail.

$\endgroup$
5
  • $\begingroup$ 1. how do you plan to do the correlation? 2. how often do you want "points" in the envelope functions that come out of this? 3. how is this different from the heterodyne oscillator? $\endgroup$ Commented Jan 7, 2014 at 5:24
  • $\begingroup$ 1. Is there more than one way? My thoughts would be sum(s[n] * p[n] for n in lenth_of_p) where s[] is a segment of the input, and p[] is a windowed probe phasor. If I understand correctly, this is the same as in the definition of the DFT/FT 2. What do you mean, exactly? There will be a single number (the correlation) in the ouput for each probe phasor. There are log N probe phasors, spaced at logarithmic frequencies. 3. Heterodyne oscillation is just amplitude modulation, if I understand wikipedia correctly. I guess I don't see any similarities -- do you? $\endgroup$
    – bryhoyt
    Commented Jan 7, 2014 at 6:32
  • $\begingroup$ Oh, I just realised I didn't specify that the correlation would be with a segment of the signal at time t. Because it's supposed to be a replacement for the Short-Time Fourier Transform, to produce a time-vs-frequency-magnitude graph. Edited to clarify. $\endgroup$
    – bryhoyt
    Commented Jan 7, 2014 at 6:44
  • $\begingroup$ did you look at constant-q transform? $\endgroup$
    – endolith
    Commented May 1, 2015 at 3:48
  • $\begingroup$ If you want an invertible, fast, NLogN complexity, logarithmic frequency analysis method, consider combining the band-splitting and downsampling approach of the fast wavelet transform with an STFT analysis of the individual bands. That means you can split up each half band again and trade frequency versus time resolution to your liking. The frequency band spacing will be piecewise linear, but approximate a logarithmic grid rather well. The only downside is the accumulation of band aliasing due to the critical downsampling which will affect your SNR. $\endgroup$
    – Jazzmaniac
    Commented May 1, 2015 at 9:40

3 Answers 3

3
$\begingroup$

True, the complexity would be $O(N \log N)$.

But this $\log N$ captures a different reality from the $\log N$ in the FFT algorithm. In your case it comes from the fact that you're only interested in a smaller set of logarithmic spaced frequencies. In the FFT case, it comes from the "divide and conquer", recursive structure of the Cooley–Tukey algorithm.

Let's say you're interested in quarter-tone resolution, from 40 Hz to 1.28kHz. That's 5 octaves, or 120 quarter-tones. For a 1024-long buffer, your algorithm is doing exactly 1024 x 120 complex MAC (plus some book-keeping). A naive radix-2 FFT would do 1024 x 10 complex MAC (plus some book-keeping). Your algorithm is 12 times slower.

Also, your transform doesn't sound invertible, which makes it useful only for analysis applications (not resynthesis, effects...).

$\endgroup$
3
  • $\begingroup$ Ok, that makes sense, thanks. You can save a lot of the MACs by only correlating with a single cycle of each given frequency, though -- the number of samples required at each increasing frequency drops off exponentially. Is that valid? Also, a 1024-long naive radix-2 FFT would only start to get quarter-tone resolution at about 1.4kHz -- so perhaps this scenario isn't the best comparison? Of course, it does depend on what frequency band you're most interested in. $\endgroup$
    – bryhoyt
    Commented Jan 7, 2014 at 20:29
  • $\begingroup$ Correct, it's not invertible. Or at least I haven't thought about that side of it. Good point, that's a pretty important feature for many applications. I'm considering it only for analysis (and perhaps resynthesis in a roundabout way, stemming from the frequency analysis, but not as a direct inversion). $\endgroup$
    – bryhoyt
    Commented Jan 7, 2014 at 20:31
  • $\begingroup$ If you correlate with a single cycle you don't get the benefit of improved temporal resolution in the highest frequencies - since in this case the temporal resolution in the highest frequencies is limited by the "step size" dictated by the longest analysis period in the low frequencies. $\endgroup$ Commented Jan 7, 2014 at 21:43
3
$\begingroup$

What you describe might be similar to applying a Constant-Q transform using Morlet or Gabor wavelets. These transforms may have a higher computational cost than using windowed FFTs, but may, when scaled suitably, represent something closer to human perceptual resolutions for both frequency and time.

$\endgroup$
1
  • $\begingroup$ I'm not sure -- yes, the purpose is certainly the same as some uses of a Constant-Q transform. But I've never seen a Constant-Q transform that achieves good time/frequency resolution in anything near O(N log N) time, and I'm proposing a specific algorithm for doing so, and wondering if it's as good as it seems to me. If we could really achieve the level of temporal & frequency resolution I'm outlining with O(N log N) operations, then why doesn't more software do it? If we can't, then what's wrong with my theory? $\endgroup$
    – bryhoyt
    Commented Jan 7, 2014 at 6:48
0
$\begingroup$

Please define what you mean by probe phasor? i haven't encountered that term in audio.

I think that your theory is good, because i wrote a similar program in 2004 which gives sample accurate results, without bins and windows. the uncertainty principle does prevent bit accurate detection of sine wave peaks and troughs, if you had two sines mixed inside an audio file and you wanted to analyse them in a spectrogram from a submarine radar which is pretty precise, you wouldnt be able to see very precise frequency and time signatures, you couldn't have a sine wave of 440 that makes a single detected signature line at 440, but you could average the values of the spectrogram that would form a bell curve around the 440 mark and say with high precision that the signal was at 440.

The probe phasor would necessarily be based on delays of audio and comparing values across the time domain, and that means that the highest frequency time signatures that you can have are delayed as a factor of the frequency, so low frequencies have lower time domain resolution... and as you increase the precision of the phasor probe, i found that the resulting graph values are written with more time delay or less time precision, i don't know which.

I saw that, as you can compute a 1gb high precision analysis of a 1 second sound, you have so much data in the end that you can visually and mathematically average and process the result to have much more precision that the initial uncertainty precision implies, that if you do further processing, you can have frequency accurate data, and very accurate time data too, but it does seem to be by statistically averaging the results. same as all statistics, with alot of data you have high certainty of the underlying values.

you can also run the audio forwards and backwards, and determine from both graphs where in time the detected signals are actually occuring, they are in the middle of the forwards and backwards analysed spectrogram.

$\endgroup$
4
  • $\begingroup$ i think what the OP means by "correlate the input signal with probe phasors at select frequencies" is what we used to call the "heterodyne oscillator" where you multiply the input $x[n]$ by $e^{j \omega n}$ and LPF the result to get a complex envelope (has magnitude and phase) of the component of the input at frequency $\omega$ (normalized angular frequency so that the Nyquist frequency is $\omega=\pi$). $\endgroup$ Commented May 1, 2015 at 20:30
  • $\begingroup$ if it were single notes to analyze, i would first do pitch detection and, knowing the period of each cycle (to a precision of a fractional sample), interpolate each cycle to be, say, 256 samples long, then FFT and the 1st harmonic will be in bins 1 and 255, the 2nd harmonic will be in bins 2 and 254, etc. $\endgroup$ Commented May 1, 2015 at 20:33
  • $\begingroup$ @robertbristow-johnson for an analysis of a multi-octave range of musical notes, could you do this twelve times on the lowest octave, once for each musical note, given that notes in higher octaves are automatically harmonics of the note in the lowest octave? This would be 12 x 256-FFTs however each would operate on the same duration of signal, allowing better realtime updates. I don't have a good feeling for how this would compare performance-wise. $\endgroup$
    – davidA
    Commented Jan 9, 2017 at 22:25
  • $\begingroup$ well, @meowsqueak, your resolution would be $\tfrac{1}{12}$ octave. problem is that note harmonics sometimes slide outa one semitone and into another. lotsa analysis to do with the 12 big bins, and i dunno how you would analyze the note without ambiguity in comparison with the STFT done well. $\endgroup$ Commented Jan 9, 2017 at 22:39

Not the answer you're looking for? Browse other questions tagged or ask your own question.