As I commented on a previous post, the time-frequency analysis method known as "short term Fourier transform" $X$ is equivalent to a filter bank, analysing your signal $x$. For a given analysis window $w_n$, of size $N$, the filter at frequency $k/N$ is :
$$ h_n=w_{−n}e^{j2\pi\frac{nk}{N}}$$
For usual analysis windows (Hann, Hamming, or even rectangle), this correspond to a low-pass filter, with cut-off frequency around $1/N$, which is "shifted" to frequency bin $k$ (thanks to the complex exponential modulation), therefore leading to a band-pass filter.
At this point, in order to answer directly your concern about reflecting human perception, some people derived the ["constant-Q transform" (CQT)][Brown91]. It relies on the same principle as the FT, in its filter bank interpretation. However, the centers $f_k$ are not linearly spaced as for a "normal" FT, but rather log2-spaced. The scale is then closely related to a Western musical scale: if one chooses $f_{k+1} = 2^{1/12} f_k$, then we obtain 12 frequencies per octave (rings a bell? :-) ), and the bandwidth is set to, say, $\frac{2^{1/12} - 1}{2} f_k$. You can also choose other centers, as best suits your need.
You can find implementations of the CQT here and there, a recent one by Prof. Klapuri, coming with a rather decent inverse can be found here. The Audio group at Telecom ParisTech also has an implementation by Prof. Prado, but I did not try it yet.
[Brown91] J. Brown, "Calculation of a Constant Q Spectral Transform", Journal of the Acoustical Society of America, 1991, 89, 425-434
EDIT 20121014: some answers and comments to your (bryhoyt's) questions.
Just general ideas on your own comments to the main question:
You seem to be interested in many applications which are, to me, not quite trivial problems to address. "Timbre modelling" sounds to me more related to speech recognition or the like, for which pitch or frequency resolution or precision is not much of an issue (consider how MFCCs are usually computed).
Consider also how many top researchers (F. Pachet and the repmus team at IRCAM, France, to cite a few) are working on the topic of automatic improvisation and accompaniment: the task is not impossible, but requires expertise in many areas. To summarize, a typical system needs to imitate the human auditory system (at least), implement sound/music/pitch/rhythm perception, know about music theory and take decisions based on the estimations of all the previous steps. The Fourier transform, or any signal representation, is just one (tiny) step towards the end goal - and potentially, in my opinion, the best understood so far.
That said, there is still the possibility that everyone is looking far beyond what actually happens, and that you may crack it down in a simple, thus elegant solution! Don't forget to publish about it once it's done! :-)
a sample of 0.1s at 44kHz is enough to contain a vast range of frequencies
This would lead, in the case of FT, to a resolution of the order of $F_s / N = 44100/4410 = 10Hz$, at all the frequency bins of the FT. That's almost 2 semitones at 100Hz! Could be better...
The FFT can't detect this for low and high frequencies, but you say other algorithms can: what's the tradeoff?
Short answer: read my thesis on melody estimation!
To elaborate a bit more: many pitch estimation algorithm go beyond the limitations of the FT, thanks to assumptions on the sounds to process. We expect notes from natural sounds (human voice, oboe, sax, piano...) to be more complex than single sinusoids. Most pitched sounds are more or less harmonic, which means that they can be modelled as sums of sinusoids whose frequency is a multiple of the fundamental frequency.
It is therefore useful to take into account these harmonics when estimating the pitch, with methods using detection functions such as spectral sums, spectral products or auto-correlation functions exist. Someone started a related topic recently.
What are the tradeoffs? More specifically, what level of frequency accuracy can I expect for a reasonably short window? (I understand the window size in CQT is variable -- how much so?) Even more specifically, how close will I be able to get to my approx. goal of 0.5% frequency difference with a window of 0.005s?
As previously said, with a window of 0.005s, you can expect something like 200Hz of "frequency leak". That's really a problem only when you have 2 sinusoids with frequencies that are closer than 200Hz, such that the FT won't be able to show that they are 2 different sinusoids. Well, we are far from your 0.5% (by the way, a semitone is at 6% of the frequency!) and 0.005s is really a bit small for your purpose. However, if you want to provide an estimate every 0.005s, you can still process longer overlapping frames, as usually done in speech/music processing. Is that what you actually want?
As for the size of the windows, you can refer to [Schoerkhuber2010], with frame lengths equal to:
$$
N_k = \frac{F_s}{f_k (2^{1/B} - 1)}
$$
where $B$ is the number of frequency bins per octave that is desired for the CQT. That means very long windows: $B=48$ and $f_k=100Hz$ require about 0.7s long windows. It's nothing to say that we then lose a bit of the temporal resolution... But as mentioned earlier, this is a problem only if we forget the structure of the sound. Additionally, psychoacoustics considers that below 500Hz, humans do not really distinguish the sinusoids so well: even humans are challenged there. Of course, we can hope our computers can do better than us, but here, we face a tough issue!
At last, note that other ways of computing a time-frequency representation of a sound exist, consider for instance gammatone filter-banks. The advantage of the CQT I mentioned previously is that there is software for both the transform and its invert. Personally, I still stick to the STFT, though, for its simplicity and because, so far, I never needed better resolution in low frequencies, even for source separation.
[Schoerkhuber2010] Schoerkhuber, C. and Klapuri, A., " Constant-Q transform toolbox for music processing,", 7th Sound and Music Computing Conference, Barcelona, Spain, 2010.