In various software forums, including stackoverflow, I've seen many posts by software developers who seem to be trying to determine musical pitch (for instance when coding up yet another guitar tuner) by using an FFT. They expect to find some clear dominant frequency in the FFT result to indicate the musical pitch. But musical pitch is a psycho-perceptual phenomena. These naive attempts at using an FFT often fail, especially when used for the sounds produced big string instruments and bass or alto voices. And so, these software developers ask what they are doing wrong.
It is surprising that, in these modern times, people do not realize just how much of what they experience in life is mostly an illusion. There is plenty of recent research on this topic. One of my favorite books on this general subject is "MindReal" by Robert Ornstein. Daniel Kahneman won the 2002 Nobel Prize for ground-breaking research in a related area.
The illusion that our decisions are logical allows us to be susceptible to advertising and con men. The illusion that we see what is really out there allows magicians and pick-pockets to perform tricks on us. The illusion that what we hear is actually the sound that a musical instrument transmits to our ear is what seems to be behind the misguided attempts to determine the pitch of a musical instrument, or a human voice, by using just a bare FFT.
One reason that frequency in not pitch is that many interesting sounds contain a lot of harmonics or overtones. This is what makes these sounds interesting. Otherwise they would sound pretty boring, like a pure sine-wave generator might be. The higher overtones or harmonics, after being amplified or filtered by the resonance of the body of a musical instrument or the head of a singer, can often end up stronger than the original fundamental frequency. Then the ear/brain combination of the listener, finding mostly these higher harmonics in a sound, guesses and gives us the illusion that we are hearing just a lower pitch. In fact, this lower pitch frequency can be completely missing from the audio frequency spectrum of a sound, or nearly so, and still be clearly heard as the pitch.
So pitch is different from frequency, and musical pitch detection and estimation is different from just frequency estimation. So pitch estimators look for periodicity, not spectral frequency.
But how can you have a periodic sound that does not contain the pitch frequency in its spectrum?
It's easy to experiment and hear this yourself. Using a sound editor, one can create a test sound waveform which shows this effect. Create a high-frequency tone, say a 1568 Hz pure sine wave (a G6). Chop this tone into a short segment, say slightly shorter than the 100th of a second. Repeat this high-frequency tone segment 100 times per second. Play it. What do you hear? It turns out you don't hear the high-frequency tone. An FFT will show most of the waveform magnitude in a frequency bin near 1568 Hz, since that's what makes up the vast majority of the waveform. Even though the sound you created consist only of high frequency sine wave bursts, you'll actually hear the lower frequency of the repeat rate, or the periodicity. A human will hear 100 Hz.
So to determine what pitch a human will hear, one needs a periodicity or pitch detector or estimator, not a frequency estimator.
There are many pitch detection or pitch estimation methods from which to choose, with varying strengths, possible weaknesses, as well as differing computation complexity. Some of these methods include autocorrelation, or other lag estimators such as AMDF or ASDF. A lag estimator will look for which next segment of waveform in time is the closest to being a copy of the current segment. Lag estimators are often weighted, since periodic waveforms have multiple repetition periods from which to choose. Other pitch estimation methods include, in no particular order, Cepstrum or cepstral methods, harmonic product spectrum analysis, linear predictive coding analysis, and composite methods such as YAPT or RAAPT, which may even involved some statistical decision analysis.
It's not a simple as feeding samples to an FFT and expecting a useful result.
Great Post. Since last few weeks I am investigating in this area. Your post gave me a little bit guidance.
ReplyDeleteThank you, Ron. This was a very interesting musing. As dpakrk has apparently been doing, I am also investigating pitch and the relationship to frequency and periodicity in my efforts to try and understand human speech intonation, most specifically, in the interplay of pitch and intensity. I found your excellent iPad/iPhone app "Hot Paw Musical Piano Roll Spectrograph" and I must say it's the best (and most fun) bit of software I've found to date to analyze live sound as well as recorded samples. So I thank you profusely for creating that app, and I will follow your musings closely from now on! :)
ReplyDeletealoha from Kaliko in Hawaii.
Very useful Ron, thank you
ReplyDeleteso true, thanks
ReplyDeleteGreat post. Thank you!
ReplyDeletejust saw this for the first time, hotpaw. now i know your "real" name.
ReplyDeletejust FYI, i use ASDF pretty much exclusively. but i turn it upside down so it looks like a normalized autocorrelation. there are also lots of tricks to save computation. i don't compute *every* autocorrelation lag. and while computing a single lag, i stride through the data with a step that is larger than a single sample to shorten the summation.
the real secret sauce comes about how one chooses which autocorrelation peak and avoid octave errors.