(compMus2) Phase Vocoding

Phase Vocoding allows for independent control of time duration and pitch. 

Time Expansion/Compression with Phase Vocoding

The conversion of an audio signal from the time domain to the frequency domain results in a series of frames containing bins of frequency and amplitude information. If you conceive of the FFT as producing a snapshot, a frozen picture of frequency/amplitude information for a short segment of time, then it is easy to understand time expansion/compression as similar to changing the frame rate of video playback. Individual pictures (the analysis frames) are not changed, only their rate of playback. 

Consider a simple math example. With an FFT size of 512 samples, each analysis segment lasts for approximately 11 ms. In the frequency domain, this 11 ms analysis segment represents one frame of frequency/amplitude bins. If during resynthesis (the inverse FFT) each frame is resynthesized at a rate of 11 ms per frame then the output signal is the same duration as the input signal. If the rate of frame resynthesis changes to 22 ms per frame (twice the original analysis duration), then the output signal will be twice as long as the original. If the rate changes to 44 ms per frame, then the output signal expands to four times the original length. This method of time expansion/contraction is completely analogous to slow motion (or fast motion) video. You are not adding more frames to the video playback when you slow down/expand time (like you would with granular synthesis); you are simply changing the playback rate of the frames you have already recorded/analyzed. 

Pitch Shifting with the Phase Vocoder

Phase vocoding shifts pitch through simple multiplication. You multiply the frequency information in all bins of every frame by the same transposition factor. Multiplying by two transposes the output resynthesis up one octave; multiplying by 0.5 transposes the output down one octave. If you are looking for precise semitone transposition you will need to calculate 2 to the-x/12th power, where x equals the number of semitones of transposition.


Leave a Reply