The Vocoder and the Phase Vocoder are Not the Same Thing
The vocoder (short for voice encoder) that appears in musical usage like Imogen Heap’s Hide and Seek, dates back to the 1930s. It was necessary to transmit voice communications over long distances with small information bandwidths. The solution was to encode the voice as a set of control signals, using filter-bank analysis. The control signals didn’t use as much data as sound waves, and the control signals could be resynthesized at the receiving end by applying the control signals to a musical tone stored in the receiver. As telephone communications became prominent, it was determined that it was better to separate the transients of the voice from the vowel sounds. This technique was called channel vocoding.
In the 1960s, researchers at Bell Labs determined that if you used the Fourier Transform on short-time segments of voice that you did not have to separate the transients from the rest of the signal for proper encoding, and that this method of encoding could time compress/expand independently of pitch. I have simplified the discussion of the FT somewhat to say that frequency and amplitude information is obtained from the analysis. The way frequency is obtained is by measuring the phase angle (theta) of rotation. Therefore, when you manipulate frequency and time information with the FT, you are manipulating phase information. Researchers at Bell Labs called this technique the Phase Vocoder.
The classic vocoder employs a cross-synthesis technique similar to convolution. The phase vocoder allows for independent control of time duration and pitch, using control information determined by an FT analysis.
Time Expansion/Compression with Phase Vocoding
The conversion of an audio signal from the time domain to the frequency domain results in a series of frames containing bins of frequency and amplitude information. If you conceive of the FFT as producing a snapshot, a frozen picture of frequency/amplitude information for a short segment of time, then it is easy to understand time expansion/compression as similar to changing the frame rate of video playback. Individual pictures (the analysis frames) are not changed, only their rate of playback.
Consider a simple math example. With an FFT size of 512 samples, each analysis segment lasts for approximately 11 ms. In the frequency domain, this 11 ms analysis segment represents one frame of frequency/amplitude bins. If during resynthesis (the inverse FFT) each frame is resynthesized at a rate of 11 ms per frame then the output signal is the same duration as the input signal. If the rate of frame resynthesis changes to 22 ms per frame (twice the original analysis duration), then the output signal will be twice as long as the original. If the rate changes to 44 ms per frame, then the output signal expands to four times the original length. This method of time expansion/contraction is completely analogous to slow motion (or fast motion) video. You are not adding more frames to the video playback when you slow down/expand time (like you would with granular synthesis); you are simply changing the playback rate of the frames you have already recorded/analyzed.
Pitch Shifting with the Phase Vocoder
Phase vocoding shifts pitch through simple multiplication. You multiply the frequency information in all bins of every frame by the same transposition factor. Multiplying by two transposes the output resynthesis up one octave; multiplying by 0.5 transposes the output down one octave. If you are looking for precise semitone transposition you will need to calculate 2 to the-x/12th power, where x equals the number of semitones of transposition.
The Phase Vocoder in Practice: SoundHack
Phase Vocoding can be implemented in different ways, and SoundHack should be understood as one implementation of a Phase Vocoder. The most noticeable thing about the SoundHack Phase Vocoder is that when you select the FFT size, the number of bands refers to the number of bands between 0 and the Nyquist Frequency, not 0 and the Sampling Rate.
Remembering that there is a trade-off between frequency resolution and time resolution, I tend to start with what I like to call my Goldilocks Settings. My default parameters are
- FFT Size: 1024
- Overlaps: 2
- Window: Hamming
You can only apply time scaling or pitch scaling in one pass, but you can obviously do one then the other to stretch a sound then transpose it. I would leave the default settings for the gating, as the FFT can produce analysis frequencies that don’t exist.
You can make changes to size and overlaps to capture certain aspects of sounds. If a sound is very complex, with a lot of inharmonic frequency content and low frequency content, you may need to increase the number of analysis bands. Doing so, however, will reduce the time resolution of the output signal, and you will lose some of the transient quality of the original signal. Reducing the number of frequency bands can improve the time resolution, but if you hear artificial metallic sounds you know that you have insufficient frequency resolution to reproduce the signal. You should use your ear along with a little trial and error to find the best balance.
You can use more overlaps to offset the loss of time resolution with larger FFT sizes, but do know that you will often end up with audible echo effects in the output.
Leave a Reply