Approaches we read about
Throughout the entire project, we studied the existing literature regarding matching imperfectly sampled time-series data to similar data. The papers we found seemed to diverge greatly in the quality of the query data their algorithms were intended for. While some of the algorithms expected the input to be very similar to the time series in the database, as might be used for radio content identification, others tolerated more variation in the input. Commercial applications of the former type of system include Shazam and the music-recognition services offered by commercial wireless carriers. The systems that accepted more variable input can be seen in the wild on sites like Midomi, whose database of pre-recorded music is augmented by user-contributed solo vocal recordings.
Here are some notes on a few of the papers we evaluated.
- J. Haitsma, T. Kalker, and J. Oostveen, “Robust Audio Hashing for Content Identification”: The authors describe a system they created to generate a sequence of hashes corresponding to an audio sample, based on the presence of different frequency components at different times through the audio sample. The system is intended to create similar hash values for perceptually similar audio files (like an uncompressed version vs. a compressed version) but would perform poorly with noisy, inaccurate recordings. It would not work at all for query by singing or humming.
- V. Athitsos, P. Papapetrou, M. Potamias, G. Kollios, and D. Gunopulos, “Approximate Embedding-Based Subsequence Matching of Time Series”: The authors develop a generalized subsequence-matching system that can be used for music recognition. Dynamic time warping is a computationally expensive algorithm to find similarity between two time series where time is distorted between the matching samples, as for phoneme recognition in speech-to-text systems. The authors of this paper created a vector-based algorithm to improve the computational efficiency of subsequence matching.
- Z. Cao, S. Guan, and Z. Wang, “A Real-time Algorithm for Music Recognition Based on Wavelet Transform”: The algorithm described in this paper is focused on recognizing pitches from an input audio signal, in order to create a MIDI file or similar. Though the authors focus mostly on creating a transcription system, a query processed in this way could then be used to query a database of MIDI files, or files of some other format containing manually-transcribed melodies of popular songs.
How Shazam does it
The most useful documentation we found regarding the algorithms and architecture of commercial audio matching services came from the paper “An Industrial-Strength Audio Search Algorithm”. The paper is by Avery Li-Chun Wang, one of the people who developed Shazam’s fast, scalable algorithm for music recognition. We didn’t base our fingerprinting algorithm on this, mainly because we discovered this paper fairly late. However, it gives a good start-to-finish overview of one fast and relatively computationally cheap fingerprinting algorithm.
Shazam’s algorithm, like most algorithms we looked at, starts with a spectrographic view of the data: it turns a time series of amplitudes (the raw audio data) into a time series of intensities of frequency components. It then finds peaks in the spectrogram — the time-frequency pairs with the highest amplitudes. Certain peaks are selected as “anchor points”, and these points are associated with a set number of peaks of similar frequency that appear soon after the original peaks. These pairs, which take very little storage space, are then indexed in a database along with metadata about the associated music track and time offset.
To perform a search, the input audio is fingerprinted in the same way as the tracks in the database, yielding another set of frequency-time pairs and their time offset from the beginning of the input audio stream. For every track in the database, the time offsets of frequency-time pairs matching the pairs in the input are compared. If the input matches a track in the database, frequency-time pairs in the input and in the database track will have a consistent offset from the beginning of the song. That is, if a certain time-frequency pair occurs at 0:05 in the sample and 1:30 in the same database track, a pair from 0:06 in the input should appear at 1:31 in the output. Database tracks with a statistically significant number of identical time offsets like that are then flagged as matches of the input audio stream.