Approaches we read about

Throughout the entire project, we studied the existing literature regarding matching imperfectly sampled time-series data to similar data. The papers we found seemed to diverge greatly in the quality of the query data their algorithms were intended for. While some of the algorithms expected the input to be very similar to the time series in the database, as might be used for radio content identification, others tolerated more variation in the input. Commercial applications of the former type of system include Shazam and the music-recognition services offered by commercial wireless carriers. The systems that accepted more variable input can be seen in the wild on sites like Midomi, whose database of pre-recorded music is augmented by user-contributed solo vocal recordings.

Here are some notes on a few of the papers we evaluated.

How Shazam does it

The most useful documentation we found regarding the algorithms and architecture of commercial audio matching services came from the paper “An Industrial-Strength Audio Search Algorithm”. The paper is by Avery Li-Chun Wang, one of the people who developed Shazam’s fast, scalable algorithm for music recognition. We didn’t base our fingerprinting algorithm on this, mainly because we discovered this paper fairly late. However, it gives a good start-to-finish overview of one fast and relatively computationally cheap fingerprinting algorithm.

Shazam’s algorithm, like most algorithms we looked at, starts with a spectrographic view of the data: it turns a time series of amplitudes (the raw audio data) into a time series of intensities of frequency components. It then finds peaks in the spectrogram — the time-frequency pairs with the highest amplitudes. Certain peaks are selected as “anchor points”, and these points are associated with a set number of peaks of similar frequency that appear soon after the original peaks. These pairs, which take very little storage space, are then indexed in a database along with metadata about the associated music track and time offset.

To perform a search, the input audio is fingerprinted in the same way as the tracks in the database, yielding another set of frequency-time pairs and their time offset from the beginning of the input audio stream. For every track in the database, the time offsets of frequency-time pairs matching the pairs in the input are compared. If the input matches a track in the database, frequency-time pairs in the input and in the database track will have a consistent offset from the beginning of the song. That is, if a certain time-frequency pair occurs at 0:05 in the sample and 1:30 in the same database track, a pair from 0:06 in the input should appear at 1:31 in the output. Database tracks with a statistically significant number of identical time offsets like that are then flagged as matches of the input audio stream.