The Subversion repository for our code can be checked out anonymously using the Subversion URL http://svn.assembla.com/svn/olin_ca08/.

Our audio matching strategy is a fairly straightforward one. Like the systems we reviewed in the literature, our system compares the “fingerprint” of an input audio file to the pre-computed fingerprints of a set of possible matches. The most similar fingerprint in the database (determined with a MATLAB correlation coefficient function) is then returned as the best match if the similarity is above a certain threshold.

The first step in our algorithm is to create a fingerprint of the music. There is a lot of data in a raw audio sample, and computationally if we could simplify the amount of data we’re working with, the search step will be simpler. Furthermore, the goal of this project is not only to determine that one audio file is identical to another audio file, but to attempt to determine if one audio file is related to another. If we can determine the qualities that make up the music, and record them, regardless of the actual recording, version, or compression quality, we should be able to determine if they are related, or the same song.

The fingerprinting method that we use looks at the peak intensities per timestep in a spectrographic view of the audio data. From this, we calculate and store the intervals between the peaks. This step is important, as the actual frequency of a note in a song may be very different depending on key, but the intervals in chords will be the same, regardless of key. The method chosen sets every lowest peak found as 0, and then marks the difference between the 2nd and the 1st as the next peak, the 3rd and 1st difference as the 2nd peak, etc. For example, if there are peaks at {0.1, 0.2, 0.5} radians, our algorithm would store {0.1, 0.4} as the fingerprint. This would yield the same fingerprint as {0.2, 0.3, 0.6} would. Each song fingerprint becomes a list of these sets of numbers at every timestep.

The method relies on the same algorithm of peak finding used in the beats-per-minute section. The code is configurable to choose any number of peak per time step, but in most of our testing we used three. The specific PSD function (the function used to generate the discrete fourier transform of our data) arguments used make a great difference on the results; one could not compare fingerprints made between two different PSD functions. This was because changing the PSD function dramatically changes where and what peaks are highest, and thus discovered by our algorithm. In the end, we chose a balance between more data to work with, and the benefits of the less noisy sample.

With a fingerprint, our algorithm can then compare one song to another. We calculate the coefficient of correlation between our sample fingerprint and our database of fingerprints using the built in MATLAB function corrcoef. The highest correlated song becomes our algorithms pick.

We used a database of approximately 5 pop songs and 10 piano jazz songs to do our initial comparisons. The pop songs included such hits as Katy Perry’s “Hot and Cold” from live YouTube recordings and purchased MP3s. The jazz music was a collection of Scott Joplin songs performed by the pianist Butch Thompson and others. We had at most three versions of any one song. The jazz music was purposefully chosen to be similar in key and quality, in an attempt to test the limits of our system.

We found that the system generally worked. In our small sample size, it correctly determined that a test file of the same song but not of the same recording for all but one of our songs. Further examination determined that the one song that did not work well, the maple leave rag, had the five second sample that we used of it completely non overlapping with the five second sample in our database.

With a slightly working system, the bulk of the time was then spent optimizing. We went back to our fingerprinting method and explored a variety of different PSD and numbers of peaks. We ended up using the standard PSD and 3 peaks. We also explored which range of frequencies to look for peaks over. This variable became the most significant effect of our performance. So significant was this variable to optimize, that we wrote an optimization script which determined that the ideal range for our sample files was between 0.15 and 0.25 radians, which corresponds to 6600 Hz to 11000 Hz. It is interesting that these ranges do not lie in the traditional melodic tones of human hearing. We found that these ranges in our PSD were useless to our fingerprinting algorithm due to heavy noise. It was in the harmonic frequencies that we found the main tones stood out and were easily identified.

We also spent much time looking into various correlations. It was most interesting to note how closely the histograms of our data matched up — a histogram sample of the normalized frequencies most often fingerprinted would match another of the same song very well, in fact, much closer then the fingerprints themselves would. We did not have time to explore this further, but were wary that going down this route would take away the valuable time/relationships. We also explored finding simple coefficients of regressions of one dataset plotted against another, and other built in MATLAB statistical correlation functions.

There is much more that needs to be explored in this algorithm. In the fingerprinting, perhaps instead of just looking at the height of peaks, we could try to characterize the width and shape — such that we could tell apart a narrow frequency from a broad range. Also, it occurred during this writeup, that our interval normalization function isn’t correct over a broad range, which may have been why the range became such an important variable in our method. Whereas we took {1, 3, 5} as the same as {2, 4, 6} by our difference formula (both become {2, 4}), we should have been looking at ratio of frequencies, as the difference between 1 and 3 in frequency is very different from the difference of 1000 and 1003. so {1, 3, 5} would become {3, 5} which would be the same as {2, 6, 10}. Had we used ratios, perhaps we would have much cleaner results.

This project did a good job exploring the space and coming up with a working implementation in a small sample set. By combining beat detection into this fingerprinting algorithm, fixing and improving on the algorithm, one can continue to build and explore a working music recognition system.