Shazam identifies songs by converting audio into a spectrogram, extracting peak frequency points, and creating unique hashes from pairs of these peaks based on their frequencies and timing. It then matches these hashes against a vast database, using consistent time offsets in the matches to accurately recognize the song even in noisy environments.
Shazam is a widely used app that can identify songs by analyzing just a few seconds of audio, even from noisy environments, by comparing it against a vast database of 100 million songs. The challenge lies in the fact that raw soundwaves vary greatly with volume, making direct fingerprinting unreliable. Instead, Shazam focuses on the underlying frequencies or notes, which remain consistent regardless of how loud or quiet the music is.
To achieve this, Shazam converts the audio into a spectrogram, a visual representation where time is on the horizontal axis and frequency on the vertical axis. The brightness in the spectrogram indicates the loudness of specific frequencies at each moment. This representation is robust to changes in volume, as the pattern of frequencies remains similar whether the music is played softly or loudly.
Since a full spectrogram contains millions of data points, searching through all of them in real time would be impractical. Shazam addresses this by isolating only the peak points—frequencies that are louder than their immediate surroundings. These peaks form a sparse constellation of dots that capture the structured elements of the music, distinguishing it from random background noise, which lacks consistent frequency patterns.
However, a single peak is not unique enough to identify a song, as many songs share common frequencies. Shazam increases specificity by pairing peaks that occur close together in time, creating hashes that encode the frequency pair and the time gap between them. This method is robust to different parts of the song, such as intros or choruses, because it relies on relative timing rather than absolute positions.
When a user records a snippet, Shazam matches the generated hashes against its pre-fingerprinted database. It then plots the matching times from the database against the recording times. If the song is correct, the matches align along a diagonal line with a slope of one, indicating a consistent time offset. Incorrect matches scatter randomly, allowing Shazam to confidently identify the correct song quickly and accurately.