An Asymmetric Matching Method for a Robust Binary Audio Fingerprinting

4

Click here to load reader

Transcript of An Asymmetric Matching Method for a Robust Binary Audio Fingerprinting

Page 1: An Asymmetric Matching Method for a Robust Binary Audio Fingerprinting

844 IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 7, JULY 2014

An Asymmetric Matching Method for aRobust Binary Audio Fingerprinting

Jin S. Seo, Member, IEEE

Abstract—Audio fingerprinting has been successfully applied forthe problems associated with the protection of intellectual prop-erty, management of large database and indexation of content. Fora reliable fingerprinting system, improving fingerprint matchingaccuracy is crucial. In this letter, we try to improve a binary audiofingerprint matching performance by utilizing auxiliary informa-tion which is discarded while extracting fingerprint bits. Since itis not feasible to store all the auxiliary information in the finger-print DB, we only utilize the auxiliary information obtained whileextracting fingerprints from the input unknown audio. We call theproposed method an asymmetric fingerprint matching, since theauxiliary information is available only in one side. Experimentalresults show that the proposed asymmetric matching method is ef-fective in improving fingerprint matching performance over theconventional Hamming distance.

Index Terms—Asymmetricmatching., content identification, fin-gerprinting, robust hashing.

I. INTRODUCTION

W ITH the huge volume of digital music available for var-ious services, there is a strong need to efficiently index

music information automatically. The aim of fingerprinting(also known as content identification) is to provide fast andreliable means for protection, management, and indexing ofmultimedia contents [1][2]. Fingerprints are discriminative androbust summaries of multimedia content. Similar to a humanfingerprint which has been used for identifying an individual,an audio fingerprint is used for recognizing an audio clip. Asbiometric recognition systems should be robust against a cer-tain amount of biometric deformations, an audio fingerprintingsystem must make allowances for perceptual-quality preservingmodification of audio signal while distinguishing one audioclip from another. The practical importance of fingerprintingis getting more prominent and has been used in a number ofapplications including filtering for file-sharing services, auto-mated monitoring for broadcasting stations, music recognition

Manuscript received January 17, 2014; accepted March 01, 2014. Date ofpublication March 06, 2014; date of current version April 28, 2014. This workwas supported by Basic Science Research Program through the National Re-search Foundation of Korea (NRF) funded by the Ministry of Science, ICT &Future Planning (2012012876). The associate editor coordinating the review ofthis manuscript and approving it for publication was Prof. Alex Dimakis.The author is with the Dept. of Electrical Engineering, Gangneung-Wonju

National University, Gangneung city, Gangwon-Do 210-702, Korea (e-mail:[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/LSP.2014.2310237

through mobile network, commercial verification, duplicatefile detection, and automated indexing of large-scale mediaarchives. These applications have boosted the interest in multi-media fingerprinting, which has also led to a number of audiofingerprinting methods [3].A fingerprinting system is made up of two phases: finger-

print database (DB) generation and fingerprint identification. Inthe first phase, the fingerprint DB of the known large numberof music clips is generated and associated with the availablemetadata, such as copyright holder, singer, composer, and so on.As shown in Fig. 1, the fingerprint identification phase worksover the constructed fingerprint DB and generally consists ofthree steps: fingerprint extraction, DB search, and fingerprintmatching [1][2]. Query fingerprints are extracted from an un-known audio clip that is to be identified. Then the candidatesfor the query fingerprints are obtained by the nearest neighborDB search. On these candidates, the fingerprint matching is per-formed for verification. It is important to design all the threesteps carefully and seamlessly for a successful fingerprint iden-tification. Among the three steps of fingerprint identification,this letter focuses on the fingerprint matching and proposes anasymmetric matching method using an obtained auxiliary in-formation while extracting fingerprints from the input unknownaudio. Auxiliary information in this letter refers to the remainingdiscriminant information of an audio signal which is not in-cluded in the extracted fingerprint bits. Since it is infeasible tostore all the auxiliary information of the large number of musicclips in the fingerprint DB due to storage restrictions, we pro-pose a fingerprint matching method which utilizes auxiliary in-formation available from the input unknown audio in the fin-gerprint identification phase. Thus we call the proposed methodasymmetric matching since the auxiliary information is avail-able only in one side. The baseline audio fingerprinting method,we consider in this letter, is the Philips robust hash (PRH) whichhas been widely accepted as one of the standard approaches inaudio fingerprinting [4][5]. The PRH takes the sign (phase) ofthe subband-energy difference as a fingerprint bit and discardsthe magnitude of the subband-energy difference. The proposedasymmetric matching method incorporates the discarded mag-nitude information in fingerprint matching with the assumptionthat the fingerprint bit with larger magnitude is more likely tobe robust. We formulate the fingerprint matching as a likelihoodratio test and derive a decision function for fingerprint identifi-cation in the form of a weighted Hamming distance. The pro-posed asymmetric matching method showed better identifica-tion performance than the conventional Hamming distance.This letter is organized as follows. Section II describes the

proposed asymmetric fingerprint matching method for the PRH.

1070-9908 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: An Asymmetric Matching Method for a Robust Binary Audio Fingerprinting

SEO: ASYMMETRIC MATCHING METHOD FOR A ROBUST BINARY AUDIO FINGERPRINTING 845

Fig. 1. Overview of the audio fingerprint identification phase with the proposedasymmetric matching using auxiliary information.

Section III evaluates the identification performance of the pro-posed fingerprint matching method. Finally, Section IV con-cludes the letter.

II. PROPOSED FINGERPRINT MATCHING METHOD

Constructing a relevant fingerprinting function is not a trivialtask since the human auditory perception is intricate to modelwith a condensed representation. It is almost impossible or atleast not viable for now to include all the discriminant auditoryinformation in a series of fingerprint bits. Even though we care-fully design the fingerprint extraction function, there would besome remaining discriminant information which is not includedin the fingerprint bits and thus discarded. This letter exploits thediscarded information (we call it auxiliary information) in fin-gerprint matching.

A. Philips robust hash for audio signal

The baseline audio fingerprinting method, considered in thisletter, is the PRH [1] which has been studied in depth and re-garded as one of the standard approaches in audio fingerprinting.Let be the subband energy of an audio signal at the bandof the -th frame. The subbands lie in the range of 300 Hz to

2000 Hz and consists of 33 logarithmically spaced bands. Thenthe is filtered by a simple 2D difference filter (along boththe frequency and the temporal axis) as follows:

(1)

Then the fingerprint bit of PRH is obtained by takingthe sign of given by

(2)

From the 33 subbands of each frame, we obtain 32 bits (). As the 32-bit fingerprint from a frame does not contain

enough information to identify the whole audio, a fingerprintblock, which is composed of consecutive frames (typically

, consequently fingerprint bits), is usedfor fingerprint matching. From here after, we refer to the finger-print block as the by binary fingerprint matrix. Two fin-gerprint blocks are declared similar if their Hamming distance,usually expressed as Bit Error Rate (BER), is below a certainthreshold. In [1] it is shown that a BER of less than 35% leads

to a very reliable identification. More details of the PRH and theparameter settings can be found in [1][4].

B. Asymmetric Fingerprint Matching for PRH Using theMagnitude Information

We will formulate the fingerprint matching for PRH as alikelihood ratio test and derive a decision function for finger-print identification in the form of a weighted Hamming distancein this section. It is noted that the idea behind the proposedmatching method is general and thus could be applied to otherfingerprints besides PRH.We first normalize the magnitude information of each finger-

print bit in a fingerprint block. The normalized magnitudes aresorted. Hypothesis testing for fingerprint matching is formulatedby assuming that the fingerprint bit with larger normalized mag-nitude is more likely to be robust.1) Normalization of the Subband-Energy Difference: The fin-

gerprint bit extraction in (2) is clearly information-loss processsince only the sign (phase) is kept as a fingerprint bit whilethe magnitude is discarded. We reuse the discardedmagnitude, which we call the auxiliary information, for fin-gerprint matching. To statistically model the magnitude infor-mation , we make an assumption on the distributionof the subband energy ; the four subband energies, in-volved in calculating , are independent and identicallydistributed according to an Exponential distribution with themean

. The difference between two Exponentially-distributed in-dependent random variables follows the Laplace distributionwith mean zero and variance . Thusand in (1) follow the Laplace

distribution. Then can be regarded as a realization of thedifference between two random variables with the Laplace dis-tribution, whose distribution is studied in [6]. The cumulativedistribution (CDF) of can be derived as

(3)

where which we call the normalized magni-tude associated to the fingerprint bit . From the CDFof , the confidence level of the observed is di-rectly related to the . As the value of is greater,it is more difficult to tweak the fingerprint bit eitherintentionally or unintentionally by applying audio processing.Thus the expected resilience of the fingerprint bit is claimed tobe proportional to the value of the normalized magnitude (i.e.auxiliary information in this letter), which is utilized in derivinga novel fingerprint matching method.2) Fingerprint Matching Based on the Likelihood Ratio Test:

Fingerprint matching compares the fingerprint extracted froman input music clip with the stored fingerprints in the finger-print database as shown in Fig. 1. If we find a fingerprint in thedatabase which is sufficiently close to the input fingerprint, weclaim that the input music clip is the one in the database. Thushow to measure the proximity between two fingerprints is cru-cial in fingerprinting performance. In the PRH, the Hamming

Page 3: An Asymmetric Matching Method for a Robust Binary Audio Fingerprinting

846 IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 7, JULY 2014

distance, which measures the number of symbol difference, isused for fingerprint matching as follows:

(4)where and are the fingerprint block (i.e. 256 by 32 binarymatrix, thus 8192 bits) extracted from the different audio clips.Using the Hamming distance for the PRH disregards the

magnitude information. Although it is infeasible to store all themagnitude information of the songs in database, we can easilykeep and use the magnitude information of the query music clipin Fig. 1. Let’s assume that the fingerprint block is extractedfrom the query music clip, and the fingerprint block isstored at the fingerprint DB. Thus the normalized magnitude

(which is defined in Section II-B1) associated with isonly available for fingerprint matching. This letter utilizes thereadily-available magnitude information of the query musicclip in fingerprint matching, which we call the asymmetricfingerprint matching. To make the notation easier to follow,we concatenate the by matrices, , , and , into

-dimensional vectors, , , and , respectively. Wefirst sort the normalized magnitude in increasing order intoand relocate the binary fingerprint vectors and intoand with the same ordering of . Then we evaluate the

Hamming distance vector between the relocated fingerprintvectors and denoted by for

. With the assumption that the fingerprint bitwith larger magnitude is more likely to be robust, we formulatethe fingerprint matching as the following hypothesis testing:• : Two fingerprint blocks and are from the dif-ferent music excerpts if the Hamming distance be-tween them is distributed as the Bernoulli distribution with

.• : Two fingerprint blocks and are from the samemusic excerpt if the Hamming distance between themis distributed as the Bernoulli distribution with

where is an increasing function with respectto .

In this letter, we consider the following three different typesof increasing functions as :

(5)

(6)

(7)

where all three functions span from 0.5 to 0.9 for. Using the computed Hamming distance vector

, the log likelihood and of the hypothesis and arerespectively given by

(8)

(9)

We note that the computation of the log-likelihoods is quite ef-ficient; is constant, and can be calculated with only ad-ditions since the observed Hamming distance is a binaryvalue. We take one hypothesis over another by comparing theand the .The final decision whether the two fingerprints from the same

audio or not can be made using the log likelihood ratio ( ).Since is constant, we compare the value of with a pre-defined threshold . If is greater than , the null hypoth-esis is rejected, and we decide that the fingerprint blocks

and are from the same audio. Otherwise, the null hy-pothesis is taken, and we decide that the fingerprint blocks

and are from the different audio. For the selection of, there is a tradeoff between the false alarm rate and the falserejection rate. The false alarm rate is the probability todeclare different audio clips as same. The false rejection rate

is the probability to declare an audio and its processedversions as dissimilar. In practice, is difficult to analyzesince there are plenty of audio processing steps of which theexact characteristics are not known. Thus, it is common to se-lect a threshold of minimizing subject to a fixed .In order to analyze the choice of the threshold , the Hammingdistance is assumed to be a binary random variable dis-tributed as the Bernoulli distribution with mean 0.5 and variance0.25 when two fingerprint blocks and are from differentaudio clips. Then the in (9) is the weighted summation of thebinary random variables with the the Bernoulli distribution. Bythe weighted central limit theorem [7], the has a normal dis-tribution if is sufficiently large and the contributions in theweighted summation are sufficiently independent. Using the as-sumption, when two fingerprint blocks are from different audioclips, the mean and the variance of the are given respec-tively as

(10)

(11)

For the considered three types of the function , the statisticalcharacteristics of the were measured from the ran-domly selected pairs of audio blocks and shown in Table I.The measured mean was close to the value obtained from (10).The measured standard deviation was approximately 2.5 timeshigher than the one obtained from (11). Similar observation isalso noted in [1] for the Hamming distance of PRH since thefingerprint bits are not uncorrelated especially along the timeaxis, whose effect could be counted in by adding covariance ofthe fingerprint bits in (11). The values of the skewness and thekurtosis were close to zero and three respectively, which empir-ically confirms the normal approximation. Through the normalapproximation by the weighted central limit theorem,the false alarm rate for is given as follows:

(12)

Page 4: An Asymmetric Matching Method for a Robust Binary Audio Fingerprinting

SEO: ASYMMETRIC MATCHING METHOD FOR A ROBUST BINARY AUDIO FINGERPRINTING 847

TABLE IMEASURED STATISTICAL CHARACTERISTICS OF FOR DIFFERENT TYPESOF INCREASING FUNCTIONS. THE IDEAL VALUES OF EACH STATISTICALMOMENTS FROM (10) AND (11) ARE GIVEN IN THE PARENTHESES

For a certain value of , the threshold for the can bedetermined.

III. EXPERIMENTAL RESULTS

The performance of the proposed asymmetric fingerprintmatching method was evaluated using the fingerprint DBgenerated from thousand songs belonging to various genres,such as classic, jazz, pop, rock, and hiphop. We extract 32-bitfingerprint for every frame of the songs from the energy dif-ferences of 33 bands which lie in the range from 300 Hz to2000 Hz as in [1]. The fingerprint matching was performedusing the fingerprint block composed of 256 subsequent32-bit fingerprint vector (in total, 8192 bits). As mentioned inSection II-B2, the proposed asymmetric fingerprint matchingfor PRH is formulated as a likelihood-ration test in this letterin which there are two types of errors: the false alarm rateand the false rejection rate. For a fair comparison with theHamming distance, which was originally employed for PRH,the detection error tradeoff (DET) curve, that plots the falserejection rate versus the false alarm rate, was used. The DETcurve is obtained by measuring both error rates while varyingthe threshold used in the fingerprint matching. The falsealarm rate was calculated by randomly sampling pairs of audioblocks from the original fingerprint DB. To calculate the falserejection rate, we tested the proposed method against variousdistortions (using Cool Edit Pro 2.1 software) including 10-dBwhite noise addition (WN), 3:1 Expander below 10 dB (EX),Telephone bandpass filter (BF), 48-kbps mp3 compression(MP), super loud (SL), linear speed change by 1% (SC), afilter emulating old time radio (OR), pitch increase by 1% (PI),30-band classic equalization (CE), 30-band pop equalization(PE). Each original audio was subjected sequentially to a setof the selected distortions. We considered four different sets ofdistortions as shown in Fig. 2. By comparing the distance be-tween the fingerprints from the original and the correspondingprocessed audio clips with the threshold, the false rejection ratewas obtained. The resulting DET curves of PRH are shownin Fig. 2 for the Hamming distance and the proposed . Theproposed distance performed better than the conventionalHamming distance in all considered distortions. Especiallythe proposed method performed more effectively than theHamming distance in the low region of the DET curve

Fig. 2. DET curves for four sets of distortions. (a) . (b). (c) . (d) .

where most fingerprinting systems operate in practice. Theproposed asymmetric matching method could be conducivefor the applications, such as music recognition through mobilephone, which need robustness against severe distortions.

IV. CONCLUSION

For a reliable fingerprinting system, improving fingerprintmatching accuracy is crucial. This letter proposes an asym-metric fingerprint matching method which utilizes an auxiliaryinformation obtained while extracting fingerprints from theinput unknown audio. The auxiliary information, we use in thisletter, is the magnitude of the subband energy difference whichis discarded while extracting the PRH. The problem of reliablefingerprint matching is approached by a hypothesis testingwith the assumption that the fingerprint bit with larger normal-ized magnitude is more likely to be robust. Experiments onthousand songs against various distortions were performed tocompare the performance of the asymmetric matching with theconventional Hamming distance. In all considered distortions,the proposed method improved matching performance. Furtherwork includes the development of more effective hypothesistesting and the extension of the proposed asymmetric matchingmethod to other types of fingerprinting methods.

REFERENCES[1] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting

system,” in Proc. Int. Conf. Music Information Retrieval, 2002.[2] J. Seo, M. Jin, S. Lee, D. Jang, S. Lee, and C. Yoo, “Audio finger-

printing based on normalized spectral subband moments,” IEEE SignalProcess. Lett., vol. 13, no. 4, pp. 209–212, 2006.

[3] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review of algorithmsfor audio fingerprinting,” in Proc. IEEE Workshop Multimedia SignalProcessing, 2002, pp. 169–173.

[4] F. Balado, N. Hurley, E. McCarthy, and G. Silvestre, “Performanceanalysis of robust audio hashing,” IEEE Trans. Inf. Forens. Secur., vol.2, no. 2, pp. 254–266, Jun. 2007.

[5] P. Doets and R. Lagendijk, “Distortion estimation in compressedmusic using only audio fingerprints,” IEEE Trans. Audio, Speech,Lang. Process., vol. 16, no. 2, pp. 302–317, Feb. 2008.

[6] S. Nadarajah and S. Kotz, “On the linear combination of laplacerandom variables,” Probabil. Eng. Inf. Sci., vol. 19, no. 4, pp.463–470, 2005.

[7] M. Weber, “A weighted central limit theorem,” Statist. Probabil. Lett.,vol. 76, no. 14, pp. 1482–1487, 2006.