a35-yang

7/27/2019 a35-yang

1/20

Exposing MP3 Audio Forgeries Using Frame Offsets

RUI YANG, ZHENHUA QU, and JIWU HUANG, Sun Yat-sen University

Audio recordings should be authenticated before they are used as evidence. Although audio watermarking and signature arewidely applied for authentication, these two techniques require accessing the original audio before it is published. Passiveauthentication is necessary for digital audio, especially for the most popular audio format: MP3. In this article, we proposea passive approach to detect forgeries of MP3 audio. During the process of MP3 encoding the audio samples are divided intoframes, and thus each frame has its own frame offset after encoding. Forgeries lead to the breaking of framing grids. So the frameoffset is a good indication for locating forgeries, and it can be retrieved by the identification of the quantization characteristic.In this way, the doctored positions can be automatically located. Experimental results demonstrate that the proposed approachis effective in detecting some common forgeries, such as deletion, insertion, substitution, and splicing. Even when the bit rate isas low as 32 kbps, the detection rate is above 99%.

Categories and Subject Descriptors: H.4.0 [Information Systems Applications]: General; K.6.5 [Management of Comput-ing and Information Systems]: Security and Protection

General Terms: Security, Algorithms, Verification

Additional Key Words and Phrases: MP3 audio forgery, forgery detection, audio authenticationACM Reference Format:

Yang, R., Qu, Z., and Huang, J. 2012. Exposing MP3 audio forgeries using frame offsets. ACM Trans. Multimedia Comput.Commun. Appl. 8, S2, Article 35 (September 2012), 20 pages.DOI = 10.1145/2344436.2344441 http://doi.acm.org/10.1145/2344436.2344441

1. INTRODUCTION

With the development of digital voice recorders and cell phones, nowadays speech and conversationcan be easily recorded as evidence. However, hearing cannot be believing since these audio recordingscan be tampered with very easily by pervasive audio editing software. An audio recording may containsome important words or sentences synthesized from other audio, so authentication technologies

need to be developed for digital audio. The existing audio authentication technologies can be dividedinto two groups: active authentication (including digital watermarking and digital signature) and pas-sive authentication. Active authentication requires accessing original audio before it is distributed,for example, embedding a watermark or generating a signature, while passive audio authentication

A portion of this article was presented at the 10th ACM Multimedia and Security Workshop.The work was supported in part by 973 Program (2011CB302204) in China and NSFC (U1135001, 61202497).J. Huang is also a visiting researcher of State Key Laboratory of Information Security, Beijing 100190, China.

Authors addresses: R. Yang, Z. Qu, and J. Huang (corresponding author), Sun Yat-sen University, Guangzhou 510006, China;email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee providedthat copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first pageor initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute tolists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may berequested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481,or [email protected] 2012 ACM 1551-6857/2012/09-ART35 $15.00

DOI 10.1145/2344436.2344441 http://doi.acm.org/10.1145/2344436.2344441

ACM Transactions on Multimedia Computing, Communications a nd Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.

7/27/2019 a35-yang

2/20

35:2 R. Yang et al.

means checking the integrity of audio recording by analyzing its inherent properties. In most authen-tication cases, audio does not actually contain any digital watermark or signature. Thus it is necessaryto passively examine the integrity of the digital audio.

Until now, there were few works on passive authentication for digital audio. Based on the assumption

that a natural signal has weak higher-order statistical correlations in the frequency domain and thatforgery in speech would introduce unnatural correlations, Farid [1999] used bispectral analysis todetect digital forgery for speech signals. It was shown that the zero phase of bispectral decreased alot for forged speech. However, the method is only suitable for uncompressed audio. Grigoras [2005]pointed out that digital equipment captures not only the intended speech but also the 50/60 Hz ElectricNetwork Frequency (ENF) when recording. The ENF criterion could be used to check the integrityof digital audio recordings and to verify the exact time when a digital recording was created. Thiscould be done by compared the ENF of audio recordings with a reference frequency database from theelectric company or the laboratory. The method is highly dependent on the accuracy of the extractedENF, while ENF is a quite weak signal compared to the audio recording. Dittmann et al. [Kraetzeret al. 2007] proposed a method to determine the authenticity of the speakers environment. In theirpaper it was said that the extraction of the background features in an audio stream could provide aninformative basis for determining the location of its origin and the used microphone. But a lot of audio

recordings are required for training.MP3 audio format is popularly used in most applications, and is now the most popular format among

all formats in digital voice recorders. The top 20 best-selling digital voice recorders of amazon.com allsupport the MP3 format, and some of them only support the MP3 format. For most cell phones, thedefault recording format is the MP3 format. Digital voice recorder and cell phone are the most fre-quent recording machines for people in daily life. It would be fairly easy to remove complete sectionsof a recording or splice two sentences from different recordings. Small changes in the audio streamcan cause a different meaning of the whole sentence. Exposing forgeries in MP3 files can authenti-cate the daily recordings presented as evidence in criminal and civil court cases, and such as under-cover surveillance recordings made by the police, recordings presented by feuding parties in a divorce,recorded telephone conversation in domestic violence cases, and recordings from corporations seekingto prove employee wrongdoing or industrial espionage. At the same time, forgeries detection solutionsare needed for manufacturers of audio recording equipment.

There are as yet still no reported passive authentication methods focusing on MP3 format audio. Anexisting related work is the classification of MP3 encoders, which was proposed by Boehm and Westfeld[2004]. The work outlines a method to discriminate 20 different MP3 encoders with 10 features. Experi-mental results show that these features have accurate classification for MP3 encoders and can improvethe performance of MP3 steganalysis. The application of the method to passive authentication is notdiscussed in the paper. Theoretically the method could handle tampered audio by splicing audio fromdifferent recorders, but tampering within an audio recording is out of its range. As MP3 audio becomespopular, it is necessary to develop passive approaches to check the integrity of MP3 audio.

Passive authentication on JPEG image and MPEG video has attracted many researchers. Some ap-proaches have been proposed, such as the quantization-table-based method [Lukas and Fridrich 2003],the periodical-artifacts-based method [Popescu and Farid 2004], Benfords-law-based method [Fu et al.2007], and the shift double JPEG detection-based method [Qu et al. 2008]. One direct question arises:

can these methods be applied to passive authentication on MP3? Unfortunately, direct extension ofthe existing JPEG methods to MP3 audio does not work, because there are many differences betweenMP3 compression and JPEG compression. For example, an MP3 encoder divides the samples of thetime domain into frames with 50% overlap, while JPEG compression is without overlap. This leads tothe impossibility of detection of block artifacts in MP3 compression. The calculation and quantization

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. S2, Article 35, Publication date: September 2012.

7/27/2019 a35-yang

3/20

Exposing MP3 Audio Forgeries Using Frame Offsets 35:3

Fig. 1. Block diagram of MP3: (a) encoder; (b) decoder.

in MP3 compression are performed with float point representation. So the quantization-table-basedmethod in JPEG which performs well with integer numbers is useless for MP3 compression.

In this article we will propose a forgery detection method for digital audio of MP3 format. Note thatforgeries at MP3 files are always performed in this way: first decoding, then tampering, and finallyre-encoding. Based on the discovery that forgeries break the original frame segmentation, we utilizeframe offsets to locate forgeries automatically. The original frame offsets are retrieved by a quantiza-tion characteristic. Via extensive experiments, it is shown that the proposed method can detect mostcommon forgeries, such as deletion, insertion, substitution, and splicing. At the same time, the pro-posed method is robust to some common postprocesses like filtering and adding noise.

The article is organized as follows. In Section 2, we give a brief analysis of MP3 coding and claim thatonly identical frame offsetting can introduce the quantized spectral characteristic. Then we develop amethod to detect frame offsets in Section 3. Based on the detection method, we propose that the changeof frame offsets could locate forgeries effectively in Section 4. The experimental results are shown in

Section 5. Finally, we conclude our article with a discussion and future work in Section 6.

2. ANALYSIS OF MP3 COMPRESSION CHARACTERISTICS

In this section, first we will give a brief overview of MP3 coding, then explain two important concepts ofthis article: frame offset and quantization characteristics. In Section 2.1 we only explain those princi-ples that are relevant to our detection method, especially the spectral decomposition and quantization.Detailed architecture and specification of MP3 coding may be referred to ISO [1992]. In Section 2.2,the definition of frame offset is demonstrated via an example. In Section 2.3, the quantization charac-teristics are analyzed.

2.1 MP3 Coding

Figure 1(a) shows the block diagram of a typical MP3 encoder [Painter and Spanias 2000]. The inputPCM signal is first separated into 32 sub-bands by the analysis filterbank, and the Modified DiscreteCosine Transform (MDCT) window further divides each of these 32 sub-bands into 18 sub-bands (longwindows) or 6 sub-bands (short windows). Then a total of 576 or 192 spectral lines are generatedrespectively.


7/27/2019 a35-yang

4/20

35:4 R. Yang et al.

Fig. 2. Framing grids and frame offsets. The top panel shows three continuous framing grids for the first encoding, and thebottom panel shows the corresponding frame grids for the second encoding. The frame offsets of the three framing grids areidentical.

The psychoacoustic model analyses the audio content and estimates the masking thresholds. Theoutput of this model consists of the just noticeable noise level for each sub-band and the informationabout the window type for MDCT.

According to the masking thresholds estimated by the psychoacoustic model, the spectral valuesare quantized via a power-law quantizer. The quantization step introduces an iterative algorithm to

control both the bit rate and the distortion level, so that the perceived distortion is as small as possible,under the limitations of the desired bit rate. Finally, the quantized spectral values are encoded usingHuffman code tables to form a bitstream.

The block diagram of MP3 decoder is shown in Figure 1(b). Firstly, Huffman decoding is performed onthe MP3 bitstream, and then the decoder restores the quantized MDCT coefficient values and the sideinformation related to them, such as the window type that is assigned to each frame. After inversequantization, the coefficients are inverse-MDCT transformed to the sub-band domain. Finally, thePCM waveforms are reconstructed by the synthesis filterbank.

2.2 Frame Offset

The frame offset [Yang et al. 2008] is defined as the shifting samples of the frame grid between thefirst and second encoding in this article. It is noted that forgeries at MP3 files are always performed

in this way: first decoding, then tampering, and finally re-encoding. So the frame offset would becomenonzero when forgeries are conducted on MP3 files, and is always zero for no forgery. Figure 2 showsan illustration of the generation of frame offset. When performing the first encoding, the framing gridsof the original signal are shown in the top of Figure 2. Each framing grid contains 1152 samples with50% overlap. After decoding, some extra zero samples are added at the beginning of the signal by the


7/27/2019 a35-yang

5/20


0 100 200 300 400 500 6000.05

0

0.05

frequency index

value

(a) unquantized spectral in a real value form

0 100 200 300 400 500 6000.05

0

0.05

frequency index

value

(b) quantized spectral in a real value form

0 100 200 300 400 500 6000

5

10

frequency indexmag

nitude(dB) (c) unquantized spectral in a logarithmic representation

0 100 200 300 400 500 6000

5

10

frequency indexmagnitude(dB) (d) quantized spectral in a logarithmic representation

No troughs

Many troughs

Fig. 3. Unquantized and quantized spectral coefficients: (a) and (b) are in a real value form, while (c) and (d) are in a logarithmicrepresentation. The major difference between the unquantized and quantized spectral is the number of zero coefficients, whichare shown as troughs.

decoder. During the second encoding, new framing grids are generated. Obviously, if forgeries occur,frame offsets of some frames may change.

2.3 Quantization Characteristics

Many spectral coefficients are usually quantized to zero during the encoding. This is due to some spec-tral components being completely masked by other components and the existence of some coefficientsaround zero which is the inherent probability distribution of the spectral coefficients. The increase inzero spectral coefficients is a quantization characteristic of MP3 coding. This characteristic is firstly de-scribed by Herre and Schug [2000] and Herre et al. [2002]. They utilized it to optimize audio cascadedcoding. In the following, we will analyze this characteristic.

The difference between an unquantized spectral coefficient and its quantized one is not easily visiblein their real value form, as illustrated in Figures 3(a) and (b). But they can be discriminated by looking

at the spectral coefficients in a logarithmic representation. As shown in Figures 3(c) and (d), there aremany zero values which appear as troughs in the quantized spectral, while this phenomenon cannotbe found in the unquantized spectral.

These troughs in the spectral representation will be visible only if the framing grids are the same asthose in the first encoding. This means that only if the identical frame offset with the first encoding is


7/27/2019 a35-yang

6/20

35:6 R. Yang et al.

0 100 200 300 400 500 6000

5

10

frequency index

magn

itude(dB) (a) offset = 1

0 100 200 300 400 500 6000

5

10

frequency index

magnitude(dB) (b) offset = 0

0 100 200 300 400 500 6000

5

10

frequency index

magnitude(dB) (c) offset = +1

Fig. 4. Spectral coefficients when with frame offsets of1, 0, +1 samples. The quantization characteristics appear only if thecorrect frame offset (0) is applied.

applied will these troughs appear. This fact is illustrated by Figure 4, which shows MDCT coefficientsof a decoded signal with one-sample-left shift (offset = 1), no-sample shift (offset = 0) and one-sample-right shift (offset = +1) from the encoder framing grid, respectively. As we see, the troughs disappeareven with the frame offset being one-sample shift in the decoded signal.

3. METHOD OF RETRIEVING FRAME OFFSETS

The key of detecting frame offsets is the identification of quantization characteristics. In this section,we develop a method of retrieving frame offsets based on the observations in the previous section.

3.1 Number of Active Coefficients

From Figure 4, it is noted that a significant difference between spectral coefficients without offsets(Figure 4(b)) and with offset (Figures 4(a) and (c)) is the number of active (nonzero) spectral coefficients.For convenience, we denote the number of active coefficients as NAC in this article. In Figure 4, theNACs for offset 1 and +1 (shifted offsets) are 306 and 300, respectively; while the NAC for offset 0(matching offset) is only 197. For a robust and automatic identification of the characteristic spectral,

the NACs as a function of frame offset can be used as a feature. Such a criterion yields reliable results,as shown in Figure 5. We observe that the beginning of each frame is clearly detectable by an obviousdecrease in the NACs. A period of 576 can be observed. Why is there a period of 576? It is noted that576 = 1152 50%, where 1152 is the length of a frame and 50% is the amount of overlap specified bythe MP3 standard. A frame with offset 576 exactly corresponds to the next frame.


7/27/2019 a35-yang

7/20


0 576 1152 1728 2000150

200

250

300

350Number of active coefficients via different frame offsets

frame offset

Numberofactivecoefficients

Fig. 5. NACs via different frame offsets. NAC achieves minimums when the frame offsets are multiples of 576.

3.2 Theoretical Analysis

Now let us examine why the quantization characteristics appear only if the matching offset is applied.It arises from the inherent property of MDCT. The MDCT transform performed in MP3 coding is asfollows [Wang and Velermo 2003].

X(p)[k] =2

N

2N1n=0

x(p)[n] h[n] cos

N (n +

N+ 1

2)

k +

1

2

, 0 k N 1 (1)

By applying an inverse-MDCT transform to the frame, we get 2N time-aliased samples.

x(p)[n] =2

N

N1k=0

X(p)[k] cos

N

n +

N+ 1

2

k +

1

2

, 0 n 2N 1 (2)

In order to cancel the aliasing and get the original samples, we have to use the OLA (Overlapping

Addition) procedure. An inverse-MDCT is applied to the previous and the next frame. Then, each ofthe resulting aliased segments is multiplied by its corresponding window function and the overlappingtime segments are added together. We thus recover the original samples.

x(p)[n] =

x(p1)[n + N] h[N n 1] + x(p)[n] h[n], 0 n N 1

x(p)[n] h[2N n 1] + x(p+1)[n N] h[n N], N n 2N 1(3)

Denote that

x(p)[n] = x(p)[n] h[n], 0 n 2N 1. (4)

If a signal exhibits local symmetry such that

x(p)[n] = x(p)[N n 1], 0 n N 1

x(p)[n] = x(p)[3N n 1], N n 2N 1(5)

its MDCT coefficients become zero. That is, X(p)[k] = 0 for k = 0, . . . , N 1.In Wang et al. [2000], it has been proven that x(p)[n] fulfills Eq. (5) if X(p)[k] = 0. This inherent

property of the MDCT gives the answer to why NAC has a significant decrease only if the identicalframe offset is applied. After MP3 encoding, many spectral coefficients are masked or quantized to


7/27/2019 a35-yang

8/20

35:8 R. Yang et al.

Table I. Mean Value and Standard Diviationof NACs at Different Bit Rates

shifted NACs matching NACbit rateMean Std Mean Std

32 kbps 175.61 13.45 67.80 12.34

64 kbps 313.46 19.99 178.38 11.06

96 kbps 331.72 18.30 249.15 25.07

128 kbps 345.45 19.14 310.23 25.60

zero. When decoding, these zero spectral coefficients are restored to the time domain, and x(p)[n] ful-fills Eq. (5). While performing MDCT on the decoded data with the identical frame offset to the firstencoding process, we will get a lot of X(p)[k] equal to zero. If there is a different frame offset, the localsymmetry in Eq. (5) is broken, and then the corresponding spectral X(p)[k] will not be zero.

3.3 Experiments on Retrieving Frame Offsets

To illustrate the preceding analysis, we randomly select 30 different audio frames, and encode theseframes with LAME v3.97 [LAM 2012] at the bit rates of 32 kbps, 64 kbps, 96 kbps, and 128 kbps,respectively. For each bit rate, we apply offsets from 575 to 575 on these frames, and calculate NACs

corresponding to all offsets. Then we get 1151 NACs for each frame totally. The 1150 NACs correspond-ing to wrong offsets are named as shifted NACs, and the NAC corresponding to the correct offset isdenoted as matching NAC. The shifted NACs and the matching NAC are plotted, respectively. Asshown in Figure 6, for each bit rate, there are 30 boxes representing the distribution of shifted NACs.

As shown in Figure 6(a), the minimum value of shifted NACs is larger than 150 for each frame, whilethe matching NAC is below 80. For all frames, we observe that matching NAC is very discriminativefrom shifted NACs. The case of 64 kbps, 96 kbps, and 128 kbps are illustrated in Figures 6(b), (c), and(d), respectively.

Although frames may be encoded with different bit rates, the matching NAC is always smaller thanshifted NACs. This means that we can regard the minimum NAC as the matching NAC. From Figure 6,we also notice that the distance between shifted NACs and the matching NAC becomes small whilethe bit rate increases. This is because signal distortion and lost information is less when the bit rate ishigher, and MDCT coefficients contain less 0s.

As the aforesaid investigation is based on only 30 frames, the conclusion may be not general enough.In the following, we will take statistics on 12800 frames, including 6400 frames of speech and 6400frames of music. We compute 1150 shifted NACs and the matching NAC for each frame. Table Idisplays the mean values and standard deviations of NAC based on 12800 frames. It is found thatthe mean values of shifted NACs and the matching NAC have a significant distance. The standarddeviations are all small compared to the mean values. However, as we noted before, the differencebetween shifted NACs and the matching NAC becomes small when with a high bit rate, such as128 kbps.

4. LOCATING FORGERIES VIA CHECKING FRAME OFFSETS

As audio samples are divided into frames for encoding, the frame offset could be useful evidence oftampering. When forgeries occur, all frames after the forged points will be affected. The detected offsets

of corresponding frames will change. Figure 7 is an example of cropping. The original sentence I amnot guilty is recorded with sampling rate of 44.1kHz and saved as MP3 format by a digital recorder,as shown in Figure 7(a). We manipulate this audio recording with CoolEdit v2.1, and remove the keyword not. The meaning of the sentence becomes the opposite: I am guilty, shown in Figure 7(b). Thedetected offsets of all frames in the original audio and the doctored one are demonstrated in Figure 7(c)


7/27/2019 a35-yang

9/20


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3050

100

150

200

250

300

350

400

NAC

different audio frames

(a) NAC result of frames encoded with 32 kbps

distribution of 1150 NACswith wrong offsets for 14th audio frame

NAC with the correct offset for 14th audio frame

32kbps

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30100

150

200

250

300

350

400

NAC


(b) NAC result of frames encoded with 64 kbps

64kbps

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30100

150

200

250

300

350

400

NAC


(c) NAC result of frames encoded with 96 kbps

96kbps

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30100

150

200

250

300

350

400

NAC


(d) NAC result of frames encoded with 128 kbps

128kbps

Fig. 6. The distribution of NACs corresponding to frame offsets from 575 to 575 on 30 different audio frames, which areencoded using LAME v3.97, mono. The box stands for the distribution of 1150 NACs with wrong offsets, while the isolated pointis the NAC with the correct offset. In panel (a) (b) (c) (d) are the cases for 32 kbps, 64 kbps, 96 kbps, 128 kbps, respectively.


7/27/2019 a35-yang

10/20

35:10 R. Yang et al.

0 2 4 6 8 10 12 14 16

x 104

0.5

0

0.5(a) Original Waveform

0 2 4 6 8 10 12 14 16

x 104

0.40.2

00.2

(b) Doctored Waveform

I am guilty.

I am not guilty.

Cropping

0 50 100 150 200 2500

200400600

detec

tedoffset

different frame

(c) detection result of original audio

0 50 100 150 200 250

0200400600

detectedoffset

different frame

(d) detection result of doctored audio

Fig. 7. Example of locating one cropping. The sentence I am not guilty is cropped to I am guilty, shown as (a) and (b). (c) isthe detection result of the original audio. The detected offsets of all frames are 0, which means there are no forgeries. (d) is thedetection result of the doctored audio. The detected offsets change at frame 119, which means there is a forgery. Note that thehorizontal-axis represents samples in (a)(b), but frames in (c)(d). 160000 samples corresponds to 277 frames exactly.

and Figure 7(d), respectively. We observe that all frames in the original audio have the same offset 0.But for the doctored one, the detected offsets have two different values, 0 for frames 1 to frame 118,and 384 for the remainder. We can draw a conclusion that there is a forgery at frame 119.

From the previous example, we have the general procedures of locating forgeries: (i) detecting offsetsof all frames; (ii) checking the differences between frame offsets.

Now how can the offsets of all frames be retrieved effectively?Given an audio signal of L samples, we denote it with vector-notation x, and mark the j-sample-

shifted version (which means appending j zero samples at the beginning ofx) as x(j) (0 j < 576).

x(0) = x, x(j+1) = 0,x(j)

, j = 0, . . . , 574For each offset j, we split x(j) into 1152 samples per frame with 50% overlap, so we totally get

N = L/576 1 frames as follows. We have

x

(j)0 x

(j)N1

= Fx(j),


7/27/2019 a35-yang

11/20


where F represents frame segmentation as well as applying the window function, and x(j)k is the k-th

frame ofx(j),We apply the filterbank and MDCT to each frame and obtain its spectral (576 MDCT coefficients).

We have

s(j)k = Tx

(j)k ,

where T represents both filtering by the filterbank and MDCT. s(j)k represents the spectral of the k-th

frame ofx(j).We change s

(j)k into the logarithm representation Mk

(j).

M(j)k = 10log

max

s

(j)k s

(j)k 10

10, 1

We express Mk(j) in a logarithm representation by projecting all values into the range [0,10].

We then count the number of active value in Mk(j). We have

c(j)k = CM

(j)k ,

where C represents the counting operation.For frame k, the detected offset is

offsetk =

arg min

jc

(j)k , if mean

c

(j)k

min c

(j)k ,

100, if mean

c(j)k

min c

(j)k < ,

where mean(c(j)k ) =

1576

575j=0 c

(j)k , is a threshold to discriminate whether the frame offset is detectable.

For some cases the frame offset does not exist or is not covered, all c(j)k are close, but there is always a

min c(j)

k . So we need a threshold to indicate these cases, and we accept the frame offset is detectable

only when mean(c(j)k ) min c

(j)k is large enough. Otherwise the frame offset is undetectable. Note that

each frame would expect a 0 offset for no forgery, since there is no sample shift on each frame. However,the detection results of some frames would come up with nonzero offset for forgery.

To locate the forgeries, we just differentiate offset. Ifoffsetk = offsetk1, a forgery occurs at frame k.

5. EXPERIMENTAL RESULTS

5.1 Illustration of Locating Forgeries

In Section 4, we show that the proposed method can locate one deletion correctly. However, the frameoffset method is effective not only for one deletion, but also for multiple deletions. Here we demonstratean example where a sentence only consists of numbers, as often appears in witness statements. Asshown in Figure 8, three numbers are cropped away from the original sentence. The detected offsets ofall frames in the doctored audio are shown in Figure 8(c). We observe that the frame offsets change atthe 70th, 180th, and 470th frame. This means that some forgeries occur at these locations.

From Figure 8, if the manipulations on the MP3 audio destroy frame segmentations of the previousencoding, the frame offset method would be able to locate those forgeries. After insertion, the doctored

audio is separated into three segments. Obviously the three segments have different frame offsets.Figure 9 shows an example of insertion detection. It is shown that the method locates those forgeries

very exactly. As two spliced parts often come from the different sources, they often have different frameoffsets, so our method is also effective for detecting splicing. The case of substitution is illustrated inFigure 10.


7/27/2019 a35-yang

12/20


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

1

0.5

0

0.5

1

(a) waveform of original audio

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

1

0.5

0

0.5

1(b) waveform of doctored audio

one two three four five six seven eight nine

one three five six seven nine

0 100 200 300 400 500 600 700 800

0

200

400

600

different frames

detectedo

ffset

(c) detect result of doctored audio

Fig. 8. Example of locating multiple deletions. Three numbers are cropped away from a series of numbers, shown as (a) and(b). (c) is the detection result of the doctored audio. Frame offsets change at the 70th, 180th, and 470th frames, which meansthere are forgeries at these frames.

5.2 Extensive ExperimentsOur experiments also include extensive tests of different types of audio clips. Our tested audio includes64 speech clips (each 30 s long) and 64 music clips (each 30 s long). These original audio clips are inWAV format, 22.05 kHz, 16 bit, mono. We use LAME 3.97 to encode the audio clips into MP3 withbit rates of 32 kbps, 64 kbps, and 96 kbps, respectively. Then each clip consists of 1142 frames. Foreach clip, we randomly select 100 frames and each frame performs 200 sample deletion and 200 sam-ple insertion, respectively. So for each bit rate, we test our approach on 12800 doctored frames withdeletion and another 12800 frames with insertion. We apply our method to these audio clips. We usethe false positive error to measure the undoctored frames incorrectly identified as doctored, while thefalse negative error represents the doctored ones that are not detected. We denote the false positiveerror rate and false negative error rate as fp and fn, respectively. The accurate detection rate AR iscalculated as follows.

AR =

1 f

p+ f

n2

100% (6)

The test results for speech and music are shown in Table II and Table III, respectively.As we see, whether we are locating deletion or insertion in these audio frames, all accuracy rates are

above 99%. We notice that the detection results of low bit rates are a little better than those of high bit


7/27/2019 a35-yang

13/20


0 2 4 6 8 10 12 14

x 104

0.5

0

0.5(a) Original waveform 1

0 2 4 6 8 10 12 14

x 104

0.5

0

0.5(b) Original waveform 2

0 2 4 6 8 10 12 14

x 104

0.5

0

0.5(c) Forgery waveform

I dont think so

I agree with it

I dont agree with it

Insertion

0 50 100 150 200 250

0200400600

different frames

(d) Detect result of doctored audio

detectedoffset

Fig. 9. Example of locating insertion. A key word dont is inserted into a sentence, shown as (a) and (b). (c) is the detectionresult of the doctored audio. Frame offsets change at the 48th and 100th frames, which means there are forgeries at theseframes.

rates. This is due to MP3s with lower bit rates having stronger compression traces which means thatthe frame offset can be detected more accurately. The fps of speech are higher than those of music,while the opposite is the case for fns. This may be due to the presence of fewer silent samples in themusic clips, and frame offset detection of silent portions introduces errors more easily.

It is noted that the detection rate cannot achieve 100%. For some special cases our method will fail tolocate forgeries. When the frame contains lots of zero samples, for example, one half, the correct offsetcannot be detected via NAC, as shown in Figure 11. The actual offset of the frame is 200. However,the detected offset is 575. While applying different offsets, the number of zero samples varies rapidly,which leads to unstable NAC.

5.3 Sensitivity and Robustness

In this subsection, we discuss the sensitivity and robustness of the proposed method against a varietyof attack schemes.

5.3.1 Splicing at the Boundary. If the adversary is smart enough to splice or crop exactly multipleof 576 samples to achieve the exact boundary of one frame, will the detection method still work? Aftergenerating the desired audio, the adversary only needs to adjust some (1 575) samples to match


7/27/2019 a35-yang

14/20


0 2 4 6 8 10 12

x 104

0.5

0

0.5(a) Original waveform 1

0 2 4 6 8 10 12

x 104

0.5

0

0.5(b) Original waveform 2

0 2 4 6 8 10 12

x 104

0.5

0

0.5(c) Forgery waveform

I like it

I hate doing that

I like doing that

Substitution

0 50 100 150 200

0200400600

different framesdetectedoffset

(d) Detect result of doctored audio

Fig. 10. Example of locating substitution. A key word hate is replaced by like, shown as (a) and (b). (c) is the detection resultof the doctored audio. Frame offsets change at the 48th and 90th frames, which means there are forgeries at these frames.

Table II. Detection Results for SpeechForgery Type bit rate fp fn AR

deletion 32 kbps 0.50% 0.03% 99.73%deletion 64 kbps 0.90% 0.14% 99.48%

deletion 96 kbps 1.12% 0.34% 99.27%

insertion 32 kbps 0.51% 0.03% 99.73%

insertion 64 kbps 0.85% 0.20% 99.47%

insertion 96 kbps 1.01% 0.37% 99.31%

Table III. Detection Results for MusicForgery Type bit rate fp fn AR

deletion 32 kbps 0.20% 0.27% 99.76%

deletion 64 kbps 0.27% 0.47% 99.63%

deletion 96 kbps 0.32% 0.61% 99.53%

insertion 32 kbps 0.16% 0.20% 99.82%

insertion 64 kbps 0.23% 0.42% 99.67%

insertion 96 kbps 0.28% 0.45% 99.63%

the frame boundary. Because 1 575 samples only last less than 575/44100 = 0.013 s for a 44.1 kHzsampling rate, this adjustment would not affect the meaning of the desired audio. Thanks to the 50%overlap framing method during the MP3 encoding, we can still find the trace of this forgery. We give ademonstration in Figure 12. Suppose that one forgery occurs at the boundary of frame k. There exactly


7/27/2019 a35-yang

15/20


0 100 200 300 400 500 6001

0.5

0

0.5

1

sample index

amplitude

(a)waveform of an undetectable frame

0 100 200 300 400 500 600100

150

200

250

300

350

400

frame offset

NA

C

(b) NAC result

Fig. 11. An example of fail case. Shown in (a) is the waveform of one frame with undetectable frame offset. Shown in (b) arethe NACs via different frame offsets.

576 samples are cropped. The spectral of new frame k + 1 will not have the quantization characteristicno matter with which offset, but frame k and frame k + 2 still have many troughs with the originaloffset.

5.3.2 Additive Noise. Additive noise may be added to the tampered speech to cover forgeries, andthis presents a challenge for forgery detection. To investigate the robustness of the proposed schemeundergone with additive noise, a short speech clip consisting of 45 frames is tested. The audio samplesof the 20th frame are added with white Gaussian noise of 30dB, as shown in Figure 13(a). Sinceboth the 19th and 21st frames are 50% overlapping with the 20th frame, it means that the 19th and21st frames are half doctored at the same time. Then we investigate the effect of additive noise onNAC. All frames are applied with offsets from 0 to 575, and the corresponding NACs are recorded andplotted vertically, as shown in Figure 13(b). It is noted that all the plots have a significantly small

value except those plots of the 18th, 19th, 20th, 21st, and 22nd frames. This means frame offsets ofall frames except these five frames can be detected via NAC. Since there is not such a remarkabledecrease among the NACs of the 18th, 19th, 20th, 21st, and 22nd frames, the frame offsets of thesefive frames are undetectable and marked with a special value 100 as mentioned in Section 3. The

detection result of the tampered speech is shown as Figure 13(c). From the detection result, it showsthat the proposed method can resist locally added noise, which means that forgeries covered by noisecan be located.

However, if the noise is globally added after forgeries, all the frame offsets become undetectableand marked as 100. In this case, the proposed method is not able to locate the forgeries, but it still


7/27/2019 a35-yang

16/20


0 576 1152 1728 23030.2

0

0.2

amplitude

(a) waveform

0 576 1152 17280

5

10

magnitude(dB)

(b) spectral

frame k frame k+1 frame k+2

Original audio

Doctored audio

Fig. 12. The case of splicing at the boundary. Shown in (a) is a waveform of audio whose 576 samples are cropped from the1153rd sample. Shown in (b) is the spectral of the three frames of doctored audio. All the frames have the quantization charac-

teristics except the middle frame.

0 0.5 1 1.5 2 2.5

x 104

1

0

1(a) audio with additive noise

0 5 10 15 20 25 30 35 40 45

200

400

different frame

NAC

(b) NAC result of each frame

0 5 10 15 20 25 30 35 40 45

100

50

0

different frame

detectedoffset

(c) detection result of each frame

adding noise

Fig. 13. The effect of additive noise on NAC. Shown in (a) is the waveform of audio with partially additive noise. Shown in(b) are the NAC results of all frames. Shown in (c) is the detection result of frame offsets.

indicates that the audio is abnormal and must be postprocessed. In this case, the audio is suspect andrejected as evidence.

5.3.3 Filtering. Another common way to cover forgeries is filtering the tampered signal. Here wetest with a median filter, mean filter, and low-pass filter. The same speech clip as in the precedingsection is selected for testing. Since the effect of different filters on NAC is similar, under the limitationof page range only the result of the median filter is illustrated.


7/27/2019 a35-yang

17/20


0 0.5 1 1.5 2 2.5

x 104

1

0

1(a) audio with filtering

0 5 10 15 20 25 30 35 40 45

200

400

different frame

NAC

(b) NAC result of each frame

0 5 10 15 20 25 30 35 40 45

100

50

0

different frame

detectedoffset

(c) detection result of each frame

filtering

Fig. 14. The effect of median filtering on NAC. Shown in (a) is the waveform of audio partially filtered. Shown in (b) are theNAC results of all frames. Shown in (c) is the detection result of frame offsets.

First, the 20th frame of the audio signal is filtered by a median filter with length of 7, as shown inFigure 14(a). Since both 19th and 21st frames are 50% overlapping with the 20th frame, it means thatthe 19th and 21st frames are half filtered at the same time. Then NACs of all frames are investigatedand the proposed detection method is applied to the whole speech clip.

As shown in Figure 14(b), similar to the case of adding noise, the plots of NACs of the 18th, 19th,20th, 21st, and 22nd frames have no significant decreases, while the plots of other frames have anobviously small value. From the detection result at Figure 14(c), it shows that frame offsets of the18th, 19th, 20th, 21st, and 22nd frames are undetectable, but other frames have a obvious offset as 0.It means that the proposed method can indicate the filtered portion of an audio signal if the signal is

partially filtered. However, similar as the case of adding noise, if the audio signal is globally filtered,the proposed method could not locate forgeries automatically, but still indicates the filtered signal hasbeen manipulated.

6. DISCUSSIONS AND CONCLUSIONS

6.1 Extension to Other Formats

Although we only investigate audio of MP3 format, the idea of locating forgeries via the frame offsetis suitable for audio of other compressed formats, such as AAC, WMA, and OGG Vorbis. Since thegeneration of audio with these formats is performed frame by frame, the frame offset of each frame isachievable.

To confirm this, we use audio signal encoded with AAC for testing. Notice that the length of eachframe in AAC is 1024, and the frequency spectral is also of MDCT coefficients. The tool we utilize

to encode and decode audio signals is FAAC [FAA]. The test clip consists of 40 frames audio, and itssampling rate is 44.1 kHz. The encoding parameters of FAAC are 96 kbps, mono. First, we investigatewhether the AAC audio has the quantization characteristic. Offsets 1, 0, and +1 are applied to the9th frame, respectively. For each offset, 1024 MDCT coefficients can be obtained. Then we plot thesecoefficients in a logarithmic representation, as shown in Figures 15(a), (b), and (c). It is obvious that


7/27/2019 a35-yang

18/20


0 200 400 600 800 1000 1200

0

20

40

m

agnitude(dB)

frequency index

(a) offset = 1

0 200 400 600 800 1000 12000

20

40

magnitude(dB)

frequency index

(b) offset = 0

0 200 400 600 800 1000 12000

20

40

magnitude(dB)

frequency index

(c) offset = +1

0 500 1000 1500 2000 2500400

600

800

frame offset

NAC

(d) NAC result of audio encoded with AAC

Fig. 15. Quantization characteristic of AAC. Subfigure (a), (b), (c) are corresponding to spectral of the 9th frame with offsets1, 0, and +1, respectively. Similar with the case of MP3, the quantization characteristic shows up when only with the matchingoffset (0). Subfigure (d) shows the NAC result of 9th frame with offsets 1 to 2500.

only Figure 15(b) shows the quantization characteristic. Furthermore, we apply offsets 1 to 2500 onthe frame, and obtain the corresponding NAC results, as shown in Figure 15(d). A period of 1024can be observed. Within the length of the frame, there is only one matching offset, and its NAC is

discriminative from other 1023 NACs.Now we are in a step of checking AAC audio forgeries. The audio with 40 frames has totally 40960

samples. We delete samples from index 10000 to 15000. Then we apply the proposed method to thedoctored AAC audio. Each frame generates 1024 NACs, and the matching offset is recognized as theone corresponding to minimize NAC. The detection result is shown as Figure 16.

Therefore we show that the proposed method can detect forgeries on AAC audio. Our method is alsoable to extend to other frame-based encoders, since applying the matching offset is easier to approxi-mate with the first-encoding spectral than using other shifted offsets. What we must remember is theprocedure of extracting spectral varying from different encoders, since they use different frame lengthand windows.

6.2 Conclusions

In this article, we propose a method to expose MPEG audio forgeries using frame offsets. The maincontributions of this work are as follows. First, according to our best knowledge, this is the first pieceof work on detecting forgeries on MP3 audio. It extends the research topics of forgery detection. Second,this work illustrates that MDCT coefficients can reflect forgery traces very well for MPEG audio. Viatheoretical analysis and extensive experiments, we show that NAC is a reliable feature to retrieve


7/27/2019 a35-yang

19/20


0.5 1 1.5 2 2.5 3 3.5 4x 10

4

1

0

1(a) original audio

0.5 1 1.5 2 2.5 3 3.5 4

x 104

1

0

1(b) doctored audio

5 10 15 20 25 30 35 400

200

400(c) detection result

frame indexdetectedoffset

Fig. 16. Forgeries detection result of AAC audio.

frame offsets. Based on the fact that most common forgeries change frame offsets of audio, the proposedmethod can locate these forgeries effectively. Extensive experimental results show that the proposedmethod has very good performance on both speech and music. All the accuracy rates are above 99%,which shows the effectiveness of our proposed method. Another advantage of the proposed method isthe simplicity in computation. We only need to investigate the MDCT coefficients of the audio.

However, if audio is transcoded between different compressed formats, the frame offset is difficultto obtain and the proposed method will fail in this case. It is noted that at a high bit rate such as 128kbps the NAC method is not very suitable for retrieving frame offsets, since zero coefficients are fewat high bit rates. So in the future, we will focus on obtaining the frame offset when transcoding and athigh bit rates.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for their constructive comments. Theirsuggestions will be very helpful for our future work.

REFERENCES

BOEHM, R. AND WESTFELD, A. 2004. Statistical characterisation of mp3 encoders for steganalysis. In Proceedings of the 6th ACMMultimedia and Security Workshop. ACM.

FAAC. 2012. Freeware advanced audio coder. http://www.audiocoding.com/faac.html.

FARID, H. 1999. Detecting digital forgeries using bispectral analysis. MIT AI Memo AIM-1657, MIT.

FU, D., SHI, Y., AND SU, W. 2007. A generalized benfords law for jpeg coefficients and its applications in image forensics. InProceedings of SPIE Conference on Security, Steganography, and Watermarking of Multimedia Contents.

GRIGORAS, C. 2005. Digital audio recording analysis: The electric network frequency (enf) criterion. Int. J. Speech Lang. Law 2, 1,

6376.

HERRE, J. AND SCHUG, M. 2000. Analysis of decompressed audioThe inverse decoder. In Proceedings of the 109th AESConvention.

HERRE, J., SCHUG, M., AND GEIGER, R. 2002. Analysing decompressed audio with the inverse decoderTowards an operativealgorithm. In Proceedings of the 112th AES Convention.


7/27/2019 a35-yang

20/20


ISO. 1992. Iso/iec international standard is 11172-3. Information technologyCoding of moving pictures and associated audiofor digital storage media up to about 1.5 Mbit/s. http://www.iso.org/iso/catalouge detail.htm?csnumber=22412.

KRAETZER, C., OERMANN, A., DITTMANN, J., AND LANG, A. 2007. Digital audio forensics: A first practical evaluation on microphoneand environment classification. In Proceedings of the 9th ACM Multimedia and Security Workshop.

LAME 3.97. 2012. Mp3 encoder. http://lame.sourceforge.net.

LUKAS, J. AND FRIDRICH, J. 2003. Estimation of primary quantization matrix in double compressed jpeg images. In Proceedingsof the Digital Forensic Research Workshop.

PAINTER, T. AND SPANIAS, A. 2000. Perceptual coding of digital audio. Proc. IEEE 88, 4, 451513.

POPESCU, A. AND FARID, H. 2004. Statistical tools for digital forensics. In Proceedings of the 6th International Workshop onInformation Hiding.

QU, Z., LUO, W., AND HUANG, J. 2008. A convolutive mixing model for shift double jpeg compression with application to passiveimage authentication. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing .

WANG, Y. AND VELERMO, M. 2003. Modified discrete cosine transformIts implications for audio coding and error concealment.AES J. 51, 1, 5162.

WANG, Y., YAROSLAVSKY, L., VILERMO, M., AND VAANANEN, M. 2000. Some peculiar properties of the mdct. In Proceedings of the16th IFIP World Computer Congress.

YANG, R., QU, Z., AND HUANG, J. 2008. Detecting digital audio forgeries by checking frame offsets. In Proceedings of the 10th ACMMultimedia and Security Workshop. ACM.

Received November 2010; revised July 2011; accepted August 2011


a35-yang

Documents

Transcript of a35-yang