Hear “No Evil”, See “Kenansville”*: Efficient and Transferable … · 2019. 10. 14. ·...

18
Hear “No Evil”, See “Kenansville”*: Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton and Patrick Traynor University of Florida {hadi10102, rahmanm, w.garcia, bluel, kwarren9413, anuragswar.yadav, teshrim, traynor}@ufl.edu Abstract—Automatic speech recognition and voice identifica- tion systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have signif- icant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box and transferable, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statisti- cally significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy. We also find that certain English language phonemes (in particular, vowels) are significantly more susceptible to our attack. We show that the attacks are effective when mounted over cellular networks, where signals are subject to degradation due to transcoding, jitter, and packet loss. I. I NTRODUCTION The telephony network is still the most widely used mode of audio communication on the planet, with billions of phone calls occurring every day within the USA alone [3]. Such a degree of activity makes the telephony network a prime target for mass surveillance by governments. However, hiring individual human listeners to monitor these conversations can not scale. To overcome this bottleneck, governments have used Machine Learning (ML) based Automatic Speech Recogni- tion (ASR) systems and Automatic Voice Identification (AVI) systems to conduct mass surveillance of their populations. Governments accomplish this by employing ASR systems to flag anti-state phone conversations and AVI systems to identify the participants [6]. The ASR systems convert the phone call audio into text. Next, the government can use keywords *The title of our paper plays on “Hear No Evil, See No Evil” and we use the attacks described in our paper to generate the above title. Thus, when a model is fed “No Evil”, it mistranscribes it as “Kenansville”, a town located in Central Florida - text completely unrelated to the audio input. searches on the audio transcripts to flag potentially dissenting audio conversations [8]. Similarly, AVI systems identify the participants of the phone call using voice signatures. Currently, there does not exist any countermeasure for a dissident attempting to circumvent this mass surveillance infrastructure. There are several targeted attacks against ASR and AVI systems that exist in the current literature. However, none of these consider the limitations of the dissident (near- real-time, no access/knowledge of the state’s ASR and AVI systems, success over the telephony network, limited queries, transferable, high audio quality). Targeted attacks either require white-box knowledge [32], [83], [30], [71], [46] generate noisy audio [18], [28], are query intensive [19], [78], or not resistant to the churn of the telephone network. For a comprehensive overview of the state of the current attacks with respect to our own, we refer the reader to Table IV in the Appendix. In this work, we propose the first near-real-time, black- box, model agnostic method to help evade the ASR and AVI systems employed as part of the mass telephony surveillance infrastructure 1 . Using our method, a dissident can force any ASR system to mistranscribe their phone call audio and an AVI system to misidentify their voice. As a result, governments will lose trust in their surveillance models and invest greater resources to account for our attack. Additionally, by forcing mistranscriptions, our attack will prevent governments from successfully flagging the conversation. Our attack is untargeted i.e., it can not generate selected words or specific speakers. However, in the absence of any technique that can attain these goals in the severely constrained setting of the dissident, our methods ability to achieve a limited set of goals (i.e., evasion) is still valuable. This can be used by dissident as the first line of defense. Our attack is specifically designed to address the needs of the dissident attempting to evade the audio surveillance infrastructure. The following are the key contributions of our work: Our attack can circumvent any state of the art ASR and AVI system in near real-time, black-box, transferable manner: A dissident attempting to evade 1 Recently, we have seen a number of attack papers against ASR and AVI systems. To better understand why our work in novel and clearly differentiate it from existing literature, we encourage the readers to review Table IV in the Appendix. arXiv:1910.05262v1 [cs.CR] 11 Oct 2019

Transcript of Hear “No Evil”, See “Kenansville”*: Efficient and Transferable … · 2019. 10. 14. ·...

  • Hear “No Evil”, See “Kenansville”*: Efficient andTransferable Black-Box Attacks on Speech

    Recognition and Voice Identification Systems

    Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue,Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton and Patrick Traynor

    University of Florida{hadi10102, rahmanm, w.garcia, bluel, kwarren9413, anuragswar.yadav, teshrim, traynor}@ufl.edu

    Abstract—Automatic speech recognition and voice identifica-tion systems are being deployed in a wide array of applications,from providing control mechanisms to devices lacking traditionalinterfaces, to the automatic transcription of conversations andauthentication of users. Many of these applications have signif-icant security and privacy considerations. We develop attacksthat force mistranscription and misidentification in state of theart systems, with minimal impact on human comprehension.Processing pipelines for modern systems are comprised of signalpreprocessing and feature extraction steps, whose output is fed toa machine-learned model. Prior work has focused on the models,using white-box knowledge to tailor model-specific attacks. Wefocus on the pipeline stages before the models, which (unlikethe models) are quite similar across systems. As such, ourattacks are black-box and transferable, and demonstrably achievemistranscription and misidentification rates as high as 100% bymodifying only a few frames of audio. We perform a study viaAmazon Mechanical Turk demonstrating that there is no statisti-cally significant difference between human perception of regularand perturbed audio. Our findings suggest that models may learnaspects of speech that are generally not perceived by humansubjects, but that are crucial for model accuracy. We also findthat certain English language phonemes (in particular, vowels)are significantly more susceptible to our attack. We show thatthe attacks are effective when mounted over cellular networks,where signals are subject to degradation due to transcoding, jitter,and packet loss.

    I. INTRODUCTION

    The telephony network is still the most widely used modeof audio communication on the planet, with billions of phonecalls occurring every day within the USA alone [3]. Sucha degree of activity makes the telephony network a primetarget for mass surveillance by governments. However, hiringindividual human listeners to monitor these conversations cannot scale. To overcome this bottleneck, governments have usedMachine Learning (ML) based Automatic Speech Recogni-tion (ASR) systems and Automatic Voice Identification (AVI)systems to conduct mass surveillance of their populations.Governments accomplish this by employing ASR systems toflag anti-state phone conversations and AVI systems to identifythe participants [6]. The ASR systems convert the phonecall audio into text. Next, the government can use keywords

    *The title of our paper plays on “Hear No Evil, See No Evil” and we usethe attacks described in our paper to generate the above title. Thus, when amodel is fed “No Evil”, it mistranscribes it as “Kenansville”, a town locatedin Central Florida - text completely unrelated to the audio input.

    searches on the audio transcripts to flag potentially dissentingaudio conversations [8]. Similarly, AVI systems identify theparticipants of the phone call using voice signatures.

    Currently, there does not exist any countermeasure fora dissident attempting to circumvent this mass surveillanceinfrastructure. There are several targeted attacks against ASRand AVI systems that exist in the current literature. However,none of these consider the limitations of the dissident (near-real-time, no access/knowledge of the state’s ASR and AVIsystems, success over the telephony network, limited queries,transferable, high audio quality). Targeted attacks either requirewhite-box knowledge [32], [83], [30], [71], [46] generate noisyaudio [18], [28], are query intensive [19], [78], or not resistantto the churn of the telephone network. For a comprehensiveoverview of the state of the current attacks with respect to ourown, we refer the reader to Table IV in the Appendix.

    In this work, we propose the first near-real-time, black-box, model agnostic method to help evade the ASR and AVIsystems employed as part of the mass telephony surveillanceinfrastructure1. Using our method, a dissident can force anyASR system to mistranscribe their phone call audio and anAVI system to misidentify their voice. As a result, governmentswill lose trust in their surveillance models and invest greaterresources to account for our attack. Additionally, by forcingmistranscriptions, our attack will prevent governments fromsuccessfully flagging the conversation. Our attack is untargetedi.e., it can not generate selected words or specific speakers.However, in the absence of any technique that can attain thesegoals in the severely constrained setting of the dissident, ourmethods ability to achieve a limited set of goals (i.e., evasion)is still valuable. This can be used by dissident as the first lineof defense.

    Our attack is specifically designed to address the needsof the dissident attempting to evade the audio surveillanceinfrastructure. The following are the key contributions of ourwork:

    • Our attack can circumvent any state of the artASR and AVI system in near real-time, black-box,transferable manner: A dissident attempting to evade

    1Recently, we have seen a number of attack papers against ASR and AVIsystems. To better understand why our work in novel and clearly differentiateit from existing literature, we encourage the readers to review Table IV in theAppendix.

    arX

    iv:1

    910.

    0526

    2v1

    [cs

    .CR

    ] 1

    1 O

    ct 2

    019

  • mass surveillance will not have access to the targetASR or AVI systems. A key contribution of this workis the ability to generate audio samples that will induceerrors in a variety ASR and AVI systems in a black-box setting, where the adversary has no knowledgeof the target model. Current black-box attacks againstaudio models [19] use genetic algorithms, which stillrequire hundreds of thousands of queries and daysof execution to generate a single attack sample. Incontrast, our attack can generate a sample in fewerthan 15 queries to the target model. Additionally, weshow that if dissident can not query the target model,which is most likely the case, our adversarial audiosamples will still be transferable i.e., evade unknownmodels.

    • Attack does not significantly impact human-perceived quality or comprehension and worksreal audio environments: The dissident must beconfident that the attack perturbations will maintainthe quality of the phone call audio, survive the churnof the telephony network and still be able to evadethe ASR and AVI systems. Therefore, we designour attack to introduce imperceptible changes, suchthat there is no significant degradation of the audioquality. We substantiate this claim by conducting anAmazon Turk user study. Similarly, we test our attacksover the cellular network, which introduces significantaudio quality degradation due to transcoding, jitterand packet loss. We show that even after undergoingsuch serious degradation and loss, our attack audiosample is still effective in tricking the target ASRand AVI systems. To our knowledge, our work isthe first to generate attack audio that is robust tothe cellular network. Therefore, our attack ensuresthat the dissenter’s adversarial audio will not haveany significant impact on quality and will evade thesurveillance models after having being interceptedwithin telephony networks.

    • Robust to existing adversarial detection and de-fense mechanisms: Finally, we evaluate our attackagainst existing techniques detecting or defendingadversarial samples. For the former, we test the at-tack against the temporal based method, which hasshown excellent results against traditional adversarialattacks [82]. We show that this method has limitedeffectiveness for our attack. It is not better thanrandomly choosing whether an attack is in progressor not. Regarding defenses, we test our attack againstadversarial training, which has shown promise in theadversarial image space [53]. We observe that thismethod slightly improves model robustness, but at acost of significant decrease in model accuracy.

    The remainder of this paper is organized as follows:Section II provides background information on topics rangingfrom signal processing to phonemes; Section III details ourmethodology, including our assumptions and hypothesis; Sec-tion IV presents our experimental setup and parameterization;Section V shows our results; Section VI offers further dis-cussion; Section VII discusses related work; and Section VIIIprovides concluding remarks.

    Preprocessing   (a)

    Low-PassFilter

    Noise Filter

    Decoding  (c)

    Feature Extraction (b)

    "Hello,how're you?"

    Text

    AccousticModels 

    Fig. 1: Modern ASR systems take several steps to convertspeech into text. (a) Preprocessing removes high frequenciesand noise from the audio waveform, (b) feature extractionextracts the most important features of the audio sample, and(c) decoding converts the features into text.

    II. BACKGROUND

    A. Automatic Speech Recognition (ASR) Systems:

    An ASR system converts a sample of speech into text usingthe steps seen in Figure 1.

    (a) Preprocessing Preprocessing in ASR systems attemptsto remove noise and interference, yielding a “clean” signal.Generally, this consists of noise filters and low pass filters.Noise filters remove unwanted frequency components fromthe signal that are not directly related to the speech. Theexact process by which the noise is identified and removedvaries among different ASR systems. Additionally, since thebulk of frequencies in human speech fall between 300 Hz and3000Hz [36], discarding higher frequencies with a low passfilter helps remove unnecessary information from the audio.

    (b) Feature Extraction Next, the signal is converted intooverlapping segments, or frames, each of which is then passedthrough a feature extraction algorithm. This algorithm retainsonly the salient information about the frame. A variety ofsignal processing techniques are used for feature extraction,including Discrete Fourier Transforms (DFT), Mel FrequencyCepstral Coefficients (MFCC), Linear Predictive Coding, andthe Perceptual Linear Prediction method [64]. The most com-mon of these is the MFCC method, which is comprised ofseveral steps. First, a magnitude DFT of an audio sample istaken to extract the frequency information. Next, the Mel filteris applied to the magnitude DFT, as it is designed to mimic thehuman ear. This is followed by taking the logarithm of powers,as the human ear perceives loudness on a logarithmic scale.Lastly, this output is given to the Discrete Cosine Transform(DCT) function that returns the MFCC coefficients.

    Some modern ASR systems use data-driven learning tech-niques to establish which features to extract. Specifically, amachine learning layer (or a completely new model) is trainedto learn which features to extract from an audio sample inorder to properly transcribe the speech [81].

    (c) Decoding During the last phase, the extracted featuresare passed to a decoding function, often implemented in theform of a machine learning model. This model has beentrained to correctly decode a sequence of extracted featuresinto a sequence of characters, phonemes, or words, to formthe output transcription. ASR systems employ a variety of sta-tistical models for the decoding step, including ConvolutionalNeural Networks (CNNs) [66], [17], [55], Recurrent NeuralNetworks (RNNs) [40], [67], [68], [69], Hidden MarkovModels (HMMs) [39], [69], and Gaussian Mixture Models(GMMs) [50], [69]. Each model type provides a unique set

    2

  • of properties and therefore the type of model selected directlyaffects the ASR system quality. Depending on the model,the extracted features may be re-encoded into a different,learnable feature space before proceeding to the decodingstage. A recent innovation is the paradigm known as end-to-end learning, which combines the entire feature extractionand decoding phase into one model, and greatly simplifiesthe training process. The most sophisticated methods willleverage many machine learning models during the decodingprocess. For example, one may employ a dedicated languagemodel, in addition to the decoding function, to improve theASR’s performance on high-level grammar and rhetoricalconcepts [20]. Our attacks are agnostic to how the target ASRsystem is implemented, making our attack completely black-box. To our knowledge, we are the first paper to introduceblack-box attacks in a limited query environment.

    B. Automatic Voice Identification (AVI) Systems:

    AVI systems are trained to recognize the speaker of avoice sample. The modern AVI pipeline is mostly similar tothe one used in the ASR systems, shown in Figure 1. Whileboth systems use the preprocessing and feature extractionsteps, the difference lies in the decoding step. Even thoughthe underlying statistical model (i.e., CNNs, RNNs, HMMsor GMMs) at the decoding stage remains the same for bothsystems, what each model outputs is different. In the case ofASR systems, the decoding step converts the extracted featuresinto a sequence of characters, phonemes, or words, to formthe output transcription. In contrast, the decoding step forAVI models outputs a single label which corresponds to anindividual. AVI systems are commonly used in security criticaldomains as an authentication mechanism to verify the identityof a user. In our paper, we present attacks to prevent the AVImodels from correctly identifying speakers. To our knowledge,we are the first to do so in a limited query, black-box setting.

    C. Data-Transforms

    In this paper, we use standard signal processing trans-formations to change the representation of audio samples.The transforms can be classified into two categories: data-independent and data-dependent.

    1) Data-Independent Transforms: These represent inputsignals in terms of fixed basis functions (e.g., complex expo-nential functions, cosine functions, wavelets). Different basisfunctions extract different information about the signal beingtransformed. For our attack, we focus on the DFT, whichexposes frequency information. We do so because the DFTis well understood and commonly used in speech processing,both as a stand-alone tool, and as part of the MFCC method,as discussed in Section II-A.

    The DFT, shown in Figure 2(b), represents a discrete-timeseries x0, x1, . . . , xN−1 via its frequency spectrum — a se-quence of complex values f0, f1, . . . , fN−1 that are computedas fk =

    ∑N−1n=0 xn exp

    ((−j2π) kN n

    )where j =

    √−1, for

    k = 0, 1, . . . , N − 1. One can view fk as the projectionof the time series onto the k-th basis function, a (discrete-time) complex sinusoid with frequency k/N (i.e., a sinusoidthat completes k cycles over a sequence of N evenly spacedsamples). Intuitively, the complex-valued fk describes “how

    Fig. 2: (a) Original audio “about” [1]; (b) the correspondingDFT and (c) SSA decompositions. In both, low magnitudecomponents (frequencies or eigenvectors, respectively) con-tribute little to the original audio.

    much” of the time series x0, x1, . . . , xN−1 is due to a sinu-soidal waveform with frequency k/N . It compactly encodesboth the magnitude of the k-th sinusoid’s contribution, aswell as phase information, which determines how the k-thsinusoid needs to be shifted (in time). The DFT is invertible,meaning that a time-domain signal is uniquely determined bya given sequence of coefficients. Filtering operations (e.g. low-pass/high-pass filters) allow one to accentuate or downplay thecontribution of specific frequency components; in the extreme,setting a non-zero fk to zero ensures that the resulting time-domain signal will not contain that frequency.

    2) Data-Dependent Transforms: Unlike the DFT, data-dependent transforms do not use predefined basis functions.Instead, the input signal itself determines the effective basisfunctions: a set of linearly independent vectors which canbe used to reconstruct the original input. Abstractly, an inputsequence x with |x| = n can be thought of as a vector in thespace Rn, and the data-driven transform finds the bases forthe input x. Singular Spectrum Analysis (SSA) is a spectralestimation method that decomposes an arbitrary time seriesinto components called eigenvectors, shown in Figure 2(c).These eigenvectors represent the various trends and noisethat make up the original series. Intuitively, eigenvectorscorresponding to eigenvalues with smaller magnitudes conveyrelatively less information about the signal, while those withlarger eigenvalues capture important structural information, aslong-term “shape” trends, and dominant signal variation fromthese long-term trends. Similar to the DFT, the SSA is alsolinear and invertible. Inverting an SSA decomposition afterdiscarding eigenvectors with small eigenvalues is a means toremove noise from the original series.

    D. Cosine Similarity

    Cosine Similarity is a metric used to measure the similarityof two vectors. This metric is often used to measure howsimilar two samples of text are to one another (e.g., as partof the TF-IDF measure [45]). In order to calculate this, the

    3

  • sample texts are converted into a dictionary of vectors. Eachindex of the vector corresponds to a unique word, and theindex value is the number of times the word occurs in thetext. The cosine similarity is calculated using the equationcos(x, y) = x·y||x||·||y|| , where x and y are the sentence samples.Cosine values close to one mean that the two vectors, or inthis case sentences, have high similarity.

    E. Phonemes

    Human speech is made up of various component soundsknown as phonemes. The set of possible phonemes is fixeddue to the anatomy that is used to create them. The numberof phonemes that make up a given language varies. English,for example, is made up of 44 phonemes. Phonemes can bedivided into several categories depending on how the soundis created. These categories include vowels, fricatives, stops,affricates, nasal, and glides. In this paper, we mostly deal withfricatives and vowels; however, for completeness, will brieflydiscuss the other categories here.

    Vowels are created by positioning the tongue and jaw suchthat two resonance chambers are created within the vocal tract.These resonance chambers create certain frequencies, knownas formants, with much greater amplitudes than others. Therelationship between the formants determines which vowel isheard. Examples of vowels include iy, ey, and oy in the wordsbeet, bait, and boy, respectively.

    Fricatives are generated by creating a constriction in theairway that causes turbulent flow to occur. Acoustically, frica-tives create a wide range of higher frequencies, generally above1 kHz, that are all similar in intensity. Common fricativesinclude the s and th sounds found in words like sea and thin.

    Stops are created by briefly halting air flow in the vocaltract before releasing it. Common stops in English include b,g, and t found in the words bee, gap, and tea. Stops generallycreate a short section of silence in the waveform before a rapidincrease in amplitude over a wide range of frequencies.

    Affricates are created by concatenating a stop with africative. This results in a spectral signature that is similarto a fricative. English only contains two affricates, jh and chwhich can be heard in the words joke and chase, respectively.

    Nasal phonemes are created by forcing air through thenasal cavity. Generally, nasals have less amplitude than otherphonemes and consist predominantly of lower frequencies.English nasals include n and m such as in the words noonand mom.

    Glides are unlike other phonemes, since they are notgrouped by their means of production, but instead by theirroll in speech. Glides are acoustically similar to vowels butare instead used like consonants, acting as transitions betweendifferent phonemes. Examples of glides include the l and ysounds in lay and yes.

    III. METHODOLOGY

    A. Hypothesis and Threat Model

    Hypothesis: Our central hypothesis is that ASR and AVIsystems rely on components of speech that are non-essentialfor human comprehension. Removal of these components can

    Signal Decomposition

    (a)

    Reconstruction(c)

    Measurement(e)

    Thresholding(b)

    ASR(d)

    “the meeting is now Charmed”

    W1, W2, W3, ...AAAB+3icbZDLSsNAFIZP6q3WW6xLN4NFcFFCUgVdFty4rGCbQhvCZDpph04uzEzEEvoqblwo4tYXcefbOGmz0NYDM3z8/znMmT9IOZPKtr+Nysbm1vZOdbe2t39weGQe13syyQShXZLwRPQDLClnMe0qpjjtp4LiKODUDaa3he8+UiFZEj+oWUq9CI9jFjKClZZ8s+76ThO5fqu4LpvIsizfbNiWvSi0Dk4JDSir45tfw1FCsojGinAs5cCxU+XlWChGOJ3XhpmkKSZTPKYDjTGOqPTyxe5zdK6VEQoToU+s0EL9PZHjSMpZFOjOCKuJXPUK8T9vkKnwxstZnGaKxmT5UJhxpBJUBIFGTFCi+EwDJoLpXRGZYIGJ0nHVdAjO6pfXodeyHM33V412u4yjCqdwBhfgwDW04Q460AUCT/AMr/BmzI0X4934WLZWjHLmBP6U8fkDjKGRgw==AAAB+3icbZDLSsNAFIZP6q3WW6xLN4NFcFFCUgVdFty4rGCbQhvCZDpph04uzEzEEvoqblwo4tYXcefbOGmz0NYDM3z8/znMmT9IOZPKtr+Nysbm1vZOdbe2t39weGQe13syyQShXZLwRPQDLClnMe0qpjjtp4LiKODUDaa3he8+UiFZEj+oWUq9CI9jFjKClZZ8s+76ThO5fqu4LpvIsizfbNiWvSi0Dk4JDSir45tfw1FCsojGinAs5cCxU+XlWChGOJ3XhpmkKSZTPKYDjTGOqPTyxe5zdK6VEQoToU+s0EL9PZHjSMpZFOjOCKuJXPUK8T9vkKnwxstZnGaKxmT5UJhxpBJUBIFGTFCi+EwDJoLpXRGZYIGJ0nHVdAjO6pfXodeyHM33V412u4yjCqdwBhfgwDW04Q460AUCT/AMr/BmzI0X4934WLZWjHLmBP6U8fkDjKGRgw==AAAB+3icbZDLSsNAFIZP6q3WW6xLN4NFcFFCUgVdFty4rGCbQhvCZDpph04uzEzEEvoqblwo4tYXcefbOGmz0NYDM3z8/znMmT9IOZPKtr+Nysbm1vZOdbe2t39weGQe13syyQShXZLwRPQDLClnMe0qpjjtp4LiKODUDaa3he8+UiFZEj+oWUq9CI9jFjKClZZ8s+76ThO5fqu4LpvIsizfbNiWvSi0Dk4JDSir45tfw1FCsojGinAs5cCxU+XlWChGOJ3XhpmkKSZTPKYDjTGOqPTyxe5zdK6VEQoToU+s0EL9PZHjSMpZFOjOCKuJXPUK8T9vkKnwxstZnGaKxmT5UJhxpBJUBIFGTFCi+EwDJoLpXRGZYIGJ0nHVdAjO6pfXodeyHM33V412u4yjCqdwBhfgwDW04Q460AUCT/AMr/BmzI0X4934WLZWjHLmBP6U8fkDjKGRgw==AAAB+3icbZDLSsNAFIZP6q3WW6xLN4NFcFFCUgVdFty4rGCbQhvCZDpph04uzEzEEvoqblwo4tYXcefbOGmz0NYDM3z8/znMmT9IOZPKtr+Nysbm1vZOdbe2t39weGQe13syyQShXZLwRPQDLClnMe0qpjjtp4LiKODUDaa3he8+UiFZEj+oWUq9CI9jFjKClZZ8s+76ThO5fqu4LpvIsizfbNiWvSi0Dk4JDSir45tfw1FCsojGinAs5cCxU+XlWChGOJ3XhpmkKSZTPKYDjTGOqPTyxe5zdK6VEQoToU+s0EL9PZHjSMpZFOjOCKuJXPUK8T9vkKnwxstZnGaKxmT5UJhxpBJUBIFGTFCi+EwDJoLpXRGZYIGJ0nHVdAjO6pfXodeyHM33V412u4yjCqdwBhfgwDW04Q460AUCT/AMr/BmzI0X4934WLZWjHLmBP6U8fkDjKGRgw==

    W1, W2, 0, ...AAACC3icbZC7SgNBFIZnvcb1FrW0GRIEi7DsBkHLgI1lBJMNJGGZnZwkQ2YvzJwVw5LexlexsVDE1hew822cXApNPDDw8f/nzJn5w1QKja77ba2tb2xubRd27N29/YPD4tFxUyeZ4tDgiUxUK2QapIihgQIltFIFLAol+OHoeur796C0SOI7HKfQjdggFn3BGRopKJZsP/Aq1A+qFdpBeMDZlbmC3iR3JxXqOE5QLLuOOyu6Ct4CymRR9aD41eklPIsgRi6Z1m3PTbGbM4WCS5jYnUxDyviIDaBtMGYR6G4+WzyhZ0bp0X6izImRztTfEzmLtB5HoemMGA71sjcV//PaGfavurmI0wwh5vNF/UxSTOg0GNoTCjjKsQHGlTBvpXzIFONo4rNNCN7yl1ehWXU8w7cX5VptEUeBnJISOSceuSQ1ckPqpEE4eSTP5JW8WU/Wi/Vufcxb16zFzAn5U9bnD5xWmN8=AAACC3icbZC7SgNBFIZnvcb1FrW0GRIEi7DsBkHLgI1lBJMNJGGZnZwkQ2YvzJwVw5LexlexsVDE1hew822cXApNPDDw8f/nzJn5w1QKja77ba2tb2xubRd27N29/YPD4tFxUyeZ4tDgiUxUK2QapIihgQIltFIFLAol+OHoeur796C0SOI7HKfQjdggFn3BGRopKJZsP/Aq1A+qFdpBeMDZlbmC3iR3JxXqOE5QLLuOOyu6Ct4CymRR9aD41eklPIsgRi6Z1m3PTbGbM4WCS5jYnUxDyviIDaBtMGYR6G4+WzyhZ0bp0X6izImRztTfEzmLtB5HoemMGA71sjcV//PaGfavurmI0wwh5vNF/UxSTOg0GNoTCjjKsQHGlTBvpXzIFONo4rNNCN7yl1ehWXU8w7cX5VptEUeBnJISOSceuSQ1ckPqpEE4eSTP5JW8WU/Wi/Vufcxb16zFzAn5U9bnD5xWmN8=AAACC3icbZC7SgNBFIZnvcb1FrW0GRIEi7DsBkHLgI1lBJMNJGGZnZwkQ2YvzJwVw5LexlexsVDE1hew822cXApNPDDw8f/nzJn5w1QKja77ba2tb2xubRd27N29/YPD4tFxUyeZ4tDgiUxUK2QapIihgQIltFIFLAol+OHoeur796C0SOI7HKfQjdggFn3BGRopKJZsP/Aq1A+qFdpBeMDZlbmC3iR3JxXqOE5QLLuOOyu6Ct4CymRR9aD41eklPIsgRi6Z1m3PTbGbM4WCS5jYnUxDyviIDaBtMGYR6G4+WzyhZ0bp0X6izImRztTfEzmLtB5HoemMGA71sjcV//PaGfavurmI0wwh5vNF/UxSTOg0GNoTCjjKsQHGlTBvpXzIFONo4rNNCN7yl1ehWXU8w7cX5VptEUeBnJISOSceuSQ1ckPqpEE4eSTP5JW8WU/Wi/Vufcxb16zFzAn5U9bnD5xWmN8=AAACC3icbZC7SgNBFIZnvcb1FrW0GRIEi7DsBkHLgI1lBJMNJGGZnZwkQ2YvzJwVw5LexlexsVDE1hew822cXApNPDDw8f/nzJn5w1QKja77ba2tb2xubRd27N29/YPD4tFxUyeZ4tDgiUxUK2QapIihgQIltFIFLAol+OHoeur796C0SOI7HKfQjdggFn3BGRopKJZsP/Aq1A+qFdpBeMDZlbmC3iR3JxXqOE5QLLuOOyu6Ct4CymRR9aD41eklPIsgRi6Z1m3PTbGbM4WCS5jYnUxDyviIDaBtMGYR6G4+WzyhZ0bp0X6izImRztTfEzmLtB5HoemMGA71sjcV//PaGfavurmI0wwh5vNF/UxSTOg0GNoTCjjKsQHGlTBvpXzIFONo4rNNCN7yl1ehWXU8w7cX5VptEUeBnJISOSceuSQ1ckPqpEE4eSTP5JW8WU/Wi/Vufcxb16zFzAn5U9bnD5xWmN8=

    Component Weights

    Modified Voice Signal

    Threshold Update

    Components

    Digitized Voice Signal

    Transcription

    “I'm eating a sandwich arms”

    Fig. 3: The figure shows the steps involved in generatingan attack audio sample. First, the target audio sample ispassed through a signal decomposition function (a) whichbreaks the input signal into components. Next, subject to someconstraints, a subset of the components are discarded duringthresholding (b). A perturbed audio sample is reconstructed(c) using the remaining weights from (a) and (b) . Theaudio sample is then passed to the ASR/AVI system (d) fortranscription. The difference between the transcription of theperturbed audio and the original audio is measured (e). Thethresholding constraints are updated accordingly (c) and theentire process is repeated.

    dramatically reduce the accuracy of ASR system transcriptionsand AVI system identifications without significant loss of audioclarity. Our methods and experiments are designed to test thishypothesis.

    Threat Model and Assumptions: For the purposes of thispaper, we define the attacker or adversary as a person who isaiming to trick an ASR or AVI system via audio perturbations.In contrast, we define the defender as the owner of the targetsystem.

    We assume the attacker has no knowledge of the type,weights, architecture, or layers of the target ASR or AVIsystem. Thus, we treat each system as a black-box to whichwe make requests and receive responses. We also assumethe attacker can only make a limited number of queries tothe target model, as a large number of queries will alert thedefender of the attacker’s activities. Furthermore, the attackerhas less computational resources than the defender.

    We assume the defender has the ability to train an ASR orAVI system. Additionally, they may use any type of machinelearning model, feature extraction, or preprocessing to createtheir ASR or AVI system. Finally, the defender is able tomonitor incoming queries to their system and prevent attackersfrom performing large numbers of queries.

    B. Attack Steps

    Readers might incorrectly assume that certain trivial attacksmight be able to achieve the goals of the dissident i.e.,evade the model while maintaining high audio quality. Onesuch trivial attack includes adding white-noise to the speechsamples, expecting a mistranscription by the model. However,such a an attack will fail. We discuss in detail how we test thistrivial attack and the corresponding results in Appendix A.Similarly we introduce a simple impulse perturbation tech-nique that exposes the sensitivity of ASR systems, discussedin Appendix B1. Realizing the limitations of this approach, weleave its details to Appendix B1 to B4. We continue our study

    4

  • to develop a more robust attack algorithm in the followingsections.

    The attack should meet certain constraints: First, it shouldintroduce artificial noise to the audio sample that exploitsthe ASR/AVI system and forces an incorrect model output.Second, the distortion should have little to no impact on theunderstandability of the perturbed audio file for the humanlistener.

    The attack steps are outlined in Figure 3. During decompo-sition, shown in Figure 3(a), we pass the audio sample to theselected algorithm (DFT or SSA). The algorithm decomposesthe audio into individual components and their correspondingintensities. Next, we threshold these components, as shown inFigure 3(b). During thresholding, we remove the componentswhose intensity falls below a certain threshold. The intuitionfor doing so is that the lower intensity components are lessperceptible to the human ear and will likely not affect a user’sability to understand the audio. We discuss how the algorithmcalculates the correct threshold in the next paragraph. We thenconstruct a new audio sample from the remaining components,using the appropriate inverse transform, shown in Figure 3(c).Next, the audio is passed on to the model for inference,Figure 3(d). If the system being attacked is an ASR, thenthe model outputs a transcription. On the other hand if thetarget system is an AVI, the model outputs a speaker label.The model output is compared with that of the original duringthe measurement step, Figure 3(e).

    The goal of the algorithm is to calculate the optimumthreshold, which discards the least number of componentswhilst still forcing the model to misinterpret the audio sample.Discarding components degrades the quality of the recon-structed audio sample. If the discard threshold is too high,neither the human listener nor the model will be able tocorrectly interpret the reconstructed audio. On the other hand,by setting it too low, both the human listener and the modelwill correctly understand the reconstructed audio.

    To compensate for these competing tensions, the attackalgorithm executes the following steps. If the model outputmatches the original label, during the measurement step, thealgorithm will increase the threshold value. It will then passthe audio sample and the updated threshold back to thethresholding step for an additional round of optimization.However, if the model outputs an incorrect interpretation ofthe audio sample, the algorithm reduces the degradation byreducing the discard threshold, before returning the audio tothe thresholding step. This loop will repeat until the algorithmhas calculated the optimum discard threshold.C. Performance

    In order to find the optimal threshold, we incrementallyremove more components until the model fails to properlytranscribe the audio file. This process takes O(n) queries to themodel, where n is the number of decomposition components.We can reduce the time complexity from linear to logarithmictime such that an attack audio is produced in O(log n) queries.To achieve this, we model the distortion search as a binarysearch (BS) problem where values represent the number ofcoefficients to use during reconstruction.

    If the reconstruction is misclassified, we move to the leftBS search-space and attempt to improve the audio quality by

    removing less coefficients. If the audio is correctly transcribed,we move to the right. This search continues until either anupper bound on the search depth is reached. This result wassufficient for the scope of this paper, and we leave a morerigorous analysis of distortion search complexity for futurework.

    D. Transferability

    One measure of an attack’s strength is the ability to gener-ate adversarial examples that are transferable to other models(i.e., a single audio sample that is mistranscribed/misidentifiedby multiple models). An attacker will not know the precisemodel he is trying to fool. In such a scenario, the attackerwill generate examples for a known model, and hope that thesamples will work against a completely different model.

    Attacks have been shown to generalize between modelsin the image domain [59]. In contrast, attack audio trans-ferability has only has seen limited success. Additionally,audio generated with previous approaches ([28], [30]) aresensitive to naturally occurring external noise, which fails toexploit the target model in a real-world setting. This is inline with previous results of physical attacks in the imagedomain [52]. Instead we focus on the evasion style of attack,where the attack is considered successful if the ASR systemtranscribes the attack audio incorrectly or the AVI misidentifiesthe speaker of the attack audio. We propose a completelyblack-box approach that does not consider model specifics asa means of bypassing these limitations.

    E. Detection and Defense

    We evaluate our attack against the adversarial audio detec-tion mechanism which is based on temporal dependencies [82].This is the only method designed specifically to detect ad-versarial audio samples. This method has demonstrated excel-lent results: it is light-weight, simple and highly effective atdetecting tradition adversarial attacks. The mechanism takesas input an audio sample. This can either be adversarial orbenign. Next, the audio sample is partitioned into two. Onlythe partition corresponding to the first half is retained. Next,the entire original audio sample and the first partition arepassed to the model and the transcriptions are recorded. If thetranscriptions are similar, then the audio sample is consideredbenign. However, if the transcriptions are differing, the audiosample is adversarial. This is because adversarial attack algo-rithms against audio samples distort the temporal dependencewithin the original sequences. The temporal dependency-baseddetection is designed to capture this information and use it forattack audio detection.

    Additionally, we evaluate our attack against adversarialtraining based defense. However, we have placed the steps,the methodology, and the results in the Appendix C.

    IV. SETUP

    Our experimental setup involved two datasets, four attackscenarios, two test environments, seven target models, and auser study. We discuss the relevant details of our experimentshere.

    5

  • A. TIMIT Data Set

    The TIMIT corpus is a standard dataset for speech pro-cessing systems. It consists of 10 English sentences that arephonetically diverse being spoken by 630 male and femalespeakers [35]. Additionally, there is metadata of each speakerthat includes the speaker’s dialect. In our tests, we randomlysampled six speakers, three male and three female, from eachof four regions (New England, Northern, North Midland, andSouth Midland). We then perturbed all 10 sentences spoken byour speakers using our technique, and also extracted phonemesfor our phoneme-level experiments. In total, we attacked 240recordings with 7600 phonemes.

    B. Word Audio Data Set

    Testing the word-level robustness of an ML model poseschallenges in terms of experimental design. Although thereexist well-researched datasets of spoken phrases for speechtranscription [58], [11], partitioning the phrases into individualwords based on noise threshold is not ideal. In this case,the only way to control the distribution of candidate phraseswould be to pass them to a strong transcription model, whilediscarding audio samples which are mistranscribed. Doing somay bias the candidate attack samples towards clean, easy tounderstand samples. Instead, we build a word-level candidatedataset using a public repository of the 1,000 most commonwords in the English language [1]. We then download audiofor each of the 1,000 words using the Shotooka spoken-wordservice [9].

    C. Attack Scenarios

    In order to test our technique in a variety of different ways,we performed four attacks: word level, phoneme level, ASRPoisoning, and AVI poisoning. We also tested the transferabil-ity of the attack audio samples across models.

    1) Word Level Perturbations: Using the 1,000 mostcommon words, we performed our attack as described inSection III-B. We optimized our threshold using the techniqueoutlined in Section III-D, stopping either after the thresholdvalue had converged or after a maximum of 15 queries.

    2) Phoneme Level Perturbations: Next, we ran perturba-tions on individual phonemes rather than entire words. Thegoal of this attack was to cause mistranscription of an entireword by only perturbing a single phoneme. We tested thisattack on audio files from the TIMIT corpus and replacedthe regular phoneme with its perturbed version in the audiofile. The audio sample was then passed to the ASR systemfor transcription. We repeated this process for every phonemeusing the binary search technique outlined in Section III-D.

    3) ASR Poisoning: ASR systems often use the previouslytranscribed words to infer the transcription of the next [31],[63], [21]. Perturbing a single word’s phonemes not onlyaffects the model’s ability to transcribe that word, but alsothe words that come after it. We wished to measure the effectof perturbing a single phoneme on the transcription of theremaining words in a sentence. To do this, we generatedadversarial audio sample by perturbing a single phonemewhile keeping the remaining audio intact for the sentencesof the TIMIT dataset. We repeated this for every phoneme

    in the dataset and passed the attack audio samples to theASR system. The cosine similarity metric was used to measurethe transcription similarity between the attack audio with theoriginal audio.

    When perturbing phonemes, we do not use the attackoptimization described in Section III-D. Since the averagelength of a phoneme in our dataset was only 31ms, a singleperturbed phoneme in a sentence does not significantly impactaudio comprehension. Therefore, we simply discard half of alldecomposition coefficients in a single 31 ms window duringthe thresholding step. This maintains the quality of the adver-sarial audio, while still forcing the model to mistranscribe.

    4) AVI Poisoning: We also evaluate our attacks’ perfor-mance against AVI system. To do so, we first trained an AzureSpeaker Identification model [2] to recognize 20 speakers. Weselected 10 male and 10 female speakers from the TIMITdataset to service as our subjects. For each speaker, sevensentences were used for training, while the remaining threesentences were used for attack evaluation. We only perturbed asingle phoneme while the rest of the sentence is left unaltered.We passed both the benign and adversarial audio samples to themodel. The attack was considered a success if the AVI modeloutput different labels for each sample. This attack setup issimilar to the one for ASR poisoning, except that here wetarget an AVI system.

    D. Models

    We choose a set of seven models that are representativeof the state-of-the-art in ASR and AVI technology, given theirhigh accuracy [15], [14], [5], [33]. These include a mixture ofboth proprietary and open-source models to expose the utilityof our attack. However, all are treated as black-box.

    Google (Normal): To demonstrate our attack in a truly black-box scenario, we target the speech transcription APIs providedby Google. The ‘Normal’ model is provided by Google for‘clean’ use cases, such as in home assistants, where the speechis not expected to traverse a cellular network [4].

    Google (Phone): To demonstrate our attack against modeltrained for noisy audio, we test the attack against the ‘Phone’model. Google provides this model for cellular use cases andtrained it on call audio that will be representative of cellularnetwork compression [12]. We also assume that the Google‘Phone’ model will be robust against the noise, jitter, loss andcompression introduced to audio samples that have traveledthrough the telephony network.

    Facebook Wit: To ensure better coverage across the spaceof proprietary speech transcription services, we also targetFacebook Wit, which provides access to a ‘clean’ speechtranscription model [16]. As before, no information is knownabout this model due to its proprietary nature.

    Deep Speech-1: The goal of Deep Speech 1 was to eliminatehand-crafted feature pipelines by leveraging a paradigm knownas end-to-end learning [39], [41]. This results in robust speechtranscription despite noisy environments, if sufficient trainingdata is provided. For our experiments, we use an open-source implementation and checkpoint provided by Mozillawith MFCC features [7].

    6

  • Deep Speech-2: Deep Speech-2 introduced architecture op-timizations for very large training sets. It is trained to mapraw audio spectrograms to their correct transcriptions, anddemonstrates the current state-of-the-art in noisy, end-to-end audio transcription [20]. We use an open-source imple-mentation2 trained on LibriSpeech provided by GitHub userSeanNaren [58], [56]. The primary difference in our two testedversions is feature preprocessing: the tested version of DeepSpeech-1 uses MFCC features, while the tested version ofDeep Speech-2 uses raw audio spectrograms.

    CMU Sphinx: The CMU Sphinx project is an open-sourcespeech transcription repository representing over twenty yearsof research in this task [50]. Sphinx does not heavily relyon deep learning techniques, and instead implements a com-bination of statistical methods to model speech transcriptionand high-level language concepts. We use the PocketSphinximplementation and checkpoints provided by the CMU Sphinxrepository [50].

    Microsoft Azure: To demonstrate our attack against AVIsystems in a black-box environment, we attack the SpeakerIdentification API provided by Microsoft Azure [2]. Thissystem is proprietary, and hence, completely black-box. Thereis no publicly available information about the internals of thesystem.

    Although the landscape of audio models is ever-changing,we believe our selection represents an accurate approximationof both traditional and state-of-the-art methods. Intuitively,future models will derive from the existing state-of-the-art interms of data and implementation. We also compare againstopen-source and proprietary models, to show our approachgeneralizes to any model regardless of a lack in aprioriknowledge.

    E. Transferability

    We measure the transferability of our proposed attack byfinding the probability that the attack samples for one modelwill successfully exploit another. This is done by creating aset of successful word-level attack audio samples X∗m for amodel m, then averaging their calculated Mean Squared Error(MSE) distortion, MSEm. Intuitively, this average MSE willbe higher for stronger models, and lower for weaker models.This acts as a ‘hardness score’ for a given model and isused to compare between attack audio sets of two models.Now consider a baseline model f , comparison model g, thesuccessful attack transfer event Sf→g , and the number of audiosamples in each model’s attack audio set n = |X∗f | = |X∗g |.We can calculate attack transfer probability from f to g asthe probability of sampling attack audio from X∗f whosedistortion meets or surpasses the score MSEg . We denote thisprobability P (Sf→g) and the set of potentially transferableaudio samples as Vf→g . We calculate the probability usingthe equation P (Sf→g) =

    |Vf→g|n , where we build the set of

    transferable attack samples such that Vf→g = {x∗f,i ∈ X∗f :MSE(x∗f,i) ≥MSEg}.

    Thus, we approximate the probability of sampling a pieceof audio which meets the ‘hardness’ of model g from the set

    2At the time of running our experiments, the implementation did not includea language model to aid in the beam-search decoding.

    which was successful over model f . This value is calculatedacross each combination of models in our experiments for SSAand DFT transforms, with n set to 1,000 as a result of usingour Word Audio data set.

    F. Detection

    Our experimental method was designed to be as closeto that of the authors [82]. We assume that the attacker isnot aware of the existence of the defense mechanism or thesize of the partition. Therefore, we perturbed the entire audiosample using our attack, to maximally distort the temporaldependencies. Next, we partitioned the audio sample into twohalves. We passed both the entire audio sample and the firsthalf to the Google Speech API for transcription. We conductedthis experiment with 266 adversarial audio samples generatedusing the DFT perturbation method at a factor of 0.07. Ourset of benign audio samples consisted of 534 benign audiosamples. The audio samples for both the benign and adversarialsets were taken from the TIMIT dataset. Similar to the authors,we use Word Error Rate (WER) as a measure of transcriptionsimilarity. In line with the authors, we calculate and report theArea Under Curve (AUC) value. AUC values lie between 0.5and 1. A perfect detector will return an AUC of 1, while adetector that randomly guesses returns an AUC of 0.5.

    G. Over-Cellular

    The Over-Cellular environment simulates a more realisticscenario where an adversary’s attack samples have to travelthrough a noisy and lossy channel before reaching the targetsystem. Additionally, this environment accurately models oneof the most common mediums used for transporting humanvoice – the telephony network. In our case, we did this bysending the audio through AT&T’s LTE network and theInternet via Twilio [13] to an iPhone 6. The attacker’s audiois likely to be distorted while passing through the cellularnetwork due to jitter, packet loss, and the various codecsused within the network. The intuition for testing in thisenvironment is to ensure that our attacks are robust againstthese kinds of distortions.

    H. MTurk Study Design

    In order to measure comprehension of perturbed phonecall audio, we conducted an IRB approved online study. Weran our study using Amazon’s Mechanical Turk (MTurk)crowdsourcing platform. All participants were between theages of 18 and 55 years old, located in the United States,and had an approval rating for task completion of more than95%. During our study, each participant was asked to listento audio samples over the phone. Parts of the audio samplehad been perturbed, while others had been left unaltered3.The audio samples were delivered via an automated phonecall to each of our participants. The participants were askedto transcribe the audio of a pre-recorded conversation that wasapproximately one minute long. After the phone call was done,participants answered several demographic questions whichconcluded the study. We paid participants $2.00 for completingthe task. Participants were not primed about the perturbation of

    3We invite the readers to listen to the perturbed audio samples for them-selves: https://sites.google.com/view/transcript-evasion

    7

  • 0

    20

    40

    60

    80

    100

    Succ

    essf

    ul T

    rans

    crip

    tions

    (%) DFT Attack

    0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175Mean Squared Error Between Original and Attack Audio

    0

    20

    40

    60

    80

    100

    Succ

    essf

    ul T

    rans

    crip

    tions

    (%) SSA Attack

    SphinxDeepSpeech-1WitGoogle (Normal)Google (Phone)

    Fig. 4: Success transcriptions against our word-level attackplotted against increasing distortion, calculated using MeanSquare Error (MSE). The SSA-based word-level attack seesa faster, sharper decrease in the successful transcriptions thanthe DFT-based word-level attack, noted by its ability to reach50% attack success (solid black line) across all models withina smaller span of distortion. This means 50% of the wordsin the dataset were mistrascribed by the target ASR. In everycase, the test set accuracy falls considerably before reachingthe GSM baseline distortion (dashed red line).

    the pre-recorded audio, which prevented introducing any biasduring transcription. In order to make our study statisticallysound, we ran a sample size calculation under the assumptionof following parameter values: an effect size of 0.05, type-Ierror rate of 0.01, and statistical power of 0.8. Under thesegiven values, our sample size was calculated to be 51 and weended up recruiting 89 participants in MTurk. Among these89, 23 participants started but did not complete the studyand 5 participants had taken the study twice. After discardingduplicate and incomplete data, our final sample size consistsof 61 participants.

    V. RESULTS

    As outlined previously in Section IV-C, we evaluated ourattack in various different configurations in order to highlightcertain properties of the attack. To begin, we will evaluate ourattack against the speech to text capabilities of multiple ASRsystems in several different setups.

    A. Attacks Against ASR systems

    1) Word Level Perturbations: We study the effect of ourword-level attack against each model. We measure attacksuccess against distortion and compare the DFT and the SSAattacks, shown in Figure 4. In this subsection, we discussfive target models: Google (Normal), Google (Phone), Wit,DeepSpeech-1 and Sphinx. Distortion is calculated using the

    MSE between every normal audio sample and its adversarialcounterpart.

    We use the GSM audio codec’s average MSE as a baselinefor audio comprehension, as it is used by 2G cellular networks(the most common globally). We denote this baseline withthe red, vertical dashed line in Figure 4. Thus, we considerany audio with higher MSE than the baseline to be totallyincomprehensible to humans. It is important to note that thisassumption is extremely conservative, since normal compre-hensible phone call audio often has larger MSE than ourbaseline.

    Figure 4 shows that as distortion is iteratively increasedusing the word-level attack, test set accuracy begins to diminishacross all models and all transforms. Models which decreaseslower, such as Google (Phone), indicate a higher robustness toour attack. In contrast, weaker models, such as Deep Speech-1, exhibit a sharper decline. For all transforms, the Google(Phone) model outperforms the Google (Normal) model. Thisindicates that training the Google (Phone) model on noisyaudio data exhibits a rudimentary form of adversarial training.However, all attacks are eventually successful to at least 85%while retaining audio quality that is comprehensible to humans.

    Despite implementing more traditional machine learn-ing techniques, Sphinx exhibits more robustness than DeepSpeech-1 across both attacks. This indicates that Deep Speech-1 may be overfitting across certain words or phrases, and itsexisting architecture is not appropriate for publicly availabletraining data. Due to the black-box nature of Wit and theGoogle models, it is difficult to compare them directly totheir white-box counterparts. Overall, Sphinx is able to matchWit’s performance, which is also more robust than the Google(Normal) model in the DFT attack.

    Surprisingly, for the SSA attack Sphinx is able to outper-form all models as distortion approaches the human percep-tibility baseline. This may be a byproduct of the handcraftedfeatures and models built into Sphinx. Overall, the SSA-basedattack manages to induce less distortion, allowing all modelsto fail with 50% (represented by the horizontal black line) orless test set accuracy before 0.0100 calculated MSE. Manuallistening tested showed that there was no perceivable drop inaudio quality at this MSE.

    2) Phoneme Perturbations: Our evaluation of the phonemelevel attacks exposed several trends, as shown in Figures 5and 12. For brevity, only Google (Normal) and Wit are shownacross each data transform, with complete charts available inSection D of the Appendix.

    Figure 5 shows the relationship between phonemes andattack success. Lower bars correspond to a greater percentageof attack examples that were incorrectly transcribed by themodel. According to the figure, vowels are more sensitive toperturbations than other phonemes. This pattern was consis-tently present across all the models we evaluated. There area few possible explanations for this behavior. Vowels are themost common phonemes in the English language and theirformant frequencies are highly structured. It is possible thesetwo aspects in tandem force the ASR system to over-fit to thespecific formant pattern during learning. This would explainwhy vowel perturbations cause the model to mistranscribe.

    8

  • eng ow aw ae ey uw ux sh ao aa uh ay er

    em en ih oy ah eh axr iy ch w

    ax-h ax ngr ix hh m y z s jht l v n g th k hvf el p d dx dh b nx zh

    Phonemes

    0

    20

    40

    60

    80

    100

    Atta

    ck S

    ucce

    ss (%

    ) Google (Normal)

    b ch zh ng em en

    eng hh hv ih ae aa aw ay ah ow uh uw ux er axr

    ax-h n iy eh ao g sh vt z m l ax w ey p ix r el s k nx y dxf

    oy jh d dh th

    Phonemes

    Atta

    ck S

    ucce

    ss (%

    ) WitStopsAffricatesFricativesNasalsSemivowels_and_GlidesVowels

    Fig. 5: A comparison of attack success of our DFT-based phoneme-level attack against two ASR models. There is a clearrelationship between the which phoneme is attacked and the attack’s success. It is clear across all models that we evaluated thatvowels are more vulnerable to this attack than other phonemes.

    Similarly, Figure 12 shows the distortion thresholds neededfor each phoneme to cause mistranscription of a phrase. Thelonger the bar, the greater the required threshold. For the DFT-based attack experiment, shown in Figure 12(a), we observethat the vowels require a lower threshold compared to theother phonemes. This means that less distortion is requiredfor a vowel to trick the ASR system. In contrast, the SSA-based attack experiments, as shown in Figure 12(b), revealthat all of the phonemes are equally vulnerable. In general,the SSA attacks required a higher threshold than our previousDFT attack. However, the MSE of the audio file after beingperturbed when compared to the original audio is still small.The average MSE during these tests was 0.0067, which is anorder of magnitude less than the MSE of audio being sent overLTE (0.0181).

    Our SSA attacks did not appear to expose any systemicvulnerability in our models as the DFTs did. There exist twolikely causes for this: DFT’s use in ASR feature extraction andSSA’s data dependence. ASR systems often use DFTs as partof their feature extraction mechanisms, and thus the modelsare likely learning some characteristics that are dependent onthe DFT. When our attack alters the DFT, we are directlyaltering acoustic characteristics that will affect which featuresthe model is extracting and learning.

    Additionally, when SSA is used in our attack, we areremoving eigenvectors rather than specific frequencies like wedo with a DFT. These eigenvectors are made up of multiplefrequencies that are not unique to any one eigenvector. Thus,the removal of an eigenvector does not guarantee the completeremoval of a given frequency from the original audio sample.We believe the combination of these two factors results in ourSSA-based attack being equally effective against all phonemes.

    Both Figures 5 and 12 provide information for an attackerto maximize their attack success probability. Perturbing vowelsat a 0.5 threshold while using a DFT-based attack will providethe highest probability for success because vowels are vul-nerable across all models. Even though the discard thresholdapplied to vowels might vary from one model to another,choosing a threshold value of 0.5 can guarantee both stealthand a high likelihood of a successful mistranscription.

    3) ASR Poisoning: As described in Section IV-C2, per-turbing a single phoneme not only causes a mistranscriptionof the given word but also of the following words as well.Results of this phenomena can be seen in Table ??. Wefurther, characterize this numerically across each model for the

    Fig. 6: Cosine similarity between the transcriptions of theoriginal and the perturbed audio file. At a value of 0.5(horizontal line) half of the sentence is incorrect. Attack audiosamples were generated by perturbing a single phoneme.

    DFT-based attack in Figure 6, where higher values of cosinesimilarity translate to lower attack mistranscription.

    We observe a relationship between the model type andthe cosine similarity score. Of all the models tested, Wit isthe most vulnerable, given low average cosine similarity of0.36. On the other hand, the Google (Normal) model seemsto be least vulnerable with the highest cosine similarity of0.78. To better characterize the phenomenon, we use the cosinesimilarity value to estimate the number of words that the attackeffects. We do so by assuming a sentence comprised of 10words each. Perturbing a single phoneme can force Wit tomistranscribe the next seven words. In contrast, only two of thenext 10 words will be mistranscribed by the Google (Normal)model. This robustness for the Google (Normal) model mightbe due to its internal recurrent layers being less weightedtowards the previously transcribed content. It is also interestingto note that, despite their common internal structure, DeepSpeech-1 and Deep Speech-2 are significantly different in theirvulnerability to this effect. Deep Speech-1 and Deep Speech-2have a cosine similarity score of 0.4 and 0.7, respectively. Thisdifference could potentially be attributed to different featureextraction mechanisms. While Deep Speech-1 uses MFCCs,its counterpart uses a CNN feature extraction. This is becausefeature extraction using MFCCs and CNNs produce varyingresults and might capture divergent information about thesignal.

    Observing the models individually, there is no observablerelationship between the cosine similarity and the phoneme

    9

  • Model Original Transcription Attack Transcription

    Google (Normal) The emperor had a mean Temper syempre Hanuman Templethen the chOreographer must arbitrate Democrat ographer must arbitrate

    Wit she had your dark suit in Greasy wash water all year nopemasquerade parties tax one’S imagination stop

    By only perturbing a single phoneme (bold faced and underlined), our attack forces ASR systems to completely mistranscribe the resulting audio.

    To (g)

    P (Sf→g)Google(Phone)

    Google(Normal) Wit Sphinx Deep-Speech 1

    Google(Phone) 100% 78% 83% 42% 87%

    Google(Normal) 13% 100% 65% 22% 70%

    From(f) Wit 6% 10% 100% 14% 52%

    Sphinx 21% 74% 81% 100% 80%

    Deep-Speech 1 3% 7% 31% 12% 100%

    TABLE I: The probability of transferability P (Sf→g) calcu-lated for each combination of the tested models. Only ‘harder’models tend to transfer well to weaker models. The elements inbold show the highest transferability successes. Model namesin the columns have been arranged in descending order of theirstrength from harder to weaker.

    type. All phonemes seem to be approximately equally vulner-able to the attack. This means that models do not use knowl-edge of previously occurring phonemes when transcribing thecurrent word. In fact, the results above show that the modelsuse the current phonemes and previous words in combinationto transcribe the current word, which intuitively is the intentionbehind modern ASR system design.B. Attacks Against AVI systems

    Next, we observe our attacks’ effectiveness inducing errorsan AVI system. Figure 7 shows the attack success rate pervowel for both SSA and DFT attacks. Similar to our attackresults against ASR models, attacks against the AVI systemexhibited higher success rate when attacking phoneme. Thismeans an adversary wishing to maximize attack success shouldfocus on perturbing vowels. Figure 11 shows the relativeamount of perturbation necessary in order to force an AVImisclassification. The SSA attack requires a relatively highperturbation for every phoneme. In contrast, the DFT attackrequires a smaller degree of perturbation, all except somevowels. This implies that the DFT attack could be conductedmore stealthily by intelligently selecting certain phonemes. Theperturbation required, the more stealthy the attack. However,because the SSA attack requires larger degree of perturbationsfor most phonemes, the level of stealth the attack can achieveis relatively higher.

    C. Transferability

    Table I shows results of the transferability experimentsusing the SSA-based attack. Overall, the attack has the highesttransfer probability when a ‘harder’ model is used to generateattack audio samples. The Google (Phone) model had thehighest average threshold across samples which, as discussedin Section IV-E, translated to the highest transfer probability.In contrast, a weaker model will have a lower threshold andthus be less likely to transfer. This can be seen when treatingSphinx as the baseline model in Table I. The table shows thatin the worst case attack audio generated for the Google (Phone)the model will also be effective against any other model at least

    42% of the time. This ensures a high probability of successeven in the extreme case when the adversary does not knowwhich model the victim will be employing. By generatingattack samples on a stronger model, an adversary can expect anattack to transfer well as long as the victim model is weaker.Finding a weaker model is trivial. As long as the adversaryhas sufficient queries, they can compare transcription rates fora candidate audio sample between the two models.

    D. Detection

    For the detection experiments, we created set of 266adversarial samples, perturbed using the DFT technique, whilethe benign set consisted of 534 unperturbed audio samples. Ofthe adversarial samples provided to the Google Speech API,20% did not produce any transcription. This was true both theentire audio samples and their corresponding partition. Thismeans that the WER for these samples was 0, which introduceda bias to our results. Though this is perfect for the dissident,but it introduces bias in our results. Specifically, because thereare two benign cases in which the WER will be zero, benignaudio or very noisy audio. We discarded these audio samplesfrom our adversarial set to remove this bias in our results. Inthe real world, an attacker can merely reduce the attack factorto prevent the model from producing no transcription. Next,we calculated the AUC scores for the samples. In our case,the AUC value was 0.527, which is far lower than the AUCvalue of 0.936 reported by the authors for the attacks theytested. This means even though the temporal based detectioncan do an excellent job of detecting other attacks, it is highlyinaccurate for detecting our attack samples.

    E. Per-Layer Effects

    We now probe more deeply into why the attacks succeed.In particular, we will explore the attacks’ effects at each layerof a typical deep learning-based speech transcription model.We begin by opening our local deployment of Deep Speech-2, which represents a state-of-the-art architecture in this task.The model was provided by an open-source community andwas not trained on the same breadth of data as the originalproposal [20]. However, examining such a deployment givesinsight into the attack’s effects on a publicly available, open-source implementation. We begin by recording the activationsat each layer of the Deep Speech-2 architecture for severalpairs of original and word-level perturbed audio sample.

    We treat the activation of the original audio as a baselineand subtract this from the activation of the perturbed audio tofind the net effect. We quantify their distance using the L2-norm, then divide this value by the L2-norm of the originalactivation. In total, the tested implementation is comprisedof thirteen logical layers4, thus for every normal audio x,

    4We treat the combination of convolutional, pooling, and activation layersas one logical convolution layer. For recurrent layers, we treat the BatchNorm-RNN [20] as one logical recurrent layer.

    10

  • t d w iy ow jh aw hh ng uw uh ay m k p oy hv ix ey th ux sh ehl

    dx ae g n z ax r ah s ih axr v er y b aa dh

    ax-h ao e

    lnx en em

    f zh ch

    Phonemes

    0

    10

    20

    30

    Atta

    ck S

    ucce

    ss (%

    ) Azure DFT Attack

    oy ngt ae iy er ey ux ow ay eh ix w sh jhk l

    aw aa d axr r

    uw uh m n dx el p en th s g z ax ih ah y ao hv dh v b

    ax-h hh nx em

    f zh ch

    Phonemes

    Azure SSA AttackStopsAffricatesFricativesNasalsSemivowels_and_GlidesVowels

    Fig. 7: Success rate of our attack against a Automatic Voice Identification (AVI) system. When perturbing a single phonemein the entire audio sample, an adversary has a greater chance of succeeding with an SSA attack rather than a DFT attack.Additionally, similar to the observation in Figure 5 vowels are more vulnerable than other phonemes.

    Conv

    .0Co

    nv.1

    Conv

    .2Co

    nv.3

    Conv

    .4Co

    nv.5

    RNN.0

    RNN.1

    RNN.2

    RNN.3

    RNN.4 FC

    Softm

    ax

    Logical Layer

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Rela

    tive

    Chan

    ge (L

    2 No

    rm)

    Relative Change of L2 Norm per Layer

    Fig. 8: Calculated net change ∆hl for each layer in thetested Deep Speech 2 model (Convolutional, Recurrent (RNN),fully connected (FC), and Softmax). Lower-level convolutionallayers tend to have higher adversarial effect than upper-levelrecurrent and fully connected layers.

    adversarial audio x∗, layer hl, l ∈ {0, ..., 13}, and the L2-norm written for brevity as L2, we have the layer’s net change∆hl calculated using the equation ∆hl = L2[h

    l(x)−hl(x∗)]L2[hl(x)]

    .

    This process was repeated for every benign-perturbed audiopair in our corpus, which is the same corpus built for the word-level perturbation experiments in Section IV-C1. We averageacross every pair, then repeat this process for every logicallayer hl in the Deep Speech-2 architecture to produce theresults shown in Figure 8. We observe a higher rate of changefor low-level convolutional layers, particularly the upper-levelconvolutional layers. These convolutional layers learn to mimicand surpass the MFCC preprocessing filter. The net changeimmediately diminishes as the sample passes through the re-current layers, reflecting the intuition that changes at individualframes, phonemes, or words may be mitigated by temporalinformation. However, we also observe a non-zero net changein the fully connected and softmax layers, indicating that theshift was enough to force mistranscription. This points to theCNNs as the main source of the attack vulnerability.

    F. Over-Cellular

    Next, we test our attacks for use over a cellular network.We run previously successful attack audio samples over thecellular network before passing them again to the targetmodels. The rate of success for this experiment is shown inFigure 9 plotted against the DFT and SSA-based attacks foreach model. If our attack were to be used over a cellularconnection, having near real-time performance is important.The largest source of potential delay caused by our attack is

    Fig. 9: The attack audio was sent over the cellular networkand recorded on a mobile end-point. The figure above showsthe percentage of the attack audio that was still mistranscribedeven after being passed over the cellular network.

    from calculating the DFT on the original audio sample. Whilewe do not conduct our own time evaluation, Danielsson etal. showed that calculating a DFT on a commodity Androidplatform took approximately 0.5 ms for a block size of 4096[33].

    The DFT-based attack managed to be more successfulacross the mobile network than the SSA-based attack andwas only consistently filtered by the Deep Speech and Witmodels. Overall, the models react differently based on thetransformation. Wit performs best under DFT transforms andworst under SSA transforms, while the opposite is true forDeep Speech 1 and Google (Normal) model. Sphinx is equallyvulnerable to both transformation methods.

    An intuition for these results can be formed by consideringthe task each transformation is performing. When transformingwith the DFT, the attack audio sample forms around pieces ofweighted cross-correlations from the original audio sample. Incontrast, the SSA is built as a sum of interpretable principalcomponents. Although SSA may be able to produce fine-grained perturbations to the signal, the granularity of per-turbations is lost during the mobile network’s compression.This effect is expected to be amplified for higher amountsof compression, although such a scenario also limits theperceptibility of benign audio.

    G. Amazon Turk Listening Tests

    In order to evaluate the transcription done by MTurkworkers, we initially manually classified the transcriptions aseither correct or incorrect. Table II shows a side by side

    11

  • Original Transcription Attack TranscriptionHow are you?How’s work going?

    How are youposmothdro?

    I am really sorryto hear that.

    I am relief for youto hear that.

    TABLE II: Example of the attacked audio which was playedto MTurk Workers and the corresponding transcriptions.

    Accuracy (Perturbed) Accuracy (Benign)Male Female Male Female

    91.8%(56/61)

    100%(61/61)

    98.36%(60/61)

    98.36%(60/61)

    TABLE III: Transcription accuracy results of MTurk workersfor benign and perturbed audios between Male and Femalespeakers.

    comparison of original and attack transcription of the perturbedportion of the audio sample. Transcriptions which had eithera missing word, a missing sentence, or an additional word notpresent in the audio were marked as incorrect. At the end ofthis classification task, the inter-rater agreement value, Cohen’skappa was found to be 0.901, which is generally considered tobe ‘very good’ agreement among individual raters. Our manualevaluation found only 11% of the transcriptions to be incorrect.More specifically, we found that incorrect transcriptions mostlyhad missing sentences from the beginning of the played audiosample, but the transcriptions did not contain any misinter-preted words. Our subjective evaluation did not consider wrongtranscription of perturbed vs. non-perturbed portion of theaudio, rather we only evaluated human comprehension of theaudio sample.

    In addition to subjective evaluation, we ran a phoneme-level edit distance test on transcriptions to compare thelevel of transcription accuracy between perturbed and non-perturbed audio samples. We used this formula for phonemeedit distance, φ: φ = δL , where δ is the Levenshtein editdistance between phonemes of two transcriptions (originaltranscription and MTurk worker’s transcription) for a word andL is the phoneme length of non-perturbed, normal audio for theword [28]. We defined accuracy as 1 when φ = 0, indicatingexact match between two transcriptions. For any other valueφ > 0, we defined it as ‘in-accuracy’ and assigned a value of 0.In Table III, we present transcription accuracy results betweenperturbed and non-perturbed audio across our final sample sizeof 61. We also ran a paired sample t-test, using the individualaccuracy score for perturbed and benign audio transcriptions,with the null hypothesis that participants’ accuracy levels weresimilar for both cases of transcriptions. Our results showedparticipants had better accuracy transcribing non-perturbedaudio samples (mean = 0.98, SD = 0.13) than for perturbedaudio (mean = 0.90, SD = 0.30). At a significance level ofp < 0.01, our repeated-measures t-test found this differencenot to be significant, t(60) = −2.315, p = 0.024. Recall thatour chosen significance level (p < 0.01) was not arbitrary,rather it was chosen during our sample size calculation forthis study. Together, this suggests that our word level audioperturbation create no obstacle for human comprehension intelecommunication tasks, thus supporting our null hypothesis.

    VI. DISCUSSION

    A. Phoneme vs. Word Level Perturbation

    Our attack aims to force a mistranscription while stillbeing indistinguishable from the original audio to a humanlistener. Our results indicate that at both the phoneme-leveland the word-level, the attack is able to fool black-box modelswhile keeping audio quality intact. However, the choice ofperforming word-level or phoneme-level attacks is dependenton factors such as attack success guarantee, audible distortion,and speed. The adversary can achieve guaranteed attack suc-cess for any word in the dictionary if word-level perturbationsare used. However, this is not always true for a phoneme-level perturbation, particularly for phonemes which are pho-netically silent. An ASR system may still properly transcribethe entire word even if the chosen phoneme is maximallyperturbed. Phoneme-level perturbations may introduce lesshuman-audible distortion to the entire word, as the humanbrain is well suited to interpolate speech and can compensatefor a single perturbed phoneme. In terms of speed, creatingword-level perturbations is significantly slower than creatingphoneme-level perturbations. This is because a phoneme-levelattack requires perturbing only a fraction of the audio samplesneeded when attacking an entire word.

    B. Steps to Maximize Attack Success

    An adversary wishing to launch an Over-Cellular evasionattack on an ASR system would be best off using the DFT-based phoneme-level attack on vowels, as it guarantees a highlevel of attack success. Our transferability results show that anattacker can generate samples for a known ‘hard’ model suchas Google (Phone) and have reasonable confidence that theattack audio will transfer to an unknown ASR model. Fromour ASR poisoning results, we observe that an adversary doesnot have to perturb every word to earn 100% mistranscriptionof the utterance. Instead, the attacker can perturb a vowelof every other word in the worst case, and every fifth wordin the best case. The ASR poisoning effect will ensure thatthe non-perturbed words are also mistranscribed. Finally, theattack audio samples have a high probability of survivingthe compression of a cellular network, which will enable thesuccess of our attack over lossy and noisy mediums.

    Contrary to an ASR system attack, an adversary lookingto execute an evasion attack on an AVI system would preferto use the SSA-based phoneme-level attack. Similar to ASRpoisoning, we observe that an adversary does not have toperturb the entire sentence to cause a misidentification, butrather just a single phoneme of a word in the sentence. Basedon our results, the attacker would need to perturb on averageone phoneme every 8 words (33 phoneme) to ensure a highlikelihood of attack success. The attack audio samples aregenerated in an identical manner for both the ASR and AVIsystem attacks, thus the AVI attack audio should also be robustagainst lossy and noisy mediums (e.g., a cellular network).

    C. Why the Attack Works

    Our attacks exploit the fundamental difference in how thehuman brain and ASR/AVI systems process speech. Specif-ically, our attack discards low intensity components of anaudio sample which the human brain is primed to ignore.

    12

  • The remaining components are enough for a human listenerto correctly interpret the perturbed audio sample. On theother hand, the ASR or AVI systems have unintentionallylearned to depend on these low intensity components forinference. This explains why removing such insignificant partsof speech confuses the model and causes a mistranscriptionor misidentification. Additionally, this may also explain someportion of the ASR and AVI error on regular testing data sets.Future work may use these revelations in order to build morerobust models and be able to explain and reduce ASR and AVIsystem error.

    D. Audio CAPTCHAs

    In addition to helping dissidents overcome mass-surveillance, our attack has other applications as well. Specif-ically, in the domain of audio CAPTCHAs. These are oftenused by web services to validate the presence of a human.CAPTCHAs relies on humans being able to transcribe audiobetter than machines, an assumption that modern ASR systemscall into question [26], [77], [70], [73], [24]. Our attack couldpotentially be used to intelligently distort audio CAPTCHAsas a countermeasure to modern ASR systems.

    VII. RELATED WORK

    Machine Learning (ML) models, and in particular deeplearning models, have shown great performance advancementsin previously complex tasks, such as image classification [47],[75], [42] and speech recognition [61], [20], [39]. However,previous work has shown that ML models are inherentlyvulnerable to a class of ML known as Adversarial MachineLearning (AML) [43].

    Early AML techniques focused on visually imperceptiblechanges to an image that cause the model to incorrectly classifythe image. Such attacks target either specific pixels [49], [76],[38], [22], [74], [54], or entire patches of pixels [25], [72],[60], [29]. In some cases, the attack generates entirely newimages that the model would classify to an adversary’s chosentarget [57], [51].

    However, the success of these attacks are a result oftwo restrictive assumptions. First, the attacks assume that theunderlying target model is a form of a neural network. Second,they assume the model can be influenced by changes at thepixel level [60], [57]. These assumptions prevent image attacksfrom being used against ASR models. ASR systems have foundsuccess across a variety of ML architectures, from HiddenMarkov Models (HMMs) to Deep Neural Networks (DNNs).Further, since audio data is normally preprocessed for featureextraction before entering the statistical model, the modelsinitially operate at a higher level than the ‘pixel level’ of theirimage counterparts.

    To overcome these limitations, previous works have pro-posed several new attacks that exploit behaviors of particularmodels. These attacks can be categorized into three broadtechniques that generate audio that include: a) inaudible tothe human ear but will be detected by the speech recognitionmodel [84], b) noisy such that it might sound like noise tothe human, but will be correctly deciphered by the automaticspeech recognition [80], [28], [18], and c) pristine audiosuch that the audio sounds normal to the human but will be

    deciphered to a different, chosen phrase [83], [27], [37], [44],[19], [46], [32], [71], [48]. Although they may seem the mostuseful, attacks in the third category are limited in their success,as they often require white-box access to the model.

    Attacks against image recognition models are well studied,giving attackers the ability to execute targeted attacks evenin black-box settings. This has not yet been possible againstspeech models [23], even for untargeted attacks in a queryefficient manner. That is, both targeted and untargeted attacksrequire knowledge of the model internals (such as architectureand parameterization) and large number of queries to themodel. In contrast, we propose a query efficient black-boxattack that is able to generate an attack audio sample thatwill be reliably mistranscribed by the model, regardless ofarchitecture or parameterization. Our attack can generate anattack audio sample in logarithmic time, while leaving theaudio quality mostly unaffected.

    VIII. CONCLUSION

    Automatic speech recognition systems are playing an in-creasingly important role in security decisions. As such, therobustness of these systems (and the foundations upon whichthey are built) must be rigorously evaluated. We performsuch an evaluation in this paper, with particular focus onspeech-transcription. By exhibiting black-box attacks againstof multiple models, we demonstrate that such systems rely onaudio features which are not critical to human comprehensionand are therefore vulnerable to mistranscription attacks whensuch features are removed. We then show that such attacks canbe efficiently conducted as perturbations to certain phonemes(e.g., vowels) that cause significantly greater misclassificationto the words that follow them. Finally, we not only demonstratethat our attacks can work across models, but also show that theaudio generated has no impact on understandability to users.This detail is critical, as attacks that simply obscure audio andmake it useless to everyone are not particularly useful to theadversaries we consider. While adversarial training may helpin partial mitigations, we believe that more substantial defensesare ultimately required to defend against these attacks.

    REFERENCES[1] “1,000 Most Common US English Words,” Last accessed in 2019,

    available at https://www.ef.edu/english-resources/english-vocabulary/top-1000-words/.

    [2] “Azure speaker identification api,” Last accessed in 2019, availableat https://azure.microsoft.com/en-us/services/cognitive-servic/speaker-recognition/.

    [3] “Background on CTIAs Semi-Annual Wireless Industry Survey ,” Lastaccessed in 2019, available at http://files.ctia.org/pdf/CTIA SurveyYE 2012 Graphics-FINAL.pdf.

    [4] “Google Cloud Speech-to-Text API,” Last accessed in 2019, availableat https://cloud.google.com/speech-to-text/.

    [5] “Googles Speech Recognition Technology Now Has a 4.9% Word ErrorRate,” Last accessed in 2019, available at https://bit.ly/2rGRtUQ.

    [6] “Inside China’s Massive Surveillance Operation,” Lastaccessed in 2019, available at https://www.wired.com/story/inside-chinas-massive-surveillance-operation/.

    [7] “Mozilla project deepspeech,” Last accessed in 2019, availableat https://azure.microsoft.com/en-us/services/cognitive-servic/speaker-recognition/.

    [8] “NSA Speech Recognition Snowden Searchable Text,” Last ac-cessed in 2019, available at https://theintercept.com/2015/05/05/nsa-speech-recognition-snowden-searchable-text/.

    13

    https://www.ef.edu/english-resources/english-vocabulary/top-1000-words/https://www.ef.edu/english-resources/english-vocabulary/top-1000-words/https://azure.microsoft.com/en-us/services/cognitive-servic/speaker-recognition/https://azure.microsoft.com/en-us/services/cognitive-servic/speaker-recognition/http://files.ctia.org/pdf/CTIA_Survey_YE_2012_Graphics-FINAL.pdfhttp://files.ctia.org/pdf/CTIA_Survey_YE_2012_Graphics-FINAL.pdfhttps://cloud.google.com/speech-to-text/https://bit.ly/2rGRtUQhttps://www.wired.com/story/inside-chinas-massive-surveillance-operation/https://www.wired.com/story/inside-chinas-massive-surveillance-operation/https://azure.microsoft.com/en-us/services/cognitive-servic/speaker-recognition/https://azure.microsoft.com/en-us/services/cognitive-servic/speaker-recognition/https://theintercept.com/2015/05/05/nsa-speech-recognition-snowden-searchable-text/https://theintercept.com/2015/05/05/nsa-speech-recognition-snowden-searchable-text/

  • [9] “Project SHTOOKA - A Multilingual Database of Audio Recordingsof Words and Sentences,” Last accessed in 2019, available at http://shtooka.net/.

    [10] “Simple Audio Recognition,” Last accessed in 2019, available at https://www.tensorflow.org/tutorials/sequences/audio recognition.

    [11] “The CMU Audio Database (also known as AN4 database),” Lastaccessed in 2019, available at http://www.speech.cs.cmu.edu/databases/an4/.

    [12] “Transcribing Phone Audio with Enhanced Models,” Last accessedin 2019, available at https://cloud.google.com/speech-to-text/docs/phone-model.

    [13] “Twilio - Communication APIs for SMS, Voice, Video and Authenti-cation,” Last accessed in 2019, available at https://www.twilio.com/.

    [14] “Wer are we - an attempt at tracking states of the art(s) and recentresults on speech recognition,” https://github.com/syhw/wer are we,Last accessed in 2019.

    [15] “Who’s Smartest: Alexa, Siri, and or Google Now?” Last accessed in2019, available at https://bit.ly/2ScTpX7.

    [16] “Wit.ai Natural Language for Developers,” Last accessed in 2019,available at https://wit.ai/.

    [17] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applyingconvolutional neural networks concepts to hybrid nn-hmm model forspeech recognition,” pp. 4277–4280, 05 2012.

    [18] H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. Butler, and J. Wilson,“Practical hidden voice attacks against speech and speaker recognitionsystems,” Proceedings of the 2019 Network and Distributed SystemSecurity Symposium (NDSS), 2019.

    [19] M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that? adver-sarial examples against automatic speech recognition,” arXiv preprintarXiv:1801.00554, 2018.

    [20] D. Amodei et al., “Deep speech 2 : End-to-end speech recognition inenglish and mandarin,” in Proceedings of The 33rd InternationalConference on Machine Learning, ser. Proceedings of MachineLearning Research, M. F. Balcan and K. Q. Weinberger, Eds., vol. 48.New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 173–182.[Online]. Available: http://proceedings.mlr.press/v48/amodei16.html

    [21] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,“End-to-end attention-based large vocabulary speech recognition,” inAcoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Inter-national Conference on. IEEE, 2016, pp. 4945–4949.

    [22] S. Baluja and I. Fischer, “Adversarial transformation networks: Learningto generate adversarial examples,” arXiv preprint arXiv:1703.09387,2017.

    [23] M. K. Bispham, I. Agrafiotis, and M. Goldsmith, “A Taxonomy ofAttacks via the Speech Interface,” 2018.

    [24] K. Bock, D. Patel, G. Hughey, and D. Levin, “uncaptcha: a low-resourcedefeat of recaptcha’s audio challenge,” in Proceedings of the 11thUSENIX Conference on Offensive Technologies. USENIX Association,2017, pp. 7–7.

    [25] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer, “Adversarialpatch,” arXiv preprint arXiv:1712.09665, 2017.

    [26] E. Bursztein, R. Beauxis, H. Paskov, D. Perito, C. Fabry, and J. Mitchell,“The failure of noise-based non-continuous audio captchas,” in Securityand Privacy (SP), 2011 IEEE Symposium on. IEEE, 2011, pp. 19–31.

    [27] W. Cai, A. Doshi, and R. Valle, “Attacking speaker recognition withdeep generative models,” arXiv preprint arXiv:1801.02384, 2018.

    [28] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields,D. Wagner, and W. Zhou, “Hidden voice commands.” in USENIXSecurity Symposium, 2016, pp. 513–530.

    [29] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in Security and Privacy (SP), 2017 IEEE Symposium on.IEEE, 2017, pp. 39–57.

    [30] N. Carlini and D. Wagner, “Audio Adversarial Examples: TargetedAttacks on Speech-to-Text,” ArXiv e-prints, p. arXiv:1801.01944, Jan.2018.

    [31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in Advances in neuralinformation processing systems, 2015, pp. 577–585.

    [32] M. Cisse, Y. Adi, N. Neverova, and J. Keshet, “Houdini: Fooling deepstructured prediction models,” arXiv preprint arXiv:1707.05373, 2017.

    [33] A. Danielsson, “Comparing android runtime with native: Fastfourier transform on android,” 2017, mS thesis. [Online]. Available:”https://bit.ly/2MQpUV1”

    [34] E. Dohmatob, “Limitations of adversarial robustness: strong no freelunch theorem,” arXiv preprint arXiv:1810.04065, 2018.

    [35] J. S. Garofolo et al., “Getting started with the darpa timit cd-rom: Anacoustic phonetic continuous speech database,” National Institute ofStandards and Technology (NIST), Gaithersburgh, MD, vol. 107, p. 16,1988.

    [36] S. A. Gelfand, Hearing: An Introduction to Psychological and Physio-logical Acoustics, 5th ed. Informa Healthcare, 2009.

    [37] Y. Gong and C. Poellabauer, “Crafting adversarial examples for speechparalinguistics applications,” arXiv preprint arXiv:1711.03280, 2017.

    [38] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572, 2014.

    [39] A. Graves and N. Jaitly, “Towards end-to-end speech recognition withrecurrent neural networks,” in International Conference on MachineLearning, 2014, pp. 1764–1772.

    [40] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” ICASSP, IEEE International Confer-ence on Acoustics, Speech and Signal Processing - Proceedings, vol. 38,03 2013.

    [41] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deepspeech: Scaling up end-to-end speech recognition,” arXiv preprintarXiv:1412.5567, 2014.

    [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learningfor Image Recognition,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2016, pp. 770–778.

    [43] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D.Tygar, “Adversarial machine learning,” in Proceedings of the 4thACM Workshop on Security and Artificial Intelligence, ser. AISec ’11.New York, NY, USA: ACM, 2011, pp. 43–58. [Online]. Available:http://doi.acm.org/10.1145/2046684.2046692

    [44] C. Kereliuk, B. L. Sturm, and J. Larsen, “Deep learning and musicadversaries,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp.2059–2071, 2015.

    [45] H. Köpcke, A. Thor, and E. Rahm, “Evaluation of entity resolutionapproaches on real-world match problems,” Proceedings of the VLDBEndowment, vol. 3, no.