Detecting Deception in Interrogation Settings · Detecting Deception in Interrogation Settings C.E....

Detecting Deception in Interrogation Settings

C.E. Lamb and D.B. Skillicorn

December 2012External Technical Report

ISSN-0836-0227-2012-600

School of ComputingQueen’s University

Kingston, Ontario, Canada K7L 3N6

Document prepared December 18, 2012Copyright c⃝2012 C.E. Lamb and D.B. Skillicorn

School of ComputingQueen’s University

{carolyn,skill}@cs.queensu.ca

Abstract

Bag-of-words deception detection systems outperform humans, but are still not always accu-rate enough to be useful. In interrogation settings, present models do not take into accountpotential influence of the words in a question on the words in the answer. Because of therequirements of many languages, including English, as well as the theory of verbal mimicry,such influence ought to exist. We show that it does: words in a question can “prompt” otherwords in the answer. However, the effect is receiver-state-dependent. Deceptive and truthfulsubjects respond to prompting in different ways. The accuracy of a bag-of-words deceptioncan be improved by training a predictor on both question words and answer words, allowingit to detect the relationships between these words. This approach should generalize to otherbag-of-words models of psychological states in dialogues.

Detecting Deception in Interrogation Settings

C.E. Lamb and D.B. SkillicornSchool of Computing, Queen’s University

Technical Report 2012-600

1 Introduction

From court cases to airport security, many high-stakes real-world situations rely on thehuman ability to distinguish truth from deception. Sadly, humans do not do this well.Automating the detection of deception would thus bring many benefits. If an automatedmethod outperformed humans, it could be used to augment human judgement in high-stakessituations; and it could also be used in lower-stakes situations where expert human judgementis infeasible. For instance, a filter alerting users to potential deception could be used withonline job postings.

One obstacle in the way of automation is that deception works differently in differentsituations. One important factor mediating cues to deception is the difference betweenmonologues and dialogues. Many deception models are based on monologues such as speechesor written statements. In many interesting real-world cases, the deceptive person is in adialogue, and each participant responds and adjusts to the linguistic style of the other. Thismakes deception detection in a dialogue complicated. If one participant shows linguisticsigns of deception, are they being deceptive, or are they responding to some aspect of theother participant’s prompting?

We have undertaken an exploratory study of these issues by analyzing archival transcriptsof real-world interrogations and applying a deception model developed by Newman et al. [43].We found that word classes relevant to the model were vulnerable to a prompting effect: usingcertain words in the question increased the frequency of other words in the answer. Thereforethe model cannot be applied to dialogue data in its present form, because the promptingeffect is a potential confound.

Our goal, therefore, was to expand the model’s applicability by removing the effect ofthe question from each answer. However, using this correction on real-world data revealedthat our assumptions had been overly simplistic. Deceptive and non-deceptive respondentsin the data responded to prompting in different ways, and removing the prompting effectalso removed some of the signals of difference between these subgroups. Instead of removingthe effect of the question, the relationship between question and answer words needs to betaken into account. Tests with a predictor confirmed that a model based on question andanswer words, rather than just answer words, is indeed more accurate.

We conclude that, in deception, it is not the case that the respondent has an underlyingmental state whose effects are distorted in a uniform way by prompting. Instead, promptingoccurs, but the respondent’s mental state determines its effect. Fortunately, this second-orderbehavior is itself detectable. We suspect that this effect will generalize to other word-basedpsychological models in dialogues, helping reveal other aspects of the relationship between

1

questioner and respondent.

2 Background

2.1 Stop Words

In text mining it is common practice to remove certain “stop words”: words like “the”, “I”,and “if”, which are thought to be too common to be worth analyzing [31]. It is thought thatthese stop words have only a grammatical function, and thus are unnecessary in a model thatdoes not take word order into account. However, stop words contain interesting meaningsof their own. Aside from their grammatical function, they have contextual, social meanings.(The word “he”, for example, refers to a man, but we need context in order to figure outwhich man.) They also serve a stylistic function. For this reason they are also called “stylewords” or “function words”.

Function words are particularly useful for the discovery of a speaker’s mental state fortwo reasons. First, they are more plentiful than content words. Although normal Englishspeakers have only 500 function words in their vocabulary (out of 100,000 total Englishwords), 55% of all the words they speak or write on a given day will be function words [55].Second, these words are produced in a different part of the brain than content words [39]and can “leak” information about a person’s mental state even without their awareness [17].

Social psychologists in the past twenty years have been analyzing people’s speaking andwriting styles by counting both function words and common content words. One impor-tant tool for this analysis is Pennebaker’s [45] Linguistic Inquiry and Word Count program(LIWC). With this program, researchers have discovered correlations between function worduse and everything from depression [49] to sexual orientation [25] and quality of a rela-tionship [50] and, our focus here, deception [43]. Function words have also been shown todistinguish among different kinds of Islamist language [35].

2.2 Deception Detection

Unaided humans are poor at detecting deception. Although individual proficiency varies,even groups of trained humans such as police officers rarely perform above chance [22]. Manyattempts have been made to construct empirical models of deception, relying on objectivecues, but this is difficult because different studies produce conflicting results. In fact, nosingle cue is reliably indicative of deception across studies [8]. One reason for this is thatdifferent studies address the issue of deception in different situations, and the relevant cuesin these situations may also be different. For example, DePaulo et al.’s meta-analysis foundthat effect sizes were different depending on the deceptive person’s motivation, the amount ofinteractivity in the setting, and whether a social transgression was involved [21]. Meanwhile,computer-mediated communication appears to function along different lines from face-to-facecommunication, perhaps because each person has time to rehearse what they will say [61].

2

2.3 The Pennebaker Deception Model

The model of deception we use comes from Newman et al. [43]. The authors asked collegestudents to speak or write, either deceptively or truthfully, about a variety of topics (theiropinion on abortion, their feelings about their friends, and a mock theft). Then they analyzedall the truthful and deceptive statements with LIWC. Four linguistic signs of deceptionemerged across categories:

• First person singular pronouns decrease in the deceptive condition. Deceptive peoplehave less personal experience with their subject matter than truthful people, and areless emotionally willing to commit to what they are saying, so they focus on and referto themselves less [43].

• Exclusive words decrease in the deceptive condition. Such words introduce increasedcomplexity into sentences. Deception causes a higher cognitive load than truthfulnessbecause of the increased difficulty of monitoring a deceptive performance [21], whichproduces an increase in visible signs of effort [62], even when the deceptive person hashad time to prepare [56].

• Negative emotion words increase in the deceptive condition, consistent with DePaulo etal.’s [21] meta-analysis. Deceptive people are thought to do this as a sign of unconsciousdiscomfort [43]. Alternatively, increased emotion can be a sign of trying too hard topersuade the listener, as in Zhou et al. [61].

• Motion verbs (“go”, “run”) increase in the deceptive condition. This is the result ofincreased cognitive load and perhaps a need to keep the story moving and discouragesecond thoughts on the part of the reader/hearer.

Lowered self-reference is consistent with other models (e.g. Zhou et al. [59]) and persistsregardless of motivation [27] but DePaulo et al. [21] and Zuckerman et al. [62], in their meta-analyses, did not find a reliable, statistically significant decrease in self-references acrossstudies.

DePaulo et al. [21] point out that increases in emotion should be treated with caution.Most lies in the real world are “white lies”, told to smooth over social interactions, andpeople telling them experience very little distress [20]. Moreover, some truthful statements,such as confessions of wrongdoing, might lead to high levels of negative emotions. For similarreasons, this part of the model should not be applied to psychopaths, who experience nodiscomfort when breaking moral rules [47].

The effect of cognitive load should also be treated with caution: some white lies are easyto tell, and some potentially volatile truths may be as mentally effortful as a lie [21].

Since the original work, the Pennebaker model of deception has been extensively validatedacross a large number of populations. The words used in these criteria are summarized inTable 1.

The LIWC model performed significantly better than human raters in distinguishingtruth from deception [43]. Gupta and Skillicorn [26] suggest that the LIWC model detects

3

Categories KeywordsFirst-person pronouns I, me, my, mine, myself, I’d, I’ll, I’m, I’veExclusive words but, except, without, although, besides, however, nor, or,

rather, unless, whereasNegative emotion words hate, anger, enemy, despise, dislike, abandon, afraid, agony, an-

guish, bastard, bitch, boring, crazy, dumb, disappointed, disap-pointing, f-word, suspicious, stressed, sorry, jerk, tragedy, weak,worthless, ignorant, inadequate, inferior, jerked, lie, lied, lies,lonely, loss, terrible, hated, hates, greed, fear, devil, lame, vain,wicked

Motion verbs walk, move, go, carry, run, lead, going, taking, action, arrive,arrives, arrived, bringing, driven, carrying, fled, flew, follow,followed, look, take, moved, goes, drive

Table 1: Words used in the LIWC deception model

not only outright falsehood, but also “spin” or “persona deception”, in which a person doesnot make factually false statements, but consciously projects an image of themselves thatthey know to be inaccurate. Skillicorn and Leuprecht have used this reasoning to applyLIWC to the speeches of politicians [51].

2.4 Fine-Tuning the LIWC Model

Keila and Skillicorn [34] applied LIWC to a dataset consisting of internal emails at Enronin the period just before its fraudulent accounting scandal. They combined LIWC withsingular value decomposition (SVD). SVD is a useful tool for eliciting correlations frommultidimensional data. An SVD is a factorization of a real matrixA into three other matrices:

A = USV T

such that U and V T are unitary matrices and S is a diagonal matrix [53]. The matricesU and V represent the data as vectors in a many-dimensional space. These axes appear inthe matrices in decreasing order of size: the first rows of V T represent the axes of maximalvariation in the data. The nonzero values of the matrix S, meanwhile, correspond to theamount of variation along each axis. One can therefore truncate a set of SVD matrices atan arbitrary number of dimensions and be confident that the truncated version is the mostfaithful possible representation of A in the chosen number of dimensions. The S matrixvalues provide an indication of how much variation is being lost [52].

Reducing the number of dimensions in this manner not only compresses the data butelucidates correlations between different parts of it, because documents or variables thatvary together will be represented as closely aligned vectors in the SVD space [19].

Keila and Skillicorn [34] found that LIWC performed well when combined with SVD,because words within a given category (except exclusive words) were correlated with each

4

other and formed sensible structures in semantic space. The top ranked emails by a geomet-rical measure in reduced-dimension semantic space were all deceptive. However, in such alarge dataset, it is difficult to tell how many false negatives there are.

Skillicorn and Little [52] repeated the SVD-based analysis on transcripts from the Gomerycommission, in which former Canadian government officials were examined regarding allegedcorruption. They found that in this data, deception was associated with increases – notdecreases – in first-person singular pronouns and exclusive words. Altering the model tolook for increases in all categories, they produced results in rough agreement with mediaestimates of who was being deceptive and who was not. (Although ground truth for theGomery data was not available, the most deceptive people according to the model werepeople saying they did not remember basic facts about their own employment, such as whothey were working for. Meanwhile, the least deceptive people were witnesses called in toexplain purely technical matters.) Hence the standard Pennebaker model does not appearto be effective in dialogue settings.

If deception decreases first-person pronoun use in some situations and increases it in oth-ers, this explains DePaulo et al. [21] and Zuckerman et al.’s [62] inability to find a significanteffect for self-references across many studies. However, it raises the more vexing questionof what causes these words to behave in this way. Skillicorn and Little [52] suggest thatfirst-person singular pronouns and exclusive words increased with deception in the contextof the Gomery commission because it was not an emotionally charged situation. As we shallsee, there is a better explanation.

2.5 Forcing Word Use in Questions and Answers

There are two processes of interaction between the language of questions and the languageof answers that alter the word use in both. The first is the technical requirements of the lan-guage itself; the second is the largely unconscious mimicry that participants in a conversationfall into.

In most languages, the function word patterns in questions force some correspondingstructure into a responsive answer. For example, a question containing the phrase “Did you. . . ” requires either a first-person singular or first-person plural pronoun (“Yes, we did . . . ”;“No, I didn’t”) or a passive verb. A respondent therefore does not have the same freedomof word use in a dialogue that they have in an unforced setting.

Two people in conversation, even if they are strangers, will imitate each other in every-thing from facial expression and body language [9] to volume [40], pitch [33], and speechrates [58]. People speaking to each other also mimic each other’s words and phrases [36].This mimicry is generally not conscious [9]. Subjects mimic phrases they have heard evenwhen consciously trying not to [3] and when their conscious working memory is filled with adistractor task [36].

LIWC can be used to measure verbal mimicry on the level of categories of words, whereit is called Linguistic Style Matching (LSM). LSM is measured by comparing LIWC countson all function word categories between both partners in an interaction. This comparisoncan be done through product-moment correlation [44] or a weighted difference score [32].

5

The LSM metric is internally consistent: if a dyad matches in their use of one functionword category, they will probably match to the same degree with all others. It also generalizeswell across different contexts, from online chat conversations [44] to letters between well-known colleagues [32] and face-to-face discussions [44].

LSM appears to a greater or lesser degree in different circumstances. Some linguisticstyles are more easily matched than others, and different people will match a given stylewith more or less ease [32]. However, while greater LSM is associated with greater socialcohesion, the matching occurs even between strangers who dislike each other [44]. We wouldthus expect a degree of matching to occur in any context – even a courtroom interrogation.

3 Method and Results

The prompting effect we expect to see is a generalization of style matching because, unlikepure mimicry, we expect that some categories of question words prompt different categoriesof response words. This is particularly obvious in the case of pronouns. Our first hypothesisis therefore:

H1. The function words in an answer to a question are not independent of the question,but are prompted (positively or negatively) by the function words in the question. At leastsome of the words affected in this way will be relevant words from the Pennebaker deceptionmodel.

Our second hypothesis follows:H2. If word frequencies from the LIWC model are affected by the words in the question,

then this potentially distorts the results of the Pennebaker model. Removing the effectsof the question from the answer before applying the LIWC model should result in greateraccuracy.

Our plan of attack has three parts. First, we gather archival question-and-answer dataand visualize any relevant changes in frequency that could be caused by H1. Second, wedevelop a method to correct for these changes. Third, although not every answer in ourdata can be labeled as entirely truthful or entirely deceptive, we use some straightforwardmethods to validate the corrections and to investigate whether they do, in fact, improve theseparation between truthful and deceptive responses.

3.1 Datasets

We created three datasets by taking archival transcriptions of real-life, high-stakes question-and-answer interactions.

The Republican dataset comprises transcripts of each of the Republican primary de-bates leading up to the American presidential election of 2012. These involved a totalof ten candidates and were televised and transcribed online by various news organiza-tions [1, 4–7,10–16,18,24,30,42,46,48,57].

We used the Republican dataset to investigate overall patterns of interaction betweenquestion and answer words. We did not rate the Republican presidential candidates as

6

deceptive or non-deceptive, since there is no reliable estimate of ground truth about thecandidates’ honesty. Fact checking websites may show that specific statements are true orfalse, but they do not help with the issue of persona deception, and there is no reliable wayto judge one candidate overall as more deceptive than another. However, we did judge thedebates in general as a forum in which all candidates would be motivated towards personadeception as defined by Gupta and Skillicorn [26]. Presenting themselves and the factsin the most favorable possible light is essential to getting elected, and much of the timethis favorable light does not correspond exactly with reality. The full Republican datasetcontained 2118 question-answer pairs and 301,539 total words.

The Nuremberg dataset comprised selected examinations and cross-examinations ofwitnesses from the Nuremberg trials of 1945-1956. Most of these examinations were takenfrom the Trial of German Major War Criminals, transcribed at the Holocaust memorialwebsite nizkor.org [29]. Two were instead taken from the Nuremberg Medical Trial, whichis partially transcribed online at the website of the Harvard Law Library [41]. Unlike theRepublican dataset, the Nuremberg dataset contained obvious subgroups with markedlydifferent motivations towards deception. The first group, Defendants, contained two Naziwar criminals testifying in their own defense who were eventually found guilty on all countsand executed. These men were highly motivated towards deception: they were guilty, andtheir lives depended on convincing the tribunal that they were not. The second group,Untrustworthy witnesses, contained ten lower-ranking Nazis who were not themselveson trial. While this group was not at immediate risk of conviction, it seems reasonableto suppose that their accounts would be moderately deceptive. Most of them would bemotivated to absolve themselves either by minimizing Nazi war crimes as a whole or byminimizing their own involvement. The third group, Trustworthy witnesses, containednineteen survivors of Nazi war crimes who testified about those crimes. Fourteen of thesewere Holocaust survivors, while the other five were civilians who reported on more generalconditions in Nazi-occupied countries. We did not consider any of these witnesses deceptive.

Some witnesses spoke at great length about their experiences, so we chose to truncateanswers in the Nuremberg dataset at 500 words, matching the maximum size that occurrednaturally in the Republican dataset. This affected less than 1 percent of the answers. Thefull Nuremberg dataset contained 4159 question-answer pairs (1355 from Defendants,1826 from Untrustworthy witnesses, and 978 from Trustworthy witnesses). Itcontained a total of 311,099 words.

We also make brief use of a third dataset, Simpson. This contains depositions from thecivil trial of O.J. Simpson for the wrongful death of his wife, Nicole Brown, and another man.Extensive transcripts of this material were available online [54]. Simpson contains Simpson’sdeposition in his own defense, which we considered deceptive, as well as the depositions offamily and friends of the deceased, which we considered largely truthful. (We chose the civiltrial rather than the criminal trial because it was the first time Simpson testified directly inhis own defense; it also had the advantage of being a trial at which he was found guilty.)Since Simpson’s own deposition was three times as long as the rest of the dataset, we usedonly every third question-answer pair from him. The Simpson dataset contained 20,810

7

question and answer pairs, totalling 412,385 words.

3.2 Model Words

When processing this data, we are most interested in what occurs at the level of a singlequestion and answer pair. Thus, we looked at “windows” in the data, which we defined as asingle question followed by its answer. We recorded the length (in words) of each questionand answer, along with the count of the number of words in these categories:

FPS First person singular pronouns. The Pennebaker model predicts that deceptive peopleshould use a lower rate of first-person pronouns than truthful people. However, in theGomery commission data, Little and Skillicorn [37] found that deceptive people useda higher rate.

but Lower rates of this and other exclusive words are expected in deception. The Pen-nebaker model puts “but” and “or” in the same exclusive words category but Littleand Skillicorn’s results [37] suggest that the two words often have quite different prop-erties in semantic space. We therefore counted “but” and “or” separately.

or Another exclusive word.

excl The other exclusive words from the model, for example, “unless” and “whereas”.

neg Negative emotion words. Higher rates of these are expected in deception.

action Action verbs. Higher rates of these are expected in deception.

FPP First person plural pronouns. These can be substituted for singular in some circum-stances (the “royal we”) and a questioner using the “royal we” might encourage anrespondent to do likewise. Alternatively, a respondent who is forced grammatically touse a first person pronoun might choose a plural one instead of a singular as a dis-tancing technique. Similarly, if the questioner is referring to a group to which boththey and the respondent belong, this might prompt the respondent to continue talk-ing about this group. Focusing on a group leaves less time to focus on oneself, so weexpected higher FPP rates in the question to prompt lower FPS rates in the answer.

SPP Second person pronouns. A question containing the word “you” is probably about therespondent, and the respondent is expected to respond by giving information aboutthemselves. Thus, we expected higher SPP rates to prompt higher FPS rates in theanswer.

“Wh” Who, what, when, where, why, and how. Questions containing these words promptthe respondent for a specific fact. Questions about specific facts should make it harderfor the respondent to be evasive. We expected higher “wh” word rates to prompthigher rates of exclusive words due to an increase in cognitive complexity.

8

TPS Third person singular pronouns. These words are references to a person other thanthe questioner or respondent. Being asked about a third person should induce therespondent to talk about that person, and thus give them less opportunity to talkabout themselves. When talking about another person – not oneself, and not thequestioner – it might also be easier to make disparaging or negative remarks. Weexpected higher TPS rates to prompt lower FPS rates and higher negative emotionrates.

TPP Third person plural pronouns. We expected higher TPP rates to prompt lower FPSrates and higher negative emotion rates, for the same reason as above.

these An early analysis with a part-of-speech tagging program showed that rates of “these”,“those”, and “to” in the question were weakly correlated with rates of FPS in theanswer. This early analysis did not yield other interesting results.

3.3 Gaussian Distributions

A 500-word answer with five action words is statistically and linguistically different from a15-word answer with five action words. Therefore, using raw word counts in our statisticalanalysis would be inappropriate. For all our analyses, we divided the number of words ofeach type in each single question or answer by the total number of words it contains, givinga rate statistic for each type of word.

We expected the shortest questions and answers to be difficult to analyze. In a five-wordanswer with one “but”, the rate statistic for “but” would be 0.2 – unusually large. It isn’tclear that the answer has all the properties associated with use of the word “but” to a greaterdegree than, say a 16-word answer with one “but” which would have a rate of only 0.0625.Based on this reasoning and some early inspection of the data, we began the analysis witha minimum window size of 50 words in both question and answer.

We separated answers into two categories for each pair of question word and responseword. The “unprompted” category contained all answers where the matching question didnot contain the relevant question word. We assume that rates in such answers reflect theirnatural rates without prompting. The “prompted” category contained all answers in whichthe relevant question word appeared at least once.

For each pair of question word and answer word in the prompted category, we counted therates in all windows and fitted a two-dimensional Gaussian distribution to the data, using theFitFunc toolbox for Matlab. For the unprompted category we computed a one-dimensionalGaussian fitting the rates in the answers.

These Gaussian distributions provide an indicator of the relationship between the ques-tion word and the answer word pair. A distribution with a positive slope and a mean higherthan that of the unprompted data indicates that the question word prompts higher ratesof the answer word. A distribution with a negative slope and a mean lower than that ofthe unprompted data indicates that the question word prompts lower rates of the answer

9

word. A flat or near-spherical distribution with a mean close to that of the unprompted datasuggests that there is no relationship between the two word categories.

0 0.02 0.04 0.06 0.08 0.1 0.120

0.02

0.04

0.06

0.08

0.1

0.12

(a) First-person pluralpronouns prompting first-person singular pronouns

0 0.02 0.04 0.06 0.08 0.1 0.120

0.02

0.04

0.06

0.08

0.1

0.12

(b) Second-person pronounsprompt first-person singularpronouns

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

(c) “Wh” words prompting“or”. The other exclusiveword categories looked simi-lar.

0 0.02 0.04 0.06 0.08 0.1 0.120

0.02

0.04

0.06

0.08

0.1

0.12

(d) Third person singu-lar pronouns prompt first-person singular pronouns

0 0.02 0.04 0.06 0.08 0.1 0.120

0.02

0.04

0.06

0.08

0.1

0.12

(e) Third person plural pro-nouns prompt first-personsingular pronouns

0 0.02 0.04 0.06 0.08 0.1 0.120

0.02

0.04

0.06

0.08

0.1

0.12

(f) “These”, “those”, and“to” prompt first-person sin-gular pronouns.

0 0.02 0.04 0.06 0.08 0.1 0.120

0.02

0.04

0.06

0.08

0.1

0.12

(g) First-person singularpronouns prompt “but”

−0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.040

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

(h) “But” prompting “or”

0 0.01 0.02 0.03 0.04 0.05 0.06 0.070

0.01

0.02

0.03

0.04

0.05

0.06

0.07

(i) Third person plural pro-nouns prompt action words

Figure 1: Gaussian distributions from the Republican dataset – minimum window size of50 words. x-axis = rates for answer words, y-axis = rates for question words; the red lineparallel to the y-axis shows the one-dimensional unprompted distribution out to 1 standarddeviation.

10

Our initial estimates of the question word categories that would have significant prompt-ing effects had mixed success. We did not find evidence of many of the relationships we hadexpected (Figures 1). However, we did find that higher rates of second-person pronouns inthe question, as expected, were associated with higher rates of first-person singular pronounsin the answer. So were “these”, “those”, and “to”. With third-person plural pronouns, wefound a weak effect in the opposite direction from what we expected: higher third-personplural pronoun rates prompted very slightly higher first-person singular pronoun rates. Al-though most pairs of question and answer word categories produced no effect, a few pairsfor which we had no hypothesis also showed a prompting effect.

3.4 Data Correction

To remove the effect of prompting words, we carried out a series of affine transformationson the points in a bivariate Gaussian model in Cartesian space – producing corrected ratesfor each word category.

0 0.02 0.04 0.06 0.08 0.1 0.120

0.02

0.04

0.06

0.08

0.1

0.12

(a) The original data

−0.04 −0.02 0 0.02 0.04 0.06 0.08

−0.04

−0.02

0

0.02

0.04

0.06

0.08

(b) Steps 1 and 2: Trans-lated to a mean of (0, 0) androtated

−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12−0.02

0

0.02

0.04

0.06

0.08

(c) Steps 3 and 4: Rescaledand translated back to amean that matches un-prompted data

Figure 2: Steps in the correction process illustrated using second-person pronouns in thequestion and first-person singular pronouns in the answer. The blue point represents aparticular speech as it is transformed.

The correction method is illustrated in Figure 2. For each question-answer pair, we fit abivariate Gaussian to the prompted data and translate it so the mean is at (0,0). We thenrotate the distribution using a standard rotation matrix until it is “flat” (one of its axes wasparallel to y = 0). We use the smallest possible angle of rotation in either direction.

After that, we rescale the data on the y-axis so that its standard deviation in the ydirection equals the standard deviation of the unprompted data, and translate it so thatits mean returns to its original value in the x direction, and is equal to the mean of theunprompted data in the y direction.

We have included a blue dot in Figure 2 representing an example window (in this case, therespondent is Newt Gingrich) so that it can be traced through each step of the process. In the

11

uncorrected data, Gingrich is heavily prompted and responds with a number of first-personsingular pronouns that is about average for the data as a whole, but much less than whatmight be expected given the general trend towards increased first-person singular pronounswith more prompting. After the data is corrected, Gingrich’s first-person singular pronounrate is low, as expected.

During this process, some windows can be assigned slightly negative values. Obviously itis impossible for a person to say fewer than zero words in response to a question. We treatthe negative values as very emphatic zeroes, but do not correct them until the end of theprocess, since a subsequent correction for another word pair might return them to positivevalues.

We performed these transformations for each question-answer pair in the data, feedingthe input from one transformation to the next so that, for each response word, a correctionis made for each prompting word in turn. For question-answer pairs where the promptingword had negligible effect on the response word rate, the effect of this correction would alsobe negligible. There was a risk of these small effects producing slight noise in the data, butwe accepted this risk because we did not wish to choose an arbitrary threshold separatingeffects that counted from effects that didn’t.

Examples of the changes to rate statistics are shown in Figure 3. It is useful to see howlarge a change in word frequencies such a correction represents. An average absolute valueof three words were added or taken away from each window in both the FPS categories –at an average window size of 188 words, about six of which on average were FPS. So abouthalf of these FPS pronouns, according to our model, are caused by prompting. “But” andaction words changed by an average of one word per window, but some windows change in apositive direction and some in a negative, so the average change is much less than one wordper window. The other categories, on average, were barely changed.

To examine the sensitivity of the correction method we reran the corrections for eachcategory of answer words after removing the 2.5% highest rates and the 2.5% lowest ratesfor each category (5% of the data in total). In this restricted analysis, the average absolutevalue of the difference after correction was still about three words for FPS and about oneword for action words. However, the average change in FPS value was reduced to only twowords, and the effect on “but” disappeared. This suggests that the correction method issomewhat sensitive to outliers, but that its results for FPS and action words are largelyvalid.

3.5 Window Sizes

In many settings, the available window sizes are much smaller than 50 words. For example,in the Simpson depositions, the average question and answer contains fewer than 20 wordsin both sides of the window put together, and vanishingly few met the criterion of having 50words or more in both the question and the answer. So we investigated ways to apply ourmodel to smaller windows.

We experimented with the Republican dataset, using versions with smaller minimumwindow sizes – 30, 10, and 1. We also tried merging adjacent questions and answers, provided

12

that they involved the same questioner and respondent, repeating this until all windowseither met the minimum size for both question and answer, or could not be merged adjacentwindows because there were none. We then removed any windows that still did not meetthe minimum size. We call these composite windows.

FPS but or excl neg action FPP SP wh TPS TPP these

FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(a) Minimum window size 50


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(b) Minimum window size 30


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(c) Minimum window size 10


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(d) Composites totaling size 50

Figure 3: Color maps showing the average change in answer rates as the result of correctionsfor each question-and-answer pair at each minimum window size. Prompting words are thecolumns and response words the rows; bright colors indicate that question words force lowerrates of answer words, and so answer word rates are corrected to increase; dark colors indicatethe opposite. All maps are on the same color-based scale.

We then applied the corrections, measured the average change induced at each stage ofthe correction, and compared it to the average change in the original data. We created colormaps that show the average correction to each question-answer pair (Figure 3). The patternsin 50-word windows degrade as the minimum window size is lowered, particularly to below30. The most notable degradation happened in the most promising question-answer pair –second-person pronouns prompting first-person singular pronouns. The composite windowsalso showed slight degradation, comparable to that in the 30-word windows.

Table 2 shows how much of each dataset is usable at each window size. For Nurembergand Simpson, small windows predominate, and there are many stretches where one individ-ual answered many questions in a row, making them particularly amenable to the use of com-posite windows. With Nuremberg and Simpson the composite window technique allowedanalysis of much more data than 30-word windows. Furthermore, it caused less degradationthan the small window sizes which would otherwise have been needed to cover this muchdata. In Republican, windows tended to be larger, so the effect was less pronounced, but

13

Republican Nuremberg SimpsonQ A Q A Q A

Full 65,480 236,411 109,551 201,548 219,262 193,123≥ 10 55,161 159,450 75,313 155,395 45,162 71,357≥ 30 43,876 107,836 31,452 51,735 3,280 4,967≥ 50 33,193 72,078 15,394 22,366 323 350Comp 44,759 101,211 105,423 170,706 219,197 192,537

Table 2: Number of words that could be included for each minimum window size

the composite window technique still gave coverage and degradation comparable to that ofthe 30-word windows. For these reasons, we performed the rest of our analysis (in all threedatasets) with composite windows.

3.6 Validation: Nuremberg and SVD

We expected corrected data to show a sharper distinction than the original data betweenthe truthful and the deceptive. Using composite windows, we encoded the Nurembergdata and checked that it had similar properties to the Republican data. The magnitudeof correction in FPS and action words was quite similar in the two datasets, although theeffect on “but” the Nuremberg data was much smaller.

We performed singular value decomposition on the answer data before and after perform-ing the correction on the Nuremberg dataset. This reduced the 6-category deception modelto 3 dimensions. We then made a scatter plot showing each window in the resulting semanticspace, expecting an increase in the distance between truthful and deceptive subgroups.

Figure 4 shows the result of singular value decomposition on the Nuremberg databefore and after correction. The effect is the opposite of what was expected – the correctionerases most of the spatial distinction that existed between the groups. This suggests that,rather than being a confounding effect that should be removed to improve the detection ofdeception, the prompting effect contains important information about deception.

3.7 Validation: Subgroups in the Nuremberg Data

We analyzed each of the three Nuremberg subgroups separately, applying a correction toeach. Figure 5 shows color maps of this data. The differences between Defendants andUntrustworthy witnesses (both the Nazi subgroups) are not large, but the responsepatterns of the Trustworthy witnesses subgroup are quite different.

Note that these color maps are on the same scale as the earlier Republican color maps. Wechose to present them this way in order to make comparisons easy across the different colormaps. However, by using a scale that works for the Republican data, we have obscuredan important fact about the Nuremberg data – namely that the average change in first-person singular pronouns prompted by second-person pronouns, for both the Defendants

14

−0.04 −0.02 0 0.02 0.04 0.06 0.08−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

558

559

560

561562

563

564

565566

567

568569

570571

572

573574

575

576

577

578

579

580

581582

583

584

585586 587588589

590

591592

593

594

595

596597

598

599

600

601

602

603

604

605606607

608

609

610

611

612

613

614

615616

617

618619

620 621622

623

624625626 627

628

629630631

632 633

634

635

636

637638 639640

641

642

643

644645646

647648

649650

651652

653

654

655

656657

658

659660

661

662

663

664665

666

667

668

669

670671

672 673

674

675

676677

678679

680681682 683684

685

686687

688

689690 691

692

693694

695

696

697

698699

700

701702

703

704

705706

707

708

709

710

711712713714

715

716

717718

719720 721

722

723

724

725

726

727728

729

730731

732 733734735

736

737738

739740

741

742

743

744

745746

747

748

749

750751

752

753

754

755

756

757

758

759

760

761 762

763764

765766

767

768

769770

771

772

773

774775776

777

778779

780781

782

783

784

785

786

787

788

789

790791

792

793

794795

796

797

798

799

800801802

803

804

805

806

807808809 810

811

812

813814

815

816817

818 819

820821

822

823

824

825

826

827

828

829

830

831 832

833

834

835836

837

838

839

840841

842 843844

845846

847

848

849

850

851

852

853854

855

856

857

858859

860861

862

863

864865

866867

868

869870

871

872

873874

875876

877

878

879

880

881882

883

884

885

886

887

888

889

890 891

892

893894

895896

897

898899

900

901

902 903904905

906

907

908

909910

911

912

913

914915

916

917918

919

920921

922

923

924925 926

927928

929

930931

932

933

934935

936

937938939

940

941942

943

944

945

946 947

948

949

950 951

952

953954955

956

957958

959

960

961962

963

964965

966

967

968

969 970

971972

973

974

975976

977978

979

980

981

982

983

984

985

986

987

988

989

990 991

992993

994

995

996

997

998

999

1000

1001

1002

10031004

100510061007

1008

1009

1010

1011

10121013 101410151016

1017

1018

10191020

12

3

4

5

67

8

9

10

1112 1314

15

16

17

18

19

20

212223

24

2526

27

2829

30 31

32

333435 36

3738

39

40

414243

444546

84

8586 87888990

9192

93

949596

97

9899

100

101102103

104

105

106

107

108

109

110111112 113

114

115

116 117

118 119120

121122123

124

125

126

127

128

129

130

131

132

133 134

135

136137

138139140

141

142143

144

145

146

147

148149

150

151

152

153

154155

156157

158

159160

161

162 163164

165166167168

169

170

171172

173

174175

176177

178179180181

182

183184

185186

187

188

189

190

191

192

193

194195

196197

198

199

200201

202203

204

205206

207

208

209

306

307

308309

310

311

312

313

314

315

316

317

318

319

320

321322

323

324

325326

327

328

329

330

331

332333

334335

336

337

338339340 341

342343

344 345346

347 348

349

350351

352

353

354355356357

358

359

360

361 362

363

364

365

366

367368369

370

371

372373

374

375

376

377

378

379380

381382

383384

385

386387

388

389

390

391 392

393394

395396

397 398399

400

401

402

403404

405

406407

408409410

411412

413

414

415

416417

418419

420

421

422

423424

425

426427

428

429

430

431

432

433

434

435 436437438

439

440

441

442

443444

445

446

447

448

449

450

451 452

453

454

455 456457

458

459

460461

462

463

464

465

466 467468

469

470

471472

473

474

475

476

477478

479

480481482

483

484

485

486

487

488

489

490

491

492

493 494

495

496

497

498

499

500501

502

503

504

505

506

507508

509510

511

512

513514515

516

517

518 519 520

521

522523 524

525

526

527

528

529

530 531

532

533

534

535

536

537

538 539

540541

542543

544545546

547

548

549550

551552

553554

555

556

557

4748 495051

52

53

54

555657

585960

6162

63 646566 67

68

69

70

71

72

73

74

75

76

77

7879

80

81

82

83

210211212

213

214

215

216

217

218

219

220

221222

223

224225

226

227

228

229230

231

232

233234

235236237

238239

240

241

242

243

244 245

246 247

248

249

250

251252253254255

256

257258259

260

261262

263264265

266

267

268269270

271272

273

274

275

276277 278

279

280

281

282283

284

285

286

287

288

289290

291292

293

294

295296

297

298

299

300 301

302

303

304305

1021

1022

1023102410251026

1027

1028

1029

1030

1031

10321033

1034

1035

1036

10371038

1039

1040

1041

1042

104310441045

1046

10471048

104910501051

10521053

1054

1055

105610571058

1059

1060

10611062

1063

1064

1065

1066

1067

10681069

1070

1071

1072

1073

10741075

10761077

1078

1079

108010811082

108310841085

1086

1087

10881089

1090 1091

10921093

1094

10951096

1097

1098

1099

1100

(a) Before correction

−0.04 −0.02 0 0.02 0.04 0.06 0.08

−0.05

0

0.05

558559

560561

562563

564

565566567

568569

570571

572573

574575

576

577

578

579

580581

582583

584585

586

587588

589

590

591

592

593

594

595

596

597

598599600

601

602

603

604605

606607

608

609610

611612 613

614

615616

617618

619

620621

622

623

624

625

626

627628

629

630

631

632633

634

635

636

637

638

639640

641

642

643

644

645646

647

648649

650

651 652

653

654655656657658

659

660

661

662

663

664

665

666

667668

669

670

671

672

673674

675

676

677

678679680

681

682

683684

685686

687

688

689

690 691

692

693694695

696

697

698699

700701

702703

704

705706

707

708

709

710

711

712713

714

715

716

717718719

720721

722

723

724

725

726

727

728

729730

731

732

733734

735

736

737

738

739

740 741

742

743744

745746

747

748

749

750751

752

753

754

755756

757758

759760

761762

763

764

765

766

767

768

769 770

771

772

773

774

775

776

777

778

779

780

781

782

783

784785

786 787788

789 790

791

792793

794

795

796

797

798

799

800801

802

803804

805

806

807808

809810

811812

813

814

815

816817

818

819

820

821

822

823

824

825

826

827

828

829830

831

832

833

834

835

836

837

838

839

840

841842843844

845

846

847

848849

850

851

852

853854

855

856

857

858

859

860861862

863

864

865

866

867 868869

870871

872

873

874

875876

877

878879880881882883884885

886

887

888

889

890

891

892

893894

895896

897 898899900901

902

903

904

905

906907908

909 910

911

912

913

914

915

916917

918

919

920 921

922 923

924

925

926

927

928

929

930

931

932

933934

935

936

937

938939

940

941942

943

944

945946

947

948949

950 951

952

953954955

956957958

959

960

961

962

963

964

965966

967

968969

970

971

972

973

974

975

976

977978

979

980

981

982

983984

985

986 987988

989

990 991

992

993

994

995

996

997

998

999

1000

10011002100310041005

1006

10071008

1009

1010

101110121013

1014

1015

1016

1017

1018

10191020

1

2

3

4 5

6

7

8 910

1112 13

14 15

16

171819

20

21

2223

24

2526

272829

30

31

32

33

34

353637

3839

40

41

42

43

4445

46

84

8586

87

888990 91

92

93

949596 97

9899

100

101102

103

104105

106

107

108

109

110

111

112

113

114 115

116117118

119120121

122123

124

125

126127

128

129130

131

132

133 134

135136

137138139

140

141

142

143144

145

146

147148

149150

151152

153

154155

156157

158

159

160

161

162

163

164

165166167

168

169

170

171 172173

174175

176

177178

179

180181

182

183184

185186187

188189

190191192

193

194195196 197

198

199

200201

202203

204

205

206207 208

209

306

307

308

309310

311 312

313

314315

316

317318 319

320

321 322323

324

325

326

327

328329

330

331332333

334335

336337 338

339

340

341342343

344345 346347

348

349350

351

352

353354

355

356

357

358359

360361

362

363 364

365

366

367

368

369370 371372373374 375

376377

378379380

381

382

383 384

385

386

387

388

389

390

391392

393394

395

396

397

398399

400

401

402 403

404

405

406

407

408409

410411

412

413

414415

416417418

419

420

421

422

423

424425426427

428

429

430

431

432433

434435

436

437

438439

440441

442

443

444

445

446

447

448

449

450

451

452

453

454455456

457

458459

460

461462

463

464

465

466467

468

469470

471472473

474

475

476

477

478479

480 481482

483484

485

486487

488489

490

491

492

493494495

496

497

498

499

500

501

502

503

504505

506507

508509

510511 512513

514

515

516

517

518 519520521

522

523

524525

526

527

528

529

530531

532

533

534535536 537

538

539

540

541542

543

544545546 547

548

549550

551 552

553 554

555

556

557

474849

505152

53

54

5556

57

58

59

6061

62

63

6465

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

210

211

212213214

215

216

217

218

219220

221

222

223

224

225

226

227

228

229 230 231

232

233234

235

236237

238

239

240

241

242

243

244245

246 247248

249

250

251252

253254255256

257

258259

260

261

262

263264

265

266

267

268

269

270271

272

273

274

275

276

277

278

279

280281 282

283

284

285

286287

288

289 290291

292

293

294

295

296

297

298

299

300

301

302

303

304

305 1021

10221023

1024

10251026

1027

1028

1029

1030

1031

1032103310341035

1036

1037

1038

1039

1040

1041

1042

1043

104410451046

10471048

1049

1050

1051

10521053

1054

1055

105610571058

1059

10601061

1062

10631064

10651066

1067

1068

1069

1070

1071

10721073

10741075

1076

1077

1078

1079

1080

1081 1082

108310841085

10861087

108810891090

1091

1092

1093

1094

1095

1096

1097

1098

1099

1100

(b) After correction

Figure 4: Singular value decomposition of the responses in the Nuremberg dataset.Defendants are marked in red, Trustworthy witnesses in blue, and Untrustworthywitnesses in green. Before correction, the Trustworthy witnesses are concentrated onone side of the semantic space, but this is less the case after the correction.

and Untrustworthy witnesses, is literally off the scale. Both of these are more thantwice the maximum amount shown by the color map; the Defendants’ change is somewhatlarger than the Untrustworthy witnesses. (Since we were using composite windows forthis, the Republican average change is also slightly higher than it was in the previouscolormaps, which were made with 50-word minimum windows.) Figure 6 shows this moreclearly. (This is not a contradiction of our earlier claim that first-person pronoun corrections,in words per window, were similar in both the Nuremberg and Republican datasets. TheNuremberg data had smaller windows, and it contained subgroups with both much higherand much lower rate statistics, on first-person pronouns, than the Republican data.)

Seeing these large differences prompted us to look at individual distributions more closely.We overlaid the distribution from one subgroup onto the corresponding distributions for theother subgroups. This confirmed that different subgroups responded to prompting differently.They were not simply being prompted at different rates, or showing different rates of responseindependently of the prompt – many question word/ response word pairs showed completelydifferent distributions. In general, Defendants and Untrustworthy witnesses, thetwo deceptive groups, were similar, but the Trustworthy witnesses group was different.The strongest effects tended to involve first-person pronouns or action words. Four of themost striking sets of distributions are shown in Figure 7.

The prompting effect must be receiver-state-dependent – deceptive individuals respondto prompting one way, and truthful individuals another. One of the particularly strongdifferences involves second-person pronouns prompting first-person singular pronouns, whichwas already one of the strongest effects. Correcting for prompting reduced, rather thanincreased, the difference between these groups, because the prompting itself is a factor that

15


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(a) Defendants


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(b) Untrustworthy


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(c) Trustworthy

Figure 5: Corrections for the three Nuremberg subgroups analyzed separately.

distinguishes them from each other. Rather than having a base rate of word use (due todeception or lack thereof) which is modulated by prompting, the different subgroups actuallyexperience different kinds of prompting – which means paying attention to the way in whichprompting occurs ought to further elucidate differences between them.

These results explain the success of the ad hoc coding scheme used by Little and Skillicorn.Without knowledge of the rates of prompting words in questions, deceivers appear to increasetheir rates of first-person singular pronouns and exclusive words relative to less deceptiverespondents. In fact, similar levels of prompting in questions to both groups produces thesedifferentiated responses.

16

REPUBLICAN DEFENDANTS UNTRUSTWORTHYTRUSTWORTHY−0.025

−0.02

−0.015

−0.01

−0.005

0

Figure 6: Average change in rates of first-person singular pronouns prompted by second-person pronouns in the Republican and Nuremberg datasets, all using composite win-dows.

3.8 Validation: Random Forests

Does information in the question actually improve the detection of deception or is thatinformation present in the answer, but in some way not captured by the Pennebaker model?We used random forests [2] to estimate the significance of question words. A random forestnot only classifies data but estimates the importance of each attribute in the data by countinghow often each attribute is selected to act as the split point for an internal node of a decisiontree of the forest.

We trained random forests, one with the rates of only the six response word categories inthe answers, and one with these plus the rates of all the stimulus words in the questions. Forsimplicity, we included only the Defendants and Trustworthy witnesses subgroups.

Because not every statement by Defendant is a lie, we did not expect high accuracyfrom either of these forests. Rather, we wanted to compare them to each other to see if thequestion words made a difference. If both questions and answers were predictive of deception,then the model that uses both should make better predictions, and it should select the wordsthat appear the most promising (i.e. first-person singular pronouns in answers and second-person pronouns in questions). If only the words in answers were relevant to deception, thenboth models should perform about the same.

Figure 8 shows the performance of both random forests. While the answer-words-onlyrandom forest performs above chance, it shows a strong bias: its accuracy with Trustworthywitnesses responders is much lower than its accuracy with Defendants, that is it tendsto classify windows of both kinds as Defendants. Adding the question words improvesoverall accuracy by more than 10 percentage points – an even more dramatic result than weexpected. Moreover, it reduces bias by an even larger amount, producing an improvementon the Trustworthy witnesses without reducing the accuracy on Defendants.

Table 3 shows the results of attribute ranking with the best-performing random forest.

17

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160

0.02

0.04

0.06

0.08

0.1

0.12

(a) Second-person pronounsprompting first-person sin-gular pronouns

0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050

0.02

0.04

0.06

0.08

0.1

0.12

(b) “Or” prompting first-person singular pronouns

0.01 0.02 0.03 0.04 0.05 0.060

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

(c) First-person plural pro-nouns prompting “but”

0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

(d) “Or” prompting actionwords

Figure 7: Gaussian distributions for the Nuremberg dataset, separated by subgroup.Defendants are red, Untrustworthy witnesses are magenta, and Trustworthywitnesses are blue. Not all of the figures are on the same scale.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Chance

Answers only

Questions and answers

NUREMBERG overallDEFENDANTSTRUSTWORTHY

Figure 8: Prediction accuracy of random forests trained on the Nuremberg data

The most influential words in the question-and-answer random forest were as predicted:first-person singular pronouns in the answer and second-person pronouns in the question,with similar frequencies. It is likely that the interaction between these two categories drovemany of the decision trees in the forest.

18

Word category SplitsA first-person singular pronouns 10771Q second-person pronouns 10563Q “wh” words 9581Q “these”, “those”, and “to” 9277Q first-person singular pronouns 6839A “but” 6584A action words 6581Q “or” 4934A “or” 4692Q action words 4165Q third-person plural pronouns 3973Q first-person plural pronouns 3641A misc exclusive words 3570Q third-person singular pronouns 3383Q “but” 3119A negative emotion words 2254Q misc exclusive words 1538Q negative emotion words 871

Table 3: Word categories in the random forest trained with question-and-answer data, rankedby the number of splits in the model that use each word.

After the top two spots, the three next most important word categories were all questionwords, suggesting that the forest not only supplemented its reasoning with question words,but actually made more decisions based on question words than on answer words. However,a small overrepresentation of question words is to be expected given that we are counting alarger number of word categories in the question than in the answer. Also, in some cases aquestion word may give context to an answer word and thus needs to be considered first.

3.9 Validation: the Simpson dataset

Our final task was to see whether these validations generalized. We looked at the Simpsondataset and performed the same analysis.

The SVD results from Nuremberg generalized to the Simpson data: while a modestvisual distinction existed in semantic space between Simpson and the Plaintiffs, applyingour correction method reduced this distinction (Figure 9). The average change in words perwindow for each answer word category was very similar to that of the Nuremberg data.

The distinction between subgroups did not generalize as well. While Simpson’s depositionwas somewhat different from those of the Plaintiffs (Figure 10), the two color maps did notdiffer in the same ways that the Nuremberg color maps differed. Looking at the overlaidsubgroups for individual question-answer pairs (Figure 11) was also different: there was

19

−0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2

1022

1097

1046

285

71

39

1029978

608

10941117

967

1026

1083

1001694

358

160

76

9162

282

214

303

461622606

692

7481085

125

471

or346

1134

1003

1100

I

221890607

38211024237

509546943153

1028609

9451143270741

137

7766741058

1107

431

296

653

1059

837334860549754518

199

146813428

103710491123

747

1032107811191104

178

875801807

354

1109391

1157but10421010

1215479

434

2791219

1053107911811060104710551039532122610024119869844689231207501381996635981068443108211781051211112021552610884071034100612042992832183533941102106311111103478103510271048100011322671223418476971311105011159491005946531903107416194840551004413101298310866781056111366210382798899712181188121170310706193639954609981011111411161098111011336853008731031111839292197610431205841374889975749842950189601052595117412309938488561009159985100768663897366410648671019896680974677107797030461194052218442168992120069984028622255311121108104446459652122233641412201033110639954210181001065456832102011211148110141926595411984845749342744221525851045868102390712161688767545811254811146572415411017309833104059445244710418514746563648053769098799112410714827973016414702959686271073110510729949797207626404279592518924035923877360296512092761551180120640825960410872005549573228436199098711212021030

634

48316919025863617109696668842910541228189121618437932587134291991224691341371171383700883991117621329440449188572238917260544906951106264385211901403751013103896925453856695311561080779292377705704551120315752778480031880811916488878039773022291227387631689787768377169810146762911221072881231472409761880183120884663118110669586422547451642071309995627751159

783

2716031967781423511236103694795691239672350761511317183077294336469193423081541155119244969715881186180611778506873575339758198935511971175918713650639416780106778533093175972564541512312359335888622933241016102191585386596226011282623526637984983980958650980465939584701502469109198272110764417867118188981210113881512173987275088728742891069726297107511728368122801184810212913440570552688042521179911637excl1227917121317611471706748666325592410611836472249522505677271213853651586376337342693778783518829861014511528174488214160175417410112962012011234402109998190069536649427371576968164475119480290894306114272462110541187904858118248796311779012328298343172561135766847393315855390109275681320314582329333908669777101108484511863868441095312593770753335684123311491931152972455261614119934575737878966827762992992881923290121624359644841573120115862315793327102348589624626240288307917506894642639029142306544062014661015510143895706655737475798141471025560649899795261189116357349117311371057854796755672106193687816778127812259057021821331442081398172235007167172644898438884346309101130798738690667535513286671467326233116451256541261620992549566682311820494292010557811463841115684461193828450113956956342575499709445143881868697352756411451287733974001194162964371246161948161772497647096614320654549673979486649394203681167205728173523122148524635211151734370528257118561321710811090174220838733477424724575033596368228324136116523548871406732253953441631141558210234122912452552053552113513670492148423498305455011277857253736613115074059820886151444350661160116643674374212623924749751651959054011611266964011211059880285796002318646511953070820325358011945151754311959392613121916573012711684757527686597609671457553036210897659169381196185108826736114068283111542725053565917582212123671144658504747678251911795997075764321136557149202524248313732881799340544382

93077478853911621347244641923724919845480352109347385985749074655945334388561633597488951971667318914231923689625529555227420462100874522890neg11710775116933698211701805811611169325338171331226269138212171214331153660612323

28484

360

78296153425430577632321485

601

824245332380719

310

129941

56728779257166567931893435

156action18750614385

439556

365

150

104275

337242

361

235238135

109328

955

442339

628

927

69

445

65

935583

87092

888

(a) Before correction (b) After correction

Figure 9: Singular value decomposition of responses in the Simpson dataset before and aftercorrection, with O.J. Simpson’s responses in red and Plaintiffs in blue. The difference isvery small.


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(a) Simpson


FPS

but

or

excl

neg

action −6

−4

−2

0

2

4

6

x 10−3

(b) Plaintiffs

Figure 10: Corrections for the two O.J. Simpson subgroups analyzed separately.

evidence of differences between the two groups, as there had been with Nuremberg, but notin quite the same way – they tended to be slightly weaker. Moreover, the strongest differencesbetween subgroups in the Simpson data did not usually correspond with the strongestdifferences in the Nuremberg data. In particular, the relationship between second-personpronouns in the question and first-person singular pronouns in the answer appeared muchweaker between subgroups – Simpson used more first-person singular pronouns but was also

20

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

(a) Second-person pronounsprompting first-person sin-gular pronouns

0.01 0.02 0.03 0.04 0.05 0.060

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

(b) “Or” prompting first-person singular pronouns

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

(c) Second-person pronounsprompting “but”

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

(d) Third person singularpronouns prompting actionwords

Figure 11: Gaussian distributions for the Simpson dataset, separated by subgroup. Simpsonis red and Plaintiffs are blue. Not all of the figures are on the same scale.

prompted more.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Questions and answers

Answers only

Chance

PlaintiffsSimpsonOverall

Figure 12: Prediction accuracy of random forests trained on the Simpson data

Finally we constructed the same two random forests using the Simpson data. Therandom forest trained with both question and response words showed an increase in accuracy,but a smaller one than that for the Nuremberg data (Figure 12). In general, the Simpsondata supports the view that prompting effects exist, but it is less clear where the prompting

21

effects actually are.

4 Discussion

At present, neither humans nor computers are particularly good at detecting deception.Models that perform with 70% accuracy (as in Mihalcea and Strapavara’s LIWC-basedstudy [38]) are considered promising. But we can hardly afford for three in ten accusedcriminals to be falsely classified as deceptive. We also cannot rely on our own human abilityto detect deception, since that ability is quite weak. Improvements to existing deceptionmodels are desperately needed.

We have increased the understanding of the processes that occur when a deceptive personis questioned. A prompting effect influences the rates of word usage in answers to questions,but the effect is receiver-state-dependent.

We developed a method to correct for prompting effects and so remove its influence. Thismethod contains two components: a corrected representation of an utterance, and the sizeof the corrective change that is made. Each of these components might be useful in differentsituations. In our setting, the corrected representation was not useful because it removedsome of the distinction between deceptive and truthful subgroups. However, tracking thesize of the change was useful as it allowed us to see at a glance the similarities and differencesin prompting effects between different subgroups.

The way in which respondents respond to the prompting effect is, in itself, a cue todeception. We support this conclusion by building random forests to classify deceptive andtruthful subgroups, finding that the random forests performed substantially better whenthey were given both question data and answer data. In this situation, the random forestsperformed at 70-80% overall accuracy.

This suggest a number of avenues for ongoing research in deception. First, as the randomforest results show, paying attention to both questions and answers will improve word-basedmodels – even without detailed understanding of what the effect of question language is.Other classification approaches to deception will almost certainly benefit from including thewords of questions in the data.

Beyond this, these results suggest what we ought to look for in the questions. Weknow that the prompting effect is receiver-state-dependent. So tracking the nature of theprompting effect across different groups of respondents in similar situations should illuminatedifferences between respondents. We have made a start at showing which relationshipsbetween words are most useful in such analysis – in particular, second-person pronounsprompting first-person singular pronouns.

Deception is not the only setting in which first-person pronouns are relevant. Any variablestudied using bag-of-word approaches might be assessed in dialogue settings. In any suchsetting, it seems probable that paying attention to both question words and answer wordswill improve accuracy. It may be the case that people of different personalities, sexes, or agesrespond to prompting differently – which means that taking question words into account willimprove our ability to profile people in dialogues, for example, if a participant’s identity in

22

a computer-mediated chat setting is uncertain. Meanwhile, in the study of relationships –useful, for example, in analysis of social networks – there may be rich and interesting effectsto uncover from the association between relationship style or quality and the way the peopleinvolved respond to each other’s prompting. On the other hand, there may be settings ofthis nature in which all the interesting subgroups respond to prompting in the same way. Inthat case, our method for removing the influence of the question may prove useful.

4.1 Limitations

Our analysis is biased towards common word categories. The “average change” metric mea-sures not only the strength of a relationship between two pairs of words but how frequentlythat relationship actually appears. We defend this choice of metric by pointing out that themore a model relies on common words the more applicable it will be to small windows.

There are almost certainly important words that do not appear in this analysis. The Pen-nebaker model has been extensively validated, but other bag-of-word models have emerged.Hauch et al.’s recent meta-analysis [28] supports all the categories of the Pennebaker modelbut suggests several other cues to deception: increased positive and overall emotion words(as Zhou et al. found [60]), increased negations (“not”, “never”), decreased third-personpronouns, and slightly decreased tentative and time-related words.

The use of bag-of-words implies the treatment of words as forms without any semantics.There are therefore potential confounds associated both with polysemy, and with the use offunction words in a stylized, rather than active, way (for example, whether “thank you” isconsidered to contain an active second-person pronoun or not).

The prompting effects in the Simpson dataset were somewhat different from those inthe Nuremberg dataset and generally smaller. The random forest model also showed asmaller improvement when question data was added. Intuitively, either the Simpson datasetunderrepresents the difference between truthful and deceptive testimony, or the Nurembergdataset overrepresents it – or both. This illustrates the difficulty of comparing deceptive andtruthful people across different contexts. In the Simpson dataset, it is plausible that the dataunderrepresents the difference between deception and truthfulness because the Plaintiffswere possibly not entirely truthful. They had a stake in the outcome, and were possiblymotivated towards persona deception. Despite DePaulo et al.’s recommendations, littleresearch has been done on the communication of people who are probably truthful, buthighly motivated to “spin” the truth. Until such research is done, applying the deceptionmodels to civil suits remains problematic.

In the Nuremberg dataset, the difference between truthfulness and deception may beexaggerated because of other differences between the Nazis and the Trustworthy wit-nesses. These groups generally spoke different first languages and were asked differenttypes of questions; the Trustworthy witnesses were often given perfunctory cross-examination. Care must be taken in interpreting these differences in questions. As Hancocket al. [27] discovered, deception is a process involving both the questioner and respondent.If truthful and deceptive respondents are questioned differently, it may partly be becausethe questioner is biased – or it may be that the questioner is responding unconsciously to

23

deceptive speech patterns. To make matters worse, it may be both. As for demographicdifferences such as language and nationality, these cannot be ruled out as a potential con-found, but their influence is probably small: researchers such as Fornaciari et al. [23] whotried to restrict their datasets to remove such confounds found that it did not increase theaccuracy of their models. Nevertheless, the testimony of many of these witnesses was trans-lated, and this may have distorted the language patterns, and certainly distorted the timingof questions and responses.

Any deception model used in the field will be faced with this type of confound: it willbe used on men and women, people from different cultures, people experiencing differentemotions and confronted by different kinds of questioning, even people with mental illnessesor disabilities that affect their word choice. No deception model should be considered usablefor practical purposes unless it has been tested across all these different kinds of categories.

References

[1] American Broadcasting Company. Full transcript: ABC News Iowa Republican debate,December 11 2011. Accessed in winter 2012 at http://abcnews.go.com/Politics/full-transcript-abc-news-iowa-republican-debate/.

[2] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

[3] A.S. Brown and D.R. Murphy. Cryptomnesia: Delineating inadvertent plagiarism.Journal of Experimental Psychology: Learning, Memory, and Cognition, 15(3):432–442,1989.

[4] Cable News Network. Full transcript of CNN-Tea Party Republican debate, 20:00-22:00, September 12 2011. Accessed in fall 2011 at http://transcripts.cnn.com/TRANSCRIPTS/1109/12/se.06.html.

[5] Cable News Network. Republican debate, June 13 2011. Accessed in fall 2011 athttp://transcripts.cnn.com/TRANSCRIPTS/1106/13/se.02.html.

[6] Cable News Network. Full transcript of CNN Arizona Republican presidential de-bate, Feburary 22 2012. Accessed in winter 2012 at http://archives.cnn.com/TRANSCRIPTS/1202/22/se.05.html.

[7] Cable News Network. Full transcript of CNN Florida Republican Presidential debate,January 26 2012. Accessed in winter 2012 at http://archives.cnn.com/TRANSCRIPTS/1201/26/se.05.html.

[8] J.R. Carlson, J. F. George, J.K. Burgoon, M. Adkins, and C.H. White. Deception incomputer-mediated communication. Group Decision and Negotiation, 13:5–28, 2004.10.1023/B:GRUP.0000011942.31158.d8.

24

[9] T. L. Chartrand and R. van Baaren. Human mimicry. Advances in Experimental SocialPsychology, 41, 2009.

[10] The Chicago Sun-Times. CNN Republican debate, Nov. 22, 2011. Transcript. Ac-cessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/11/cnn_republican_debate_nov_22_2.html.

[11] The Chicago Sun-Times. CBS/National Journal GOP debate. Transcript, video, Novem-ber 13 2011. Accessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/11/_cbsnational_journal_gop_debat.html.

[12] The Chicago Sun-Times. CNBC Republican debate. Transcript, video highlights,November 9 2011. Accessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/11/cnbc_republican_debate_transcr.html.

[13] The Chicago Sun-Times. Republican Las Vegas CNN debate: Transcript, Octo-ber 19 2011. Accessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/10/republican_las_vegas_cnn_debat.html.

[14] The Chicago Sun-Times. GOP NH ABC/Yahoo News debate: Transcript, January 82012. Accessed in winter 2012 at http://blogs.suntimes.com/sweet/2012/01/gop_nh_abcyahoo_news_debate_tr.html.

[15] The Chicago Sun-Times. GOP NH NBC’s Meet the Press/Facebook debate: Transcript,January 8 2012. Accessed in winter 2012 at http://blogs.suntimes.com/sweet/2012/01/gop_nh_nbcs_meet_the_pressface.html.

[16] The Chicago Sun-Times. South Carolina GOP CNN debate, Jan. 19, 2012. Transcript,January 20 2012. Accessed in winter 2012 at http://blogs.suntimes.com/sweet/2012/01/south_carolina_gop_cnn_debate_.html.

[17] C. Chung and J. Pennebaker. The psychological functions of function words. InK. Fiedler, editor, Social Communication, pages 343–359. New York: Psychology Press,2007.

[18] Council on Foreign Relations. Republican Debate Transcript, Tampa, Florida, Jan-uary 2012. Accessed in winter 2012 at http://www.cfr.org/us-election-2012/republican-debate-transcript-tampa-florida-january-2012/p27180.

[19] S. Deerwester, S. T. Dumais, G.W. Furnas, and T.K. Landauer. Indexing by latentsemantic analysis. Journal of the American Society for Information Science, 41:391–407, 1990.

[20] B.M. DePaulo, D.A. Kashy, S.E. Kirkendol, M.M. Wyer, and J.A. Epstein. Lying ineveryday life. Journal of Personality and Social Psychology, 70(5):979–95, 1996.

25

[21] B.M. DePaulo, J.J. Linday, B.E. Malone, L. Muhlenbruck, K. Charlton, and H. Cooper.Cues to deception. Psychological Bulletin, 129:74–118, 2003.

[22] P. Ekman and M. O’Sullivan. Who can catch a liar? American Psychologist, 46(9):913–920, September 1991.

[23] T. Fornaciari and M. Poesio. On the use of homogenous sets of subjects in deceptivelanguage analysis. In Proceedings of the Workshop on Computational Approaches toDeception Detection, 13th Conference of the European Chapter of the Association forComputational Linguistics, pages 39–47, April 23 2012.

[24] Fox News. Complete text of the Iowa Republican debate on Fox News channel, August12 2011. Accessed in fall 2011 at http://foxnewsinsider.com/2011/08/12/full-transcript-complete-text-o

Detecting Deception in Interrogation Settings · Detecting Deception in Interrogation Settings C.E....

Documents

Transcript of Detecting Deception in Interrogation Settings · Detecting Deception in Interrogation Settings C.E....