Detecting Deception Steven Van Aperen Australian Polygraph Services 23rd November 2006.
Detecting Deception in Interrogation Settings · Detecting Deception in Interrogation Settings C.E....
Transcript of Detecting Deception in Interrogation Settings · Detecting Deception in Interrogation Settings C.E....
-
Detecting Deception in Interrogation Settings
C.E. Lamb and D.B. Skillicorn
December 2012External Technical Report
ISSN-0836-0227-2012-600
School of ComputingQueen’s University
Kingston, Ontario, Canada K7L 3N6
Document prepared December 18, 2012Copyright c⃝2012 C.E. Lamb and D.B. Skillicorn
School of ComputingQueen’s University
{carolyn,skill}@cs.queensu.ca
Abstract
Bag-of-words deception detection systems outperform humans, but are still not always accu-rate enough to be useful. In interrogation settings, present models do not take into accountpotential influence of the words in a question on the words in the answer. Because of therequirements of many languages, including English, as well as the theory of verbal mimicry,such influence ought to exist. We show that it does: words in a question can “prompt” otherwords in the answer. However, the effect is receiver-state-dependent. Deceptive and truthfulsubjects respond to prompting in different ways. The accuracy of a bag-of-words deceptioncan be improved by training a predictor on both question words and answer words, allowingit to detect the relationships between these words. This approach should generalize to otherbag-of-words models of psychological states in dialogues.
-
Detecting Deception in Interrogation Settings
C.E. Lamb and D.B. SkillicornSchool of Computing, Queen’s University
Technical Report 2012-600
1 Introduction
From court cases to airport security, many high-stakes real-world situations rely on thehuman ability to distinguish truth from deception. Sadly, humans do not do this well.Automating the detection of deception would thus bring many benefits. If an automatedmethod outperformed humans, it could be used to augment human judgement in high-stakessituations; and it could also be used in lower-stakes situations where expert human judgementis infeasible. For instance, a filter alerting users to potential deception could be used withonline job postings.
One obstacle in the way of automation is that deception works differently in differentsituations. One important factor mediating cues to deception is the difference betweenmonologues and dialogues. Many deception models are based on monologues such as speechesor written statements. In many interesting real-world cases, the deceptive person is in adialogue, and each participant responds and adjusts to the linguistic style of the other. Thismakes deception detection in a dialogue complicated. If one participant shows linguisticsigns of deception, are they being deceptive, or are they responding to some aspect of theother participant’s prompting?
We have undertaken an exploratory study of these issues by analyzing archival transcriptsof real-world interrogations and applying a deception model developed by Newman et al. [43].We found that word classes relevant to the model were vulnerable to a prompting effect: usingcertain words in the question increased the frequency of other words in the answer. Thereforethe model cannot be applied to dialogue data in its present form, because the promptingeffect is a potential confound.
Our goal, therefore, was to expand the model’s applicability by removing the effect ofthe question from each answer. However, using this correction on real-world data revealedthat our assumptions had been overly simplistic. Deceptive and non-deceptive respondentsin the data responded to prompting in different ways, and removing the prompting effectalso removed some of the signals of difference between these subgroups. Instead of removingthe effect of the question, the relationship between question and answer words needs to betaken into account. Tests with a predictor confirmed that a model based on question andanswer words, rather than just answer words, is indeed more accurate.
We conclude that, in deception, it is not the case that the respondent has an underlyingmental state whose effects are distorted in a uniform way by prompting. Instead, promptingoccurs, but the respondent’s mental state determines its effect. Fortunately, this second-orderbehavior is itself detectable. We suspect that this effect will generalize to other word-basedpsychological models in dialogues, helping reveal other aspects of the relationship between
1
-
questioner and respondent.
2 Background
2.1 Stop Words
In text mining it is common practice to remove certain “stop words”: words like “the”, “I”,and “if”, which are thought to be too common to be worth analyzing [31]. It is thought thatthese stop words have only a grammatical function, and thus are unnecessary in a model thatdoes not take word order into account. However, stop words contain interesting meaningsof their own. Aside from their grammatical function, they have contextual, social meanings.(The word “he”, for example, refers to a man, but we need context in order to figure outwhich man.) They also serve a stylistic function. For this reason they are also called “stylewords” or “function words”.
Function words are particularly useful for the discovery of a speaker’s mental state fortwo reasons. First, they are more plentiful than content words. Although normal Englishspeakers have only 500 function words in their vocabulary (out of 100,000 total Englishwords), 55% of all the words they speak or write on a given day will be function words [55].Second, these words are produced in a different part of the brain than content words [39]and can “leak” information about a person’s mental state even without their awareness [17].
Social psychologists in the past twenty years have been analyzing people’s speaking andwriting styles by counting both function words and common content words. One impor-tant tool for this analysis is Pennebaker’s [45] Linguistic Inquiry and Word Count program(LIWC). With this program, researchers have discovered correlations between function worduse and everything from depression [49] to sexual orientation [25] and quality of a rela-tionship [50] and, our focus here, deception [43]. Function words have also been shown todistinguish among different kinds of Islamist language [35].
2.2 Deception Detection
Unaided humans are poor at detecting deception. Although individual proficiency varies,even groups of trained humans such as police officers rarely perform above chance [22]. Manyattempts have been made to construct empirical models of deception, relying on objectivecues, but this is difficult because different studies produce conflicting results. In fact, nosingle cue is reliably indicative of deception across studies [8]. One reason for this is thatdifferent studies address the issue of deception in different situations, and the relevant cuesin these situations may also be different. For example, DePaulo et al.’s meta-analysis foundthat effect sizes were different depending on the deceptive person’s motivation, the amount ofinteractivity in the setting, and whether a social transgression was involved [21]. Meanwhile,computer-mediated communication appears to function along different lines from face-to-facecommunication, perhaps because each person has time to rehearse what they will say [61].
2
-
2.3 The Pennebaker Deception Model
The model of deception we use comes from Newman et al. [43]. The authors asked collegestudents to speak or write, either deceptively or truthfully, about a variety of topics (theiropinion on abortion, their feelings about their friends, and a mock theft). Then they analyzedall the truthful and deceptive statements with LIWC. Four linguistic signs of deceptionemerged across categories:
• First person singular pronouns decrease in the deceptive condition. Deceptive peoplehave less personal experience with their subject matter than truthful people, and areless emotionally willing to commit to what they are saying, so they focus on and referto themselves less [43].
• Exclusive words decrease in the deceptive condition. Such words introduce increasedcomplexity into sentences. Deception causes a higher cognitive load than truthfulnessbecause of the increased difficulty of monitoring a deceptive performance [21], whichproduces an increase in visible signs of effort [62], even when the deceptive person hashad time to prepare [56].
• Negative emotion words increase in the deceptive condition, consistent with DePaulo etal.’s [21] meta-analysis. Deceptive people are thought to do this as a sign of unconsciousdiscomfort [43]. Alternatively, increased emotion can be a sign of trying too hard topersuade the listener, as in Zhou et al. [61].
• Motion verbs (“go”, “run”) increase in the deceptive condition. This is the result ofincreased cognitive load and perhaps a need to keep the story moving and discouragesecond thoughts on the part of the reader/hearer.
Lowered self-reference is consistent with other models (e.g. Zhou et al. [59]) and persistsregardless of motivation [27] but DePaulo et al. [21] and Zuckerman et al. [62], in their meta-analyses, did not find a reliable, statistically significant decrease in self-references acrossstudies.
DePaulo et al. [21] point out that increases in emotion should be treated with caution.Most lies in the real world are “white lies”, told to smooth over social interactions, andpeople telling them experience very little distress [20]. Moreover, some truthful statements,such as confessions of wrongdoing, might lead to high levels of negative emotions. For similarreasons, this part of the model should not be applied to psychopaths, who experience nodiscomfort when breaking moral rules [47].
The effect of cognitive load should also be treated with caution: some white lies are easyto tell, and some potentially volatile truths may be as mentally effortful as a lie [21].
Since the original work, the Pennebaker model of deception has been extensively validatedacross a large number of populations. The words used in these criteria are summarized inTable 1.
The LIWC model performed significantly better than human raters in distinguishingtruth from deception [43]. Gupta and Skillicorn [26] suggest that the LIWC model detects
3
-
Categories KeywordsFirst-person pronouns I, me, my, mine, myself, I’d, I’ll, I’m, I’veExclusive words but, except, without, although, besides, however, nor, or,
rather, unless, whereasNegative emotion words hate, anger, enemy, despise, dislike, abandon, afraid, agony, an-
guish, bastard, bitch, boring, crazy, dumb, disappointed, disap-pointing, f-word, suspicious, stressed, sorry, jerk, tragedy, weak,worthless, ignorant, inadequate, inferior, jerked, lie, lied, lies,lonely, loss, terrible, hated, hates, greed, fear, devil, lame, vain,wicked
Motion verbs walk, move, go, carry, run, lead, going, taking, action, arrive,arrives, arrived, bringing, driven, carrying, fled, flew, follow,followed, look, take, moved, goes, drive
Table 1: Words used in the LIWC deception model
not only outright falsehood, but also “spin” or “persona deception”, in which a person doesnot make factually false statements, but consciously projects an image of themselves thatthey know to be inaccurate. Skillicorn and Leuprecht have used this reasoning to applyLIWC to the speeches of politicians [51].
2.4 Fine-Tuning the LIWC Model
Keila and Skillicorn [34] applied LIWC to a dataset consisting of internal emails at Enronin the period just before its fraudulent accounting scandal. They combined LIWC withsingular value decomposition (SVD). SVD is a useful tool for eliciting correlations frommultidimensional data. An SVD is a factorization of a real matrixA into three other matrices:
A = USV T
such that U and V T are unitary matrices and S is a diagonal matrix [53]. The matricesU and V represent the data as vectors in a many-dimensional space. These axes appear inthe matrices in decreasing order of size: the first rows of V T represent the axes of maximalvariation in the data. The nonzero values of the matrix S, meanwhile, correspond to theamount of variation along each axis. One can therefore truncate a set of SVD matrices atan arbitrary number of dimensions and be confident that the truncated version is the mostfaithful possible representation of A in the chosen number of dimensions. The S matrixvalues provide an indication of how much variation is being lost [52].
Reducing the number of dimensions in this manner not only compresses the data butelucidates correlations between different parts of it, because documents or variables thatvary together will be represented as closely aligned vectors in the SVD space [19].
Keila and Skillicorn [34] found that LIWC performed well when combined with SVD,because words within a given category (except exclusive words) were correlated with each
4
-
other and formed sensible structures in semantic space. The top ranked emails by a geomet-rical measure in reduced-dimension semantic space were all deceptive. However, in such alarge dataset, it is difficult to tell how many false negatives there are.
Skillicorn and Little [52] repeated the SVD-based analysis on transcripts from the Gomerycommission, in which former Canadian government officials were examined regarding allegedcorruption. They found that in this data, deception was associated with increases – notdecreases – in first-person singular pronouns and exclusive words. Altering the model tolook for increases in all categories, they produced results in rough agreement with mediaestimates of who was being deceptive and who was not. (Although ground truth for theGomery data was not available, the most deceptive people according to the model werepeople saying they did not remember basic facts about their own employment, such as whothey were working for. Meanwhile, the least deceptive people were witnesses called in toexplain purely technical matters.) Hence the standard Pennebaker model does not appearto be effective in dialogue settings.
If deception decreases first-person pronoun use in some situations and increases it in oth-ers, this explains DePaulo et al. [21] and Zuckerman et al.’s [62] inability to find a significanteffect for self-references across many studies. However, it raises the more vexing questionof what causes these words to behave in this way. Skillicorn and Little [52] suggest thatfirst-person singular pronouns and exclusive words increased with deception in the contextof the Gomery commission because it was not an emotionally charged situation. As we shallsee, there is a better explanation.
2.5 Forcing Word Use in Questions and Answers
There are two processes of interaction between the language of questions and the languageof answers that alter the word use in both. The first is the technical requirements of the lan-guage itself; the second is the largely unconscious mimicry that participants in a conversationfall into.
In most languages, the function word patterns in questions force some correspondingstructure into a responsive answer. For example, a question containing the phrase “Did you. . . ” requires either a first-person singular or first-person plural pronoun (“Yes, we did . . . ”;“No, I didn’t”) or a passive verb. A respondent therefore does not have the same freedomof word use in a dialogue that they have in an unforced setting.
Two people in conversation, even if they are strangers, will imitate each other in every-thing from facial expression and body language [9] to volume [40], pitch [33], and speechrates [58]. People speaking to each other also mimic each other’s words and phrases [36].This mimicry is generally not conscious [9]. Subjects mimic phrases they have heard evenwhen consciously trying not to [3] and when their conscious working memory is filled with adistractor task [36].
LIWC can be used to measure verbal mimicry on the level of categories of words, whereit is called Linguistic Style Matching (LSM). LSM is measured by comparing LIWC countson all function word categories between both partners in an interaction. This comparisoncan be done through product-moment correlation [44] or a weighted difference score [32].
5
-
The LSM metric is internally consistent: if a dyad matches in their use of one functionword category, they will probably match to the same degree with all others. It also generalizeswell across different contexts, from online chat conversations [44] to letters between well-known colleagues [32] and face-to-face discussions [44].
LSM appears to a greater or lesser degree in different circumstances. Some linguisticstyles are more easily matched than others, and different people will match a given stylewith more or less ease [32]. However, while greater LSM is associated with greater socialcohesion, the matching occurs even between strangers who dislike each other [44]. We wouldthus expect a degree of matching to occur in any context – even a courtroom interrogation.
3 Method and Results
The prompting effect we expect to see is a generalization of style matching because, unlikepure mimicry, we expect that some categories of question words prompt different categoriesof response words. This is particularly obvious in the case of pronouns. Our first hypothesisis therefore:
H1. The function words in an answer to a question are not independent of the question,but are prompted (positively or negatively) by the function words in the question. At leastsome of the words affected in this way will be relevant words from the Pennebaker deceptionmodel.
Our second hypothesis follows:H2. If word frequencies from the LIWC model are affected by the words in the question,
then this potentially distorts the results of the Pennebaker model. Removing the effectsof the question from the answer before applying the LIWC model should result in greateraccuracy.
Our plan of attack has three parts. First, we gather archival question-and-answer dataand visualize any relevant changes in frequency that could be caused by H1. Second, wedevelop a method to correct for these changes. Third, although not every answer in ourdata can be labeled as entirely truthful or entirely deceptive, we use some straightforwardmethods to validate the corrections and to investigate whether they do, in fact, improve theseparation between truthful and deceptive responses.
3.1 Datasets
We created three datasets by taking archival transcriptions of real-life, high-stakes question-and-answer interactions.
The Republican dataset comprises transcripts of each of the Republican primary de-bates leading up to the American presidential election of 2012. These involved a totalof ten candidates and were televised and transcribed online by various news organiza-tions [1, 4–7,10–16,18,24,30,42,46,48,57].
We used the Republican dataset to investigate overall patterns of interaction betweenquestion and answer words. We did not rate the Republican presidential candidates as
6
-
deceptive or non-deceptive, since there is no reliable estimate of ground truth about thecandidates’ honesty. Fact checking websites may show that specific statements are true orfalse, but they do not help with the issue of persona deception, and there is no reliable wayto judge one candidate overall as more deceptive than another. However, we did judge thedebates in general as a forum in which all candidates would be motivated towards personadeception as defined by Gupta and Skillicorn [26]. Presenting themselves and the factsin the most favorable possible light is essential to getting elected, and much of the timethis favorable light does not correspond exactly with reality. The full Republican datasetcontained 2118 question-answer pairs and 301,539 total words.
The Nuremberg dataset comprised selected examinations and cross-examinations ofwitnesses from the Nuremberg trials of 1945-1956. Most of these examinations were takenfrom the Trial of German Major War Criminals, transcribed at the Holocaust memorialwebsite nizkor.org [29]. Two were instead taken from the Nuremberg Medical Trial, whichis partially transcribed online at the website of the Harvard Law Library [41]. Unlike theRepublican dataset, the Nuremberg dataset contained obvious subgroups with markedlydifferent motivations towards deception. The first group, Defendants, contained two Naziwar criminals testifying in their own defense who were eventually found guilty on all countsand executed. These men were highly motivated towards deception: they were guilty, andtheir lives depended on convincing the tribunal that they were not. The second group,Untrustworthy witnesses, contained ten lower-ranking Nazis who were not themselveson trial. While this group was not at immediate risk of conviction, it seems reasonableto suppose that their accounts would be moderately deceptive. Most of them would bemotivated to absolve themselves either by minimizing Nazi war crimes as a whole or byminimizing their own involvement. The third group, Trustworthy witnesses, containednineteen survivors of Nazi war crimes who testified about those crimes. Fourteen of thesewere Holocaust survivors, while the other five were civilians who reported on more generalconditions in Nazi-occupied countries. We did not consider any of these witnesses deceptive.
Some witnesses spoke at great length about their experiences, so we chose to truncateanswers in the Nuremberg dataset at 500 words, matching the maximum size that occurrednaturally in the Republican dataset. This affected less than 1 percent of the answers. Thefull Nuremberg dataset contained 4159 question-answer pairs (1355 from Defendants,1826 from Untrustworthy witnesses, and 978 from Trustworthy witnesses). Itcontained a total of 311,099 words.
We also make brief use of a third dataset, Simpson. This contains depositions from thecivil trial of O.J. Simpson for the wrongful death of his wife, Nicole Brown, and another man.Extensive transcripts of this material were available online [54]. Simpson contains Simpson’sdeposition in his own defense, which we considered deceptive, as well as the depositions offamily and friends of the deceased, which we considered largely truthful. (We chose the civiltrial rather than the criminal trial because it was the first time Simpson testified directly inhis own defense; it also had the advantage of being a trial at which he was found guilty.)Since Simpson’s own deposition was three times as long as the rest of the dataset, we usedonly every third question-answer pair from him. The Simpson dataset contained 20,810
7
-
question and answer pairs, totalling 412,385 words.
3.2 Model Words
When processing this data, we are most interested in what occurs at the level of a singlequestion and answer pair. Thus, we looked at “windows” in the data, which we defined as asingle question followed by its answer. We recorded the length (in words) of each questionand answer, along with the count of the number of words in these categories:
FPS First person singular pronouns. The Pennebaker model predicts that deceptive peopleshould use a lower rate of first-person pronouns than truthful people. However, in theGomery commission data, Little and Skillicorn [37] found that deceptive people useda higher rate.
but Lower rates of this and other exclusive words are expected in deception. The Pen-nebaker model puts “but” and “or” in the same exclusive words category but Littleand Skillicorn’s results [37] suggest that the two words often have quite different prop-erties in semantic space. We therefore counted “but” and “or” separately.
or Another exclusive word.
excl The other exclusive words from the model, for example, “unless” and “whereas”.
neg Negative emotion words. Higher rates of these are expected in deception.
action Action verbs. Higher rates of these are expected in deception.
FPP First person plural pronouns. These can be substituted for singular in some circum-stances (the “royal we”) and a questioner using the “royal we” might encourage anrespondent to do likewise. Alternatively, a respondent who is forced grammatically touse a first person pronoun might choose a plural one instead of a singular as a dis-tancing technique. Similarly, if the questioner is referring to a group to which boththey and the respondent belong, this might prompt the respondent to continue talk-ing about this group. Focusing on a group leaves less time to focus on oneself, so weexpected higher FPP rates in the question to prompt lower FPS rates in the answer.
SPP Second person pronouns. A question containing the word “you” is probably about therespondent, and the respondent is expected to respond by giving information aboutthemselves. Thus, we expected higher SPP rates to prompt higher FPS rates in theanswer.
“Wh” Who, what, when, where, why, and how. Questions containing these words promptthe respondent for a specific fact. Questions about specific facts should make it harderfor the respondent to be evasive. We expected higher “wh” word rates to prompthigher rates of exclusive words due to an increase in cognitive complexity.
8
-
TPS Third person singular pronouns. These words are references to a person other thanthe questioner or respondent. Being asked about a third person should induce therespondent to talk about that person, and thus give them less opportunity to talkabout themselves. When talking about another person – not oneself, and not thequestioner – it might also be easier to make disparaging or negative remarks. Weexpected higher TPS rates to prompt lower FPS rates and higher negative emotionrates.
TPP Third person plural pronouns. We expected higher TPP rates to prompt lower FPSrates and higher negative emotion rates, for the same reason as above.
these An early analysis with a part-of-speech tagging program showed that rates of “these”,“those”, and “to” in the question were weakly correlated with rates of FPS in theanswer. This early analysis did not yield other interesting results.
3.3 Gaussian Distributions
A 500-word answer with five action words is statistically and linguistically different from a15-word answer with five action words. Therefore, using raw word counts in our statisticalanalysis would be inappropriate. For all our analyses, we divided the number of words ofeach type in each single question or answer by the total number of words it contains, givinga rate statistic for each type of word.
We expected the shortest questions and answers to be difficult to analyze. In a five-wordanswer with one “but”, the rate statistic for “but” would be 0.2 – unusually large. It isn’tclear that the answer has all the properties associated with use of the word “but” to a greaterdegree than, say a 16-word answer with one “but” which would have a rate of only 0.0625.Based on this reasoning and some early inspection of the data, we began the analysis witha minimum window size of 50 words in both question and answer.
We separated answers into two categories for each pair of question word and responseword. The “unprompted” category contained all answers where the matching question didnot contain the relevant question word. We assume that rates in such answers reflect theirnatural rates without prompting. The “prompted” category contained all answers in whichthe relevant question word appeared at least once.
For each pair of question word and answer word in the prompted category, we counted therates in all windows and fitted a two-dimensional Gaussian distribution to the data, using theFitFunc toolbox for Matlab. For the unprompted category we computed a one-dimensionalGaussian fitting the rates in the answers.
These Gaussian distributions provide an indicator of the relationship between the ques-tion word and the answer word pair. A distribution with a positive slope and a mean higherthan that of the unprompted data indicates that the question word prompts higher ratesof the answer word. A distribution with a negative slope and a mean lower than that ofthe unprompted data indicates that the question word prompts lower rates of the answer
9
-
word. A flat or near-spherical distribution with a mean close to that of the unprompted datasuggests that there is no relationship between the two word categories.
0 0.02 0.04 0.06 0.08 0.1 0.120
0.02
0.04
0.06
0.08
0.1
0.12
(a) First-person pluralpronouns prompting first-person singular pronouns
0 0.02 0.04 0.06 0.08 0.1 0.120
0.02
0.04
0.06
0.08
0.1
0.12
(b) Second-person pronounsprompt first-person singularpronouns
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
(c) “Wh” words prompting“or”. The other exclusiveword categories looked simi-lar.
0 0.02 0.04 0.06 0.08 0.1 0.120
0.02
0.04
0.06
0.08
0.1
0.12
(d) Third person singu-lar pronouns prompt first-person singular pronouns
0 0.02 0.04 0.06 0.08 0.1 0.120
0.02
0.04
0.06
0.08
0.1
0.12
(e) Third person plural pro-nouns prompt first-personsingular pronouns
0 0.02 0.04 0.06 0.08 0.1 0.120
0.02
0.04
0.06
0.08
0.1
0.12
(f) “These”, “those”, and“to” prompt first-person sin-gular pronouns.
0 0.02 0.04 0.06 0.08 0.1 0.120
0.02
0.04
0.06
0.08
0.1
0.12
(g) First-person singularpronouns prompt “but”
−0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.040
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
(h) “But” prompting “or”
0 0.01 0.02 0.03 0.04 0.05 0.06 0.070
0.01
0.02
0.03
0.04
0.05
0.06
0.07
(i) Third person plural pro-nouns prompt action words
Figure 1: Gaussian distributions from the Republican dataset – minimum window size of50 words. x-axis = rates for answer words, y-axis = rates for question words; the red lineparallel to the y-axis shows the one-dimensional unprompted distribution out to 1 standarddeviation.
10
-
Our initial estimates of the question word categories that would have significant prompt-ing effects had mixed success. We did not find evidence of many of the relationships we hadexpected (Figures 1). However, we did find that higher rates of second-person pronouns inthe question, as expected, were associated with higher rates of first-person singular pronounsin the answer. So were “these”, “those”, and “to”. With third-person plural pronouns, wefound a weak effect in the opposite direction from what we expected: higher third-personplural pronoun rates prompted very slightly higher first-person singular pronoun rates. Al-though most pairs of question and answer word categories produced no effect, a few pairsfor which we had no hypothesis also showed a prompting effect.
3.4 Data Correction
To remove the effect of prompting words, we carried out a series of affine transformationson the points in a bivariate Gaussian model in Cartesian space – producing corrected ratesfor each word category.
0 0.02 0.04 0.06 0.08 0.1 0.120
0.02
0.04
0.06
0.08
0.1
0.12
(a) The original data
−0.04 −0.02 0 0.02 0.04 0.06 0.08
−0.04
−0.02
0
0.02
0.04
0.06
0.08
(b) Steps 1 and 2: Trans-lated to a mean of (0, 0) androtated
−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12−0.02
0
0.02
0.04
0.06
0.08
(c) Steps 3 and 4: Rescaledand translated back to amean that matches un-prompted data
Figure 2: Steps in the correction process illustrated using second-person pronouns in thequestion and first-person singular pronouns in the answer. The blue point represents aparticular speech as it is transformed.
The correction method is illustrated in Figure 2. For each question-answer pair, we fit abivariate Gaussian to the prompted data and translate it so the mean is at (0,0). We thenrotate the distribution using a standard rotation matrix until it is “flat” (one of its axes wasparallel to y = 0). We use the smallest possible angle of rotation in either direction.
After that, we rescale the data on the y-axis so that its standard deviation in the ydirection equals the standard deviation of the unprompted data, and translate it so thatits mean returns to its original value in the x direction, and is equal to the mean of theunprompted data in the y direction.
We have included a blue dot in Figure 2 representing an example window (in this case, therespondent is Newt Gingrich) so that it can be traced through each step of the process. In the
11
-
uncorrected data, Gingrich is heavily prompted and responds with a number of first-personsingular pronouns that is about average for the data as a whole, but much less than whatmight be expected given the general trend towards increased first-person singular pronounswith more prompting. After the data is corrected, Gingrich’s first-person singular pronounrate is low, as expected.
During this process, some windows can be assigned slightly negative values. Obviously itis impossible for a person to say fewer than zero words in response to a question. We treatthe negative values as very emphatic zeroes, but do not correct them until the end of theprocess, since a subsequent correction for another word pair might return them to positivevalues.
We performed these transformations for each question-answer pair in the data, feedingthe input from one transformation to the next so that, for each response word, a correctionis made for each prompting word in turn. For question-answer pairs where the promptingword had negligible effect on the response word rate, the effect of this correction would alsobe negligible. There was a risk of these small effects producing slight noise in the data, butwe accepted this risk because we did not wish to choose an arbitrary threshold separatingeffects that counted from effects that didn’t.
Examples of the changes to rate statistics are shown in Figure 3. It is useful to see howlarge a change in word frequencies such a correction represents. An average absolute valueof three words were added or taken away from each window in both the FPS categories –at an average window size of 188 words, about six of which on average were FPS. So abouthalf of these FPS pronouns, according to our model, are caused by prompting. “But” andaction words changed by an average of one word per window, but some windows change in apositive direction and some in a negative, so the average change is much less than one wordper window. The other categories, on average, were barely changed.
To examine the sensitivity of the correction method we reran the corrections for eachcategory of answer words after removing the 2.5% highest rates and the 2.5% lowest ratesfor each category (5% of the data in total). In this restricted analysis, the average absolutevalue of the difference after correction was still about three words for FPS and about oneword for action words. However, the average change in FPS value was reduced to only twowords, and the effect on “but” disappeared. This suggests that the correction method issomewhat sensitive to outliers, but that its results for FPS and action words are largelyvalid.
3.5 Window Sizes
In many settings, the available window sizes are much smaller than 50 words. For example,in the Simpson depositions, the average question and answer contains fewer than 20 wordsin both sides of the window put together, and vanishingly few met the criterion of having 50words or more in both the question and the answer. So we investigated ways to apply ourmodel to smaller windows.
We experimented with the Republican dataset, using versions with smaller minimumwindow sizes – 30, 10, and 1. We also tried merging adjacent questions and answers, provided
12
-
that they involved the same questioner and respondent, repeating this until all windowseither met the minimum size for both question and answer, or could not be merged adjacentwindows because there were none. We then removed any windows that still did not meetthe minimum size. We call these composite windows.
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(a) Minimum window size 50
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(b) Minimum window size 30
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(c) Minimum window size 10
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(d) Composites totaling size 50
Figure 3: Color maps showing the average change in answer rates as the result of correctionsfor each question-and-answer pair at each minimum window size. Prompting words are thecolumns and response words the rows; bright colors indicate that question words force lowerrates of answer words, and so answer word rates are corrected to increase; dark colors indicatethe opposite. All maps are on the same color-based scale.
We then applied the corrections, measured the average change induced at each stage ofthe correction, and compared it to the average change in the original data. We created colormaps that show the average correction to each question-answer pair (Figure 3). The patternsin 50-word windows degrade as the minimum window size is lowered, particularly to below30. The most notable degradation happened in the most promising question-answer pair –second-person pronouns prompting first-person singular pronouns. The composite windowsalso showed slight degradation, comparable to that in the 30-word windows.
Table 2 shows how much of each dataset is usable at each window size. For Nurembergand Simpson, small windows predominate, and there are many stretches where one individ-ual answered many questions in a row, making them particularly amenable to the use of com-posite windows. With Nuremberg and Simpson the composite window technique allowedanalysis of much more data than 30-word windows. Furthermore, it caused less degradationthan the small window sizes which would otherwise have been needed to cover this muchdata. In Republican, windows tended to be larger, so the effect was less pronounced, but
13
-
Republican Nuremberg SimpsonQ A Q A Q A
Full 65,480 236,411 109,551 201,548 219,262 193,123≥ 10 55,161 159,450 75,313 155,395 45,162 71,357≥ 30 43,876 107,836 31,452 51,735 3,280 4,967≥ 50 33,193 72,078 15,394 22,366 323 350Comp 44,759 101,211 105,423 170,706 219,197 192,537
Table 2: Number of words that could be included for each minimum window size
the composite window technique still gave coverage and degradation comparable to that ofthe 30-word windows. For these reasons, we performed the rest of our analysis (in all threedatasets) with composite windows.
3.6 Validation: Nuremberg and SVD
We expected corrected data to show a sharper distinction than the original data betweenthe truthful and the deceptive. Using composite windows, we encoded the Nurembergdata and checked that it had similar properties to the Republican data. The magnitudeof correction in FPS and action words was quite similar in the two datasets, although theeffect on “but” the Nuremberg data was much smaller.
We performed singular value decomposition on the answer data before and after perform-ing the correction on the Nuremberg dataset. This reduced the 6-category deception modelto 3 dimensions. We then made a scatter plot showing each window in the resulting semanticspace, expecting an increase in the distance between truthful and deceptive subgroups.
Figure 4 shows the result of singular value decomposition on the Nuremberg databefore and after correction. The effect is the opposite of what was expected – the correctionerases most of the spatial distinction that existed between the groups. This suggests that,rather than being a confounding effect that should be removed to improve the detection ofdeception, the prompting effect contains important information about deception.
3.7 Validation: Subgroups in the Nuremberg Data
We analyzed each of the three Nuremberg subgroups separately, applying a correction toeach. Figure 5 shows color maps of this data. The differences between Defendants andUntrustworthy witnesses (both the Nazi subgroups) are not large, but the responsepatterns of the Trustworthy witnesses subgroup are quite different.
Note that these color maps are on the same scale as the earlier Republican color maps. Wechose to present them this way in order to make comparisons easy across the different colormaps. However, by using a scale that works for the Republican data, we have obscuredan important fact about the Nuremberg data – namely that the average change in first-person singular pronouns prompted by second-person pronouns, for both the Defendants
14
-
−0.04 −0.02 0 0.02 0.04 0.06 0.08−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
558
559
560
561562
563
564
565566
567
568569
570571
572
573574
575
576
577
578
579
580
581582
583
584
585586 587588589
590
591592
593
594
595
596597
598
599
600
601
602
603
604
605606607
608
609
610
611
612
613
614
615616
617
618619
620 621622
623
624625626 627
628
629630631
632 633
634
635
636
637638 639640
641
642
643
644645646
647648
649650
651652
653
654
655
656657
658
659660
661
662
663
664665
666
667
668
669
670671
672 673
674
675
676677
678679
680681682 683684
685
686687
688
689690 691
692
693694
695
696
697
698699
700
701702
703
704
705706
707
708
709
710
711712713714
715
716
717718
719720 721
722
723
724
725
726
727728
729
730731
732 733734735
736
737738
739740
741
742
743
744
745746
747
748
749
750751
752
753
754
755
756
757
758
759
760
761 762
763764
765766
767
768
769770
771
772
773
774775776
777
778779
780781
782
783
784
785
786
787
788
789
790791
792
793
794795
796
797
798
799
800801802
803
804
805
806
807808809 810
811
812
813814
815
816817
818 819
820821
822
823
824
825
826
827
828
829
830
831 832
833
834
835836
837
838
839
840841
842 843844
845846
847
848
849
850
851
852
853854
855
856
857
858859
860861
862
863
864865
866867
868
869870
871
872
873874
875876
877
878
879
880
881882
883
884
885
886
887
888
889
890 891
892
893894
895896
897
898899
900
901
902 903904905
906
907
908
909910
911
912
913
914915
916
917918
919
920921
922
923
924925 926
927928
929
930931
932
933
934935
936
937938939
940
941942
943
944
945
946 947
948
949
950 951
952
953954955
956
957958
959
960
961962
963
964965
966
967
968
969 970
971972
973
974
975976
977978
979
980
981
982
983
984
985
986
987
988
989
990 991
992993
994
995
996
997
998
999
1000
1001
1002
10031004
100510061007
1008
1009
1010
1011
10121013 101410151016
1017
1018
10191020
12
3
4
5
67
8
9
10
1112 1314
15
16
17
18
19
20
212223
24
2526
27
2829
30 31
32
333435 36
3738
39
40
414243
444546
84
8586 87888990
9192
93
949596
97
9899
100
101102103
104
105
106
107
108
109
110111112 113
114
115
116 117
118 119120
121122123
124
125
126
127
128
129
130
131
132
133 134
135
136137
138139140
141
142143
144
145
146
147
148149
150
151
152
153
154155
156157
158
159160
161
162 163164
165166167168
169
170
171172
173
174175
176177
178179180181
182
183184
185186
187
188
189
190
191
192
193
194195
196197
198
199
200201
202203
204
205206
207
208
209
306
307
308309
310
311
312
313
314
315
316
317
318
319
320
321322
323
324
325326
327
328
329
330
331
332333
334335
336
337
338339340 341
342343
344 345346
347 348
349
350351
352
353
354355356357
358
359
360
361 362
363
364
365
366
367368369
370
371
372373
374
375
376
377
378
379380
381382
383384
385
386387
388
389
390
391 392
393394
395396
397 398399
400
401
402
403404
405
406407
408409410
411412
413
414
415
416417
418419
420
421
422
423424
425
426427
428
429
430
431
432
433
434
435 436437438
439
440
441
442
443444
445
446
447
448
449
450
451 452
453
454
455 456457
458
459
460461
462
463
464
465
466 467468
469
470
471472
473
474
475
476
477478
479
480481482
483
484
485
486
487
488
489
490
491
492
493 494
495
496
497
498
499
500501
502
503
504
505
506
507508
509510
511
512
513514515
516
517
518 519 520
521
522523 524
525
526
527
528
529
530 531
532
533
534
535
536
537
538 539
540541
542543
544545546
547
548
549550
551552
553554
555
556
557
4748 495051
52
53
54
555657
585960
6162
63 646566 67
68
69
70
71
72
73
74
75
76
77
7879
80
81
82
83
210211212
213
214
215
216
217
218
219
220
221222
223
224225
226
227
228
229230
231
232
233234
235236237
238239
240
241
242
243
244 245
246 247
248
249
250
251252253254255
256
257258259
260
261262
263264265
266
267
268269270
271272
273
274
275
276277 278
279
280
281
282283
284
285
286
287
288
289290
291292
293
294
295296
297
298
299
300 301
302
303
304305
1021
1022
1023102410251026
1027
1028
1029
1030
1031
10321033
1034
1035
1036
10371038
1039
1040
1041
1042
104310441045
1046
10471048
104910501051
10521053
1054
1055
105610571058
1059
1060
10611062
1063
1064
1065
1066
1067
10681069
1070
1071
1072
1073
10741075
10761077
1078
1079
108010811082
108310841085
1086
1087
10881089
1090 1091
10921093
1094
10951096
1097
1098
1099
1100
(a) Before correction
−0.04 −0.02 0 0.02 0.04 0.06 0.08
−0.05
0
0.05
558559
560561
562563
564
565566567
568569
570571
572573
574575
576
577
578
579
580581
582583
584585
586
587588
589
590
591
592
593
594
595
596
597
598599600
601
602
603
604605
606607
608
609610
611612 613
614
615616
617618
619
620621
622
623
624
625
626
627628
629
630
631
632633
634
635
636
637
638
639640
641
642
643
644
645646
647
648649
650
651 652
653
654655656657658
659
660
661
662
663
664
665
666
667668
669
670
671
672
673674
675
676
677
678679680
681
682
683684
685686
687
688
689
690 691
692
693694695
696
697
698699
700701
702703
704
705706
707
708
709
710
711
712713
714
715
716
717718719
720721
722
723
724
725
726
727
728
729730
731
732
733734
735
736
737
738
739
740 741
742
743744
745746
747
748
749
750751
752
753
754
755756
757758
759760
761762
763
764
765
766
767
768
769 770
771
772
773
774
775
776
777
778
779
780
781
782
783
784785
786 787788
789 790
791
792793
794
795
796
797
798
799
800801
802
803804
805
806
807808
809810
811812
813
814
815
816817
818
819
820
821
822
823
824
825
826
827
828
829830
831
832
833
834
835
836
837
838
839
840
841842843844
845
846
847
848849
850
851
852
853854
855
856
857
858
859
860861862
863
864
865
866
867 868869
870871
872
873
874
875876
877
878879880881882883884885
886
887
888
889
890
891
892
893894
895896
897 898899900901
902
903
904
905
906907908
909 910
911
912
913
914
915
916917
918
919
920 921
922 923
924
925
926
927
928
929
930
931
932
933934
935
936
937
938939
940
941942
943
944
945946
947
948949
950 951
952
953954955
956957958
959
960
961
962
963
964
965966
967
968969
970
971
972
973
974
975
976
977978
979
980
981
982
983984
985
986 987988
989
990 991
992
993
994
995
996
997
998
999
1000
10011002100310041005
1006
10071008
1009
1010
101110121013
1014
1015
1016
1017
1018
10191020
1
2
3
4 5
6
7
8 910
1112 13
14 15
16
171819
20
21
2223
24
2526
272829
30
31
32
33
34
353637
3839
40
41
42
43
4445
46
84
8586
87
888990 91
92
93
949596 97
9899
100
101102
103
104105
106
107
108
109
110
111
112
113
114 115
116117118
119120121
122123
124
125
126127
128
129130
131
132
133 134
135136
137138139
140
141
142
143144
145
146
147148
149150
151152
153
154155
156157
158
159
160
161
162
163
164
165166167
168
169
170
171 172173
174175
176
177178
179
180181
182
183184
185186187
188189
190191192
193
194195196 197
198
199
200201
202203
204
205
206207 208
209
306
307
308
309310
311 312
313
314315
316
317318 319
320
321 322323
324
325
326
327
328329
330
331332333
334335
336337 338
339
340
341342343
344345 346347
348
349350
351
352
353354
355
356
357
358359
360361
362
363 364
365
366
367
368
369370 371372373374 375
376377
378379380
381
382
383 384
385
386
387
388
389
390
391392
393394
395
396
397
398399
400
401
402 403
404
405
406
407
408409
410411
412
413
414415
416417418
419
420
421
422
423
424425426427
428
429
430
431
432433
434435
436
437
438439
440441
442
443
444
445
446
447
448
449
450
451
452
453
454455456
457
458459
460
461462
463
464
465
466467
468
469470
471472473
474
475
476
477
478479
480 481482
483484
485
486487
488489
490
491
492
493494495
496
497
498
499
500
501
502
503
504505
506507
508509
510511 512513
514
515
516
517
518 519520521
522
523
524525
526
527
528
529
530531
532
533
534535536 537
538
539
540
541542
543
544545546 547
548
549550
551 552
553 554
555
556
557
474849
505152
53
54
5556
57
58
59
6061
62
63
6465
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
210
211
212213214
215
216
217
218
219220
221
222
223
224
225
226
227
228
229 230 231
232
233234
235
236237
238
239
240
241
242
243
244245
246 247248
249
250
251252
253254255256
257
258259
260
261
262
263264
265
266
267
268
269
270271
272
273
274
275
276
277
278
279
280281 282
283
284
285
286287
288
289 290291
292
293
294
295
296
297
298
299
300
301
302
303
304
305 1021
10221023
1024
10251026
1027
1028
1029
1030
1031
1032103310341035
1036
1037
1038
1039
1040
1041
1042
1043
104410451046
10471048
1049
1050
1051
10521053
1054
1055
105610571058
1059
10601061
1062
10631064
10651066
1067
1068
1069
1070
1071
10721073
10741075
1076
1077
1078
1079
1080
1081 1082
108310841085
10861087
108810891090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
(b) After correction
Figure 4: Singular value decomposition of the responses in the Nuremberg dataset.Defendants are marked in red, Trustworthy witnesses in blue, and Untrustworthywitnesses in green. Before correction, the Trustworthy witnesses are concentrated onone side of the semantic space, but this is less the case after the correction.
and Untrustworthy witnesses, is literally off the scale. Both of these are more thantwice the maximum amount shown by the color map; the Defendants’ change is somewhatlarger than the Untrustworthy witnesses. (Since we were using composite windows forthis, the Republican average change is also slightly higher than it was in the previouscolormaps, which were made with 50-word minimum windows.) Figure 6 shows this moreclearly. (This is not a contradiction of our earlier claim that first-person pronoun corrections,in words per window, were similar in both the Nuremberg and Republican datasets. TheNuremberg data had smaller windows, and it contained subgroups with both much higherand much lower rate statistics, on first-person pronouns, than the Republican data.)
Seeing these large differences prompted us to look at individual distributions more closely.We overlaid the distribution from one subgroup onto the corresponding distributions for theother subgroups. This confirmed that different subgroups responded to prompting differently.They were not simply being prompted at different rates, or showing different rates of responseindependently of the prompt – many question word/ response word pairs showed completelydifferent distributions. In general, Defendants and Untrustworthy witnesses, thetwo deceptive groups, were similar, but the Trustworthy witnesses group was different.The strongest effects tended to involve first-person pronouns or action words. Four of themost striking sets of distributions are shown in Figure 7.
The prompting effect must be receiver-state-dependent – deceptive individuals respondto prompting one way, and truthful individuals another. One of the particularly strongdifferences involves second-person pronouns prompting first-person singular pronouns, whichwas already one of the strongest effects. Correcting for prompting reduced, rather thanincreased, the difference between these groups, because the prompting itself is a factor that
15
-
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(a) Defendants
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(b) Untrustworthy
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(c) Trustworthy
Figure 5: Corrections for the three Nuremberg subgroups analyzed separately.
distinguishes them from each other. Rather than having a base rate of word use (due todeception or lack thereof) which is modulated by prompting, the different subgroups actuallyexperience different kinds of prompting – which means paying attention to the way in whichprompting occurs ought to further elucidate differences between them.
These results explain the success of the ad hoc coding scheme used by Little and Skillicorn.Without knowledge of the rates of prompting words in questions, deceivers appear to increasetheir rates of first-person singular pronouns and exclusive words relative to less deceptiverespondents. In fact, similar levels of prompting in questions to both groups produces thesedifferentiated responses.
16
-
REPUBLICAN DEFENDANTS UNTRUSTWORTHYTRUSTWORTHY−0.025
−0.02
−0.015
−0.01
−0.005
0
Figure 6: Average change in rates of first-person singular pronouns prompted by second-person pronouns in the Republican and Nuremberg datasets, all using composite win-dows.
3.8 Validation: Random Forests
Does information in the question actually improve the detection of deception or is thatinformation present in the answer, but in some way not captured by the Pennebaker model?We used random forests [2] to estimate the significance of question words. A random forestnot only classifies data but estimates the importance of each attribute in the data by countinghow often each attribute is selected to act as the split point for an internal node of a decisiontree of the forest.
We trained random forests, one with the rates of only the six response word categories inthe answers, and one with these plus the rates of all the stimulus words in the questions. Forsimplicity, we included only the Defendants and Trustworthy witnesses subgroups.
Because not every statement by Defendant is a lie, we did not expect high accuracyfrom either of these forests. Rather, we wanted to compare them to each other to see if thequestion words made a difference. If both questions and answers were predictive of deception,then the model that uses both should make better predictions, and it should select the wordsthat appear the most promising (i.e. first-person singular pronouns in answers and second-person pronouns in questions). If only the words in answers were relevant to deception, thenboth models should perform about the same.
Figure 8 shows the performance of both random forests. While the answer-words-onlyrandom forest performs above chance, it shows a strong bias: its accuracy with Trustworthywitnesses responders is much lower than its accuracy with Defendants, that is it tendsto classify windows of both kinds as Defendants. Adding the question words improvesoverall accuracy by more than 10 percentage points – an even more dramatic result than weexpected. Moreover, it reduces bias by an even larger amount, producing an improvementon the Trustworthy witnesses without reducing the accuracy on Defendants.
Table 3 shows the results of attribute ranking with the best-performing random forest.
17
-
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160
0.02
0.04
0.06
0.08
0.1
0.12
(a) Second-person pronounsprompting first-person sin-gular pronouns
0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050
0.02
0.04
0.06
0.08
0.1
0.12
(b) “Or” prompting first-person singular pronouns
0.01 0.02 0.03 0.04 0.05 0.060
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
(c) First-person plural pro-nouns prompting “but”
0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
(d) “Or” prompting actionwords
Figure 7: Gaussian distributions for the Nuremberg dataset, separated by subgroup.Defendants are red, Untrustworthy witnesses are magenta, and Trustworthywitnesses are blue. Not all of the figures are on the same scale.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Chance
Answers only
Questions and answers
NUREMBERG overallDEFENDANTSTRUSTWORTHY
Figure 8: Prediction accuracy of random forests trained on the Nuremberg data
The most influential words in the question-and-answer random forest were as predicted:first-person singular pronouns in the answer and second-person pronouns in the question,with similar frequencies. It is likely that the interaction between these two categories drovemany of the decision trees in the forest.
18
-
Word category SplitsA first-person singular pronouns 10771Q second-person pronouns 10563Q “wh” words 9581Q “these”, “those”, and “to” 9277Q first-person singular pronouns 6839A “but” 6584A action words 6581Q “or” 4934A “or” 4692Q action words 4165Q third-person plural pronouns 3973Q first-person plural pronouns 3641A misc exclusive words 3570Q third-person singular pronouns 3383Q “but” 3119A negative emotion words 2254Q misc exclusive words 1538Q negative emotion words 871
Table 3: Word categories in the random forest trained with question-and-answer data, rankedby the number of splits in the model that use each word.
After the top two spots, the three next most important word categories were all questionwords, suggesting that the forest not only supplemented its reasoning with question words,but actually made more decisions based on question words than on answer words. However,a small overrepresentation of question words is to be expected given that we are counting alarger number of word categories in the question than in the answer. Also, in some cases aquestion word may give context to an answer word and thus needs to be considered first.
3.9 Validation: the Simpson dataset
Our final task was to see whether these validations generalized. We looked at the Simpsondataset and performed the same analysis.
The SVD results from Nuremberg generalized to the Simpson data: while a modestvisual distinction existed in semantic space between Simpson and the Plaintiffs, applyingour correction method reduced this distinction (Figure 9). The average change in words perwindow for each answer word category was very similar to that of the Nuremberg data.
The distinction between subgroups did not generalize as well. While Simpson’s depositionwas somewhat different from those of the Plaintiffs (Figure 10), the two color maps did notdiffer in the same ways that the Nuremberg color maps differed. Looking at the overlaidsubgroups for individual question-answer pairs (Figure 11) was also different: there was
19
-
−0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2
1022
1097
1046
285
71
39
1029978
608
10941117
967
1026
1083
1001694
358
160
76
9162
282
214
303
461622606
692
7481085
125
471
or346
1134
1003
1100
I
221890607
38211024237
509546943153
1028609
9451143270741
137
7766741058
1107
431
296
653
1059
837334860549754518
199
146813428
103710491123
747
1032107811191104
178
875801807
354
1109391
1157but10421010
1215479
434
2791219
1053107911811060104710551039532122610024119869844689231207501381996635981068443108211781051211112021552610884071034100612042992832183533941102106311111103478103510271048100011322671223418476971311105011159491005946531903107416194840551004413101298310866781056111366210382798899712181188121170310706193639954609981011111411161098111011336853008731031111839292197610431205841374889975749842950189601052595117412309938488561009159985100768663897366410648671019896680974677107797030461194052218442168992120069984028622255311121108104446459652122233641412201033110639954210181001065456832102011211148110141926595411984845749342744221525851045868102390712161688767545811254811146572415411017309833104059445244710418514746563648053769098799112410714827973016414702959686271073110510729949797207626404279592518924035923877360296512092761551180120640825960410872005549573228436199098711212021030
634
48316919025863617109696668842910541228189121618437932587134291991224691341371171383700883991117621329440449188572238917260544906951106264385211901403751013103896925453856695311561080779292377705704551120315752778480031880811916488878039773022291227387631689787768377169810146762911221072881231472409761880183120884663118110669586422547451642071309995627751159
783
2716031967781423511236103694795691239672350761511317183077294336469193423081541155119244969715881186180611778506873575339758198935511971175918713650639416780106778533093175972564541512312359335888622933241016102191585386596226011282623526637984983980958650980465939584701502469109198272110764417867118188981210113881512173987275088728742891069726297107511728368122801184810212913440570552688042521179911637excl1227917121317611471706748666325592410611836472249522505677271213853651586376337342693778783518829861014511528174488214160175417410112962012011234402109998190069536649427371576968164475119480290894306114272462110541187904858118248796311779012328298343172561135766847393315855390109275681320314582329333908669777101108484511863868441095312593770753335684123311491931152972455261614119934575737878966827762992992881923290121624359644841573120115862315793327102348589624626240288307917506894642639029142306544062014661015510143895706655737475798141471025560649899795261189116357349117311371057854796755672106193687816778127812259057021821331442081398172235007167172644898438884346309101130798738690667535513286671467326233116451256541261620992549566682311820494292010557811463841115684461193828450113956956342575499709445143881868697352756411451287733974001194162964371246161948161772497647096614320654549673979486649394203681167205728173523122148524635211151734370528257118561321710811090174220838733477424724575033596368228324136116523548871406732253953441631141558210234122912452552053552113513670492148423498305455011277857253736613115074059820886151444350661160116643674374212623924749751651959054011611266964011211059880285796002318646511953070820325358011945151754311959392613121916573012711684757527686597609671457553036210897659169381196185108826736114068283111542725053565917582212123671144658504747678251911795997075764321136557149202524248313732881799340544382
93077478853911621347244641923724919845480352109347385985749074655945334388561633597488951971667318914231923689625529555227420462100874522890neg11710775116933698211701805811611169325338171331226269138212171214331153660612323
28484
360
78296153425430577632321485
601
824245332380719
310
129941
56728779257166567931893435
156action18750614385
439556
365
150
104275
337242
361
235238135
109328
955
442339
628
927
69
445
65
935583
87092
888
(a) Before correction (b) After correction
Figure 9: Singular value decomposition of responses in the Simpson dataset before and aftercorrection, with O.J. Simpson’s responses in red and Plaintiffs in blue. The difference isvery small.
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(a) Simpson
FPS but or excl neg action FPP SP wh TPS TPP these
FPS
but
or
excl
neg
action −6
−4
−2
0
2
4
6
x 10−3
(b) Plaintiffs
Figure 10: Corrections for the two O.J. Simpson subgroups analyzed separately.
evidence of differences between the two groups, as there had been with Nuremberg, but notin quite the same way – they tended to be slightly weaker. Moreover, the strongest differencesbetween subgroups in the Simpson data did not usually correspond with the strongestdifferences in the Nuremberg data. In particular, the relationship between second-personpronouns in the question and first-person singular pronouns in the answer appeared muchweaker between subgroups – Simpson used more first-person singular pronouns but was also
20
-
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
(a) Second-person pronounsprompting first-person sin-gular pronouns
0.01 0.02 0.03 0.04 0.05 0.060
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
(b) “Or” prompting first-person singular pronouns
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
(c) Second-person pronounsprompting “but”
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
(d) Third person singularpronouns prompting actionwords
Figure 11: Gaussian distributions for the Simpson dataset, separated by subgroup. Simpsonis red and Plaintiffs are blue. Not all of the figures are on the same scale.
prompted more.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Questions and answers
Answers only
Chance
PlaintiffsSimpsonOverall
Figure 12: Prediction accuracy of random forests trained on the Simpson data
Finally we constructed the same two random forests using the Simpson data. Therandom forest trained with both question and response words showed an increase in accuracy,but a smaller one than that for the Nuremberg data (Figure 12). In general, the Simpsondata supports the view that prompting effects exist, but it is less clear where the prompting
21
-
effects actually are.
4 Discussion
At present, neither humans nor computers are particularly good at detecting deception.Models that perform with 70% accuracy (as in Mihalcea and Strapavara’s LIWC-basedstudy [38]) are considered promising. But we can hardly afford for three in ten accusedcriminals to be falsely classified as deceptive. We also cannot rely on our own human abilityto detect deception, since that ability is quite weak. Improvements to existing deceptionmodels are desperately needed.
We have increased the understanding of the processes that occur when a deceptive personis questioned. A prompting effect influences the rates of word usage in answers to questions,but the effect is receiver-state-dependent.
We developed a method to correct for prompting effects and so remove its influence. Thismethod contains two components: a corrected representation of an utterance, and the sizeof the corrective change that is made. Each of these components might be useful in differentsituations. In our setting, the corrected representation was not useful because it removedsome of the distinction between deceptive and truthful subgroups. However, tracking thesize of the change was useful as it allowed us to see at a glance the similarities and differencesin prompting effects between different subgroups.
The way in which respondents respond to the prompting effect is, in itself, a cue todeception. We support this conclusion by building random forests to classify deceptive andtruthful subgroups, finding that the random forests performed substantially better whenthey were given both question data and answer data. In this situation, the random forestsperformed at 70-80% overall accuracy.
This suggest a number of avenues for ongoing research in deception. First, as the randomforest results show, paying attention to both questions and answers will improve word-basedmodels – even without detailed understanding of what the effect of question language is.Other classification approaches to deception will almost certainly benefit from including thewords of questions in the data.
Beyond this, these results suggest what we ought to look for in the questions. Weknow that the prompting effect is receiver-state-dependent. So tracking the nature of theprompting effect across different groups of respondents in similar situations should illuminatedifferences between respondents. We have made a start at showing which relationshipsbetween words are most useful in such analysis – in particular, second-person pronounsprompting first-person singular pronouns.
Deception is not the only setting in which first-person pronouns are relevant. Any variablestudied using bag-of-word approaches might be assessed in dialogue settings. In any suchsetting, it seems probable that paying attention to both question words and answer wordswill improve accuracy. It may be the case that people of different personalities, sexes, or agesrespond to prompting differently – which means that taking question words into account willimprove our ability to profile people in dialogues, for example, if a participant’s identity in
22
-
a computer-mediated chat setting is uncertain. Meanwhile, in the study of relationships –useful, for example, in analysis of social networks – there may be rich and interesting effectsto uncover from the association between relationship style or quality and the way the peopleinvolved respond to each other’s prompting. On the other hand, there may be settings ofthis nature in which all the interesting subgroups respond to prompting in the same way. Inthat case, our method for removing the influence of the question may prove useful.
4.1 Limitations
Our analysis is biased towards common word categories. The “average change” metric mea-sures not only the strength of a relationship between two pairs of words but how frequentlythat relationship actually appears. We defend this choice of metric by pointing out that themore a model relies on common words the more applicable it will be to small windows.
There are almost certainly important words that do not appear in this analysis. The Pen-nebaker model has been extensively validated, but other bag-of-word models have emerged.Hauch et al.’s recent meta-analysis [28] supports all the categories of the Pennebaker modelbut suggests several other cues to deception: increased positive and overall emotion words(as Zhou et al. found [60]), increased negations (“not”, “never”), decreased third-personpronouns, and slightly decreased tentative and time-related words.
The use of bag-of-words implies the treatment of words as forms without any semantics.There are therefore potential confounds associated both with polysemy, and with the use offunction words in a stylized, rather than active, way (for example, whether “thank you” isconsidered to contain an active second-person pronoun or not).
The prompting effects in the Simpson dataset were somewhat different from those inthe Nuremberg dataset and generally smaller. The random forest model also showed asmaller improvement when question data was added. Intuitively, either the Simpson datasetunderrepresents the difference between truthful and deceptive testimony, or the Nurembergdataset overrepresents it – or both. This illustrates the difficulty of comparing deceptive andtruthful people across different contexts. In the Simpson dataset, it is plausible that the dataunderrepresents the difference between deception and truthfulness because the Plaintiffswere possibly not entirely truthful. They had a stake in the outcome, and were possiblymotivated towards persona deception. Despite DePaulo et al.’s recommendations, littleresearch has been done on the communication of people who are probably truthful, buthighly motivated to “spin” the truth. Until such research is done, applying the deceptionmodels to civil suits remains problematic.
In the Nuremberg dataset, the difference between truthfulness and deception may beexaggerated because of other differences between the Nazis and the Trustworthy wit-nesses. These groups generally spoke different first languages and were asked differenttypes of questions; the Trustworthy witnesses were often given perfunctory cross-examination. Care must be taken in interpreting these differences in questions. As Hancocket al. [27] discovered, deception is a process involving both the questioner and respondent.If truthful and deceptive respondents are questioned differently, it may partly be becausethe questioner is biased – or it may be that the questioner is responding unconsciously to
23
-
deceptive speech patterns. To make matters worse, it may be both. As for demographicdifferences such as language and nationality, these cannot be ruled out as a potential con-found, but their influence is probably small: researchers such as Fornaciari et al. [23] whotried to restrict their datasets to remove such confounds found that it did not increase theaccuracy of their models. Nevertheless, the testimony of many of these witnesses was trans-lated, and this may have distorted the language patterns, and certainly distorted the timingof questions and responses.
Any deception model used in the field will be faced with this type of confound: it willbe used on men and women, people from different cultures, people experiencing differentemotions and confronted by different kinds of questioning, even people with mental illnessesor disabilities that affect their word choice. No deception model should be considered usablefor practical purposes unless it has been tested across all these different kinds of categories.
References
[1] American Broadcasting Company. Full transcript: ABC News Iowa Republican debate,December 11 2011. Accessed in winter 2012 at http://abcnews.go.com/Politics/full-transcript-abc-news-iowa-republican-debate/.
[2] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.
[3] A.S. Brown and D.R. Murphy. Cryptomnesia: Delineating inadvertent plagiarism.Journal of Experimental Psychology: Learning, Memory, and Cognition, 15(3):432–442,1989.
[4] Cable News Network. Full transcript of CNN-Tea Party Republican debate, 20:00-22:00, September 12 2011. Accessed in fall 2011 at http://transcripts.cnn.com/TRANSCRIPTS/1109/12/se.06.html.
[5] Cable News Network. Republican debate, June 13 2011. Accessed in fall 2011 athttp://transcripts.cnn.com/TRANSCRIPTS/1106/13/se.02.html.
[6] Cable News Network. Full transcript of CNN Arizona Republican presidential de-bate, Feburary 22 2012. Accessed in winter 2012 at http://archives.cnn.com/TRANSCRIPTS/1202/22/se.05.html.
[7] Cable News Network. Full transcript of CNN Florida Republican Presidential debate,January 26 2012. Accessed in winter 2012 at http://archives.cnn.com/TRANSCRIPTS/1201/26/se.05.html.
[8] J.R. Carlson, J. F. George, J.K. Burgoon, M. Adkins, and C.H. White. Deception incomputer-mediated communication. Group Decision and Negotiation, 13:5–28, 2004.10.1023/B:GRUP.0000011942.31158.d8.
24
-
[9] T. L. Chartrand and R. van Baaren. Human mimicry. Advances in Experimental SocialPsychology, 41, 2009.
[10] The Chicago Sun-Times. CNN Republican debate, Nov. 22, 2011. Transcript. Ac-cessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/11/cnn_republican_debate_nov_22_2.html.
[11] The Chicago Sun-Times. CBS/National Journal GOP debate. Transcript, video, Novem-ber 13 2011. Accessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/11/_cbsnational_journal_gop_debat.html.
[12] The Chicago Sun-Times. CNBC Republican debate. Transcript, video highlights,November 9 2011. Accessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/11/cnbc_republican_debate_transcr.html.
[13] The Chicago Sun-Times. Republican Las Vegas CNN debate: Transcript, Octo-ber 19 2011. Accessed in fall 2011 at http://blogs.suntimes.com/sweet/2011/10/republican_las_vegas_cnn_debat.html.
[14] The Chicago Sun-Times. GOP NH ABC/Yahoo News debate: Transcript, January 82012. Accessed in winter 2012 at http://blogs.suntimes.com/sweet/2012/01/gop_nh_abcyahoo_news_debate_tr.html.
[15] The Chicago Sun-Times. GOP NH NBC’s Meet the Press/Facebook debate: Transcript,January 8 2012. Accessed in winter 2012 at http://blogs.suntimes.com/sweet/2012/01/gop_nh_nbcs_meet_the_pressface.html.
[16] The Chicago Sun-Times. South Carolina GOP CNN debate, Jan. 19, 2012. Transcript,January 20 2012. Accessed in winter 2012 at http://blogs.suntimes.com/sweet/2012/01/south_carolina_gop_cnn_debate_.html.
[17] C. Chung and J. Pennebaker. The psychological functions of function words. InK. Fiedler, editor, Social Communication, pages 343–359. New York: Psychology Press,2007.
[18] Council on Foreign Relations. Republican Debate Transcript, Tampa, Florida, Jan-uary 2012. Accessed in winter 2012 at http://www.cfr.org/us-election-2012/republican-debate-transcript-tampa-florida-january-2012/p27180.
[19] S. Deerwester, S. T. Dumais, G.W. Furnas, and T.K. Landauer. Indexing by latentsemantic analysis. Journal of the American Society for Information Science, 41:391–407, 1990.
[20] B.M. DePaulo, D.A. Kashy, S.E. Kirkendol, M.M. Wyer, and J.A. Epstein. Lying ineveryday life. Journal of Personality and Social Psychology, 70(5):979–95, 1996.
25
-
[21] B.M. DePaulo, J.J. Linday, B.E. Malone, L. Muhlenbruck, K. Charlton, and H. Cooper.Cues to deception. Psychological Bulletin, 129:74–118, 2003.
[22] P. Ekman and M. O’Sullivan. Who can catch a liar? American Psychologist, 46(9):913–920, September 1991.
[23] T. Fornaciari and M. Poesio. On the use of homogenous sets of subjects in deceptivelanguage analysis. In Proceedings of the Workshop on Computational Approaches toDeception Detection, 13th Conference of the European Chapter of the Association forComputational Linguistics, pages 39–47, April 23 2012.
[24] Fox News. Complete text of the Iowa Republican debate on Fox News channel, August12 2011. Accessed in fall 2011 at http://foxnewsinsider.com/2011/08/12/full-transcript-complete-text-o