Finding Deceptive Opinion Spam by Any Stretch of the Imagination

download Finding Deceptive Opinion Spam by Any Stretch of the Imagination

of 11

Transcript of Finding Deceptive Opinion Spam by Any Stretch of the Imagination

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    1/11

    Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 309319,Portland, Oregon, June 19-24, 2011. c2011 Association for Computational Linguistics

    Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    Myle Ott Yejin Choi Claire Cardie

    Department of Computer ScienceCornell UniversityIthaca, NY 14853

    {myleott,ychoi,cardie}@cs.cornell.edu

    Jeffrey T. Hancock

    Department of CommunicationCornell UniversityIthaca, NY 14853

    [email protected]

    Abstract

    Consumers increasingly rate, review and re-

    search products online (Jansen, 2010; Litvin

    et al., 2008). Consequently, websites con-

    taining consumer reviews are becoming tar-

    gets of opinion spam. While recent workhas focused primarily on manually identifi-

    able instances of opinion spam, in this work

    we study deceptive opinion spamfictitious

    opinions that have been deliberately written to

    sound authentic. Integrating work from psy-

    chology and computational linguistics, we de-

    velop and compare three approaches to detect-

    ing deceptive opinion spam, and ultimately

    develop a classifier that is nearly 90% accurate

    on our gold-standard opinion spam dataset.

    Based on feature analysis of our learned mod-

    els, we additionally make several theoretical

    contributions, including revealing a relation-ship between deceptive opinions and imagina-

    tive writing.

    1 Introduction

    With the ever-increasing popularity of review web-

    sites that feature user-generated opinions (e.g.,

    TripAdvisor1 and Yelp2), there comes an increasing

    potential for monetary gain through opinion spam

    inappropriate or fraudulent reviews. Opinion spam

    can range from annoying self-promotion of an un-

    related website or blog to deliberate review fraud,

    as in the recent case3 of a Belkin employee who

    1http://tripadvisor.com2http://yelp.com

    3http://news.cnet.com/8301-1001_

    3-10145399-92.html

    hired people to write positive reviews for an other-

    wise poorly reviewed product.4

    While other kinds of spam have received consid-

    erable computational attention, regrettably there has

    been little work to date (see Section 2) on opinionspam detection. Furthermore, most previous work in

    the area has focused on the detection ofDISRUPTIVE

    OPINION SPAMuncontroversial instances of spam

    that are easily identified by a human reader, e.g., ad-

    vertisements, questions, and other irrelevant or non-

    opinion text (Jindal and Liu, 2008). And while the

    presence of disruptive opinion spam is certainly a

    nuisance, the risk it poses to the user is minimal,

    since the user can always choose to ignore it.

    We focus here on a potentially more insidi-

    ous type of opinion spam: DECEPTIVE OPINION

    SPAMfictitious opinions that have been deliber-

    ately written to sound authentic, in order to deceive

    the reader. For example, one of the following two

    hotel reviews is truthful and the other is deceptive

    opinion spam:

    1. I have stayed at many hotels traveling for both business

    and pleasure and I can honestly stay that The James is

    tops. The service at the hotel is first class. The rooms

    are modern and very comfortable. The location is per-

    fect within walking distance to all of the great sights and

    restaurants. Highly recommend to both business trav-

    ellers and couples.

    2. My husband and I stayed at the James Chicago Hotel

    for our anniversary. This place is fantastic! We knew

    as soon as we arrived we made the right choice! The

    rooms are BEAUTIFUL and the staff very attentive and

    wonderful!! The area of the hotel is great, since I love

    to shop I couldnt ask for more!! We will definatly be

    4It is also possible for opinion spam to be negative, poten-

    tially in order to sully the reputation of a competitor.

    309

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    2/11

    back to Chicago and we will for sure be back to the James

    Chicago.

    Typically, these deceptive opinions are neither

    easily ignored nor even identifiable by a human

    reader;5 consequently, there are few good sources

    of labeled data for this research. Indeed, in the ab-

    sence of gold-standard data, related studies (see Sec-tion 2) have been forced to utilize ad hoc procedures

    for evaluation. In contrast, one contribution of the

    work presented here is the creation of the first large-

    scale, publicly available6 dataset for deceptive opin-

    ion spam research, containing 400 truthful and 400

    gold-standard deceptive reviews.

    To obtain a deeper understanding of the nature of

    deceptive opinion spam, we explore the relative util-

    ity of three potentially complementary framings of

    our problem. Specifically, we view the task as: (a)

    a standard text categorization task, in which we usen-grambased classifiers to label opinions as either

    deceptive or truthful (Joachims, 1998; Sebastiani,

    2002); (b) an instance of psycholinguistic decep-

    tion detection, in which we expect deceptive state-

    ments to exemplify the psychological effects of ly-

    ing, such as increased negative emotion and psycho-

    logical distancing (Hancock et al., 2008; Newman et

    al., 2003); and, (c) a problem ofgenre identification,

    in which we view deceptive and truthful writing as

    sub-genres of imaginative and informative writing,

    respectively (Biber et al., 1999; Rayson et al., 2001).

    We compare the performance of each approach

    on our novel dataset. Particularly, we find that ma-

    chine learning classifiers trained on features tradi-

    tionally employed in (a) psychological studies of

    deception and (b) genre identification are both out-

    performed at statistically significant levels by n-

    grambased text categorization techniques. Notably,

    a combined classifier with both n-gram and psy-

    chological deception features achieves nearly 90%

    cross-validated accuracy on this task. In contrast,

    we find deceptive opinion spam detection to be well

    beyond the capabilities of most human judges, who

    perform roughly at-chancea finding that is consis-

    tent with decades of traditional deception detection

    research (Bond and DePaulo, 2006).

    5The second example review is deceptive opinion spam.6Available by request at: http://www.cs.cornell.

    edu/myleott/op_spam

    Additionally, we make several theoretical con-

    tributions based on an examination of the feature

    weights learned by our machine learning classifiers.

    Specifically, we shed light on an ongoing debate in

    the deception literature regarding the importance of

    considering the context and motivation of a decep-

    tion, rather than simply identifying a universal set

    of deception cues. We also present findings that are

    consistent with recent work highlighting the difficul-

    ties that liars have encoding spatial information (Vrij

    et al., 2009). Lastly, our study of deceptive opinion

    spam detection as a genre identification problem re-

    veals relationships between deceptive opinions and

    imaginative writing, and between truthful opinions

    and informative writing.

    The rest of this paper is organized as follows: in

    Section 2, we summarize related work; in Section 3,

    we explain our methodology for gathering data and

    evaluate human performance; in Section 4, we de-scribe the features and classifiers employed by our

    three automated detection approaches; in Section 5,

    we present and discuss experimental results; finally,

    conclusions and directions for future work are given

    in Section 6.

    2 Related Work

    Spam has historically been studied in the contexts of

    e-mail (Drucker et al., 2002), and the Web (Gyongyi

    et al., 2004; Ntoulas et al., 2006). Recently, re-

    searchers have began to look at opinion spam aswell (Jindal and Liu, 2008; Wu et al., 2010; Yoo

    and Gretzel, 2009).

    Jindal and Liu (2008) find that opinion spam is

    both widespread and different in nature from either

    e-mail or Web spam. Using product review data,

    and in the absence of gold-standard deceptive opin-

    ions, they train models using features based on the

    review text, reviewer, and product, to distinguish

    between duplicate opinions7 (considered deceptive

    spam) and non-duplicate opinions (considered truth-

    ful). Wu et al. (2010) propose an alternative strategyfor detecting deceptive opinion spam in the absence

    7Duplicate (or near-duplicate) opinions are opinions that ap-

    pear more than once in the corpus with the same (or similar)

    text. While these opinions are likely to be deceptive, they are

    unlikely to be representative of deceptive opinion spam in gen-

    eral. Moreover, they are potentially detectable via off-the-shelf

    plagiarism detection software.

    310

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    3/11

    of gold-standard data, based on the distortion of pop-

    ularity rankings. Both of these heuristic evaluation

    approaches are unnecessary in our work, since we

    compare gold-standard deceptive and truthful opin-

    ions.

    Yoo and Gretzel (2009) gather 40 truthful and 42

    deceptive hotel reviews and, using a standard statis-

    tical test, manually compare the psychologically rel-

    evant linguistic differences between them. In con-

    trast, we create a much larger dataset of 800 opin-

    ions that we use to develop and evaluate automated

    deception classifiers.

    Research has also been conducted on the re-

    lated task of psycholinguistic deception detection.

    Newman et al. (2003), and later Mihalcea and

    Strapparava (2009), ask participants to give both

    their true and untrue views on personal issues

    (e.g., their stance on the death penalty). Zhou et

    al. (2004; 2008) consider computer-mediated decep-tion in role-playing games designed to be played

    over instant messaging and e-mail. However, while

    these studies compare n-grambased deception clas-

    sifiers to a random guess baseline of 50%, we addi-

    tionally evaluate and compare two other computa-

    tional approaches (described in Section 4), as well

    as the performance of human judges (described in

    Section 3.3).

    Lastly, automatic approaches to determining re-

    view quality have been studieddirectly (Weimer

    et al., 2007), and in the contexts of helpful-ness (Danescu-Niculescu-Mizil et al., 2009; Kim et

    al., 2006; OMahony and Smyth, 2009) and credibil-

    ity (Weerkamp and De Rijke, 2008). Unfortunately,

    most measures of quality employed in those works

    are based exclusively on human judgments, which

    we find in Section 3 to be poorly calibrated to de-

    tecting deceptive opinion spam.

    3 Dataset Construction and Human

    Performance

    While truthful opinions are ubiquitous online, de-

    ceptive opinions are difficult to obtain without re-

    sorting to heuristic methods (Jindal and Liu, 2008;

    Wu et al., 2010). In this section, we report our ef-

    forts to gather (and validate with human judgments)

    the first publicly available opinion spam dataset with

    gold-standard deceptive opinions.

    Following the work of Yoo and Gretzel (2009), we

    compare truthful and deceptive positive reviews for

    hotels found on TripAdvisor. Specifically, we mine

    all 5-star truthful reviews from the 20 most popular

    hotels on TripAdvisor8 in the Chicago area.9 De-

    ceptive opinions are gathered for those same 20 ho-

    tels using Amazon Mechanical Turk10 (AMT). Be-

    low, we provide details of the collection methodolo-

    gies for deceptive (Section 3.1) and truthful opinions

    (Section 3.2). Ultimately, we collect 20 truthful and

    20 deceptive opinions for each of the 20 chosen ho-

    tels (800 opinions total).

    3.1 Deceptive opinions via Mechanical Turk

    Crowdsourcing services such as AMT have made

    large-scale data annotation and collection efforts fi-

    nancially affordable by granting anyone with ba-

    sic programming skills access to a marketplace of

    anonymous online workers (known as Turkers) will-ing to complete small tasks.

    To solicit gold-standard deceptive opinion spam

    using AMT, we create a pool of 400 Human-

    Intelligence Tasks (HITs) and allocate them evenly

    across our 20 chosen hotels. To ensure that opin-

    ions are written by unique authors, we allow only a

    single submission per Turker. We also restrict our

    task to Turkers who are located in the United States,

    and who maintain an approval rating of at least 90%.

    Turkers are allowed a maximum of 30 minutes to

    work on the HIT, and are paid one US dollar for anaccepted submission.

    Each HIT presents the Turker with the name and

    website of a hotel. The HIT instructions ask the

    Turker to assume that they work for the hotels mar-

    keting department, and to pretend that their boss

    wants them to write a fake review (as if they were

    a customer) to be posted on a travel review website;

    additionally, the review needs to sound realistic and

    portray the hotel in a positive light. A disclaimer

    8TripAdvisor utilizes a proprietary ranking system to assess

    hotel popularity. We chose the 20 hotels with the greatest num-ber of reviews, irrespective of the TripAdvisor ranking.

    9It has been hypothesized that popular offerings are less

    likely to become targets of deceptive opinion spam, since the

    relative impact of the spam in such cases is small (Jindal and

    Liu, 2008; Lim et al., 2010). By considering only the most

    popular hotels, we hope to minimize the risk of mining opinion

    spam and labeling it as truthful.10http://mturk.com

    311

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    4/11

    Time spent t (minutes)

    All submissions

    count: 400

    tmin: 0.08, tmax: 29.78

    t: 8.06, s: 6.32

    Length (words)

    All submissionsmin: 25, max: 425

    : 115.75, s: 61.30

    Time spent t < 1

    count: 47

    min: 39, max: 407: 113.94, s: 66.24

    Time spent t 1

    count: 353

    min: 25, max: 425

    : 115.99, s: 60.71

    Table 1: Descriptive statistics for 400 deceptive opinion

    spam submissions gathered using AMT. s corresponds to

    the sample standard deviation.

    indicates that any submission found to be of insuffi-

    cient quality (e.g., written for the wrong hotel, unin-

    telligible, unreasonably short,11

    plagiarized,12

    etc.)will be rejected.

    It took approximately 14 days to collect 400 sat-

    isfactory deceptive opinions. Descriptive statistics

    appear in Table 1. Submissions vary quite dramati-

    cally both in length, and time spent on the task. Par-

    ticularly, nearly 12% of the submissions were com-

    pleted in under one minute. Surprisingly, an inde-

    pendent two-tailed t-test between the mean length of

    these submissions (t

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    5/11

    TRUTHFUL DECEPTIVE

    Accuracy P R F P R F

    HUMAN

    JUDGE 1 61.9% 57.9 87.5 69.7 74.4 36.3 48.7

    JUDGE 2 56.9% 53.9 95.0 68.8 78.9 18.8 30.3

    JUDGE 3 53.1% 52.3 70.0 59.9 54.7 36.3 43.6

    METAMAJORITY 58.1% 54.8 92.5 68.8 76.0 23.8 36.2

    SKEPTIC 60.6% 60.8 60.0 60.4 60.5 61.3 60.9

    Table 2: Performance of three human judges and two meta-judges on a subset of 160 opinions, corresponding to thefirst fold of our cross-validation experiments in Section 5. Boldface indicates the largest value for each column.

    experiments described in Section 5, contains all 40

    reviews from each of four randomly chosen hotels.

    Unlike the Turkers, our student volunteers are not

    offered a monetary reward. Consequently, we con-

    sider their judgements to be more honest than those

    obtained via AMT.

    Additionally, to test the extent to which the in-

    dividual human judges are biased, we evaluate the

    performance of two virtual meta-judges. Specifi-cally, the MAJORITY meta-judge predicts decep-

    tive when at least two out of three human judges

    believe the review to be deceptive, and the SKEP-

    TI C meta-judge predicts deceptive when any hu-

    man judge believes the review to be deceptive.

    Human and meta-judge performance is given in

    Table 2. It is clear from the results that human

    judges are not particularly effective at this task. In-

    deed, a two-tailed binomial test fails to reject the

    null hypothesis that JUDGE 2 and JUDGE 3 per-

    form at-chance (p = 0.003, 0.10, 0.48 for the threejudges, respectively). Furthermore, all three judges

    suffer from truth-bias (Vrij, 2008), a common find-

    ing in deception detection research in which hu-

    man judges are more likely to classify an opinion

    as truthful than deceptive. In fact, JUDGE 2 clas-

    sified fewer than 12% of the opinions as decep-

    tive! Interestingly, this bias is effectively smoothed

    by the SKEPTIC meta-judge, which produces nearly

    perfectly class-balanced predictions. A subsequent

    reevaluation of human performance on this task sug-

    gests that the truth-bias can be reduced if judges

    are given the class-proportions in advance, although

    such prior knowledge is unrealistic; and ultimately,

    performance remains similar to that of Table 2.

    Inter-annotator agreement among the three

    judges, computed using Fleiss kappa, is 0.11.

    While there is no precise rule for interpreting

    kappa scores, Landis and Koch (1977) suggest

    that scores in the range (0.00, 0.20] correspond

    to slight agreement between annotators. The

    largest pairwise Cohens kappa is 0.12, between

    JUDGE 2 and JUDGE 3a value far below generally

    accepted pairwise agreement levels. We suspect

    that agreement among our human judges is so

    low precisely because humans are poor judges of

    deception (Vrij, 2008), and therefore they perform

    nearly at-chance respective to one another.

    4 Automated Approaches to Deceptive

    Opinion Spam Detection

    We consider three automated approaches to detect-

    ing deceptive opinion spam, each of which utilizes

    classifiers (described in Section 4.4) trained on the

    dataset of Section 3. The features employed by each

    strategy are outlined here.

    4.1 Genre identification

    Work in computational linguistics has shown thatthe frequency distribution of part-of-speech (POS)

    tags in a text is often dependent on the genre of the

    text (Biber et al., 1999; Rayson et al., 2001). In our

    genre identification approach to deceptive opinion

    spam detection, we test if such a relationship exists

    for truthful and deceptive reviews by constructing,

    for each review, features based on the frequencies of

    each POS tag.15 These features are also intended to

    provide a good baseline with which to compare our

    other automated approaches.

    4.2 Psycholinguistic deception detection

    The Linguistic Inquiry and Word Count (LIWC)

    software (Pennebaker et al., 2007) is a popular au-

    tomated text analysis tool used widely in the so-

    cial sciences. It has been used to detect personality

    15We use the Stanford Parser (Klein and Manning, 2003) to

    obtain the relative POS frequencies.

    313

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    6/11

    traits (Mairesse et al., 2007), to study tutoring dy-

    namics (Cade et al., 2010), and, most relevantly, to

    analyze deception (Hancock et al., 2008; Mihalcea

    and Strapparava, 2009; Vrij et al., 2007).

    While LIWC does not include a text classifier, we

    can create one with features derived from the LIWC

    output. In particular, LIWC counts and groups

    the number of instances of nearly 4,500 keywords

    into 80 psychologically meaningful dimensions. We

    construct one feature for each of the 80 LIWC di-

    mensions, which can be summarized broadly under

    the following four categories:

    1. Linguistic processes: Functional aspects of text

    (e.g., the average number of words per sen-

    tence, the rate of misspelling, swearing, etc.)

    2. Psychological processes: Includes all social,

    emotional, cognitive, perceptual and biological

    processes, as well as anything related to time orspace.

    3. Personal concerns: Any references to work,

    leisure, money, religion, etc.

    4. Spoken categories: Primarily filler and agree-

    ment words.

    While other features have been considered in past

    deception detection work, notably those of Zhou et

    al. (2004), early experiments found LIWC features

    to perform best. Indeed, the LIWC2007 software

    used in our experiments subsumes most of the fea-tures introduced in other work. Thus, we focus our

    psycholinguistic approach to deception detection on

    LIWC-based features.

    4.3 Text categorization

    In contrast to the other strategies just discussed,

    our text categorization approach to deception de-

    tection allows us to model both content and con-

    text with n-gram features. Specifically, we consider

    the following three n-gram feature sets, with the

    corresponding features lowercased and unstemmed:

    UNIGRAMS, BIGRAMS+, TRIGRAMS+, where the

    superscript + indicates that the feature set subsumes

    the preceding feature set.

    4.4 Classifiers

    Features from the three approaches just introduced

    are used to train Nave Bayes and Support Vector

    Machine classifiers, both of which have performed

    well in related work (Jindal and Liu, 2008; Mihalcea

    and Strapparava, 2009; Zhou et al., 2008).

    For a document x, with label y, the Nave Bayes

    (NB) classifier gives us the following decision rule:

    y = arg maxc

    Pr(y = c) Pr(x | y = c) (1)

    When the class prior is uniform, for example

    when the classes are balanced (as in our case), (1)

    can be simplified to the maximum likelihood classi-

    fier (Peng and Schuurmans, 2003):

    y = arg maxc

    Pr(x | y = c) (2)

    Under (2), both the NB classifier used by Mihal-

    cea and Strapparava (2009) and the language model

    classifier used by Zhou et al. (2008) are equivalent.Thus, following Zhou et al. (2008), we use the SRI

    Language Modeling Toolkit (Stolcke, 2002) to esti-

    mate individual language models, Pr(x | y = c),for truthful and deceptive opinions. We consider

    all three n-gram feature sets, namely UNIGRAMS,

    BIGRAMS+, and TRIGRAMS+, with corresponding

    language models smoothed using the interpolated

    Kneser-Ney method (Chen and Goodman, 1996).

    We also train Support Vector Machine (SVM)

    classifiers, which find a high-dimensional separating

    hyperplane between two groups of data. To simplifyfeature analysis in Section 5, we restrict our evalu-

    ation to linear SVMs, which learn a weight vector

    w and bias term b, such that a document x can be

    classified by:

    y = sign(w x + b) (3)

    We use SVMlight (Joachims, 1999) to train our

    linear SVM models on all three approaches and

    feature sets described above, namely PO S, LIWC,

    UNIGRAMS, BIGRAMS+, and TRIGRAMS+. We also

    evaluate every combination of these features, but

    for brevity include only LIWC+BIGRAMS+, which

    performs best. Following standard practice, doc-

    ument vectors are normalized to unit-length. For

    LIWC+BIGRAMS+, we unit-length normalize LIWC

    and BIGRAMS+ features individually before com-

    bining them.

    314

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    7/11

    TRUTHFUL DECEPTIVE

    Approach Features Accuracy P R F P R F

    G EN RE ID EN TIFIC AT ION PO SSVM 73.0% 75.3 68.5 71.7 71.1 77.5 74.2

    PSYCHOLINGUISTICLIWCSVM 76.8% 77.2 76.0 76.6 76.4 77.5 76.9

    DECEPTION DETECTION

    TEXT CATEGORIZATION

    UNIGRAMS SVM 88.4% 89.9 86.5 88.2 87.0 90.3 88.6

    BIGRAMS+SVM 89.6% 90.1 89.0 89.6 89.1 90.3 89.7

    LIWC+BIGRAMS+SVM 89.8% 89.8 89.8 89.8 89.8 89.8 89.8

    TRIGRAMS

    +

    SVM 89.0% 89.0 89.0 89.0 89.0 89.0 89.0UNIGRAMS NB 88.4% 92.5 83.5 87.8 85.0 93.3 88.9

    BIGRAMS+NB 88.9% 89.8 87.8 88.7 88.0 90.0 89.0

    TRIGRAMS+NB 87.6% 87.7 87.5 87.6 87.5 87.8 87.6

    HUMAN / META

    JUDGE 1 61.9% 57.9 87.5 69.7 74.4 36.3 48.7

    JUDGE 2 56.9% 53.9 95.0 68.8 78.9 18.8 30.3

    SKEPTIC 60.6% 60.8 60.0 60.4 60.5 61.3 60.9

    Table 3: Automated classifier performance for three approaches based on nested 5-fold cross-validation experiments.

    Reported precision, recall and F-score are computed using a micro-average, i.e., from the aggregate true positive, false

    positive and false negative rates, as suggested by Forman and Scholz (2009). Human performance is repeated here for

    JUDGE 1, JUDGE 2 and the SKEPTIC meta-judge, although they cannot be directly compared since the 160-opinion

    subset on which they are assessed only corresponds to the first cross-validation fold.

    5 Results and Discussion

    The deception detection strategies described in Sec-

    tion 4 are evaluated using a 5-fold nested cross-

    validation (CV) procedure (Quadrianto et al., 2009),

    where model parameters are selected for each test

    fold based on standardCV experiments on the train-

    ing folds. Folds are selected so that each contains all

    reviews from four hotels; thus, learned models are

    always evaluated on reviews from unseen hotels.

    Results appear in Table 3. We observe that auto-mated classifiers outperform human judges for every

    metric, except truthful recall where JUDGE 2 per-

    forms best.16 However, this is expected given that

    untrained humans often focus on unreliable cues to

    deception (Vrij, 2008). For example, one study ex-

    amining deception in online dating found that hu-

    mans perform at-chance detecting deceptive pro-

    files because they rely on text-based cues that are

    unrelated to deception, such as second-person pro-

    nouns (Toma and Hancock, In Press).

    Among the automated classifiers, baseline per-

    formance is given by the simple genre identifica-

    tion approach (PO SSV M) proposed in Section 4.1.

    Surprisingly, we find that even this simple auto-

    16As mentioned in Section 3.3, JUDGE 2 classified fewer than

    12% of opinions as deceptive. While achieving 95% truthful re-

    call, this judges corresponding precision was not significantly

    better than chance (two-tailed binomial p = 0.4).

    mated classifier outperforms most human judges

    (one-tailed sign test p = 0.06, 0.01, 0.001 for thethree judges, respectively, on the first fold). This

    result is best explained by theories of reality mon-

    itoring (Johnson and Raye, 1981), which suggest

    that truthful and deceptive opinions might be clas-

    sified into informative and imaginative genres, re-

    spectively. Work by Rayson et al. (2001) has found

    strong distributional differences between informa-

    tive and imaginative writing, namely that the former

    typically consists of more nouns, adjectives, prepo-sitions, determiners, and coordinating conjunctions,

    while the latter consists of more verbs,17 adverbs,18

    pronouns, and pre-determiners. Indeed, we find that

    the weights learned by PO SSV M (found in Table 4)

    are largely in agreement with these findings, no-

    tably except for adjective and adverb superlatives,

    the latter of which was found to be an exception by

    Rayson et al. (2001). However, that deceptive opin-

    ions contain more superlatives is not unexpected,

    since deceptive writing (but not necessarily imagi-

    native writing in general) often contains exaggeratedlanguage (Buller and Burgoon, 1996; Hancock et al.,

    2008).

    Both remaining automated approaches to detect-

    ing deceptive opinion spam outperform the simple

    17Past participle verbs were an exception.18Superlative adverbs were an exception.

    315

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    8/11

    TRUTHFUL/INFORMATIVE DECEPTIVE/IMAGINATIVE

    Category Variant Weight Category Variant Weight

    NOUNS

    Singular 0.008

    VERBS

    Base -0.057

    Plural 0.002 Past tense 0.041

    Proper, singular -0.041 Present participle -0.089

    Proper, plural 0.091 Singular, present -0.031

    ADJECTIVES

    General 0.002 Third person0.026

    Comparative 0.058 singular, present

    Superlative -0.164 Modal -0.063PREPOSITIONS General 0.064ADVERBS

    General 0.001

    DETERMINERS General 0.009 Comparative -0.035

    COORD. CONJ. General 0.094PRONOUNS

    Personal -0.098

    VERBS Past participle 0.053 Possessive -0.303

    ADVERBS Superlative -0.094 PR E-DETERMINERS General 0.017

    Table 4: Average feature weights learned by PO SSVM . Based on work by Rayson et al. (2001), we expect weights on

    the left to be positive (predictive oftruthful opinions), and weights on the right to be negative (predictive of deceptive

    opinions). Boldface entries are at odds with these expectations. We report average feature weights ofunit-normalized

    weight vectors, rather than raw weights vectors, to account for potential differences in magnitude between the folds.

    genre identification baseline just discussed. Specifi-

    cally, the psycholinguistic approach (LIWCSV M) pro-

    posed in Section 4.2 performs 3.8% more accurately

    (one-tailed sign test p = 0.02), and the standard textcategorization approach proposed in Section 4.3 per-

    forms between 14.6% and 16.6% more accurately.

    However, best performance overall is achieved by

    combining features from these two approaches. Par-

    ticularly, the combined model LIWC+BIGRAMS+SV Mis 89.8% accurate at detecting deceptive opinion

    spam.19

    Surprisingly, models trained only on

    UNIGRAMSthe simplest n-gram feature setoutperform all nontext-categorization approaches,

    and models trained on BIGRAMS+ perform even

    better (one-tailed sign test p = 0.07). This suggeststhat a universal set of keyword-based deception

    cues (e.g., LIWC) is not the best approach to de-

    tecting deception, and a context-sensitive approach

    (e.g., BIGRAMS+) might be necessary to achieve

    state-of-the-art deception detection performance.

    To better understand the models learned by these

    automated approaches, we report in Table 5 the top

    15 highest weighted features for each class (truthfuland deceptive) as learned by LIWC+BIGRAMS+SV Mand LIWCSVM . In agreement with theories of reality

    monitoring (Johnson and Raye, 1981), we observe

    that truthful opinions tend to include more sensorial

    and concrete language than deceptive opinions; in

    19The result is not significantly better than BIGRAMS+SVM .

    LIWC+BIGRAMS+SVM LIWCSVM

    T RU TH FU L D EC EP TI VE T RU TH FU L D EC EP TI VE

    - chicago hear i

    ... my number family

    on hotel allpunct perspron

    location , and negemo see

    ) luxury dash pronoun

    allpunctLIWC experience exclusive leisure

    floor hilton we exclampunct

    ( business sexual sixletters

    the hotel vacation period posemo

    bathroom i otherpunct comma

    small spa space cause

    helpful looking human auxverb

    $ while past future

    hotel . husband inhibition perceptualother my husband assent feel

    Table 5: Top 15 highest weighted truthful and deceptive

    features learned by LIWC+BIGRAMS+SVM and LIWCSVM .

    Ambiguous features are subscripted to indicate the source

    of the feature. LIWC features correspond to groups

    of keywords as explained in Section 4.2; more details

    about LIWC and the LIWC categories are available athttp://liwc.net.

    particular, truthful opinions are more specific about

    spatial configurations (e.g., small, bathroom, on, lo-

    cation). This finding is also supported by recent

    work by Vrij et al. (2009) suggesting that liars have

    considerable difficultly encoding spatial information

    into their lies. Accordingly, we observe an increased

    focus in deceptive opinions on aspects external to

    the hotel being reviewed (e.g., husband, business,

    316

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    9/11

    vacation).

    We also acknowledge several findings that, on the

    surface, are in contrast to previous psycholinguistic

    studies of deception (Hancock et al., 2008; Newman

    et al., 2003). For instance, while deception is often

    associated with negative emotion terms, our decep-

    tive reviews have more positive and fewer negative

    emotion terms. This pattern makes sense when one

    considers the goal of our deceivers, namely to create

    a positive review (Buller and Burgoon, 1996).

    Deception has also previously been associated

    with decreased usage of first person singular, an ef-

    fect attributed to psychological distancing (Newman

    et al., 2003). In contrast, we find increased first

    person singular to be among the largest indicators

    of deception, which we speculate is due to our de-

    ceivers attempting to enhance the credibility of their

    reviews by emphasizing their own presence in the

    review. Additional work is required, but these find-ings further suggest the importance of moving be-

    yond a universal set of deceptive language features

    (e.g., LIWC) by considering both the contextual (e.g.,

    BIGRAMS+) and motivational parameters underly-

    ing a deception as well.

    6 Conclusion and Future Work

    In this work we have developed the first large-scale

    dataset containing gold-standard deceptive opinion

    spam. With it, we have shown that the detectionof deceptive opinion spam is well beyond the ca-

    pabilities of human judges, most of whom perform

    roughly at-chance. Accordingly, we have introduced

    three automated approaches to deceptive opinion

    spam detection, based on insights coming from re-

    search in computational linguistics and psychology.

    We find that while standard n-grambased text cate-

    gorization is the best individual detection approach,

    a combination approach using psycholinguistically-

    motivated features and n-gram features can perform

    slightly better.

    Finally, we have made several theoretical con-

    tributions. Specifically, our findings suggest the

    importance of considering both the context (e.g.,

    BIGRAMS+) and motivations underlying a decep-

    tion, rather than strictly adhering to a universal set

    of deception cues (e.g., LIWC). We have also pre-

    sented results based on the feature weights learned

    by our classifiers that illustrate the difficulties faced

    by liars in encoding spatial information. Lastly, we

    have discovered a plausible relationship between de-

    ceptive opinion spam and imaginative writing, based

    on POS distributional similarities.

    Possible directions for future work include an ex-

    tended evaluation of the methods proposed in this

    work to both negative opinions, as well as opinions

    coming from other domains. Many additional ap-

    proaches to detecting deceptive opinion spam are

    also possible, and a focus on approaches with high

    deceptive precision might be useful for production

    environments.

    Acknowledgments

    This work was supported in part by National

    Science Foundation Grants BCS-0624277, BCS-

    0904822, HSD-0624267, IIS-0968450, and NSCC-

    0904822, as well as a gift from Google, and the

    Jack Kent Cooke Foundation. We also thank, al-

    phabetically, Rachel Boochever, Cristian Danescu-

    Niculescu-Mizil, Alicia Granstein, Ulrike Gretzel,

    Danielle Kirshenblat, Lillian Lee, Bin Lu, Jack

    Newton, Melissa Sackler, Mark Thomas, and Angie

    Yoo, as well as members of the Cornell NLP sem-

    inar group and the ACL reviewers for their insight-

    ful comments, suggestions and advice on various as-

    pects of this work.

    References

    C. Akkaya, A. Conrad, J. Wiebe, and R. Mihalcea. 2010.

    Amazon mechanical turk for subjectivity word sense

    disambiguation. In Proceedings of the NAACL HLT

    2010 Workshop on Creating Speech and Language

    Data with Amazons Mechanical Turk, Los Angeles,

    pages 195203.

    D. Biber, S. Johansson, G. Leech, S. Conrad, E. Finegan,

    and R. Quirk. 1999. Longman grammar of spoken and

    written English, volume 2. MIT Press.

    C.F. Bond and B.M. DePaulo. 2006. Accuracy of de-

    ception judgments. Personality and Social PsychologyReview, 10(3):214.

    D.B. Buller and J.K. Burgoon. 1996. Interpersonal

    deception theory. Communication Theory, 6(3):203

    242.

    W.L. Cade, B.A. Lehman, and A. Olney. 2010. An ex-

    ploration of off topic conversation. In Human Lan-

    guage Technologies: The 2010 Annual Conference of

    317

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    10/11

    the North American Chapter of the Association for

    Computational Linguistics, pages 669672. Associa-

    tion for Computational Linguistics.

    S.F. Chen and J. Goodman. 1996. An empirical study of

    smoothing techniques for language modeling. In Pro-

    ceedings of the 34th annual meeting on Association

    for Computational Linguistics, pages 310318. Asso-

    ciation for Computational Linguistics.C. Danescu-Niculescu-Mizil, G. Kossinets, J. Kleinberg,

    and L. Lee. 2009. How opinions are received by on-

    line communities: a case study on amazon.com help-

    fulness votes. In Proceedings of the 18th international

    conference on World wide web, pages 141150. ACM.

    H. Drucker, D. Wu, and V.N. Vapnik. 2002. Support

    vector machines for spam categorization. Neural Net-

    works, IEEE Transactions on, 10(5):10481054.

    G. Forman and M. Scholz. 2009. Apples-to-Apples in

    Cross-Validation Studies: Pitfalls in Classifier Perfor-

    mance Measurement. ACM SIGKDD Explorations,

    12(1):4957.

    Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. 2004.

    Combating web spam with trustrank. In Proceedings

    of the Thirtieth international conference on Very large

    data bases-Volume 30, pages 576587. VLDB Endow-

    ment.

    J.T. Hancock, L.E. Curry, S. Goorha, and M. Woodworth.

    2008. On lying and being lied to: A linguistic anal-

    ysis of deception in computer-mediated communica-

    tion. Discourse Processes, 45(1):123.

    J. Jansen. 2010. Online product research. Pew Internet

    & American Life Project Report.

    N. Jindal and B. Liu. 2008. Opinion spam and analysis.

    In Proceedings of the international conference on Web

    search and web data mining, pages 219230. ACM.

    T. Joachims. 1998. Text categorization with support vec-

    tor machines: Learning with many relevant features.

    Machine Learning: ECML-98, pages 137142.

    T. Joachims. 1999. Making large-scale support vec-

    tor machine learning practical. In Advances in kernel

    methods, page 184. MIT Press.

    M.K. Johnson and C.L. Raye. 1981. Reality monitoring.

    Psychological Review, 88(1):6785.

    S.M. Kim, P. Pantel, T. Chklovski, and M. Pennacchiotti.

    2006. Automatically assessing review helpfulness.

    In Proceedings of the 2006 Conference on EmpiricalMethods in Natural Language Processing, pages 423

    430. Association for Computational Linguistics.

    D. Klein and C.D. Manning. 2003. Accurate unlexical-

    ized parsing. In Proceedings of the 41st Annual Meet-

    ing on Association for Computational Linguistics-

    Volume 1, pages 423430. Association for Computa-

    tional Linguistics.

    J.R. Landis and G.G. Koch. 1977. The measurement of

    observer agreement for categorical data. Biometrics,

    33(1):159.

    E.P. Lim, V.A. Nguyen, N. Jindal, B. Liu, and H.W.

    Lauw. 2010. Detecting product review spammers us-

    ing rating behaviors. In Proceedings of the 19th ACM

    international conference on Information and knowl-

    edge management, pages 939948. ACM.S.W. Litvin, R.E. Goldsmith, and B. Pan. 2008. Elec-

    tronic word-of-mouth in hospitality and tourism man-

    agement. Tourism management, 29(3):458468.

    F. Mairesse, M.A. Walker, M.R. Mehl, and R.K. Moore.

    2007. Using linguistic cues for the automatic recogni-

    tion of personality in conversation and text. Journal of

    Artificial Intelligence Research, 30(1):457500.

    R. Mihalcea and C. Strapparava. 2009. The lie detector:

    Explorations in the automatic recognition of deceptive

    language. In Proceedings of the ACL-IJCNLP 2009

    Conference Short Papers, pages 309312. Association

    for Computational Linguistics.

    M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M.

    Richards. 2003. Lying words: Predicting deception

    from linguistic styles. Personality and Social Psychol-

    ogy Bulletin, 29(5):665.

    A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly.

    2006. Detecting spam web pages through content

    analysis. In Proceedings of the 15th international con-

    ference on World Wide Web, pages 8392. ACM.

    M.P. OMahony and B. Smyth. 2009. Learning to rec-

    ommend helpful hotel reviews. In Proceedings of

    the third ACM conference on Recommender systems,

    pages 305308. ACM.

    F. Peng and D. Schuurmans. 2003. Combining naive

    Bayes and n-gram language models for text classifica-

    tion. Advances in Information Retrieval, pages 547

    547.

    J.W. Pennebaker, C.K. Chung, M. Ireland, A. Gonzales,

    and R.J. Booth. 2007. The development and psycho-

    metric properties of LIWC2007. Austin, TX, LIWC.

    Net.

    N. Quadrianto, A.J. Smola, T.S. Caetano, and Q.V.

    Le. 2009. Estimating labels from label proportions.

    The Journal of Machine Learning Research, 10:2349

    2374.

    P. Rayson, A. Wilson, and G. Leech. 2001. Grammaticalword class variation within the British National Cor-

    pus sampler. Language and Computers, 36(1):295

    306.

    R.A. Rigby and D.M. Stasinopoulos. 2005. Generalized

    additive models for location, scale and shape. Jour-

    nal of the Royal Statistical Society: Series C (Applied

    Statistics), 54(3):507554.

    318

  • 7/29/2019 Finding Deceptive Opinion Spam by Any Stretch of the Imagination

    11/11

    F. Sebastiani. 2002. Machine learning in automated

    text categorization. ACM computing surveys (CSUR),

    34(1):147.

    M.A. Serrano, A. Flammini, and F. Menczer. 2009.

    Modeling statistical properties of written text. PloS

    one, 4(4):5372.

    A. Stolcke. 2002. SRILM-an extensible language mod-

    eling toolkit. In Seventh International Conference onSpoken Language Processing, volume 3, pages 901

    904. Citeseer.

    C. Toma and J.T. Hancock. In Press. What Lies Beneath:

    The Linguistic Traces of Deception in Online Dating

    Profiles. Journal of Communication.

    A. Vrij, S. Mann, S. Kristen, and R.P. Fisher. 2007. Cues

    to deception and ability to detect lies as a function

    of police interview styles. Law and human behavior,

    31(5):499518.

    A. Vrij, S. Leal, P.A. Granhag, S. Mann, R.P. Fisher,

    J. Hillman, and K. Sperry. 2009. Outsmarting the

    liars: The benefit of asking unanticipated questions.

    Law and human behavior, 33(2):159166.A. Vrij. 2008. Detecting lies and deceit: Pitfalls and

    opportunities. Wiley-Interscience.

    W. Weerkamp and M. De Rijke. 2008. Credibility im-

    proves topical blog post retrieval. ACL-08: HLT,

    pages 923931.

    M. Weimer, I. Gurevych, and M. Muhlhauser. 2007. Au-

    tomatically assessing the post quality in online discus-

    sions on software. In Proceedings of the 45th An-

    nual Meeting of the ACL on Interactive Poster and

    Demonstration Sessions, pages 125128. Association

    for Computational Linguistics.

    G. Wu, D. Greene, B. Smyth, and P. Cunningham. 2010.

    Distortion as a validation criterion in the identification

    of suspicious reviews. Technical report, UCD-CSI-

    2010-04, University College Dublin.

    K.H. Yoo and U. Gretzel. 2009. Comparison of De-

    ceptive and Truthful Travel Reviews. Information and

    Communication Technologies in Tourism 2009, pages

    3747.

    L. Zhou, J.K. Burgoon, D.P. Twitchell, T. Qin, and J.F.

    Nunamaker Jr. 2004. A comparison of classifica-

    tion methods for predicting deception in computer-

    mediated communication. Journal of Management In-

    formation Systems, 20(4):139166.

    L. Zhou, Y. Shi, and D. Zhang. 2008. A Statistical Lan-guage Modeling Approach to Online Deception De-

    tection. IEEE Transactions on Knowledge and Data

    Engineering, 20(8):10771081.

    319