A Corpus for Entity Profiling in Microblog Posts

18
A Corpus for Entity Profiling in Microblog Posts UNED NLP & IR Group Madrid, Spain ISLA, University of Amsterdam Amsterdam, The Netherlands LREC Workshop on Language Engineering for Online Reputation Management May 26 th , 2012 - Istambul, Turkey Edgar Meij, Andrei Oghina, Minh T. Bui, Mathias Breuss, Maarten de Rijke Damiano Spina

description

Microblogs have become an invaluable source of information for the purpose of online reputation management. Streams of microblogs are of great value because of their direct and real-time nature. An emerging problem is to identify not only microblog posts (such as tweets) that are relevant for a given entity, but also the specific aspects that people discuss. Determining such aspects can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication. In this paper we present two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset and made available online. The first is created using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity. Human assessors have labeled each of the candidates as being relevant. The second corpus is more fine-grained and contains opinion targets. Here, annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.

Transcript of A Corpus for Entity Profiling in Microblog Posts

Page 1: A Corpus for Entity Profiling in Microblog Posts

A Corpus for Entity Profiling in Microblog Posts

UNED NLP & IR Group

Madrid, Spain

ISLA, University of Amsterdam

Amsterdam, The Netherlands

LREC Workshop on Language Engineering for Online Reputation Management

May 26th, 2012 - Istambul, Turkey

Edgar Meij, Andrei Oghina, Minh T. Bui, Mathias Breuss,

Maarten de Rijke Damiano Spina

Page 2: A Corpus for Entity Profiling in Microblog Posts

Introduction

• Online Reputation Management

– Public image of an entity in Online Media

– Entity = { brand, organization, company, person, product }

• Microblogging services (e.g. Twitter)

– People sharing thoughts about an entity

– Dynamic, Real-Time

• Human Language Technologies

– Aid to reputation managers

– Retrieval and Analysis of entity mentions

Page 3: A Corpus for Entity Profiling in Microblog Posts

Sentiment vs. Profiling

• Sentiment analysis

• Entity Profiling – “hot” topics that people talk about in the context of an entity

Page 4: A Corpus for Entity Profiling in Microblog Posts

Our task: Aspect identification

• @xbox_news here we go again,

microsoft being jealous of sony again.

• I lov big Sony headphones .. I lov my #music 2 b

more beautiful

• not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony

Page 5: A Corpus for Entity Profiling in Microblog Posts

Our task: Aspect identification

• @xbox_news here we go again,

microsoft being jealous of sony again.

• I lov big Sony headphones .. I lov my #music 2 b more beautiful

• not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony

Page 6: A Corpus for Entity Profiling in Microblog Posts

Goal

• Build manually annotated corpora

– Evaluate the task of entity profiling in microblog streams

Page 7: A Corpus for Entity Profiling in Microblog Posts

A Corpus for Entity Profiling in Microblog Posts

WePS-3 ORM Corpus

Collection of tweets Disambiguated company names (e.g. apple fruit vs. Apple Inc.)

Page 8: A Corpus for Entity Profiling in Microblog Posts

A Corpus for Entity Profiling in Microblog Posts

WePS-3 ORM Corpus

Pooling Aspects

Tweet annotation

Opinion targets

Page 9: A Corpus for Entity Profiling in Microblog Posts

A Corpus for Entity Profiling in Microblog Posts

WePS-3 ORM Corpus

Pooling Aspects

Tweet annotation

Opinion targets

Page 10: A Corpus for Entity Profiling in Microblog Posts

Approach I: Pooling aspects

• Pooling methodology

– 4 Ranking Methods:

• TF.IDF [Salton and Buckley, 1988]

• Log-Likelihood Ratio [Dunning, 1993]

• Parsimonious Language Model [Hiemstra et al. 2004]

• Opinion target extraction using topic-specific subjective lexicons [Jijkoun et al. 2010]

– Top 10 terms

• Manual annotation

Page 11: A Corpus for Entity Profiling in Microblog Posts

Aspects dataset: annotation example

Page 12: A Corpus for Entity Profiling in Microblog Posts

Aspects dataset: outcome

• Three annotators, substantial agreement

(> 0.6 Cohen/Fleiss’ kappa)

• 94 entities, 17775 tweets, ≈177 tweets/entity

• 2455 terms, 1304 aspects (54.11%)

Page 13: A Corpus for Entity Profiling in Microblog Posts

Approach II: Tweet annotation

• Opinion targets dataset

• Tweet-level annotation – Is the tweet subjective?

• Phrase-level annotation – Subjective phrase

– Opinion target phrase p: • p is an aspect of the entity

• p is included in a sentence that contains a direct subjective phrase

• p is the target of the expressed opinion

Page 14: A Corpus for Entity Profiling in Microblog Posts

Opinion Targets dataset: annotation example

Page 15: A Corpus for Entity Profiling in Microblog Posts

Opinion targets dataset: outcome

• 59 entities, 9396 tweets, ≈159 tweets/entity

• 15.16% of tweets with subjective phrases

• 13.82% of tweets with opinion targets

Page 16: A Corpus for Entity Profiling in Microblog Posts

Aspects vs. Opinion targets

1650 783 270

Aspects

Terms in Opinion Targets

Page 17: A Corpus for Entity Profiling in Microblog Posts

Aspects vs. Opinion targets

1650 783 270

Aspects

Terms in Opinion Targets

26.69%

12.67%

Page 18: A Corpus for Entity Profiling in Microblog Posts

A Corpus for Entity Profiling in Microblog Posts

• Available at

http://bitly.com/profilingTwitter

WePS-3 ORM Corpus

Pooling

Aspects dataset

Tweet annotation

Opinion targets dataset

• 94 entities, 17,775 tweets ≈177 tweets/entity • 2455 terms, 1304 aspects (54.11%)

• 59 entities, 9,396 tweets, ≈159 tweets/entity • 15.16% of tweets with subj. phrases • 13.82% of tweets with opinion targets