WWW 2014 Seoul, April 8 th

WWW 2014Seoul, April 8th

SNOW 2014 Data Challenge

Two-level message clustering for topic detection in TwitterGeorgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris

Centre for Research and Technology Hellas (CERTH)

Overview

• Applied approach:– Pre-processing– Topic detection– Ranking– Title extraction– Keyword extraction– Representative tweets selection– Relevant image extraction

• Evaluation• Conclusions

#2

Pre-processing

• Duplicate tweet aggregation:– Performed via simple hashing (very fast but does not capture near

duplicates such as some retweets)– Counts kept for subsequent processing

• Language-based filtering:– Only content in English is kept– Public Java implementation used (

https://code.google.com/p/language-detection)• Significant computational benefit for subsequent steps. e.g.,

for the first timeslot:– Originally: 15,090 tweets– After duplicate aggregation: 7,546 unique tweets– After language filtering: 6,359 unique tweets written in English

#3

Topic detection (1/2)• Different types of topic detection algorithms:

– Feature-pivot– Document-pivot– Probabilistic

• We opt for a document-pivot approach as we recognize that:– A reply tweet typically refers to the same topic as the tweet to which

it is a reply.– Tweets that include the same URL refer to the same topic

• Such information is readily available and cannot be easily taken into account by other types of topic detection algorithms

• We generate first-level clusters by grouping together tweets based on the above relationships (using a Union-Find algorithm)

#4

Topic detection (2/2)

• Not all tweets will belong to a first-level cluster; thus, we perform a second-level clustering.

• We perform an incremental threshold based clustering procedure that utilizes LSH:

– For each tweet find the best matching item that has already been examined. If its similarity to it (using tf-idf and cosine similarity) is above some threshold assign it to the same cluster, otherwise create a new cluster.

– If the examined tweet belongs to a first-level cluster, assign the other tweets from the first-level cluster to the same second-level cluster (either existing or new) and do not further consider these tweets

• Additionally, in order to reduce fragmentation:– We use the lemmatized version of terms (Stanford), instead of their raw form– We boost entities and hashtags by a constant factor (=1.5)

• Each second-level cluster is treated as a (fine-grained) topic.

#5

Ranking• A very large number of topics per timeslot is produced (e.g. 2,669 for the

first timeslot) but we need to return only 10 per timeslot• We recognize that the granularity and hierarchy of topics is important for

ranking: fine-grain subtopics of popular coarse-grain topics should be ranked higher than other fine-grain topics that are not subtopics of a popular coarse-grain topic

• To cater for this we:– Detect coarse-grain topics by running again the document-pivot procedure (i.e. a third

clustering process) but this time further boosting entities and hashtags (not by a constant factor, but a factor linear to their frequency)

– Map each fine-grain topic to a coarse-grain topic to obtain a two-level hierarchy– Rank the coarse-grain topics by the number of tweets they contain– Rank the fine-grain topics within each coarse-grain topic again by the number of tweets

they contain

• Apply a simple heuristic procedure to select the first few fine-grain topics from the first few coarse-grain topics

#6

Title extraction

• For each topic, we obtain a set of candidate titles by splitting assigned tweets to sentences (Stanford NLP library used)

• Each candidate title gets a score depending on its frequency and the average likelihood of appearance of the words in it in an independent corpus

• Rank candidate titles and return the one with the highest score

#7

Keyword extraction• We opt for phrases rather than unigrams, because phrases are more

descriptive and less ambiguous than unigrams

• For each topic, we obtain a set of candidate keywords by detecting the noun phrases and verb phrases in the assigned tweets

• As in the case of titles, each candidate keyword gets a score depending on its frequency and the average likelihood of appearance of the words in it in an independent corpus

• Rank candidate keywords

• Find the position in the ranked list with the largest score gap and select the keywords until that point

#8

Representative tweets selection

• Related tweets for each topic are readily available since we apply a document-pivot approach

• Satisfactory diversity is achieved by not considering duplicates (pre-processing) and by considering replies (as part of the core topic detection procedure)

• Selection: First, most popular, then all replies and then again with popularity (until gathering 10 tweets).

#9

Relevant image extractionThree cases:• If there are images in the tweets assigned to the topic, return

the most frequent image

• If not, query the Google search API with the title and return the first image returned

• If a result is not fetched (possibly because the title is too limiting), query again the Google search API but this time with the most popular keyword

#10

Evaluation (1/2)

• Significant computational benefit from pre-processing steps• Typically, a few hundred first-level clusters

#11

Evaluation (2/2)

#12

Missing

Keyword

“Hague”

Multimedia

irrelevant

Topic

Marginally

Newsworthy

WWW 2014 Seoul, April 8 th

Documents

Transcript of WWW 2014 Seoul, April 8 th