Recommending #-Tags in Twitter
-
Upload
evazangerle -
Category
Documents
-
view
122 -
download
0
description
Transcript of Recommending #-Tags in Twitter
3
Hashtags
• Tags for Tweets• (Manual) Categorization of conversations• Follow streams of conversation
4
Motivation
• Only 20% of tweets contain hashtags• Hashtags can be chosen freely
– #umap2011? #umap11? #umap?– Synonymous hashtags– Heterogeneity– Search capability limited
6
Goals
• Recommendation of suitable hashtags during entering a tweet
• Encourage use of hashtags– Improve search capabilities– Better categorization
• Fight heterogeneity– Avoid use of synonymous hashtags
7
Approach
• First Attempt• Crawl set of tweets containing hashtags• Analysis of dataset• Can it be done based on content?• Compare entered tweet to existing tweets
8
Content-based Approach
User enters message
Retrieve 500 most similar messages
Retrieve candidate-set of Hashtags
Ranking of Hashtags
Top-k Recommendations
9
Crawled Dataset
• Crawled July 2010 – February 2011• 16,034,195 messages in total• 3,209,281 messages containing hashtags
(20%) -> used as dataset for evaluation• Top five contained in 8% of all messages
containing hashtags (#jobs, #nowplaying, #zodiacfacts, #news and #fb)
11
Hashtags per Tweet
RT @Bhupesh tweet: #Quad #loop-http://bit.ly/ciHX2U #retweet #India #Jobs #World #news #canada #ad #win #USA #tdf #oea #hacking #icantstop #sdcc #game
13
Ranking Methods
Input: Set of Candiate Hashtags (from 500 similar tweets)Output: Ranked Candidate List -> top k shown
1. Similarity Rank– Use similarity measure of tweets for ranking (tf/idf cosine similarity)– The higher the similarity of the tweets, the higher the ranking of the
corresponding hashtags
2. Overall Popularity Rank– Most popular hashtags over whole dataset– The more popular, the higher the ranking within the candidate
hashtags
14
Ranking Methods
Input: Set of Candiate Hashtags (from 500 similar tweets)Output: Ranked Candidate List -> top k shown
3. Recommendation Popularity Rank- Count number of occurrences for each hashtags within
candidate list- The more similar tweets feature the hashtag, the higher the
rank of the hashtag
15
Evaluation
Compare top-k recommendations
Use three proposed ranking methods
Compute hashtag recommendations for t
Use t as input for recommendation algorithm
Remove hashtags from t
Randomly select tweet t from dataset
16
Evaluation
• Dataset– 3,209,281 messages– 5,097,545 hashtags– 510,170 distinct hashtags
• Testrun – 10,000 randomly chosen tweets (max. 5 hashtags)– Retweets excluded– 30,000 testruns (3 ranking methods)
19
What we showed…
• Motivation for recommendation of hashtags• Content-based recommendations• Simple, straight-forward approach• 40% Recall@3 • … so it can be done!
20
{ "coordinates": null, "favorited": false, "created_at": "Thu Jul 15 23:26:04 +0000 2010", "truncated": false, "text": "RT @ApeyBaby44: Labels r run by lawyer & accountants. http://tl.gd/2hkmas", "contributors": null, "id": 18639444000, "geo": null, "in_reply_to_user_id": null, "place": null, "in_reply_to_screen_name": null, "user": { "name": "DIGGZ", "profile_sidebar_border_color": "F2E195", "profile_background_tile": true, "profile_sidebar_fill_color": "FFF7CC", "created_at": "Fri Apr 03 03:16:01 +0000 2009", "profile_image_url": "http://a3.twimg.com/profile_images/1079346239/untitled_normal.JPG", "location": "ATL, NC, VA, NY all day!", "profile_link_color": "FF0000", "follow_request_sent": null, "url": "http://thisisseriousbiz.com", "favourites_count": 42, "contributors_enabled": false, "utc_offset": -18000, "id": 28489988, "profile_use_background_image": true, "profile_text_color": "0C3E53", "protected": false, "followers_count": 588, "lang": "en", "notifications": null, "time_zone": "Quito", "verified": false, "profile_background_color": "BADFCD", "geo_enabled": true, "description": "Half of Production duo SeriousBIZ circa 2008\r\n#teamSERIOUSBIZ\r\n#teamblackberry PIN 315442C9\r\n#teamfollowback", "friends_count": 477, "statuses_count": 6269, "profile_background_image_url": "http://a1.twimg.com/profile_background_images/118926256/_MG_43571.JPG", "following": null, "screen_name": "DIGGZSeriousBIZ" }, "source": "<a href=\"http://www.ubertwitter.com/bb/download.php\" rel=\"nofollow\">\u00dcberTwitter</a>", "in_reply_to_status_id": null }
We‘ve barely scratched the surface…
• Exploited only small fraction of available information
• 90% are metadata