Zen & the art of data mining
-
Upload
heinestien -
Category
Science
-
view
1.658 -
download
10
Embed Size (px)
description
Transcript of Zen & the art of data mining

1
Old Dominion UniversityDepartment of Computer Science
Hany SalahEldeen
Hany SalahEldeen Khalil [email protected]
Zen & the Art of Data Mining
07-08-14
Social Media Data Collection and the path to Modeling & Predicting User Intention
Web Science & Digital Libraries Lab

2
Before we start..here is a lil bit about me…
Hany SalahEldeen

3
Hany SalahELdeen
Education:• PhD Candidate• Web Science and Digital Libraries Group
• Masters Degree in Computer Vision and Artificial Intelligence• Universitat Autonoma de Barcelona
• Bachelors of Computer Systems Engineering• University of Alexandria
Hany SalahEldeen

4
Research & Technical Experience• Microsoft Research Cairo• Google GmBH Zurich • Microsoft Inc. Mountain View• National University of Singapore
Hany SalahEldeen

5Hany SalahEldeen
Detecting, Modeling, & Predicting User Temporal Intention
in Social Media
Web Mining Pattern Analysis Machine Learning
Human Behavioral AnalysisSocial Media Analysis
So what am I investigating?

6
Publications
Hany SalahEldeen
Shanghai CIKM 2014 Conference- 1 first author paper- 1 second author paper
London DL 2014 Conference- 1 third author paper
Malta TPDL 2013 Conference- 1 first author paper

7
Publications
Hany SalahEldeen
Indianapolis JCDL 2013 Conference- 1 first author paper
Rio de Janeiro WWW 2013 Conference- 1 first author paper
Cyprus TPDL 2012 Conference- 1 first author paper

8
Beside the perks of travelling, our research has been popular…
Hany SalahEldeen

9
MIT Technology Review
Hany SalahEldeen

10
MIT Technology Review
Hany SalahEldeen

11
MIT Technology Review
Hany SalahEldeen

12
Mashable
Hany SalahEldeen

13
Popular Mechanics
Hany SalahEldeen

14
BBC
Hany SalahEldeen

15
The Virginian Pilot
Hany SalahEldeen

16
Our Research’s Popularity
Hany SalahEldeen
• Local newspaper: The Virginia Pilot• 4 x MIT Technology Review• BBC• Mashable• The Atlantic• Yahoo News• Articles in > 11 different languages
• We have been called:• The Internet Archeologists• Web Time Travelers

17
My goal:Detect, model, and predict
user intention in social media
Hany SalahEldeen

18
Ok hold on, let’s go back to the basics…
Hany SalahEldeen

19
Web 2.0Definition: Web 2.0 is a concept that takes the network as a platform for information sharing, interoperability, user-centered design, and collaboration on the World Wide Web.*
* http://en.wikipedia.org/wiki/Web_2.0Hany SalahEldeen

20
Web 2.0• Yes, Web 2.0 is about “user-generated
content”• But explicit content contributed by
users is just 20% of what “matters”• 80% is in the implicitly contributed
data*
Hany SalahEldeen
*Toby Segaran, Programming Collective Intelligence, 2007

21
Systems & Web 2.0• Google: Utilizes PageRank which is a
technique for extracting intelligence from the link structure
• Flickr: Utilizes “interestingness” algorithm• Amazon: Utilizes “people who bought this
product also bought” feature• Pandora: Utilizes “similar artist radio”• eBay: Utilizes “reputation system”
Hany SalahEldeen

22
So why do we even care about all that?
Hany SalahEldeen

23
Power to the People!
Hany SalahEldeen

24
Power to the People!• Because analyzing a huge dataset of
millions of users will yield a lot of potential insights into: • User Experience• Marketing• Personal Taste• Human Behavior in general.
Hany SalahEldeen

25
So what is Data Mining?
Hany SalahEldeen

26
Data Mining• Definition: It is the computational process of
discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
http://en.wikipedia.org/wiki/Data_mining
Hany SalahEldeen

27
Back to my goal:
Hany SalahEldeen
Detecting, Modeling, & Predicting User Temporal Intention
in Social Media

28
Let’s breakdown the title first…
Hany SalahEldeen
Detecting, Modeling, & Predicting User Temporal Intention
in Social Media

29
Let’s breakdown the title first…
Hany SalahEldeen
Detecting, Modeling, & Predicting User Temporal Intention
in Social Media

30
Scenario 1:Jenny reading Jeff’s tweets
Hany SalahEldeen

31
Michael Jackson Dies
Hany SalahEldeen
Snapshot on: June 25th 2009http://web.archive.org/web/20090625232522/http://www.cnn.com/

32
Jeff tweets about it…
Hany SalahEldeen
Published on: June 25th 2009https://twitter.com/mdnitehk/status/2333993907

33
Jeff’s friend Jenny was on a vacation in Hawaii for a month
Jenny is off the grid…
Hany SalahEldeen

34
When she came back she checked Jeff’s tweets and was shocked!
Jenny starts catching up a month later
Hany SalahEldeen
Read on: July26th 2009!https://twitter.com/mdnitehk/status/2333993907

35
She quickly clicked on the link in the tweet…
Jenny follows the link on July 26th
Hany SalahEldeenhttp://web.archive.org/web/20090726234411/http://www.cnn.com/
CNN page on: July 26th 2009

36
• Implication:• Jenny thought Jeff is making a joke about her
favorite singer and she got mad at him
• Problem:• The tweet and the resource the tweet links
to have become unsynchronized.
Jenny is confused!
Hany SalahEldeen

37
Scenario 2:The Egyptian Revolution
Hany SalahEldeen

38
The Egyptian Revolution Jan 2011
Hany SalahEldeen

39
Reading about it in Storify.com a year later in March 2012
Hany SalahEldeen
http://storify.com/maq4sure/egypts-revolution

40
I noticed some shared images are missing
Hany SalahEldeenhttp://storify.com/maq4sure/egypts-revolution

41
Some tweets are still intact
Hany SalahEldeen
https://twitter.com/miss_amy_qb/status/32477898581483521

42
…and some lost their meaning with the disappearance of the images
Hany SalahEldeen
Missing ?https://twitter.com/aishes/status/32485352102952960
https://twitter.com/omar_chaaban/status/32203697597452289

43
The tweet remains but the shared image disappeared…
Hany SalahEldeen
http://yfrog.com/h5923xrvbqqvgzj

44
• Implication:• The reader cannot understand what the
author of the tweet meant because the image is not available.
• Problem:• The post is available but the linked resource
(image) is completely missing.
Cairo….we have a problem!
Hany SalahEldeen

45
…back to the title
Hany SalahEldeen
Detecting, Modeling, & Predicting User Temporal Intention
in Social Media

46
…back to the title
Hany SalahEldeen
Detecting, Modeling, & Predicting User Temporal Intention
in Social Media

4747
The Anatomy of a Tweet
Hany SalahEldeen

4848
The Anatomy of a TweetAuthor’s username
Other user mention
Tweet Body
Hash TagShortened URL to resource
Publishing timestamp
SocialPost
Shared Resource
Interactionoptions
Hany SalahEldeen

4949
3 URIs = 3 Chances to fail
Hany SalahEldeen
http://news.blogs.cnn.com/2012/04/26/norwegians-sing-to-annoy-mass-killer/
https://twitter.com/KentEiler/status/195535749754527745

5050
…t1
t4
t2
t3 t5t7 t8 t9 tn
t6
Explanation in MJ’s example

5151
If I click on a link in a tweet, which version should I get?
ttweet or tclick ?
Hany SalahEldeen

5252
Sometimes you want a previous version
The Correct Temporal Intention
CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pmHany SalahEldeen

5353
Sometimes you want the current version
The Correct Temporal Intention
In this case the current state of the press releases pageHany SalahEldeen

5454
Research Question
Can we estimate the users’ intention at the time of posting
and reading to predict and maintain temporal consistency?
Hany SalahEldeen

5555
People rely on social media for most updated information
Hany SalahEldeen

56Hany SalahEldeen
So if you are posting a tweet about your cat…
…No one cares!

57Hany SalahEldeen
Regardless how cool your cat was!

58
All tweets are equal…
…but some are more equal than the others
Hany SalahEldeen

59
Preliminary Research Questions:
1. How long would these last?2. And if lost, are they archived?3. Is this what the author intended?
Hany SalahEldeen

6060
Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.
Hany SalahEldeen
Historical Integrity

6161
The life cycle of a social post
Hany SalahEldeen

6262
The life cycle of a social post
tweets
Hany SalahEldeen

6363
The life cycle of a social post
tweets Links to
Hany SalahEldeen

6464
The life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
Hany SalahEldeen

6565
The life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
Hany SalahEldeen
The resource has disappeared

6666
The life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
The resource has disappeared
The resource has changed
Hany SalahEldeen

6767
Same state the author intended
The Resource’s Possibilities
a bigger problem since the reader might not know.
What the reader
receives
The resource has disappeared
The resource has changed
Hany SalahEldeen

6868
We could lose the linked resource
Hany SalahEldeen

6969
The attack on the embassy was in February 2013
Or the resource could change
Hany SalahEldeen

7070
Why do we want to detect the Author’s Temporal Intention?
• Match: and convey the intended information.• Notify:– the author that the resource is prone to change.– the reader that the resource has changed.
• Preserve: the resource by pushing snapshots into the archive automatically.
• Retrieve: the closest archived version to maintain the consistency.
Hany SalahEldeen

7171
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen

7272
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen

7373
Estimating Web Archiving Coverage• Goal: Estimate how much of the public web is present in the public archives and
how many copies are available?• Action:
– Getting 4 different datasets from 4 different sources:• Search Engines Indices• Bit.ly• DMOZ• Delicious.
• Results: *
• Publications: – How much of the web is archived? JCDL '11– http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.htmlHany SalahEldeen
16%-79% Archived according to the source

7474
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen

7575
The timeline of the resource
Hany SalahEldeen
http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html

7676
Timestamps Accumulation
Hany SalahEldeen

7777
Actual Vs. Estimated Dates
Hany SalahEldeen
• Successfully estimated the creation date >75% of the resources
• >33% we estimated the exact date

7878
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen

79
• From Twitter, Websites, Books:• The Egyptian revolution
• From Twitter Only:• Stanford’s SNAP dataset:• Iranian elections• H1N1 virus outbreak• Michael Jackson’s death• Obama’s Nobel Peace Prize
• Twitter API:• The Syrian uprising
Six Socially Significant Events
Hany SalahEldeen

80
Resources Missing & Archived
Hany SalahEldeen

81
Revisiting after a year…
Hany SalahEldeen
• There is a nearly linear relationship between the amount missing from the web and time.
• After 1 year ~11% is gone, and 0.02% is lost every day

82
Measured Vs. Predicted
Hany SalahEldeen

83
First Attempts to Shared Content Replacement
Hany SalahEldeen
• We performed an experiment to gauge how many of the resources that are missing could be replaced with other similar resources.
• Collected a dataset with available resources which we assumed to be missing
• Used our method to extract the replacement resources
• Measured the similarity with the original resource

84
First Attempts to Shared Content Replacement
Hany SalahEldeen
We were able to extract another resource with >70% similarity to the missing resource in >40% of the cases

8585
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen

8686
Temporal Intention Relevancy Model(TIRM)
Between ttweet and tclick:
The linked resource could have:• Changed• Not changed
The tweet and the linked resource could be:• Still relevant• No longer relevant
Hany SalahEldeen

8787
Resource is changed but relevant
• The resource changed• But it is still relevant
Intention: need the current version of the resource at any time
Hany SalahEldeen

8888
Relevancy and Intention Mapping
Current
Hany SalahEldeen

8989
Resource is changed and not relevant
Intention: need the past version of the resource at any time
• The resource changed• But it is no longer relevant
Hany SalahEldeen

9090
Past
Relevancy and Intention Mapping
Current
Hany SalahEldeen

9191
Resource is not changed and relevant
Intention: need the past version of the resource at any time
• The resource is not changed• And it is relevant
Hany SalahEldeen

9292
Past
Relevancy and Intention Mapping
Current
Past
Hany SalahEldeen

9393
Resource is not changed and not relevant
Intention: I am not sure which version of the resource I need
• The resource is not changed• But it is not relevant
Hany SalahEldeen

9494
Past
Relevancy and Intention Mapping
Current
Past Not Sure
Hany SalahEldeen

9595
Our investigation angles
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen

9696
Feature extraction
• For each tweet we perform:– Link analysis– Social Media Mining– Archival Existence– Sentiment Analysis– Content Similarity– Entity Identification
Hany SalahEldeen

9797
1- Link analysis
• Since the tweets have embedded resources shortened by Bit.ly we can extract:– Total number of clicks– Hourly click logs– Creation dates– Referring websites– Referring countries
• We calculate the depth of the resource in relation to its domain (either it is a leaf node or a root page)– We calculated the number of backslashes in the resource’s URI
Hany SalahEldeen

9898
2- Social Media Mining
• Twitter:– Using Topsy.com’s API to
extract:• Total number of tweets.• The most recent 500.• Number of tweets by
influential users.
The collection of tweets extracted provided an extended context of the resource authored by users in the twittersphere.
Hany SalahEldeen

9999
2- Social Media Mining• Facebook:– Mined too for likes, shares, posts, and clicks related to each
resource.
Hany SalahEldeen

100100
3- Archival Existence• Using Memento Time
Maps we get:– Total mementos
available– Different archives count.– The closest archived
version to the tweet time.
Hany SalahEldeen

101101
4- Sentiment Analysis• Using NLTK libraries of natural language text processing• Extract the most prominent sentiment in the text
Hany SalahEldeen

102102
5- Content Similarity• Steps:– We download the content HTML using Lynx browser.– We apply boilerplate removal algorithm and full text extraction.– Calculate the cosine similarity between the two pages.
70% similarity
Hany SalahEldeen

103103
6- Entity Identification• By visual inspection we observed that the majority of tweets about
celebrities are related to current events.• We harvested Wikipedia for lists of actors, politicians, and athletes.• Checked the existence of a celebrity mention in the tweets.
Actor: Johnny Depp
Hany SalahEldeen

104104
The trained classifier
• From the feature extraction phase we extracted 39 different features to train the classifier.
• Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%
Hany SalahEldeen

105105
What’s Next for Hany?
• Finish up my dissertation• Defend.• Get a research/Data scientist position• Interests:– L3S Research Center Germany– Microsoft Research
Hany SalahEldeen

106106
1. The state of the archived content2. The age of the shared resource 3. The states of the resource:
1. Missing from the live web2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted
intention
Hany SalahEldeen
Summary:
Email: [email protected]: 3102Website: http://www.cs.odu.edu/~hany/Twitter: @hanysalaheldeen