Zen & the art of data mining

106
Old Dominion University Department of Computer Science Hany SalahEldeen Hany SalahEldeen Khalil [email protected] Zen & the Art of Data Mining 07-08-14 Social Media Data Collection and the path to Modeling & Predicting User Intention Web Science & Digital Libraries Lab 1

description

A talk I gave at Old Dominion University to new students from PES University in Bangalore

Transcript of Zen & the art of data mining

Page 1: Zen & the art of data mining

1

Old Dominion UniversityDepartment of Computer Science

Hany SalahEldeen

Hany SalahEldeen Khalil [email protected]

Zen & the Art of Data Mining

07-08-14

Social Media Data Collection and the path to Modeling & Predicting User Intention

Web Science & Digital Libraries Lab

Page 2: Zen & the art of data mining

2

Before we start..here is a lil bit about me…

Hany SalahEldeen

Page 3: Zen & the art of data mining

3

Hany SalahELdeen

Education:• PhD Candidate• Web Science and Digital Libraries Group

• Masters Degree in Computer Vision and Artificial Intelligence• Universitat Autonoma de Barcelona

• Bachelors of Computer Systems Engineering• University of Alexandria

Hany SalahEldeen

Page 4: Zen & the art of data mining

4

Research & Technical Experience• Microsoft Research Cairo• Google GmBH Zurich • Microsoft Inc. Mountain View• National University of Singapore

Hany SalahEldeen

Page 5: Zen & the art of data mining

5Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Web Mining Pattern Analysis Machine Learning

Human Behavioral AnalysisSocial Media Analysis

So what am I investigating?

Page 6: Zen & the art of data mining

6

Publications

Hany SalahEldeen

Shanghai CIKM 2014 Conference- 1 first author paper- 1 second author paper

London DL 2014 Conference- 1 third author paper

Malta TPDL 2013 Conference- 1 first author paper

Page 7: Zen & the art of data mining

7

Publications

Hany SalahEldeen

Indianapolis JCDL 2013 Conference- 1 first author paper

Rio de Janeiro WWW 2013 Conference- 1 first author paper

Cyprus TPDL 2012 Conference- 1 first author paper

Page 8: Zen & the art of data mining

8

Beside the perks of travelling, our research has been popular…

Hany SalahEldeen

Page 9: Zen & the art of data mining

9

MIT Technology Review

Hany SalahEldeen

Page 10: Zen & the art of data mining

10

MIT Technology Review

Hany SalahEldeen

Page 11: Zen & the art of data mining

11

MIT Technology Review

Hany SalahEldeen

Page 12: Zen & the art of data mining

12

Mashable

Hany SalahEldeen

Page 13: Zen & the art of data mining

13

Popular Mechanics

Hany SalahEldeen

Page 14: Zen & the art of data mining

14

BBC

Hany SalahEldeen

Page 15: Zen & the art of data mining

15

The Virginian Pilot

Hany SalahEldeen

Page 16: Zen & the art of data mining

16

Our Research’s Popularity

Hany SalahEldeen

• Local newspaper: The Virginia Pilot• 4 x MIT Technology Review• BBC• Mashable• The Atlantic• Yahoo News• Articles in > 11 different languages

• We have been called:• The Internet Archeologists• Web Time Travelers

Page 17: Zen & the art of data mining

17

My goal:Detect, model, and predict

user intention in social media

Hany SalahEldeen

Page 18: Zen & the art of data mining

18

Ok hold on, let’s go back to the basics…

Hany SalahEldeen

Page 19: Zen & the art of data mining

19

Web 2.0Definition: Web 2.0 is a concept that takes the network as a platform for information sharing, interoperability, user-centered design, and collaboration on the World Wide Web.*

* http://en.wikipedia.org/wiki/Web_2.0Hany SalahEldeen

Page 20: Zen & the art of data mining

20

Web 2.0• Yes, Web 2.0 is about “user-generated

content”• But explicit content contributed by

users is just 20% of what “matters”• 80% is in the implicitly contributed

data*

Hany SalahEldeen

*Toby Segaran, Programming Collective Intelligence, 2007

Page 21: Zen & the art of data mining

21

Systems & Web 2.0• Google: Utilizes PageRank which is a

technique for extracting intelligence from the link structure

• Flickr: Utilizes “interestingness” algorithm• Amazon: Utilizes “people who bought this

product also bought” feature• Pandora: Utilizes “similar artist radio”• eBay: Utilizes “reputation system”

Hany SalahEldeen

Page 22: Zen & the art of data mining

22

So why do we even care about all that?

Hany SalahEldeen

Page 23: Zen & the art of data mining

23

Power to the People!

Hany SalahEldeen

Page 24: Zen & the art of data mining

24

Power to the People!• Because analyzing a huge dataset of

millions of users will yield a lot of potential insights into: • User Experience• Marketing• Personal Taste• Human Behavior in general.

Hany SalahEldeen

Page 25: Zen & the art of data mining

25

So what is Data Mining?

Hany SalahEldeen

Page 26: Zen & the art of data mining

26

Data Mining• Definition: It is the computational process of

discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

http://en.wikipedia.org/wiki/Data_mining

Hany SalahEldeen

Page 27: Zen & the art of data mining

27

Back to my goal:

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 28: Zen & the art of data mining

28

Let’s breakdown the title first…

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 29: Zen & the art of data mining

29

Let’s breakdown the title first…

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 30: Zen & the art of data mining

30

Scenario 1:Jenny reading Jeff’s tweets

Hany SalahEldeen

Page 31: Zen & the art of data mining

31

Michael Jackson Dies

Hany SalahEldeen

Snapshot on: June 25th 2009http://web.archive.org/web/20090625232522/http://www.cnn.com/

Page 32: Zen & the art of data mining

32

Jeff tweets about it…

Hany SalahEldeen

Published on: June 25th 2009https://twitter.com/mdnitehk/status/2333993907

Page 33: Zen & the art of data mining

33

Jeff’s friend Jenny was on a vacation in Hawaii for a month

Jenny is off the grid…

Hany SalahEldeen

Page 34: Zen & the art of data mining

34

When she came back she checked Jeff’s tweets and was shocked!

Jenny starts catching up a month later

Hany SalahEldeen

Read on: July26th 2009!https://twitter.com/mdnitehk/status/2333993907

Page 35: Zen & the art of data mining

35

She quickly clicked on the link in the tweet…

Jenny follows the link on July 26th

Hany SalahEldeenhttp://web.archive.org/web/20090726234411/http://www.cnn.com/

CNN page on: July 26th 2009

Page 36: Zen & the art of data mining

36

• Implication:• Jenny thought Jeff is making a joke about her

favorite singer and she got mad at him

• Problem:• The tweet and the resource the tweet links

to have become unsynchronized.

Jenny is confused!

Hany SalahEldeen

Page 37: Zen & the art of data mining

37

Scenario 2:The Egyptian Revolution

Hany SalahEldeen

Page 38: Zen & the art of data mining

38

The Egyptian Revolution Jan 2011

Hany SalahEldeen

Page 39: Zen & the art of data mining

39

Reading about it in Storify.com a year later in March 2012

Hany SalahEldeen

http://storify.com/maq4sure/egypts-revolution

Page 40: Zen & the art of data mining

40

I noticed some shared images are missing

Hany SalahEldeenhttp://storify.com/maq4sure/egypts-revolution

Page 41: Zen & the art of data mining

41

Some tweets are still intact

Hany SalahEldeen

https://twitter.com/miss_amy_qb/status/32477898581483521

Page 42: Zen & the art of data mining

42

…and some lost their meaning with the disappearance of the images

Hany SalahEldeen

Missing ?https://twitter.com/aishes/status/32485352102952960

https://twitter.com/omar_chaaban/status/32203697597452289

Page 43: Zen & the art of data mining

43

The tweet remains but the shared image disappeared…

Hany SalahEldeen

http://yfrog.com/h5923xrvbqqvgzj

Page 44: Zen & the art of data mining

44

• Implication:• The reader cannot understand what the

author of the tweet meant because the image is not available.

• Problem:• The post is available but the linked resource

(image) is completely missing.

Cairo….we have a problem!

Hany SalahEldeen

Page 45: Zen & the art of data mining

45

…back to the title

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 46: Zen & the art of data mining

46

…back to the title

Hany SalahEldeen

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 47: Zen & the art of data mining

4747

The Anatomy of a Tweet

Hany SalahEldeen

Page 48: Zen & the art of data mining

4848

The Anatomy of a TweetAuthor’s username

Other user mention

Tweet Body

Hash TagShortened URL to resource

Publishing timestamp

SocialPost

Shared Resource

Interactionoptions

Hany SalahEldeen

Page 49: Zen & the art of data mining

4949

3 URIs = 3 Chances to fail

Hany SalahEldeen

http://news.blogs.cnn.com/2012/04/26/norwegians-sing-to-annoy-mass-killer/

https://twitter.com/KentEiler/status/195535749754527745

Page 50: Zen & the art of data mining

5050

…t1

t4

t2

t3 t5t7 t8 t9 tn

t6

Explanation in MJ’s example

Page 51: Zen & the art of data mining

5151

If I click on a link in a tweet, which version should I get?

ttweet or tclick ?

Hany SalahEldeen

Page 52: Zen & the art of data mining

5252

Sometimes you want a previous version

The Correct Temporal Intention

CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pmHany SalahEldeen

Page 53: Zen & the art of data mining

5353

Sometimes you want the current version

The Correct Temporal Intention

In this case the current state of the press releases pageHany SalahEldeen

Page 54: Zen & the art of data mining

5454

Research Question

Can we estimate the users’ intention at the time of posting

and reading to predict and maintain temporal consistency?

Hany SalahEldeen

Page 55: Zen & the art of data mining

5555

People rely on social media for most updated information

Hany SalahEldeen

Page 56: Zen & the art of data mining

56Hany SalahEldeen

So if you are posting a tweet about your cat…

…No one cares!

Page 57: Zen & the art of data mining

57Hany SalahEldeen

Regardless how cool your cat was!

Page 58: Zen & the art of data mining

58

All tweets are equal…

…but some are more equal than the others

Hany SalahEldeen

Page 59: Zen & the art of data mining

59

Preliminary Research Questions:

1. How long would these last?2. And if lost, are they archived?3. Is this what the author intended?

Hany SalahEldeen

Page 60: Zen & the art of data mining

6060

Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.

Hany SalahEldeen

Historical Integrity

Page 61: Zen & the art of data mining

6161

The life cycle of a social post

Hany SalahEldeen

Page 62: Zen & the art of data mining

6262

The life cycle of a social post

tweets

Hany SalahEldeen

Page 63: Zen & the art of data mining

6363

The life cycle of a social post

tweets Links to

Hany SalahEldeen

Page 64: Zen & the art of data mining

6464

The life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

Hany SalahEldeen

Page 65: Zen & the art of data mining

6565

The life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

Hany SalahEldeen

The resource has disappeared

Page 66: Zen & the art of data mining

6666

The life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

The resource has disappeared

The resource has changed

Hany SalahEldeen

Page 67: Zen & the art of data mining

6767

Same state the author intended

The Resource’s Possibilities

a bigger problem since the reader might not know.

What the reader

receives

The resource has disappeared

The resource has changed

Hany SalahEldeen

Page 68: Zen & the art of data mining

6868

We could lose the linked resource

Hany SalahEldeen

Page 69: Zen & the art of data mining

6969

The attack on the embassy was in February 2013

Or the resource could change

Hany SalahEldeen

Page 70: Zen & the art of data mining

7070

Why do we want to detect the Author’s Temporal Intention?

• Match: and convey the intended information.• Notify:– the author that the resource is prone to change.– the reader that the resource has changed.

• Preserve: the resource by pushing snapshots into the archive automatically.

• Retrieve: the closest archived version to maintain the consistency.

Hany SalahEldeen

Page 71: Zen & the art of data mining

7171

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Page 72: Zen & the art of data mining

7272

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Page 73: Zen & the art of data mining

7373

Estimating Web Archiving Coverage• Goal: Estimate how much of the public web is present in the public archives and

how many copies are available?• Action:

– Getting 4 different datasets from 4 different sources:• Search Engines Indices• Bit.ly• DMOZ• Delicious.

• Results: *

• Publications: – How much of the web is archived? JCDL '11– http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.htmlHany SalahEldeen

16%-79% Archived according to the source

Page 74: Zen & the art of data mining

7474

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Page 75: Zen & the art of data mining

7575

The timeline of the resource

Hany SalahEldeen

http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html

Page 76: Zen & the art of data mining

7676

Timestamps Accumulation

Hany SalahEldeen

Page 77: Zen & the art of data mining

7777

Actual Vs. Estimated Dates

Hany SalahEldeen

• Successfully estimated the creation date >75% of the resources

• >33% we estimated the exact date

Page 78: Zen & the art of data mining

7878

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Page 79: Zen & the art of data mining

79

• From Twitter, Websites, Books:• The Egyptian revolution

• From Twitter Only:• Stanford’s SNAP dataset:• Iranian elections• H1N1 virus outbreak• Michael Jackson’s death• Obama’s Nobel Peace Prize

• Twitter API:• The Syrian uprising

Six Socially Significant Events

Hany SalahEldeen

Page 80: Zen & the art of data mining

80

Resources Missing & Archived

Hany SalahEldeen

Page 81: Zen & the art of data mining

81

Revisiting after a year…

Hany SalahEldeen

• There is a nearly linear relationship between the amount missing from the web and time.

• After 1 year ~11% is gone, and 0.02% is lost every day

Page 82: Zen & the art of data mining

82

Measured Vs. Predicted

Hany SalahEldeen

Page 83: Zen & the art of data mining

83

First Attempts to Shared Content Replacement

Hany SalahEldeen

• We performed an experiment to gauge how many of the resources that are missing could be replaced with other similar resources.

• Collected a dataset with available resources which we assumed to be missing

• Used our method to extract the replacement resources

• Measured the similarity with the original resource

Page 84: Zen & the art of data mining

84

First Attempts to Shared Content Replacement

Hany SalahEldeen

We were able to extract another resource with >70% similarity to the missing resource in >40% of the cases

Page 85: Zen & the art of data mining

8585

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Page 86: Zen & the art of data mining

8686

Temporal Intention Relevancy Model(TIRM)

Between ttweet and tclick:

The linked resource could have:• Changed• Not changed

The tweet and the linked resource could be:• Still relevant• No longer relevant

Hany SalahEldeen

Page 87: Zen & the art of data mining

8787

Resource is changed but relevant

• The resource changed• But it is still relevant

Intention: need the current version of the resource at any time

Hany SalahEldeen

Page 88: Zen & the art of data mining

8888

Relevancy and Intention Mapping

Current

Hany SalahEldeen

Page 89: Zen & the art of data mining

8989

Resource is changed and not relevant

Intention: need the past version of the resource at any time

• The resource changed• But it is no longer relevant

Hany SalahEldeen

Page 90: Zen & the art of data mining

9090

Past

Relevancy and Intention Mapping

Current

Hany SalahEldeen

Page 91: Zen & the art of data mining

9191

Resource is not changed and relevant

Intention: need the past version of the resource at any time

• The resource is not changed• And it is relevant

Hany SalahEldeen

Page 92: Zen & the art of data mining

9292

Past

Relevancy and Intention Mapping

Current

Past

Hany SalahEldeen

Page 93: Zen & the art of data mining

9393

Resource is not changed and not relevant

Intention: I am not sure which version of the resource I need

• The resource is not changed• But it is not relevant

Hany SalahEldeen

Page 94: Zen & the art of data mining

9494

Past

Relevancy and Intention Mapping

Current

Past Not Sure

Hany SalahEldeen

Page 95: Zen & the art of data mining

9595

Our investigation angles

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Page 96: Zen & the art of data mining

9696

Feature extraction

• For each tweet we perform:– Link analysis– Social Media Mining– Archival Existence– Sentiment Analysis– Content Similarity– Entity Identification

Hany SalahEldeen

Page 97: Zen & the art of data mining

9797

1- Link analysis

• Since the tweets have embedded resources shortened by Bit.ly we can extract:– Total number of clicks– Hourly click logs– Creation dates– Referring websites– Referring countries

• We calculate the depth of the resource in relation to its domain (either it is a leaf node or a root page)– We calculated the number of backslashes in the resource’s URI

Hany SalahEldeen

Page 98: Zen & the art of data mining

9898

2- Social Media Mining

• Twitter:– Using Topsy.com’s API to

extract:• Total number of tweets.• The most recent 500.• Number of tweets by

influential users.

The collection of tweets extracted provided an extended context of the resource authored by users in the twittersphere.

Hany SalahEldeen

Page 99: Zen & the art of data mining

9999

2- Social Media Mining• Facebook:– Mined too for likes, shares, posts, and clicks related to each

resource.

Hany SalahEldeen

Page 100: Zen & the art of data mining

100100

3- Archival Existence• Using Memento Time

Maps we get:– Total mementos

available– Different archives count.– The closest archived

version to the tweet time.

Hany SalahEldeen

Page 101: Zen & the art of data mining

101101

4- Sentiment Analysis• Using NLTK libraries of natural language text processing• Extract the most prominent sentiment in the text

Hany SalahEldeen

Page 102: Zen & the art of data mining

102102

5- Content Similarity• Steps:– We download the content HTML using Lynx browser.– We apply boilerplate removal algorithm and full text extraction.– Calculate the cosine similarity between the two pages.

70% similarity

Hany SalahEldeen

Page 103: Zen & the art of data mining

103103

6- Entity Identification• By visual inspection we observed that the majority of tweets about

celebrities are related to current events.• We harvested Wikipedia for lists of actors, politicians, and athletes.• Checked the existence of a celebrity mention in the tweets.

Actor: Johnny Depp

Hany SalahEldeen

Page 104: Zen & the art of data mining

104104

The trained classifier

• From the feature extraction phase we extracted 39 different features to train the classifier.

• Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%

Hany SalahEldeen

Page 105: Zen & the art of data mining

105105

What’s Next for Hany?

• Finish up my dissertation• Defend.• Get a research/Data scientist position• Interests:– L3S Research Center Germany– Microsoft Research

Hany SalahEldeen

Page 106: Zen & the art of data mining

106106

1. The state of the archived content2. The age of the shared resource 3. The states of the resource:

1. Missing from the live web2. Changed from what the author intended to share

4. Detect the author’s intention and collect a dataset5. Model this intention6. Create a time-based navigation tool to match the predicted

intention

Hany SalahEldeen

Summary:

Email: [email protected]: 3102Website: http://www.cs.odu.edu/~hany/Twitter: @hanysalaheldeen