Doctoral Defense: Hany SalahEldeen

178
2015 Hany SalahEldeen Dissertation Defense 1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy Dissertation Defense Old Dominion University Department of Computer Science Advisor: Dr. Michael L. Nelson Dr. Michele C. Weigle Dr. Hussein M. Abdel-Wahab Dr. M’Hammed Abdous Committee: May 5 th , 2015

Transcript of Doctoral Defense: Hany SalahEldeen

Page 1: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 1

Detecting, Modeling, & Predicting User Temporal Intention in Social Media

Hany M. SalahEldeenDoctor of PhilosophyDissertation Defense

Old Dominion UniversityDepartment of Computer Science

Advisor: Dr. Michael L. Nelson

Dr. Michele C. WeigleDr. Hussein M. Abdel-WahabDr. M’Hammed Abdous

Committee:

May 5th, 2015

Page 2: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 2

All tweets are equal…

…but some are more equal than the others

Page 3: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 3

It is imperative to know…

1. How long would these last?2. And if lost, is there a backup somewhere?3. Is this what the author intended?

Page 4: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 4

To maintain historical integrity

Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.

Page 5: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 5

Motivation

Background

Related Research

Research Question

User-Time-Shared Resource

Conclusions

Page 6: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 6

People rely on social media for most updated information

Page 7: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 7

Social media is more than kitty photos

Marie ColvinJanuary 12, 1956 – February 22, 2012

Rémi Ochlik16 October 1983 – 22 February 2012

Ahmed Assem1987 – July 8, 2013

Page 8: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 8

For the web is dark, and full of missing content…

Accessed in July 2014

3 out 8 external links on Remi’sWikipedia page return 404

Page 9: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 9

even for content shared in social media

Accessed in July 2014

Page 10: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 10

News sites are also prone to change

Accessed in July 2014

Page 11: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 11

So are specialized sites

Accessed in July 2014

Page 12: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 12

Research Problem:Author’s Intention ≠ Reader’s Experience

Page 13: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 13

Research ImplicationAuthor’s Intention ≠ Reader’s Experience

Broken Inconsistent Weband Historical Records

Page 14: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 14

Motivation

Background

Related Research

Research Question

User-Time-Shared Resource

Conclusions

Page 15: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 15

Social Post

Page 16: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 16

The anatomy of a tweet

Author’s username

Other user mention

Tweet Body

Hash TagShortened URL to resource

Publishing timestamp

SocialPost

Shared Resource

Interactionoptions

Page 17: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 17

3 URIs = 3 Chances to fail

Page 18: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 18

URL shortening and aliasing

curl -L -I http://bit.ly/losing_revolution

HTTP/1.1 301 Moved Permanently

Server: nginx

Date: Mon, 07 Jul 2014 18:19:48 GMT

Cache-Control: private; max-age=90

Location:

http://ws-dl.blogspot.com/2012/02/2012-02-11-

losing-my-revolution-year.html

Mime-Version: 1.0

Set-Cookie: _bit=53bae4c4-00328-04f10-

cb1cf10a;domain=.bit.ly;expires=Sat Jan 3

18:19:48 2015;path=/; HttpOnly

Content-Type: text/html;charset=utf-8Content-Length: 167

HTTP/1.1 200 OK

Expires: Mon, 07 Jul 2014 18:19:52 GMT

Date: Mon, 07 Jul 2014 18:19:52 GMT

Cache-Control: private, max-age=0

Last-Modified: Mon, 07 Jul 2014 18:19:07

GMT

ETag: "e3555826-b103-4daa-a3f2-

d0509ebab51f"

X-Content-Type-Options: nosniff

X-XSS-Protection: 1; mode=block

Server: GSE

Alternate-Protocol: 80:quic

Content-Type: text/html;charset=UTF-8Content-Length: 0

Page 19: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 19

Life cycle of a social post

Page 20: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 20

Life cycle of a social post

tweets

Page 21: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 21

Life cycle of a social post

tweets Links to

Page 22: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 22

Life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

Page 23: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 23

Life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

Ideally!

Page 24: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 24

Life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

After a period of time

Page 25: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 25

Life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

The resource has disappeared

After a period of time

Page 26: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 26

Life cycle of a social post

tweets

What the reader

receives

Links to

Same state the author intended

The resource has disappeared

The resource has changed

After a period of time

Page 27: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 27

Memento framework

* http://mementoweb.org/guide/rfc/

Page 28: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 28

Motivation

Background

Related Research

Research Question

User-Time-Shared Resource

Conclusions

Page 29: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 29

Related Work

• Social media analysis:• Understanding Microblogging

• Zhao 2009• Yang 2010• Newman 2003• Kwak 2010• Java 2007• Cha 2009

• History Narration• Vieweg 2010• Starbird 2010-2012• Qu 2011• Neubig 2011• Lehman and Lalmas 2012-

2013

• User’s Web Search Intention• Ashkan 2009

• Lee 2005

• Loser 2008

• Azzopardi 2009

• Baeza-Yates 2006

• Dai 2011

• Commercial Intention• Guo 2010

• Benczur 2007

• Sentiment Analysis• Mishne 2006

• Bollen 2011

• Access to Archives• Van de Sompel 2009

• Persistence of shared resources– Nelson 2002

– Sanderson 2011

– McCown 2007

• URL Shortening– Antoniades 2011

• Tweeting, Micro-blogging and Popularity– Wu 2011

– Java 2007

– Kwak 2010

• Social Networks Growth and Evolution– Meeder 2011

Further details: refer to chapter 3

Page 30: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 30

Motivation

Background

Related Research

Research Question

User-Time-Shared Resource

Conclusions

Page 31: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 31

Research Question:Can we estimate the users’

intention at the time of posting and reading to predict and

maintain temporal consistency?

Page 32: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 32

Research Goals

• Detect the temporal intention of the:

1. Author upon sharing time

2. The reader upon dereferencing time

• Model this intention as a function of time, nature of the resource, and its context.

• Predict how resources change with time and the intention behind sharing them to minimize inconsistency.

• Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web.

Further details: refer to chapter 6

Further details: refer to chapter 7

Further details: refer to chapter 8

Further details: refer to chapter 9

Page 33: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 33

Motivation

Background

Related Research

Research Question

User-Time-Shared Resource

Conclusions

Page 34: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 34

Shared Resource Time User

Our analysis covers three angles

Page 35: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 35

Shared Resource Time User

Loss and Persistence of Shared Resources

Page 36: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 36

Shared Resource Time User

Alive

First: Estimate social media content loss

Page 37: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 37

Six socially significant events

Event Source Year

Iranian Election SNAP Dataset 2009

H1N1 Virus Outbreak SNAP Dataset 2009

Michael Jackson’s Death SNAP Dataset 2009

Obama’s Nobel Peace Prize SNAP Dataset 2009

The Egyptian Revolution Twitter, Websites, Books 2011

The Syrian Uprising Twitter API 2012

Page 38: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 38

Twitter tag expansion and filtration

Page 39: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 39

Twitter tag expansion increases precision

Page 40: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 40

What are people sharing?

Page 41: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 41

Existence on the live web and in the archives

• For each unique URL we resolved the final HTTP response and considered 2 classes:• Success: 200 OK• Failure: 4XX, 50X families and the 30X loop redirects or soft 404s.

• Utilize the memento aggregator:• Archived: if it has at least one memento in the timemap

Page 42: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 42

Resources Missing and Archived

Collection Percentage Missing Percentage Archived

23.49%H1N1 Outbreak 41.65%

36.24%Michael Jackson 39.45%

26.98%Iran 43.08%

24.59%Obama 47.87%

10.48%Egypt 20.18%

7.04%Syria 5.35%

31.62% 30.78%

24.47% 36.26%

25.64% 43.87%

26.15% 46.15%

Page 43: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 43

Shared Resource Time User

Alive

Mis

sin

g

Second: Can we measure existence and disappearance as a function of time?

Page 44: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 44

Resources Missing and Archived

Collection Percentage Missing Percentage Archived

23.49%H1N1 Outbreak 41.65%

36.24%Michael Jackson 39.45%

26.98%Iran 43.08%

24.59%Obama 47.87%

10.48%Egypt 20.18%

7.04%Syria 5.35%

31.62% 30.78%

24.47% 36.26%

25.64% 43.87%

26.15% 46.15%

Page 45: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 45

Timeline of Events

Page 46: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 46

Timeline of Events

Page 47: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 47

Social Events Having a Bimodal Time Distribution

Page 48: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 48

Timeline of Events

Page 49: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 49

Social Events Having a Bimodal Time Distribution

Page 50: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 50

Existence as a function of time

Page 51: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 51

Existence as a function of time

Page 52: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 52

• Results:

• Publications and Articles:1. H. M. SalahEldeen. Losing My Revolution: A year after the Egyptian Revolution, 10% of the

social media documentation is gone. http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html , 2012.

2. H. M. SalahEldeen and M. L. Nelson. Losing my revolution: how many resources shared on social media have been lost? In Proceedings of the Second international conference on Theory and Practice of Digital Libraries, TPDL'12, 2012.

Conclusion: Existence could be estimated as a function of time

• Measured 21,625 resources from 6 data sets in archives & live web.

• After a year from publishing about 11% of content shared on social media will be gone.

• After this we are losing roughly 0.02% daily.

Page 53: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 53

Revisiting Existence after a year

MJ Iran H1N1 Obama Egypt Syria

Measured 37.10% 37.50% 28.17% 30.56% 26.29% 31.62% 32.47% 24.64% 7.55% 12.68%Predicted 31.72% 31.42% 31.96% 30.98% 30.16% 29.68% 29.60% 28.36% 19.80% 11.54%

Error 5.38% 6.08% 3.79% 0.42% 3.87% 1.94% 2.87% 3.72% 12.25% 1.14%

MJ Iran H1N1 Obama Egypt SyriaMeasured 48.61% 40.32% 60.80% 55.04% 47.97% 52.14% 48.38% 40.58% 23.73% 0.56%Predicted 61.78% 61.18% 62.26% 60.30% 58.66% 57.70% 57.54% 55.06% 37.94% 21.42%Error 13.17% 20.86% 1.46% 5.26% 10.69% 5.56% 9.16% 14.48% 14.21% 20.86%

Average Prediction Error = 11.57%

in all cases, our archival predictions were too optimistic

Missing

Archived

Average Prediction Error = 4.15%

in all cases, our missing predictions were acceptable

Page 54: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 54

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Third: Can we use social context to find replacements of missing resources?

Page 55: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 55

Context discovery and shared resource replacement

Problem:

140 characters limits the description of the linked resource. If it went missing, can we get the next best thing?

Solution:

• Shared links typically have several tweets, responses, and retweets

• We can mine these traces for context and viable replacements

Page 56: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 56

Context Discovery

Linking to: http://beta.18daysinegypt.com/

Page 57: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 57

What if the resource disappeared?

Linking to: http://beta.18daysinegypt.com/

Page 58: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 58

Use Topsy to discover tweets sharing the same link

Page 59: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 59

Social Context Extraction{

"URI": "http://beta.18daysinegypt.com/",

"Related Tweet Count": 500,

"Related Hashtags": "#tran #citizensx #arabspring #visualstorytelling

#collaborativerevolution #feb11http://t.co/qxusp70 ...",

"Users who talked about this": "@petra_stienen: @waleedrashed:

@omarsamra @ungormite: @dcisbusy @webdocumentario: ...",

"All associated unique links:": "http://t.co/63X1f3f1

http://t.co/reBh6c4V http://t.co/B3GuhQN4 http://t.co/X2sjf4Rf

http://t.co/P9iR28fH http://t.co/1C4EPh8h ...",

"All other links associated:": "http://vimeo.com/35368376

http://mashable.com/2012/01/21/18daysinegypt-2/ ",

"Most frequent link appearing:": "http://t.co/2ke0rEjP",

"Number of times the Most frequent link appearing:": 49,

"Most frequent tweet posted and reposted:": "Check out 18DaysInEgypt -

A crowd sourced documentary project ================= via

@18daysinegypt",

"Number of times the Most frequent tweet appearing:": 46,

"The longest common phrase appearing:": "RT 2ke0rEjP is an interactive

documentary website that YOU can help create Get your Jan25 stories

ready! Pl RT",

"Number of times the Most common phrase appearing:": 18

}

Page 60: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 60

Build a Tweet Document

A tweet document represents the concatenation of all extracted tweets:

do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairoit's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ”

Page 61: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 61

Tweet Signature

Tweet Document:

do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairoit's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ”

Tweet Signature = top 5 most frequent terms from Tweet Document

documentary project daysinegypt check sourced

Page 62: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 62

Query Google with the Tweet Signature

Page 63: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 63

Search Engine Results

The original resource

Page 64: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 64

Search Engine Results

The original resource

The others are good replacement

candidates

Page 65: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 65

Recommendation Evaluation

We extract a dataset of resources that are currently available:• Pretend these resources no longer exist (for a baseline)

• Each of the resources are textual based

• Each resource has at least 30 retrievable tweets.

Extracted 731 unique resources

We use boiler plate removal library to remove the template from the:• linked resources

• top 10 retrieved results from Google

We use cosine similarity to compare the documents

Page 66: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 66

Similarity measures in resource replacement

----70% similarity----

41% of the cases we found a replacement with >=70% similarity

Page 67: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 67

Conclusion: We can find viable replacements for missing shared resources

• Results:• 41% of the test cases we can find a replacement page with at least 70% similarity to the original

missing resource• The search results provide a mean reciprocal rank of 0.43

• Publications:1. H. SalahEldeen and M. L. Nelson. Resurrecting my revolution: Using social link

neighborhood in bringing context to the disappearing web. In Research and Advanced Technology for Digital Libraries- International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.

Page 68: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 68

Now we finished analyzing the shared resource…what’s next?

Page 69: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 69

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Footprints on the web

Page 70: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 70

The tweet, the resource…and time

time

Posted a tweet

Read the tweetRelevancy of the resource to the tweet changed through time

we need to measure that

Another tweet posted

And another

We need to measure tweet relevance through time

Page 71: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 71

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Longitudinal Study: Rate of change of shared content

Page 72: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 72

Pilot 1: Resource change in the first 80 hours after tweeting

Page 73: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 73

Pilot 2: Delta days from Bitly creation for just tweeted content

Dataset size = 4,000

Page 74: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 74

Pilot 3: Dataset of 1,000 freshly created Bitlys

http://www.cnn.com depth = 0

http://www.cnn.com/world depth = 1

http://www.cnn.com/2009/SHOWBIZ/Music/06/25/jackson depth = 6

Page 75: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 75

What domains do users link to?

Page 76: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 76

What categories* do users link to?

* Extracted from Alexa.com

Page 77: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 77

Summation of Intention in Social Content Through Time

Longitudinal study: We record the change over an extended period of time:• Content: we download a snapshot of the resource every 45 minutes

• Metadata: we collect meta data about the resource• Facebook likes, posts• Tweets in the last hour• Bitly clicklogs and shares

• Average data size: ~1 TB per month

Page 78: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 78

Hourly analysis over an extended period of time

Page 79: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 79

There is a difference between ttweet and tclick

• After just one hour, 4% of the resources have changed by 30%.• After six hours, the percentage doubled to be 8% changed by 40%.• After a day the change rate slowed to be 12% of the resources

changed by 40%.• After that it almost stabilizes at 17% of the resources to be

changed by 40%.

Page 80: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 80

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Archive & Creation

First: Resource – Time – Public Archives

Page 81: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 81

Revisited: Resources Missing and Archived

Collection Percentage Missing Percentage Archived

23.49%H1N1 Outbreak 41.65%

36.24%Michael Jackson 39.45%

26.98%Iran 43.08%

24.59%Obama 47.87%

10.48%Egypt 20.18%

7.04%Syria 5.35%

31.62% 30.78%

24.47% 36.26%

25.64% 43.87%

26.15% 46.15%

Page 82: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 82

But on a more general notion we want to know…

Page 83: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 83

How much of the web is archived?

• Goal: Estimate how much of the public web is present in the public archives and how many copies are available?

• Action:• Getting 4 different datasets from 4 different sources:

• Search Engines Indices• Bit.ly• DMOZ• Delicious.

Page 84: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 84

Conclusion: It depends on the source

• Results:

• Publication:S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM.

Page 85: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 85

Conclusion: It depends on the source

• Results:

• Publication:S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM.

Changes since 2011:

no more free SE APIs;

greatly reduced IA

quarantine period; 15

public web archives

2013

95%

92%

23%

26%

Page 86: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 86

Side Experiment: Analyzing the quality of the archives and the archived content

• Goal:• Assessing the quality of the web archives• Better discussed in Justin Brunelle’s work

• Publications:1. J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos

Are Created Equal: Measuring The Impact Of Missing Resources. In Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014 (Best student paper award)

Page 87: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 87

A question emerged: When did a certain resource first appear on

the web?

Page 88: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 88

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Archive & Creation

Second: When was the resource created?

Page 89: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 89

Idea

Web pages leave trails as well since the day they were created…

Page 90: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 90

WebResource

Web trails

A web page could leave a trail of one of the following denoting its existence:

• References

• Links (anchors)

• Social media likes and interactions.

• URL shortening.

• Backlinks

• The creation date of any of the associated events/trails could be an estimate of the creation date.

Page 91: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 91

Resource’s timeline

Page 92: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 92

Observations Recorded

1.Last modified date from the response header.2.First Appearance of a backlink.3.First Tweet published.4.First Bitly Shortened URL created.5.Time stamp of first memento in the archives.6.Date of the last crawl by the search engine.

Page 93: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 93

Carbon Date service

Page 94: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 94

Carbon Dating API{

"self": "http://cd.cs.odu.edu/cd?url=http://www.cnn.com","URI": "http://www.cnn.com","Estimated Creation Date": "1998-12-06T04:02:33","Last Modified": "","Bitly.com": "2008-06-08T12:00:00","Topsy.com": "2015-01-25T23:31:42","Backlinks": "2003-03-12T05:35:44","Google.com": "2005-01-11T00:00:00","Archives": [

["Earliest","1998-12-06T04:02:33"

],[

"By_Archive",{

"http://archive.today/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26","http://arquivo.pt/wayback/wayback/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26","http://wayback.vefsafn.is/wayback/20011106102722/http://www.cnn.com/": "1998-12-06T04:02:33","http://web.archive.org/web/20131218180509/http://www.cnn.com/": "2013-12-18T18:05:09"

}]

]}

Page 95: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 95

Evaluation Dataset

From each we randomly selected 100 unique URLs to create our gold standard dataset

Page 96: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 96

Evaluation

• Applied our 6 methods on 1200 resources.

• Get leftmost estimate.

Number of Resources Percentage

An estimate found 910 76%

Exact matching estimate 393 33%

No estimate found 290 24%

Total Resources 1200 100%

Page 97: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 97

Actual Vs. Estimated Dates

Page 98: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 98

Conclusion: We can estimate the creation date of resources correctly

• Results:• Succeeded in estimating the creation date accurately in 75.90% of the resources.

• Publications:1. H. M. SalahEldeen and M. L. Nelson. Carbon dating the web: Estimating the age of web

resources. In Proceedings of the 22nd International Conference on World Wide Web Companion, TempWeb03, WWW '13, 2013

Page 99: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 99

Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities.

http://cd.cs.odu.edu/

Page 100: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 100

Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities.

Yes, it’s better than mine… I admit it

Page 101: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 101

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Archive & Creation

User’s Temporal Intention

Page 102: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 102

Problem: There is an inconsistency between what the tweet’s author intended

to share at time ttweet

and what the reader might actually read upon clicking on the link at time tclick .

Page 103: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 103

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Archive & Creation

Detecting

What is Intention and how to detect it?

Page 104: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 104

Amazon’s Mechanical Turk

• Crowdsourcing Internet marketplace

• Co-ordinates the use of human intelligence to perform tasks that computers are currently unable to do.*

* http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk

Page 105: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 105

Goal: Understand and collect user intention data via MT

Tweets dataset Intention Classification Tasks User Intention Data

Classifier

Train

Page 106: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 106

Goal: Understand and collect user intention data via MT

Tweets dataset Intention Classification Tasks User Intention Data

Classifier

Train

• Problem:• It is not as easy as it seems!

Page 107: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 107

How NOT to classify temporal intention 101

• The tweet is presented along with the two snapshots:

at ttweet at tclick

Page 108: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 108

And compared MT results with Experts

• Experts: Manually assigning a version to each tweet via a face to face meeting with WS-DL members.

• For 9 MT assignments per tweet:• If we allowed 4-5 splits we have 58% match with WS-DL.

• If we allowed 3-6 splits or better we got 31% match

Which is worse than flipping a coin!

Page 109: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 109

Idea: We need to transform the problem from intention to relevance.

Page 110: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 110

Relevance tasks are simpler

• MT workers are more accustomed to classification tasks and it requires minimum amount of explanation

• Transform a hard problem to an easy one

Is that a cat?

- Yes

- No

Page 111: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 111

Temporal Intention Relevancy Model (TIRM)

Between ttweet and tclick:

The linked resource could have:• Changed• Not changed

The tweet and the linked resource could be:• Still relevant• No longer relevant

Page 112: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 112

Resource is changed but relevant

• The resource changed• But it is still relevant

Intention: need the current version of the resource at any time

Page 113: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 113

Relevancy and Intention mapping

Current

Page 114: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 114

Resource is changed and not relevant

Intention: need the past version of the resource at any time

• The resource changed• But it is no longer relevant

Page 115: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 115

Relevancy and Intention mapping

PastCurrent

Page 116: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 116

Resource is not changed and relevant

Intention: need the past version of the resource at any time

• The resource is not changed• And it is relevant

Page 117: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 117

Relevancy and Intention mapping

PastCurrent

Past

Page 118: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 118

Resource is not changed and not relevant

Intention: I am not sure which version of the resource I need

• The resource is not changed• But it is not relevant

Page 119: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 119

Relevancy and Intention mapping

PastCurrent

Past Not Sure

Page 120: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 120

Validation: Update the MT experiment

• MT workers ≡ judgments of the experts (WS-DL members)

Is the content still relevant to the tweet?

Page 121: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 121

Mechanical Turk Workers Vs. Experts

• For 100 tweets, WS-DL members % of agreement:

• Cohen’s K = 0.854 almost perfect agreement

Agreement in 3-2 split or more votes 93%

Agreement in 4-1 split or more votes 80%

Agreement with 5-0 votes 60%

Page 122: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 122

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Archive & Creation

Detecting

Modeling

Can we model this temporal intention?

Page 123: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 123

Data Collection

• From SNAP dataset we extracted:• Tweets in English

• Each has an embedded URI pointing to an external resource.

• The embedded URI is shortened via Bit.ly

• The external resource:• Still persists.

• Has at least 10 mementos.

• Is unique.

We extracted 5,937 unique instances

Page 124: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 124

Time delta between the tweet and the closest memento

Randomly selected 1,124 instancesTime delta range: 3.07 minutes to 56.04 hours Average: 25.79 hours ~ 1 day

Tweet time

After Tweet time

Before Tweet time

Page 125: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 125

Training Dataset

• Rcurrent: The state of the resource at current time.

• Rclick: The state of the resource at click time.

Relevant Assignments 929 82.65%

Non-Relevant Assignments 195 17.35%

5 MT workers agreeing (5-0 split) 589 52.40%

4 MT workers agreeing (4-1 split) 309 27.49%

3 MT workers agreeing (3-2 close call split) 226 20.11%

Page 126: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 126

Training Dataset

• Rcurrent: The state of the resource at current time.

• Rclick: The state of the resource at click time.

Relevant Assignments 929 82.65%

Non-Relevant Assignments 195 17.35%

5 MT workers agreeing (5-0 split) 589 52.40%

4 MT workers agreeing (4-1 split) 309 27.49%

3 MT workers agreeing (3-2 close call split) 226 20.11%

Page 127: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 127

Intention modeling: Feature extraction

• For each tweet we perform:• Link analysis• Social media mining• Archival existence• Sentiment analysis• Content similarity• Entity identification

Page 128: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 128

Training the classifier

• From the feature extraction phase we extracted 39 different features to train the classifier.

• Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%

Page 129: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 129

Most significant features sorted by information gain

Rank Feature Gain Ratio

1 Existence of celebrities in tweets 0.149

2 Number of mementos 0.090

3 Tweet similarity with current page 0.071

4 Similarity: Current & past page 0.053

5 Similarity: Tweet & past page 0.044

6 Original URI’s depth 0.032

Page 130: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 130

Testing the model

• We tested against:• The remaining 4,813 from the original 5,937 instances after extracting the 1,124 used

in training.

• The Tweet Collections based on historic events. (MJ, Obama, Iran, Syria, & H1N1)

Dataset Status 200 Status 404 or other Relevant % Non-Relevant %

Extended 4,813 instances 96.77% 3.23% 96.74% 3.26%

MJ’s Death 57.54% 42.46% 93.24% 6.76%

H1N1 Outbreak 8.96% 91.04% 97.48% 2.52%

Iran Elections 68.21% 31.79% 94.69% 5.31%

Obama’s Nobel Prize 62.86% 37.14% 93.89% 6.11%

Syrian Uprising 80.80% 19.20% 70.26% 29.75%

Page 131: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 131

Idea: We need to transform the problem from intention to relevance.

Now we need to transform it back!

Recap…

Page 132: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 132

Recap: Relevancy and Intention mapping

PastReading

the wrong history

Page 133: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 133

Mapping TIRM

• We used 70% similarity as a threshold of relevancy.

Reading the wrong

historyIn up to

25% of the cases

Page 134: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 134

Conclusion: We can model users’ temporal intention accurately and efficiently

• Results:• We successfully transformed the complicated problem of intention to a simpler one of relevance.• We successfully collected a gold standard dataset of temporal user intention.• We found a temporal inconsistency in the shared resource up to 25% of the cases according to the

dataset.

• Publications:1. H. M. SalahEldeen and M. L. Nelson. Reading the correct history?: Modeling temporal

intention in resource sharing. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, 2013.

Page 135: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 135

So we modeled intention… can we make it better?

Page 136: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 136

Most significant features sorted by information gain

Rank Feature Gain Ratio

1 Existence of celebrities in tweets 0.149

2 Number of mementos 0.090

3 Tweet similarity with current page 0.071

4 Similarity: Current & past page 0.0527

5 Similarity: Tweet & past page 0.04401

6 Original URI’s depth 0.0324

Page 137: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 137

Most significant features sorted by information gain

Rank Feature Gain Ratio

1 Existence of celebrities in tweets 0.149

2 Number of mementos 0.090

3 Tweet similarity with current page 0.071

4 Similarity: Current & past page 0.0527

5 Similarity: Tweet & past page 0.04401

6 Original URI’s depth 0.0324

Page 138: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 138

Enhancing TIRM

• Extending and tuning the features:• Linguistic feature analysis• Semantic similarity analysis using latent topic modeling• Dataset balancing• Feature selection and minimization

Page 139: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 139

A whole lot of features!39 65 different features in extended TIRM

Further details: refer to chapter 7

Page 140: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 140

TIRM enhancement and minimization results

Page 141: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 141

Point of Confusion: C

Point of Certainty: S

Strongest Current Intention

From binary to probabilistic strength

Further details: refer to chapter 7

Page 142: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 142

Intention strength formulation

Intention strength magnitude of the new resource:

Generalization in regards of class:

Page 143: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 143

Intention strength across instances in dataset

Page 144: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 144

Page 145: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 145

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Archive & Creation

Detecting

Modeling

Pre

dic

tin

g

Can we find a relation between the modeled intention and time

…to predict it?

Page 146: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 146

Remember: Data Collection

• From SNAP dataset we extracted:• Tweets in English

• Each has an embedded URI pointing to an external resource.

• The embedded URI is shortened via Bit.ly

• The external resource:• Still persists.

• Has at least 10 mementos.

• Is unique.

We extracted 5,937 unique instances

Page 147: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 147

Intention strength across time

time

Resource = Closest

memento

Resource = current versionWe have 10 mementos of the resource uniformly distributed

We can calculate intention strength at every point

Page 148: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 148

Intention strength across time

Dataset collection and calculation framework

Page 149: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 149

Behavior of instances in different classes

time

time

time

Inte

nti

on

str

engt

h

Inte

nti

on

str

engt

h

Inte

nti

on

str

engt

h

Steady Current Intention

Steady Past Intention

Page 150: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 150

Behavior of instances in different classes

Page 151: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 151

Given the features we already collected can we classify tweets

according to their behavioral class?

Page 152: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 152

Classifying intention behavior across time

Page 153: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 153

If we can limit the features to the ones that exist before tweet time

can we perform a prediction?

Page 154: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 154

Classifying intention behavior across time

We can perform a prediction!

Page 155: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 155

Intention behavior prediction classifier

Page 156: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 156

Conclusion: We can predict the author’s temporal intention

• Results:• We can predict for the author whether the intention conveyed to the readers will be

consistent or will it change with 77% accuracy.

• Publications:1. H. M. SalahEldeen and M. L. Nelson. Predicting Temporal Intention in Resource Sharing. In

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '15, 2015.

Page 157: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 157

At this time, we successfully detected, modeled and predicted

User’s Temporal Intention in Shared Content

Page 158: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 158

Shared Resource Time User

Alive

Mis

sin

g

Replaced

Rate of Change

Archive & Creation

Detecting

Modeling

Pre

dic

tin

g

Use

r Te

mp

ora

l In

ten

tio

n

Temporal Intention Model

Page 159: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 159

So we built an awesome prediction model for Temporal

Intention… what next?

Page 160: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 160

A Framework of Temporal Intention

time

Posted a tweet

Read the tweet

• Tools for authors• Enrich the archives with current content

for posterity

Page 161: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 161

Prediction API

Page 162: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 162

Tools for Authors

Page 163: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 163

Temporal Intention Implementation

time

Posted a tweet

Read the tweet

• Tools for readers• Maintain the temporal consistence of

content

Page 164: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 164

Tools for readers

Page 165: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 165

Tools for readers

1. Temporal preservation of

vulnerable content

2. Version recommendation

based on temporal intention

estimation

Target Publication: Utilizing Temporal Intention

Prediction for Just-in-time Preservation and

Recommendation of Vulnerable Social Media

Content. WSDM 2016

Page 166: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 166

Motivation

Background

Related Research

Research Question

User-Time-Shared Resource

Conclusions

Page 167: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 167

Accomplished Goals

• Detect the temporal intention of the:

1. Author upon sharing time

2. The reader upon dereferencing time

• Model this intention as a function of time, nature of the resource, and its context.

• Predict how resources change with time and the intention behind sharing them to minimize inconsistency.

• Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web.

Further details: refer to chapter 6

Further details: refer to chapter 7

Further details: refer to chapter 8

Further details: refer to chapter 9

Page 168: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 168

Also, our work reached fame…

Page 169: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 169

The Virginian Pilot

Page 170: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 170

http://www.bbc.com/future/story/20120927-the-decaying-web

BBC.com

Page 171: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 171

Popular MechanicsFebruary 2014 issue, page 20

Page 172: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 172

3 x MIT Technology

Review

http://www.technologyreview.com/view/513996/how-to-carbon-date-a-web-page/

http://www.technologyreview.com/view/519391/internet-archaeologists-reconstruct-lost-web-pages/

http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

Page 173: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 173

Mashable

Page 174: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 174

Mashable

Yes I am Indiana Jones of the

internet

Page 175: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 175

Publications

Published Submitted In preparation Planned

JCDL 2011 TPDL 2015 WWW 2016 IJDL 2016

TPDL 2012 SIGIR 2016 WSDM 2016

JCDL 2013

TPDL 2013

WWW 2013

DL 2014

AAAI 2015

IJDL 2015

JCDL 2015

Page 176: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 176

Remember Rémi Ochlik?

Rémi Ochlik16 October 1983 – 22 February 2012

Page 177: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 177

… and the missing content about him?

Accessed in July 2014

Page 178: Doctoral Defense: Hany SalahEldeen

2015 Hany SalahEldeen Dissertation Defense 178

We can maintain the consistency of history

Our Temporal Intention Relevancy Model