Tpdl Doctoral consortium 2012

52
Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany SalahEldeen & Michael Nelson Doctoral Consortium Hany M. SalahEldeen TPDL ‘12 Doctoral Consortium Old Dominion University Department of Computer Science Advisor: Dr. Michael L. Nelson

Transcript of Tpdl Doctoral consortium 2012

Page 1: Tpdl Doctoral consortium 2012

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Hany M. SalahEldeen

TPDL ‘12 Doctoral Consortium

Old Dominion UniversityDepartment of Computer Science

Advisor: Dr. Michael L. Nelson

Page 2: Tpdl Doctoral consortium 2012

Let’s breakdown the title first…

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 3: Tpdl Doctoral consortium 2012

Let’s breakdown the title first…

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 4: Tpdl Doctoral consortium 2012

Let’s breakdown the title first…

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 5: Tpdl Doctoral consortium 2012

Scenario 1:Jenny reading Jeff’s tweets

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 6: Tpdl Doctoral consortium 2012

Michael Jackson Dies

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Snapshot on: June 25th 2009http://web.archive.org/web/20090625232522/http://www.cnn.com/

Page 7: Tpdl Doctoral consortium 2012

Jeff tweets about it…

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Published on: June 25th 2009https://twitter.com/mdnitehk/status/2333993907

Page 8: Tpdl Doctoral consortium 2012

Jeff’s friend Jenny was on a vacation in Hawaii for a month

Jenny is off the grid…

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 9: Tpdl Doctoral consortium 2012

When she came back she checked Jeff’s tweets and was shocked!

Jenny starts catching up a month later

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Read on: July26th 2009!https://twitter.com/mdnitehk/status/2333993907

Page 10: Tpdl Doctoral consortium 2012

She quickly clicked on the link in the tweet…

Jenny follows the link on July 26th

Hany SalahEldeen & Michael Nelson Doctoral Consortium

http://web.archive.org/web/20090726234411/http://www.cnn.com/

CNN page on: July 26th 2009

Page 11: Tpdl Doctoral consortium 2012

• Implication:• Jenny thought Jeff is making a joke about her

favorite singer and she got mad at him

• Problem:• The tweet and the resource the tweet links

to have become unsynchronized.

Jenny is confused!

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 12: Tpdl Doctoral consortium 2012

Scenario 2:The Egyptian Revolution

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 13: Tpdl Doctoral consortium 2012

The Egyptian Revolution Jan 2011

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 14: Tpdl Doctoral consortium 2012

Reading about it in Storify.com a year later in March 2012

Hany SalahEldeen & Michael Nelson Doctoral Consortium

http://storify.com/maq4sure/egypts-revolution

Page 15: Tpdl Doctoral consortium 2012

I noticed some shared images are missing

Hany SalahEldeen & Michael Nelson Doctoral Consortium

http://storify.com/maq4sure/egypts-revolution

Page 16: Tpdl Doctoral consortium 2012

Some tweets are still intact

Hany SalahEldeen & Michael Nelson Doctoral Consortium

https://twitter.com/miss_amy_qb/status/32477898581483521

Page 17: Tpdl Doctoral consortium 2012

…and some lost their meaning with the disappearance of the images

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Missing ?https://twitter.com/aishes/status/32485352102952960

https://twitter.com/omar_chaaban/status/32203697597452289

Page 18: Tpdl Doctoral consortium 2012

The tweet remains but the shared image disappeared…

Hany SalahEldeen & Michael Nelson Doctoral Consortium

http://yfrog.com/h5923xrvbqqvgzj

Page 19: Tpdl Doctoral consortium 2012

• Implication:• The reader cannot understand what the

author of the tweet meant because the image is not available.

• Problem:• The post is available but the linked resource

(image) is completely missing.

Cairo….we have a problem!

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 20: Tpdl Doctoral consortium 2012

…back to the title

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 21: Tpdl Doctoral consortium 2012

…back to the title

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Detecting, Modeling, & Predicting User Temporal Intention

in Social Media

Page 22: Tpdl Doctoral consortium 2012

The Anatomy of a Tweet

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 23: Tpdl Doctoral consortium 2012

The Anatomy of a TweetAuthor’s username

Other user mention

Tweet Body

Hash TagShortened URL to resource

Publishing timestamp

SocialPost

Shared Resource

Interactionoptions

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 24: Tpdl Doctoral consortium 2012

3 URIs = 3 Chances to fail

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 25: Tpdl Doctoral consortium 2012

…t1

t4

t2

t3 t5t7 t8 t9 tn

t6

Explanation in MJ’s example

Page 26: Tpdl Doctoral consortium 2012

User’s Temporal Intention

Share time Implicit Explicit

Click time Implicit Explicit

Engineering problem Solved by providing

tools

The Focus of our research

Out of our scopePurview of Facebook, Twitter, Google, …etc

Instrumented shortener

Instrumented web client

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 27: Tpdl Doctoral consortium 2012

Sometimes you want a previous version

The Correct Temporal Intention

CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pmHany SalahEldeen & Michael Nelson Doctoral Consortium

Page 28: Tpdl Doctoral consortium 2012

Sometimes you want the current version

The Correct Temporal Intention

In this case the current state of the press releases pageHany SalahEldeen & Michael Nelson Doctoral Consortium

Page 29: Tpdl Doctoral consortium 2012

Research Question

Can we estimate the users’ intention at the time of posting

and reading to predict and maintain temporal consistency?

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 30: Tpdl Doctoral consortium 2012

Research Goals• Detect the temporal intention of the:

1. Author upon sharing time2. The reader upon dereferencing time

• Model this intention as a function of time, nature of the resource, and its context.

• Predict how resources change with time and the intention behind sharing them to minimize inconsistency.

• Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss.

• Create an environment implementing this framework that provides a smooth temporal navigation of the social web.

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 31: Tpdl Doctoral consortium 2012

Related Work• User’s Web Search Intention

– A. Ashkan ECIR ’09– C. Lee AINA ‘05– A. Loser IRSW ‘08– L. Azzopardi ECIR ‘09– R. Baeza-Yates SPIR‘06– N. Dai HT ’11

• Commercial Intention– Q. Guo SIGIR ’10– A. Benczur AIRWeb ’07

• Sentiment Analysis– G. Mishne AAAI ‘06– J. Bollen JCS ‘11

• Access to Archives– H. Van de Sompel OR‘09

• Persistence of shared resources– M. Nelson D-Lib ‘02– R. Sanderson OR’11– F. McCown JCDL ‘07

• URL Shortening– D. Antoniades WWW ’11

• Tweeting, Micro-blogging and Popularity– S. Wu WWW ’11– A. Java SNA-KDD ’07– H. Kwak WWW ’10

• Social Networks Growth and Evolution– B. Meeder WWW ’11

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 32: Tpdl Doctoral consortium 2012

BEGIN

PhD Defense

Read LiteratureCollect DatasetsAnalyze Archives CoverageAnalyze Shortened URIsPrototype ApplicationAnalyze Shared Resources Persistence and Coverage

Analyze Contextual Intention

Create Intention-based datasetExtract Intention FeaturesTrain a Parametric Model to predict intentionEvaluate, test, cross-validate the modelCreate a mockup applicationExtend the model to induce preservationFinish Writing the Dissertation

Current State

Dissertation Plan

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 33: Tpdl Doctoral consortium 2012

BEGIN

PhD Defense

Read LiteratureCollect Datasets

Analyze Archives CoverageAnalyze Shortened URIsPrototype ApplicationAnalyze Shared Resources Persistence and Coverage

Analyze Contextual Intention

Create Intention-based datasetExtract Intention FeaturesTrain a Parametric Model to predict intentionEvaluate, test, cross-validate the modelCreate a mockup applicationExtend the model to induce preservationFinish Writing the Dissertation

Dissertation Plan

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 34: Tpdl Doctoral consortium 2012

Estimating Web Archiving Coverage• Goal: Estimate how much of the public web is present in the public archives and how

many copies are available?• Action:

– Getting 4 different datasets from 4 different sources:• Search Engines Indices• Bit.ly• DMOZ• Delicious.

• Results: *

• Publications: – How much of the web is archived? JCDL '11

* Table Courtesy of Ahmed AlSum JCDL 2011

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 35: Tpdl Doctoral consortium 2012

BEGIN

PhD Defense

Read LiteratureCollect DatasetsAnalyze Archives Coverage

Prototype ApplicationAnalyze Shared Resources Persistence and Coverage

Analyze Contextual Intention

Create Intention-based datasetExtract Intention FeaturesTrain a Parametric Model to predict intentionEvaluate, test, cross-validate the modelCreate a mockup applicationExtend the model to induce preservationFinish Writing the Dissertation

Dissertation Plan

Analyze Shortened URIs

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 36: Tpdl Doctoral consortium 2012

Shortened URI analysis• Goal: Have a better understanding of URI shortening and resolving,

understand the effect of time on this process and the correlation between the page’s features and characteristics, and its resolution.

• Action:– Fresh Bit.lys – Get hourly clicklogs, rate of change, social networking spread, and other

contextual information– Longitudinal study

• Evaluation:– Compare results with frequency of change analysis of Cho and Garcia-

Molina.– Compare results with Antoniades et al. WWW 2011.

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 37: Tpdl Doctoral consortium 2012

BEGIN

PhD Defense

Read LiteratureCollect DatasetsAnalyze Archives Coverage

Prototype Application

Analyze Shared Resources Persistence and Coverage

Analyze Contextual Intention

Create Intention-based datasetExtract Intention FeaturesTrain a Parametric Model to predict intentionEvaluate, test, cross-validate the modelCreate a mockup applicationExtend the model to induce preservationFinish Writing the Dissertation

Dissertation Plan

Analyze Shortened URIs

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 38: Tpdl Doctoral consortium 2012

Estimating Loss of Shared Resources in Social Media

• Goal: Estimate how much of the public web is present in the public archives and how many copies are available?

• Action:– Sampling from 6 public events– Events spanning 3 years– Existence in the current web– Existence in the public archives – Find relation with time

• Results:– After 1st year ~11% will be lost– After that we will continue on losing 0.02% daily

• Publications:– A year after the Egyptian revolution, 10% of the social media documentation is gone.

http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html– Losing my revolution: How Many Resources Shared on Social Media Have Been Lost?

TPDL '12

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 39: Tpdl Doctoral consortium 2012

BEGIN

PhD Defense

Read LiteratureCollect DatasetsAnalyze Archives Coverage

Prototype ApplicationAnalyze Shared Resources Persistence and Coverage

User Intention Analysis

Create Intention-based datasetExtract Intention FeaturesTrain a Parametric Model to predict intentionEvaluate, test, cross-validate the modelCreate a mockup applicationExtend the model to induce preservationFinish Writing the Dissertation

Dissertation Plan

Analyze Shortened URIs

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 40: Tpdl Doctoral consortium 2012

User Intention Analysis• Goal: Have a better understanding of User Intention and what factors affect

it. Also create a new testing and training set.

• Action:– Get a sample set of tweets selected at random– Extract the URIs– Get closest Memento– Download the snapshot & current version– Use Amazon’s Mechanical Turk in choosing the best version

• Evaluation:– Measure cross-rater agreement and confidence.

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 41: Tpdl Doctoral consortium 2012

Proposed Work

• Data Gathering• Feature Extraction• Modeling the intention engine• Evaluation• Application: Prediction and Preservation

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 42: Tpdl Doctoral consortium 2012

Possible Solution for Jenny

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 43: Tpdl Doctoral consortium 2012

Possible Solution for Jenny

The resource has changed since last time it was sharedDo you wish to see the version the author intended or the current version?

Current Version Intended Version

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 44: Tpdl Doctoral consortium 2012

Current Version

Archived Version

Proposed Framework

Feature Extraction Classifier

Example Features:- Tweet Content- Click Logs- Other Tweets- Shared Resource- Timemaps

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 45: Tpdl Doctoral consortium 2012
Page 46: Tpdl Doctoral consortium 2012

Extra Slides

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 47: Tpdl Doctoral consortium 2012

Archive Shortener Application

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 48: Tpdl Doctoral consortium 2012

Estimating Shared Resources Loss in Social Media

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 49: Tpdl Doctoral consortium 2012

Estimating Shared Resources Loss in Social Media

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 50: Tpdl Doctoral consortium 2012

My Publications• S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much

of the web is archived? In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, pages 133{136, 2011.

• H. SalahEldeen and M. L. Nelson. Losing my revolution: How much social media content has been lost? Accepted in TPDL 2012

• H. SalahEldeen and M. L. Nelson. Losing my revolution: A year after the Egyptian revolution, 10% of the social media documentation is gone. http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html.

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 51: Tpdl Doctoral consortium 2012

References• D. Antoniades, I. Polakis, G. Kontaxis, E. Athanasopoulos, S. Ioannidis, E. P. Markatos, and T. Karagiannis. we.b: the web of short

urls. In Proceedings of the 20th international conference on World wide web, WWW '11, pages 715 {724, New York, NY, USA, 2011. ACM.

• A. Ashkan, C. L. Clarke, E. Agichtein, and Q. Guo. Classifying and characterizing query intent. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR '09, pages 578{586, Berlin, Heidelberg, 2009. Springer-Verlag.

• L. Azzopardi and M. de Rijke. Query intention acquisition: A case study on automatically inferring structured queries. In Proceedings DIR-2006, 2006.

• R. Baeza-Yates, L. Calderon-Benavides, and C. Gonzalez-Caro. The intention behind web queries. In F. Crestani, P. Ferragina, and M. Sanderson, editors, String Processing and Information Retrieval, volume 4209 of Lecture Notes in Computer Science, pages 98{109. Springer Berlin / Heidelberg, 2006. 10.1007/11880561 9.

• A. Benczur, I. Bro, K. Csalogany, and T. Sarlos. Web spam detection via commercial intent analysis. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, AIRWeb '07, pages 89{92, New York, NY, USA, 2007. ACM.

• J. Bollen, H. Mao, and X.-J. Zeng. Twitter mood predicts the stock market. CoRR, abs/1010.3003, 2010.• N. Dai, X. Qi, and B. D. Davison. Bridging link and query intent to enhance web search. In Proceedings of the 22nd ACM

conference on Hypertext and hypermedia, HT '11, pages 17{26, New York, NY, USA, 2011. ACM.• N. Dai, X. Qi, and B. D. Davison. Enhancing web search with entity intent. In Proceedings of the 20 th international conference

companion on World wide web, WWW '11, pages 29{30, New York, NY, USA, 2011. ACM.• K. Durant and M. Smith. Predicting the political sentiment of web log posts using supervised machine learning techniques

coupled with feature selection. In O. Nasraoui, M. Spiliopoulou, J. Srivastava, B. Mobasher, and B. Masand, editors, Advances in Web Mining and Web Usage Analysis, volume 4811 of Lecture Notes in Computer Science, pages 187{206. Springer Berlin / Heidelberg, 2007. 10.1007/978-3-540-77485-3 11.

Hany SalahEldeen & Michael Nelson Doctoral Consortium

Page 52: Tpdl Doctoral consortium 2012

References• Q. Guo and E. Agichtein. Ready to buy or just browsing?: detecting web searcher goals from interaction data. In Proceedings of the 33 rd

international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pages 130{137, New York, NY, USA, 2010. ACM.

• A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, WebKDD/SNA-KDD '07, pages 56{65, New York, NY, USA, 2007. ACM.

• H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, WWW '10, pages 591{600, New York, NY, USA, 2010. ACM.

• C.-H. L. Lee and A. Liu. Modeling the query intention with goals. In Proceedings of the 19th International Conference on Advanced Information Networking and Applications - Volume 2, AINA '05, pages 535{540, Washington, DC, USA, 2005. IEEE Computer Society.

• A. Loser, W. M. Barczynski, and F. Brauer. What's the intention behind your query? a few observations from a large developer community. In IRSW, 2008.

• F. McCown, N. Diawara, and M. L. Nelson. Factors aecting website reconstruction from the web infrastructure. In JCDL '07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 39{48, 2007.

• B. Meeder, B. Karrer, A. Sayedi, R. Ravi, C. Borgs, and J. Chayes. We know who you followed last summer: inferring social link creation times in twitter. In Proceedings of the 20th international conference on World wide web, WWW '11, pages 517{526, New York, NY, USA, 2011. ACM.

• G. Mishne. Predicting movie sales from blogger sentiment. In In AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW), 2006.

• M. L. Nelson and B. D. Allen. Object persistence and availability in digital libraries. D-Lib Magazine, 8(1), 2002.• R. Sanderson, M. Phillips, and H. Van de Sompel. Analyzing the persistence of referenced web resources with memento. CoRR,

abs/1105.3459, 2011.• H. Van de Sompel, M. L. Nelson, R. Sanderson, L. Balakireva, S. Ainsworth, and H. Shankar. Memento: Time travel for the web. CoRR,

abs/0911.1112, 2009.• S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who says what to whom on twitter. In Proceedings of the 20th international

conference on World wide web, WWW '11, pages 705{714, New York, NY, USA, 2011. ACM.

Hany SalahEldeen & Michael Nelson Doctoral Consortium