Zen & the art of data mining

download Zen & the art of data mining

of 106

  • date post

    23-Aug-2014
  • Category

    Science

  • view

    1.629
  • download

    9

Embed Size (px)

description

A talk I gave at Old Dominion University to new students from PES University in Bangalore

Transcript of Zen & the art of data mining

  • Old Dominion University Department of Computer Science Hany SalahEldeen Hany SalahEldeen Khalil hany@cs.odu.edu Zen & the Art of Data Mining 07-08-14 Social Media Data Collection and the path to Modeling & Predicting User Intention Web Science & Digital Libraries Lab 1
  • Before we start.. here is a lil bit about me Hany SalahEldeen 2
  • Hany SalahELdeen Education: PhD Candidate Web Science and Digital Libraries Group Masters Degree in Computer Vision and Artificial Intelligence Universitat Autonoma de Barcelona Bachelors of Computer Systems Engineering University of Alexandria Hany SalahEldeen 3
  • Research & Technical Experience Microsoft Research Cairo Google GmBH Zurich Microsoft Inc. Mountain View National University of Singapore Hany SalahEldeen 4
  • Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media Web Mining Pattern Analysis Machine Learning Human Behavioral Analysis Social Media Analysis So what am I investigating? 5
  • Publications Hany SalahEldeen Shanghai CIKM 2014 Conference - 1 first author paper - 1 second author paper London DL 2014 Conference - 1 third author paper Malta TPDL 2013 Conference - 1 first author paper 6
  • Publications Hany SalahEldeen Indianapolis JCDL 2013 Conference - 1 first author paper Rio de Janeiro WWW 2013 Conference - 1 first author paper Cyprus TPDL 2012 Conference - 1 first author paper 7
  • Beside the perks of travelling, our research has been popular Hany SalahEldeen 8
  • MIT Technology Review Hany SalahEldeen 9
  • MIT Technology Review Hany SalahEldeen 10
  • MIT Technology Review Hany SalahEldeen 11
  • Mashable Hany SalahEldeen 12
  • Popular Mechanics Hany SalahEldeen 13
  • BBC Hany SalahEldeen 14
  • The Virginian Pilot Hany SalahEldeen 15
  • Our Researchs Popularity Hany SalahEldeen Local newspaper: The Virginia Pilot 4 x MIT Technology Review BBC Mashable The Atlantic Yahoo News Articles in > 11 different languages We have been called: The Internet Archeologists Web Time Travelers 16
  • My goal: Detect, model, and predict user intention in social media Hany SalahEldeen 17
  • Ok hold on, lets go back to the basics Hany SalahEldeen 18
  • Web 2.0 Definition: Web 2.0 is a concept that takes the network as a platform for information sharing, interoperability, user-centered design, and collaboration on the World Wide Web.* * http://en.wikipedia.org/wiki/Web_2.0 Hany SalahEldeen 19
  • Web 2.0 Yes, Web 2.0 is about user-generated content But explicit content contributed by users is just 20% of what matters 80% is in the implicitly contributed data* Hany SalahEldeen 20 *Toby Segaran, Programming Collective Intelligence, 2007
  • Systems & Web 2.0 Google: Utilizes PageRank which is a technique for extracting intelligence from the link structure Flickr: Utilizes interestingness algorithm Amazon: Utilizes people who bought this product also bought feature Pandora: Utilizes similar artist radio eBay: Utilizes reputation system Hany SalahEldeen 21
  • So why do we even care about all that? Hany SalahEldeen 22
  • Power to the People! Hany SalahEldeen 23
  • Power to the People! Because analyzing a huge dataset of millions of users will yield a lot of potential insights into: User Experience Marketing Personal Taste Human Behavior in general. Hany SalahEldeen 24
  • So what is Data Mining? Hany SalahEldeen 25
  • Data Mining Definition: It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. http://en.wikipedia.org/wiki/Data_mining Hany SalahEldeen 26
  • Back to my goal: Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 27
  • Lets breakdown the title first Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 28
  • Lets breakdown the title first Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 29
  • Scenario 1: Jenny reading Jeffs tweets Hany SalahEldeen 30
  • Michael Jackson Dies Hany SalahEldeen Snapshot on: June 25th 2009 http://web.archive.org/web/20090625232522/http://www.cnn.com/ 31
  • Jeff tweets about it Hany SalahEldeen Published on: June 25th 2009 https://twitter.com/mdnitehk/status/2333993907 32
  • Jeffs friend Jenny was on a vacation in Hawaii for a month Jenny is off the grid Hany SalahEldeen 33
  • When she came back she checked Jeffs tweets and was shocked! Jenny starts catching up a month later Hany SalahEldeen Read on: July26th 2009 https://twitter.com/mdnitehk/status/2333993907 34
  • She quickly clicked on the link in the tweet Jenny follows the link on July 26th Hany SalahEldeen http://web.archive.org/web/20090726234411/http://www.cnn.com/ CNN page on: July 26th 2009 35
  • Implication: Jenny thought Jeff is making a joke about her favorite singer and she got mad at him Problem: The tweet and the resource the tweet links to have become unsynchronized. Jenny is confused! Hany SalahEldeen 36
  • Scenario 2: The Egyptian Revolution Hany SalahEldeen 37
  • The Egyptian Revolution Jan 2011 Hany SalahEldeen 38
  • Reading about it in Storify.com a year later in March 2012 Hany SalahEldeen http://storify.com/maq4sure/egypts-revolution 39
  • I noticed some shared images are missing Hany SalahEldeen http://storify.com/maq4sure/egypts-revolution 40
  • Some tweets are still intact Hany SalahEldeen https://twitter.com/miss_amy_qb/status/32477898581483521 41
  • and some lost their meaning with the disappearance of the images Hany SalahEldeen Missing ? https://twitter.com/aishes/status/32485352102952960 https://twitter.com/omar_chaaban/status/32203697597452289 42
  • The tweet remains but the shared image disappeared Hany SalahEldeen http://yfrog.com/h5923xrvbqqvgzj 43
  • Implication: The reader cannot understand what the author of the tweet meant because the image is not available. Problem: The post is available but the linked resource (image) is completely missing. Cairo.we have a problem! Hany SalahEldeen 44
  • back to the title Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 45
  • back to the title Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 46
  • 47 The Anatomy of a Tweet Hany SalahEldeen 47
  • 48 The Anatomy of a Tweet Authors username Other user mention Tweet Body Hash TagShortened URL to resource Publishing timestamp Social Post Shared Resource Interaction options Hany SalahEldeen 48
  • 49 3 URIs = 3 Chances to fail Hany Sal