Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista

Post on 15-Mar-2016

26 views 2 download

Tags:

description

Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista. Agenda. Using Data Science on Internet Search Behavior as a Proxy for Human Behavior. Context Problem definition Examples Summary. Context. 17,293,822,600,000,000,000 Bytes [1]. - PowerPoint PPT Presentation

Transcript of Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista

Using Data Science on Internet Search Behavior as a Proxy for

Human BehaviorJuan Miguel Lavista

AgendaUsing Data Science

on Internet Search Behavior as a Proxy for Human Behavior Context

Problem definition

Examples

Summary

Context

17,293,822,600,000,000,000 Bytes[1]

15 Exabytes = 1.5 million times the size of all books in the Library of Congress [2]

[1] The Human Face of Big Data , 2012 | ISBN-10: 1454908270 Rick Smolan, Jennifer Erwitt[2] Peter Lyman, Hal R. Varian (2000-10-18). "How Much Information?"

1984US$1

Billion [3]

Cost of storageof every single

book ever written

~130 million books[4]

2014US$3,000

[3] A history of storage cost, Matthew Komorowski, 2009[4] There are 130 Million Books in the World, How Many Have You Read?, 2009 BY WALLACE YOVETICH

1996Cost of Processing

power[6]

2014

XBOX ONE$399

ASCI Red Super computer (6000 Pentium Pro)

$67,000,000

[6] The history of supercomputers, Sebastian Anthony, 2012

Concepts

Research

Information is only useful if its accessible…

1989 – Tim Burners Lee

writes his initial proposal

for the web

August 1991, First website

from CERN online –

Including First index

Circa 1992 –

Index

discontinued.

All 29 websites!

Web – circa 1992

“If you notice something incorrect

or have any comment which you don't think is a FAQ, feel free to mail me”

Phone +1 (617)253 5702, fax +1 (617)258 8682, email: timbl@w3.org

History behind

http://www.

www.cern.ch info.cern.ch

Web started growing and there was a need to search on it

ARCHIECirca 1990

by Alan Emtage Peter J. Deutsch Simply contacted a list of FTP archives on a regular basis and stored locally

Search functionality was using

Unix GREP

24 Years Later…

2 trillion queries per year

2.8 billion Users

Indexable web is ~ 40 trillion pages

A couple of weeks to read..

5700 web pages per person

This is just 1 search (we make 2 trillion

searches per year)

A lot more time to complete a search…

Agenda• Using Data Science on Internet Search Behavior as a

Proxy for Human Behavior Context

Problem definitionExamples

Summary

Problem definitionUsing Data Science on Internet Search Behavior as a Proxy for Human Behavior

Search Focus: RelevanceAnd Performance

What can we learn from what people are searching?

Agenda• Using Data Science on Internet Search Behavior as a

Proxy for Human Behavior Context

Problem definition

ExamplesSummary

ExamplesUsing Data Science on Internet Search Behavior as a Proxy for Human Behavior

Breaking News

Drug Interactions

Wake up time

Seasonal Flu

Breaking News Detection

Breaking New Detection

Daily traffic follows a very stable pattern

We Build a model to predict query volume on a per-minute basis

If there are no rare-events, predicting query volume during the day is very accurate

Model works with some variation at the Country, State, or city level

u

We compare the daily traffic against prediction, and measure how much they deviate.

Anomaly detection Problem

Z-Score +7

Spike Location: Boston

Wake up time

Wake up TimeMethodology

We calculated the time at which we receive 50% of daily peak traffic from each metro area in their local time zones. The 25 cities follow the same general curve across all seven days of the week. While the patterns are the same, we did see a 43 minute shift between the earliest risers and the late risers.

6:43 6:55 7:10 7:15 7:28 7:32

San Francisco

Wake up time during the weekAt what time do we wake up during the week?

Monday Tuesday Thursday Friday

7:067:10

7:016:48

7:05

Wednesday

Detecting Seasonal Influenza Using

Search Logs

Epidemics of seasonal influenza are a major public

health concern, causing tens of millions of respiratory illnesses and 250,000 to

500,000 deaths worldwide each year

Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza

Using internet searches for influenza surveillance. Clinical Polgreen, P. M., Chen, Y., Pennock, D. M. & Forrest, N. D. Infectious Diseases 47, 1443–1448 (2008)

Detecting influenza epidemics using search engine query dataJeremy Ginsberg,Matthew H. Mohebbi,Rajan S. Patel,Lynnette Brammer,Mark S. Smolinski& Larry Brilliant

How does it works?Detecting influenza epidemics using search engine query data

CDC publishes national and regional data

from these surveillance systems on a

weekly basis, typically with a 1-2 week

reporting lag

Detecting influenza epidemics using search engine query dataJeremy Ginsberg,Matthew H. Mohebbi,Rajan S. Patel,Lynnette Brammer,Mark S. Smolinski& Larry Brilliant

Controversy

Lorem ipsum dolor sit amet, consectetur

adipiscing elit. Fusce suscipit neque non

libero aliquam, ut facilisis lacus pretium.

Sed imperdiet tincidunt velit.

Lorem ipsum dolor sit amet, consectetur

adipiscing elit. Fusce suscipit neque non

libero aliquam, ut facilisis lacus pretium.

Sed imperdiet tincidunt velit.

03

04

Signal is definitely relevant

Model can be improved“all models are wrong but some

are useful” George Box

This is NOT a failure for Big Data

We need to be careful of [all data] [no-science]

approaches

Article by Chris Anderson , Wired Magazine, 2008 [13]

“… faced with massive data, this approach to science —hypothesis, model, test — is becoming obsolete”

“The new availability of huge amounts of data [...] offers a whole new way of understanding the world. Correlation supersedes causation”

“There is now a better way. Petabytes allow us to say: Correlation is enough.”

“With enough data, the numbers speak for themselves.”

[13] http://edge.org/3rd_culture/anderson08/anderson08_index.html

All data no-science ?Discussion

All Data no-Science ApproachThis is a example for a subtitle

0.81 Correlation between Flu trends and GUNS related queries.

0.82 Correlation between CDC Flu and Les Miserable related queries

“Torture the data enough and it will confess..”Ronald Coase

Fooled by randomness

Signal is definitely relevant

Model can be improved“all models are wrong but some

are useful” George Box

This is NOT a failure for Big Data

We need to be careful of [all data] [no-science]

approaches

Detecting Adverse drug Interactions

Context: Adverse drug events cause substantial morbidity and mortality

and are often discovered after a drug comes to

market.

In the US alone, adverse drug events cause thousands of deaths annually and their associated medical treatment costs billions of dollar

Detecting Adverse drug InteractionsTesting impact of a drug by FDA

For each drug, FDA does a randomize control experiment before releasing them in order to Understand impact of the drug

InteractionsWhat are interactions?

Drug A OK

Drug B OK

Drug A

Drug B

Not OK

Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

Hypothesized: Internet users may provide early clues about adverse drug events via their online information-seeking

Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

Test case scenarioWeb-scale pharmacovigilance: listening to signals from the crowd

Paroxetine(an antidepressant)

Interaction between the 2 was reported to create hyperglycemia

Pravastatin(a cholesterol lowering drug)

Hyperglycemia, or high blood sugar ) is a condition in which an excessive amount of glucose circulates in the blood plasma.Web-scale pharmacovigilance: listening to signals from the crowd

Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

MethodologyWeb-scale pharmacovigilance: listening to signals from the crowd

Method: By examining words used in user queries, they sought evidence that searches from people exploring pravastatin and paroxetine over time (using logs from 2010) would have a higher rate of including hyperglycemia-associated words than people searching for only one of the drugs

Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

ResultsWeb-scale pharmacovigilance: listening to signals from the crowd

The figure shows that people who search for both paroxetine

and pravastatin over

the 12-month period are more likely to perform searches on

the terms associated with

hyperglycemia

The study shows that signals concerning drug interactions can

be mined directly from search logs and confirms the findings

of laboratory studies as well as prior known associations.

Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz

Agenda• Using Data Science on Internet Search Behavior as a

Proxy for Human Behavior Context

Problem definition

Examples

Summary

Summary

Using Data Science on Internet Search Behavior as a Proxy for Human Behavior

Search logs are a very powerful data set that can be used not only to improve the relevancy of search results, but also as a unique data source to solve other problems..This is only a small subset of problems, we believe this is the tip of the iceberg of the potential of this data source

We live in an amazing era, and is too soon to realize how big is the impact of the web in human kind..

We are living in this era.

To soon to realize how big is the impact of the internet for human kind..

We are in an inflexion point in the history of the world..

Thanks You!

@BDataScientistjlavista@microsoft.com