Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation...

45
Practical Data Science Yossi Attas

Transcript of Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation...

Page 1: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Practical Data Science Yossi Attas

Page 2: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

About me

Yossi Attas – Principal R&D Manager

Microsoft Application Insights

[email protected]

Page 3: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

About this course

Goals

1.  Understand the processes, methodology and tools of generating and communicating actionable insights on top of big data*

2.  Learn practical use of telemetry data as key enabler of application success

* To simplify the course, we reduced the volume of data to enable easier insight production with just a laptop

Page 4: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Agenda for today

1.  What is data science? (or why you should be here?)

2.  The data set for the course

3.  The project

4.  Homework

Page 5: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

1. What is Data Science?

Page 6: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Data Science is hot…

•  “Data scientists are the new superheroes”

•  KPMG survey of C-level executives: “99% said analysis of big data was important to their strategy for next year”

•  McKinsey: “by 2018 U.S. alone may face a 50%-60% gap between supply and demand of deep analytic talent”

Page 7: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

What happened?

•  Exponential growth of data generated and collected:

•  2.5 quintillion bytes of data are created daily

•  90% of the data in the world today has been created in the last two years alone.

•  Enterprise-generated data is expected to exceed 240 exabytes daily by 2020

•  A single connected car generates 25GB data per hour

•  Data storage prices are dropping

•  Over the last 30 years, space per unit cost has doubled roughly every 14 months

•  Affordable tools to analyze massive volumes of data

Page 8: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Storage prices are dropping

Hard drive storage prices are dropping

“Cloud wars” are driving cloud storage prices even further down

Page 9: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

How big is Big Data?

•  Can you imagine a petabyte? Exabyte?

•  1 PB == if you counted one byte per second, it would take 35.7 million years

•  200 PB == the entire written works of mankind, from the beginning of history, in all languages

•  5 EB == all words ever spoken by human beings

•  And yet this is today reality:

•  Google processes 100 petabytes of data every single day •  Stores 15 EB of data on 3 million servers

•  Microsoft’s Cosmos (internal analytical Map/Reduce system) – one of many systems •  3 Exabyte, 160K servers, 200K+ jobs/day

•  Not just Internet companies - Walmart processes 40PB / day

Page 10: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Big Data Technologies

•  Can you master them all?

•  Hint: no-one can •  …or needs to

Page 11: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

What is Data Science?

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

•  HOW: Hacking skills (aka: master the technology)

•  WHY: Math & Statistical Knowledge (aka:

correct interpretation of your findings) •  Running ML algorithm today is as easy

as calling a function or pushing a button… but beware!

•  WHAT: Substantive expertise (aka: domain knowledge)

Page 12: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Goal: Actionable Insight

•  Insight == actionable, data-driven finding that create business value

•  Metrics are easy, insights are hard

•  To be actionable, insight needs to be:

•  Useful/Valuable – clear business gain

•  Accessible – easily understood by relevant stakeholders

•  Non-trivial – “tell me something I don’t know”

Page 13: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Insight – Example (Walmart, 2012)

•  Improving page performance by 1-2 sec results in significantly better conversion (% of customers completing a purchase) – worth many millions of $$

http://www.webperformancetoday.com/2012/02/28/4-awesome-slides-showing-how-page-speed-correlates-to-business-metrics-at-walmart-com/

Page 14: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Getting to the Insight – Two approaches

1.  Top down, hypothesis-driven

•  Given the problem à formulate hypothesis à work with data to prove/disprove it •  Example:

•  Problem: “Should we invest in improving performance” •  Hypothesis: “Poor performance causes lower conversion” •  Insight: previous slide

2.  Bottom up, data-driven

•  Given the data, explore it to find new insights

•  Example:

Page 15: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Getting to the Insight - Process

Formulate the hypothesis

Acquire the data

“Learn” the data

Cleanse the data

Produce the insight Validate Visualize /

Communicate

Page 16: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Summary

Data scientist must:

1.  Focus on the right problem

2.  Get the data

3.  Produce the insight

4.  Communicate

In this course:

1.  You pick the problem (but we can help)

2.  We give you the data

3.  Use Python with ML packages Use Excel to explore the data

4.  Use Excel to visualize your findings

Page 17: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

2. The Data Set

Page 18: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

The problem

•  “By 2017, 94.5% of downloads will be for free apps; Less than 0.01% of consumer mobile apps will be considered a financial success”

-Gartner

Page 19: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Situation: building successful apps is hard

•  Fierce Competition: User retention requires constant improvements of apps

and services

• Constant Evolution: Web services & Mobile apps need to evolve rapidly to

survive & grow

• Continuous Delivery: Most major services push update as often as every day

Page 20: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

What is telemetry data?

•  Telemetry data tracks the behavior of the application to establish

•  Operational KPIs

•  Availability

•  Performance

•  COGs

•  Business KPIs

•  Adoption

•  Engagement

•  Retention

•  Conversion funnels

requires instrumenting client and server code

Page 21: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

What is Application Insights? Telemetry is collected at each tier, incl. browser and server-side 1

Telemetry arrives in the Application Insights service in the cloud where it is processed & stored

Get a 360° view of the application including availability, performance and usage patterns 3

2

Page 22: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Application Insights

Page 23: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Data set for this course

•  Requests – observed app behavior

•  Capture the details of HTTP request processed by web server, e.g.:

•  URL, success/failure (incl. response code), duration + info about device sending request

•  Can be used to understand: reliability of the site (how many requests succeed); performance of site or certain pages; volumes

•  PageView – observed user behavior

•  Capture the details of HTML page viewed by the user, e.g.:

•  URL, many details on the devices used (location, OS, browser, screen size, …)

•  Can be used to understand: usage patterns; audience segmentation

•  AJAX events

•  Capture the detailed interactions of the specific page with the server, both system and user originated

•  Exceptions

•  Also, all telemetry types are linked to a user and user session

Page 24: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

A simplest web application

http://www.site.com/index.html

Hello, world

Browser communicates with Web server; fetches and renders HTML pages Into beautiful screens

HTTP HTTP

Web server accepts HTTP requests, Performs business logic and serves (dynamic) HTML pages

Page 25: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

An HTTP conversation

I would like to open a connection

GET http://www.site.com/index.html

Display response

Close connection

OK

Send page or error message

OK

Client Server

<!DOCTYPE html> <html> <body> <h1>My First Page</h1> <p>Hello, world!</p> </body> </html>

Page 26: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Both were invented at the same time by the same person: Sir Tim Berners-Lee, 1989

HTTP vs HTML

•  HTTP: hypertext transfer protocol

•  The rules governing the conversation between a Web client and a Web server

•  How messages are formatted and transmitted; what actions web servers and browsers should take in response to various commands

•  HTML: hypertext markup language

•  Tag-based language for describing web pages

•  Instructs the browser how to render a page / what actions to perform on certain events

Page 27: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

HTTP headers

Accept: text/html, application/xhtml+xml, image/jxr, */* Accept-Encoding: gzip, deflate, peerdist Accept-Language: en-US, en; q=0.7, he; q=0.3

Connection: Keep-Alive Cookie: _ga=GA1.2.1161181038.1455475184; __gads=ID=63f97d8b6f522032:T=1455475184: S=ALNI_MbHACjONbrZtk3Et5JqdFyl_Lg9ow; _gat=1

Host: www.w3schools.com Referer: http://www.w3schools.com/html/html_examples.asp

User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405 X-P2P-PeerDist: Version=1.1

X-P2P-PeerDistEx: MinContentInformation=1.0, MaxContentInformation=1.0

•  A lot of information is passed in HTTP header (metadata) – primary source for telemetry

Page 28: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

AJAX

•  The “classical” web model: for every request you receive back a new HTML page and re-render the entire browser screen

•  Many web sites still work this way today…

•  But we (people) became impatient… we want higher interactivity

•  Example: typing into Google search box, you get instant suggestions…

•  AJAX == Asynchronous JavaScript And XML

•  Send / receive data from a server – asynchronously, in background

•  Still uses HTTP as underlying protocol

Page 29: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Example - Request telemetry

{ "request": [{ "id": "4251295413255663004", "name": "GET InsightsExtension/Index", "responseCode": 200, "success": true, "durationMetric": { "value": 39751.0 }, "url": "https://stamp2.app.insightsportal.visualstudio.com/ InsightsExtension", "urlData": { "protocol": "https", "host": "stamp2.app.insightsportal.visualstudio.com", "base": "/InsightsExtension" } }], …

… "context": { "data": { "eventTime": "2015-08-01T00:48:35.3821824Z" }, "device": { "os": "Windows", "osVersion": "Windows 7", "browser": "Internet Explorer", "browserVersion": "Internet Explorer 9.0", "locale": "en-US", "userAgent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; AppInsights)" }, "user": { "anonId": "us-il-ch1-t4t-edge" }, "session": { "id": "3dd36639-bf4c-4271-9814-6be1ad1f31b8" }, "location": { "continent": "North America", "country": "United States", "point": { "lat": 47.674, "lon": -122.1215 }, "clientip": "0.46.14.57", "province": "Washington", "city": "Redmond" } } }

Page 30: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Example - PageView telemetry

{ "view": [{ "urlData": { "host": "stamp2.app.insightsportal.visualstudio.com", "protocol": "https", "base": "/InsightsExtension" }, "name": "AspNetOverview", "url": "https://stamp2.app.insightsportal.visualstudio.com/In

sightsExtension?sessionId=a9689a39acdf4918996610ba31b2 944f&extensionName=AppInsightsExtension&shellVersion=5 .0.302.65%20(production%23ede3859.150729- 2229)&traceStr=&l=en.en- us&trustedAuthority=portal.azure.com%3A"

}], "context": { "device": { "os": "Windows", "osVersion": "Windows 10", "type": "PC", "browser": "Internet Explorer", "browserVersion": "Internet Explorer 12.10240", "screenResolution": { "value": "1707X960" }, "locale": "en-us", "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)

AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"

}, …

… "location": { "continent": "North America", "country": "United States", "point": { "lat": 27.7362, "lon": -82.6691 }, "clientip": "0.185.245.49", "province": "Florida", "city": "Saint Petersburg" }, "data": { "eventTime": "2015-08-01T00:57:10.489Z" }, "user": { "anonId": "ac0cd123-b7c8-4658-b17d-7a0d4b552d1e", }, "custom": { "dimensions": [{ "Prod": "5.0.302.65 (production#ede3859.150729-2229)" }, { "AppInsightsVersion": "1.0.5688.17982" } }, "session": { "id": "FA79174B-2622-4FD7-945E-56F0E716D905" } } }

Page 31: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Data set for this course

Good news: we simplified it for you! – smaller size, cleaner and in convenient shape

•  The data set is a representative sample of pageviews, requests and exceptions recorded on Application Insights site during Oct 2015 (full month)

•  Total of: XXX requests, YYY pageviews; ZZZ exceptions

•  Data set in Comma Separated Values (CSV) format

•  Separate file for each type, for each date

Page 32: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Telemetry data is…

… Exciting

•  Lot of rich information

•  Many possible questions and insights

•  High potential impact on site/app success

… Hard

•  Data may be complex and confusing (previous examples were simplified J)

•  High volumes of data •  100M pageviews/month for medium site; billions of requests •  Frequently requires specialized tools / coding skills

•  Non-friendly format

•  Data is never clean as you wish it to be

Page 33: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

4. The Project

Page 34: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Pick a business goal / a problem

Area Main business goals Problems

Operational Intelligence

Keep the site up and running, with minimum downtime,

Detect the problems early (or even before they occur) Isolate the problem efficiently

…good performance Detect performance degradation … and optimal cost Optimize capacity to usage patterns

Customer Intelligence

Grow the customer base, Discover customers that are about to leave (churn) optimize customer acquisition costs

Discover customer segments you should advertise to

monetize better Predict customer “stickiness” based on their first session

These are just examples!!!

Page 35: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Generate an insight

•  Formulate a hypothesis you are trying to prove

•  Make sure you understand the data and it’s semantics

•  Make sure you have sufficient data for the experiment

•  Consult us early when in doubt

•  Use Python/ML to produce the insight

•  Beware of difference between Correlation vs. Causality

Page 36: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

“Selling” your insight

•  Your target audience are not data scientists

•  You need to present complex insight as…

•  Easy to understand

•  Using their language (business domain)

•  Visually appealing (people love beautiful graphs)

•  Be prepared to “go interactive” – every time you give an answer, more questions will follow

•  Ideally – something they can play on their own

•  Excel is your best friend here

Page 37: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

4. Homework

Page 38: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Homework

1.  Get access to the data, download to your PC

2.  Familiarize yourself with how data is organized

•  Can you find requests? Pageviews?

3. “Sniff it”

•  Open some of the files in text editor (Excel is even better)

•  Do you understand some of the columns? Most of the columns? [Don’t worry if you don’t for now]

•  If you want to dig more, read the docs about data structure, prepare questions when unclear

Page 39: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Questions?

•  Contact information

Page 40: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Some useful links

•  “14 definitions of a data scientist”: http://bigdata-madesimple.com/what-is-a-data-scientist-14-definitions-of-a-data-scientist/

•  “Doing data science @Twitter” : https://medium.com/@rchang/my-two-year-journey-as-a-data-scientist-at-twitter-f0c13298aee6

•  “Why so many fake data scientists?”: https://www.linkedin.com/pulse/why-so-many-fake-data-scientist-bernard-marr

Page 41: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Thank you

Page 42: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

BACKUP

Page 43: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Who is Data Scientist?

•  Evolution from business or data analyst role?

•  What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization.

•  “Part analyst, part artist”

•  Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It's almost like a Renaissance individual who really wants to learn and bring change to an organization.“

•  Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.

Page 44: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

3B minutes of calls daily

Page 45: Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation between a Web client and a Web server • How messages are formatted and transmitted;

Big Data Technologies