DO NOT USE PUBLICLY Big Data and Data Science 101 Headline...

33
1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Big Data and Data Science 101 Todd Lipcon | Software Engineer (with much credit due our Data Science team, in particular Josh Wills) December 2013

Transcript of DO NOT USE PUBLICLY Big Data and Data Science 101 Headline...

Page 1: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

1

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12 Big Data and Data Science 101

Todd Lipcon | Software Engineer (with much credit due our Data Science team, in particular Josh Wills) December 2013

Page 2: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Introductions

2

• Engineer at Cloudera • I build software (Hadoop) for big data storage and analysis • No background in survey research!

• Some background in statistics and machine learning • Unlike last year’s PAPOR, this year I’m not going to fake it.

Page 3: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

3

What’s a ‘Data Scientist?’

Page 4: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

4

Page 5: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Another definition of a data scientist

5

• A person who mixes computer science, statistics, and data visualization to analyze sets of data

• Often “funny looking” sets of data • More complex analyses than “slice and dice” summary

statistics (eg machine learning)

Page 6: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

The Humble Sales Dashboard

6

Page 7: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

(an aside on pie charts)

7

Page 8: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Prepping for Hurricane Charlie at Wal-Mart

Image credit: Sam Dundon

Page 9: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Prepping for Hurricane Charlie at Wal-Mart

Page 10: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Prepping for Hurricane Charlie at Wal-Mart

Page 11: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Another definition of a data scientist

11

• A person who mixes computer science, statistics, and data visualization to build analytical applications

• Rich visualizations • Interactive analysis that lets the consumer explore the data

themselves • Things which make our lives better

Page 12: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

12

A Case Study

Developing Analytical Applications

Page 13: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

13

2012: The Predicting of the President

Page 14: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

RealClearPolitics

• Simple Average of Polls

• Transparent

• Simple Interactions • “what if” analysis on state

output

14

Page 15: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

FiveThirtyEight

• Complex Model • Many factors (economic,

correlations, etc)

• Opaque • Secret sauce

• Simple Interactions with a richer UI

15

Page 16: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Princeton Election Consortium

• Medians and Polynomials

• Transparent

• Rich Interactions • “What if” a given poll has

bias?

16

Page 17: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

How Did They Do?

17

Page 18: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

A Few of These, Because They’re Fun

18

Page 19: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

A Few of These, Because They’re Fun

19

Page 20: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Here’s the Rub: One Expert Beat Nate (Markos Moulitsas at DailyKos)

20

Page 21: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Index Funds, Hedge Funds, and Warren Buffett

21

Page 22: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

22

A Brief Introduction to Big Data and Hadoop

Page 23: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Data Storage in 2001: Databases

• Structured (tabular) data sets

• Intensive processing done where data is stored (SQL)

• Somewhat reliable • Expensive at scale

23

Page 24: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

And Then, This Happened

24

Page 25: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Big Data Economics

• No individual record is particularly valuable

• Having every record is incredibly valuable

• Web index • Recommendation systems • Market basket analysis • Online advertising

25

Page 26: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

26

“In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We

shouldn't be trying for bigger computers, but for more systems of computers.”

- Grace Hopper

Page 27: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Data Storage in 2013: Hadoop

• Stores any kind of data • Many different in-situ

processing engines (R, SAS, SQL, Search, etc)

• Reliable • Cheap, even at scale

27

Page 28: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

28

What can you build with big data?

Page 29: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Adverse Drug Events

29

Page 30: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Medical record analytics

30

Page 31: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

Durkheim Project

31

Page 32: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

A Couple of Themes

1. Interactive applications, not just static “reports”

2. Integrate data from many sources, not just one.

3. Some amount of programming usually necessary, but you

don’t always need a CS degree!

32

Page 33: DO NOT USE PUBLICLY Big Data and Data Science 101 Headline ...papor.ipower.com/wp-content/uploads/2014/12/BigData-Big-Data-an… · PowerPoint Presentation Author: Drew O'Brien Created

33