Joost ouwerkerk

32
Big Data big problems.

Transcript of Joost ouwerkerk

Page 1: Joost ouwerkerk

Big Databig problems.

Page 2: Joost ouwerkerk
Page 3: Joost ouwerkerk

What is Big Data? Volume Velocity Variety

Page 4: Joost ouwerkerk

VolumeBillions of Things:

Posts, Tweets and Likes Web Transactions Sensor Readings

Page 5: Joost ouwerkerk

VelocityStreaming Data:

Twitter: 500,000,000 TPD Walmart: 20,000,000 TPD Hopper: 750,000,000 TPD

Page 6: Joost ouwerkerk

Variety Integrating Many Sources of Data:

Unstructured Web Content Semi-structured Logs Relational Databases Images, Video, Audio

Page 7: Joost ouwerkerk

So What’s Changed?

Mobile devices Social Web Sensors, Metrics Digitization of

everything

Page 8: Joost ouwerkerk
Page 9: Joost ouwerkerk
Page 10: Joost ouwerkerk
Page 11: Joost ouwerkerk
Page 12: Joost ouwerkerk

Open Source Tools•Hadoop: distributed processing

•R: predictive analytics for big data

•Hive, Pig: ad-hoc analytics for Hadoop

•Mahout: machine learning for Hadoop

•HBase, Cassandra: distributed databases

•ElasticSearch: distributed search engine

•Storm: distributed processing for data streams

Page 13: Joost ouwerkerk
Page 14: Joost ouwerkerk

"The best minds of my generation are thinking about how to make people click ads"

- Jeff Hammerbacher (Facebook, Accel, Cloudera)

Page 15: Joost ouwerkerk

Big Minds + Big Data

Aggregate, Summarize Detect Patterns Model, Simulate Forecast, Predict

Page 16: Joost ouwerkerk

Open Data

Reports Request/Response APIs Small Data

Page 17: Joost ouwerkerk

TextText

Page 18: Joost ouwerkerk

Hack/reduce

Open Hackspace in Boston Home for Pre-seed projects, Community events

Not-for-profit sponsored by local industry and government

Page 19: Joost ouwerkerk
Page 20: Joost ouwerkerk

Hack/reduce Cluster

240-core cluster sponsored by GoGrid, a cloud computing company.

Available for use at today’s Open Data Day.

Page 21: Joost ouwerkerk

What do you with a 240-core Cluster?

Use the power of many machines to analyze Big Data sets.

Page 22: Joost ouwerkerk

How do you get computers to work together like that??That’s what Hadoop is for.

Page 23: Joost ouwerkerk

An Example

Daily Hansard: transcript of Canadian parliament since 1994

Swearwords.txt (http://www.bannedwordlist.com)

Who are the most foul-mouthed Federal MPs?

Page 24: Joost ouwerkerk
Page 25: Joost ouwerkerk
Page 26: Joost ouwerkerk

Results

•20 years of House of Commons statements

•511,341 Statements analyzed

•121,985,310 Words spoken

•3,839 Swearwords spoken

•1 in 133 statements has a swearword

Page 27: Joost ouwerkerk

Top 5 Swearers (absolute)

Pat Martin NDP 98

Randy White Conservative 88

Alexa McDonough

NDP 52

Jim Silye Conservative 50

Yvan Loubier Bloc Quebecois 49

Page 28: Joost ouwerkerk

Top 5 Swearers (relative)

Randy WhiteConservativ

e0.037

%88 299,114

Dennis Mills Liberal0.023

%14 62,221

Gerry RitzConservativ

e0.022

%22 99,037

John McCallum

Conservative

0.017%

38 226,155

John McKay Liberal0.016

%44 268,188

Page 29: Joost ouwerkerk

Top 5 Words Spoken

Paul Szabo 1,482,106

Pat Martin 1,053,365

Don Boudria 867,204

Yvan Loubier 861,888

Peter McKay 844,130

Page 30: Joost ouwerkerk

Prime Ministers

Jean Chrétien 11 604,431

Paul Martin 6 485,990

Stephen Harper 22 620,999

Page 31: Joost ouwerkerk

"The best minds of my generation are thinking about how to make people click ads"

- Jeff Hammerbacher (Facebook, Accel, Cloudera)

Page 32: Joost ouwerkerk