The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

76
The Artful Business of Data Mining Distributed Schema-less Document-Based Databases Wednesday 27 March 13

Transcript of The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

The Artful Businessof Data Mining

Distributed Schema-lessDocument-Based Databases

Wednesday 27 March 13

David Coallier@davidcoallier

Wednesday 27 March 13

Data ScientistAt Engine Yard (.com)

Wednesday 27 March 13

RDBMs

Wednesday 27 March 13

StructureRestrictionsSafety

Wednesday 27 March 13

id name age address

1234567...

daviddivadfoobarjohnjackjill...

134142

331548...

315513198851166...

Wednesday 27 March 13

id name age address

1234567...

daviddivadfoobarjohnjackjill...

134142

331548...

315513198851166...

Wednesday 27 March 13

id name age address

1234567...

daviddivadfoobarjohnjackjill...

134142

331548...

315513198851166...

Wednesday 27 March 13

id name age address

1234567...

daviddivadfoobarjohnjackjill...

13

4142

331548...

315513198851166...

Wednesday 27 March 13

id name age address

1234567...

daviddivadfoobarjohnjackjill...

134142

331548...

315513198851166...

Wednesday 27 March 13

What If?

Wednesday 27 March 13

id name age address phone

1234567...

daviddivadfoobarjohnjackjill...

262742311712821...

IEUSIE

CANZDKIE...

3531

3531

131311353...

Wednesday 27 March 13

BeforeMoving on

Wednesday 27 March 13

JSON

Wednesday 27 March 13

What is JSON?

Wednesday 27 March 13

{ "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ]}

Wednesday 27 March 13

What is HTTP?

Wednesday 27 March 13

What is a Schema?

Wednesday 27 March 13

Alternative

Wednesday 27 March 13

Schema-less

Wednesday 27 March 13

DoesNOTMeanStructure-less

Wednesday 27 March 13

DocumentsandK-V Buckets

Wednesday 27 March 13

CouchDBCluster of unreliable commodity hardware

Wednesday 27 March 13

Replication AttachmentsGenerated “random” idsDictionary Revisions?JSON ObjectsHTTP CRUD

Wednesday 27 March 13

Documents

Wednesday 27 March 13

Wednesday 27 March 13

{ "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ]}

Wednesday 27 March 13

How do youfindAnything?

Wednesday 27 March 13

Map/Reduce

Wednesday 27 March 13

...

Wednesday 27 March 13

Riak

Wednesday 27 March 13

DynamoPaper

Wednesday 27 March 13

CAPTheorem

Wednesday 27 March 13

Key-ValueBuckets

Wednesday 27 March 13

Differences?

Wednesday 27 March 13

CouchDB Riak

Storage Model append-only bitcask

Access HTTP HTTP, PB

Retrieval Views(M/R) M/R, Indexes, Search

Versioning Eventual Consistency Vector Clocks

Concurrency No Locking Client Resolution

Replication master/master/slave replication, clustering

Scaling In/Out Big Couch Built-in

Management Futon/Fuxton Riak Controlhttp://downloads.basho.com/papers/bitcask-intro.pdfhttp://guide.couchdb.org

Wednesday 27 March 13

Map/Reduce

Wednesday 27 March 13

Mapper:

Reducer:Receives output from mappers

Executed on document

Wednesday 27 March 13

{ "_id": "...", "_rev": "...", "age": "26"}

{ "_id": "...", "_rev": "...", "age": "32", "heads": "3",}

{ "_id": "...", "_rev": "...", "age": "42"}

{ "_id": "...", "_rev": "...", "age": "17"}

Wednesday 27 March 13

{ "_id": "...", "_rev": "...", "age": "26"}

{ "_id": "...", "_rev": "...", "age": "42"}

{ "_id": "...", "_rev": "...", "age": "17"}

{ "_id": "...", "_rev": "...", "age": "32", "heads": "3",}

Wednesday 27 March 13

{ "age": "32", "heads": "3",}

Wednesday 27 March 13

{ "_id": "...", "_rev": "...", "age": "26"}

{ "_id": "...", "_rev": "...", "age": "42"}

{ "_id": "...", "_rev": "...", "age": "17"}

{ "_id": "...", "_rev": "...", "age": "32", "heads": "3",}

Map: find-ages

Wednesday 27 March 13

function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age); }}

Map: find-ages

Wednesday 27 March 13

{ "_id": "...", "_rev": "...", "age": "26"}

{ "_id": "...", "_rev": "...", "age": "42"}

{ "_id": "...", "_rev": "...", "age": "17"}

{ "_id": "...", "_rev": "...", "age": "32", "heads": "3",}

Map: find-ages

Wednesday 27 March 13

{ "_id": "...", "_rev": "...", "age": "26"}

{ "_id": "...", "_rev": "...", "age": "42"}

{ "_id": "...", "_rev": "...", "age": "17"}

{ "_id": "...", "_rev": "...", "age": "32", "heads": "3",}

Map: find-ages

26 32 42 17

Wednesday 27 March 13

Map: find-ages

26 32 42

Reduce: sum

17

Wednesday 27 March 13

Reduce: sum

function sum(values) { return sum(values);}

Wednesday 27 March 13

Map: find-ages

26 32 42

Reduce: sum

17

117Wednesday 27 March 13

Mapper:

Reducer:Receives output from mappers

Executed on document

Wednesday 27 March 13

SoWhat?

Wednesday 27 March 13

The MachinesThey Lurn.

Wednesday 27 March 13

The Problem

Wednesday 27 March 13

Statistics Example

Wednesday 27 March 13

Mean,Std. DeviationAge

Wednesday 27 March 13

µ = 1n

xii=1

n

∑Wednesday 27 March 13

σ = 1n

(xi − µ)2i=1

n

Wednesday 27 March 13

Mapper:

Reducer:Receives output from mappers

Executed on document

Wednesday 27 March 13

Mapper:

Reducer:Receive, process further.

Retrieve values, pre-process

Wednesday 27 March 13

{ "_id": "...", "_rev": "...", "age": "26"}

{ "_id": "...", "_rev": "...", "age": "32", "heads": "3",}

{ "_id": "...", "_rev": "...", "age": "42"}

{ "_id": "...", "_rev": "...", "age": "17"}

Wednesday 27 March 13

[ [ 26, 676], [ 32, 1024], [ 42, 1764], [ 17, 289 ]

]

Wednesday 27 March 13

/** * Our mapper function. */map: function(doc) { emit(null, [doc.age, doc.age * doc.age]);}

/** * Our reducer... */reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0;

for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; }

var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) )

return [mean, standard_deviation]}

Wednesday 27 March 13

/** * Our mapper function. */map: function(doc) { emit(null, [doc.age, doc.age * doc.age]);}

/** * Our reducer... */reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];}));

var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) )

return [mean, standard_deviation]}

Wednesday 27 March 13

NaiveBayes

Wednesday 27 March 13

Real LifeFraud

Wednesday 27 March 13

P(x j = k | y = fraudulent)P(x j = k | y = normal)P(y)

Wednesday 27 March 13

We need to:Sum , for each yto calculate P(x|y)

x j = k

Wednesday 27 March 13

We need:More than 1 mapper.

Wednesday 27 March 13

We need

4mappers

Wednesday 27 March 13

Mapper #1:1i P(x j = k | y = fraudulent)∑

Wednesday 27 March 13

Mapper #2:1i P(x j = k | y = normal)∑

Wednesday 27 March 13

Mapper #3:1i P(y = fraudulent)∑

Wednesday 27 March 13

Mapper #4:1i P(y = normal)∑

Wednesday 27 March 13

ReducerSums up results for parameters

Wednesday 27 March 13

ClusterAnalysis

Wednesday 27 March 13

k-means

Wednesday 27 March 13

Mapper:

Reducer:Sum up the sums, get new centroids.

Divide vectors into subgroups,Calculate d(p,q) between vectors, find centroids,sum them up.

Wednesday 27 March 13