Schema Design by Gary Murakami

Post on 29-May-2015

2.919 views 0 download

Tags:

description

Schema Design by Gary Murakami

Transcript of Schema Design by Gary Murakami

Lead Engineer / Evangelist

Gary J. Murakami, Ph.D.

#MongoDB

Schema Design

Schema Design – Gary Murakami

Agenda

• What is a Record?

• Core Concepts

• What is an Entity?

• Associating Entities

• General Recommendations

• Questions

Schema Design – Gary Murakami

All application development isSchema Design

Schema Design – Gary Murakami

Success comes fromProper Data Structure

What is a Record?

Schema Design – Gary Murakami

Key → Value

• One-dimensional

• Single value is a blob

• Query on key only

• No schema

• Value cannot be updated, only replaced

Key Blob

Schema Design – Gary Murakami

Relational

• Two-dimensional (tuples)

• Each field is a single value

• Query on any field

• Very structured schema (table)

• In-place updates *

• Normalization requires many tables, joins, indexes, and poor data locality and performance

PrimaryKey

Schema Design – Gary Murakami

Document• N-dimensional

• Each field can contain 0, 1, many, or embedded values

• Query on any field & level

• Flexible schema

• Inline updates *

• Embedding related data has optimal data locality, requires fewer indexes, has better performance

_id

Core Concepts

Schema Design – Gary Murakami

Traditional Schema DesignFocus on data storage

Schema Design – Gary Murakami

Document Schema DesignFocus on data use

Schema Design – Gary Murakami

Another way to think about itTraditional:What answers do I have?

Document:What questions do I have?

Schema Design – Gary Murakami

Three Building Blocks ofDocument Schema Design

Schema Design – Gary Murakami

1 – Flexibility

• Choices for schema design

• Each record can have different fields

• Field names consistent for programming

• Common structure can be enforced by application

• Easy to evolve as needed

Schema Design – Gary Murakami

2 – ArraysMultiple Values per Field

• Each field can be:– Absent– Set to null– Set to a single value– Set to an array of many values

• Query for any matching value– Can be indexed and each value in the array is in

the index

Schema Design – Gary Murakami

3 - Embedded Documents• Any value can be a document

• Nested documents provide structure

• Query any field at any level– Can be indexed

What is an Entity?

Schema Design – Gary Murakami

An Entity

• Object in your model

• Associations with other entities

Referencing (Relational)

Embedding (Document)

has_one embeds_one

belongs_to embedded_in

has_many embeds_many

has_and_belongs_to_manyMongoDB has both referencing and embedding for

universal coverage

Schema Design – Gary Murakami

Let's model something togetherHow about a business card?

Business Card

Schema Design – Gary Murakami

Contacts

{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “phone”: “408-996-1010”, “address_id”: 1}

Referencing

Schema Design – Gary Murakami

Addresses

{“_id”: 1,“street”: “10260 Bandley

Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”

}

Contacts

{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {

“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”

}, “phone”: “408-996-1010”}

Embedding

Schema Design – Gary Murakami

Schema Design – Gary Murakami

Relational Schema

Contact

• name• compan

y• title• phone

Address

• street• city• state• zip_cod

e

Contact

• name• company• adress

• Street• City• State• Zip

• title• phone

• address• street• city• State• zip_cod

e

Schema Design – Gary Murakami

Document Schema

Schema Design – Gary Murakami

How are they different? Why?

Contact

• name• compan

y• title• phone

Address

• street• city• state• zip_cod

e

Contact

• name• company• adress

• Street• City• State• Zip

• title• phone

• address• street• city• state• zip_cod

e

{ “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {

“street”: “10260 Bandley Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”

}, “phone”: “408-996-1010”}

Schema Flexibility

Schema Design – Gary Murakami

{ “name”: “Larry Page”, “url”: “http://google.com/”, “title”: “CEO”, “company”: “Google!”, “email”: “larry@google.com”, “address”: { “street”: “555 Bryant, #106”, “city”: “Palo Alto”, “state”: “CA”, “zip_code”: “94301” } “phone”: “650-618-1499”, “fax”: “650-330-0100”}

Schema Design – Gary Murakami

Longest “Database Endgame” Mate

• Augment schema with meta data– Distance to mate (DTM)– Distance to conversion (DTC)

• Retrograde analysis of DB

• Longest checkmate– 6 piece – 262 moves, KRNKNN– 7 piece – 517 moves, so far• Completion by 2015

Example

Schema Design – Gary Murakami

Let’s Look at anAddress Book

Schema Design – Gary Murakami

Address Book

• What questions do I have?

• What are my entities?

• What are my associations?

Schema Design – Gary Murakami

Address Book Entity-Relationship

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Associating Entities

Schema Design – Gary Murakami

One to One

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Schema Design – Gary Murakami

One to OneSchema Design Choices

contact• twitter_id

twitter1 1

contact twitter• contact_id1 1

Redundant to track relationship on both sides • Both references must be updated for consistency

• Saves a fetch if no twitter

Contact• twitter

twitter 1

Schema Design – Gary Murakami

One to OneGeneral Recommendation

• Full contact info all at once– Contact embeds twitter• Parent-child relationship

– “contains”

• No additional data duplication• Can query or index on embedded field

– e.g., “twitter.name”

Contact• twitter

twitter 1

Schema Design – Gary Murakami

One to Many

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Schema Design – Gary Murakami

One to ManySchema Design Choices

contact• phone_ids: [

]phone1 N

contact phone• contact_id1 N

Redundant to track relationship on both sides • Both references must be updated for consistency

• Not possible in relational DBs• Saves a fetch if no phones

Contact• phones

phoneN

Schema Design – Gary Murakami

One to ManyGeneral Recommendation

• Full contact info all at once– Contact embeds multiple phones• Parent-children relationship

– “contains”

• No additional data duplication• Can query or index on any field

– e.g., { “phones.type”: “mobile” }

Contact• phones

phoneN

Schema Design – Gary Murakami

Many to Many

Contacts• name• company• title

Addresses

• type• street• city• state• zip_code

Phones• type• number

Emails• type• address

Thumbnails

• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

N

N

N

1

1

1

11

Twitters• name• location• web• bio

1

1

Schema Design – Gary Murakami

Many to ManyTraditional Relational Association

Join table

Contacts• name• company• title• phone

Groups• name

GroupContacts

• group_id• contact_idX

Use arrays instead

Schema Design – Gary Murakami

Many to ManySchema Design Choices

group• contact_ids:

[ ]contactN N

groupcontact• group_ids:

[ ]N N

Redundant to track relationship on both sides • Both references must be

updated for consistency

Redundant to track relationship on both sides • Duplicated data must be

updated for consistency

group• contacts

contactN

contact• groups

group N

Schema Design – Gary Murakami

Many to ManyGeneral Recommendation

• Depends on use case1. Simple address book• Contact references groups

2. Corporate email groups• Group embeds contacts for performance

groupcontact• group_ids:

[ ]N N

Schema Design – Gary Murakami

Contacts• name• company• title

addresses• type• street• city• state• zip_code

phones• type• number

emails• type• address

thumbnail• mime_type• data

Portraits• mime_type• data

Groups• name

N

1

N

1

twitter• name• location• web• bio

N

N

N

1

1

Document model - holistic and efficient representation

{“name” : “Gary J. Murakami, Ph.D.”,“company” : “10gen (the MongoDB) company”,“title” : “Lead Engineer and Ruby Evangelist”,“twitter” : {

“name” : “GaryMurakami”, “location” : “New Providence, NJ”,“web” : “http://www.nobell.org”

},“portrait_id” : 1,“addresses” : [

{ “type” : “work”, “street” : ”229 W 43rd St.”, “city” : “New York”, “zip_code” : “10036” }],“phones” : [

{ “type” : “work”, “number” : “1-866-237-8815 x8015” }],“emails” : [

{ “type” : “work”, “address” : “gary.murakami@10gen.com” },{ “type” : “home”, “address” : “gjm@nobell.org” }

]}

Contact document example

Schema Design – Gary Murakami

Schema Design – Gary Murakami

Can We Solve Chess One Day?

• Chess tablebase problem– Chess programs often play worse– Search is not localized, poor cache performance,

seeks– Working set too large for memory

• Endgame database size – big data– 5 piece: 7 GB compressed 75%• 157 MB Shredderbase – 1000x• 441 MB Shredderbase – 10,000x

– 6 piece: 1.2 TB compressed– 7 piece: 70 TB estimated by 2015

Schema Design – Gary Murakami

Working Set

1. To reduce the working set– reference less-used data instead of embedding• extract into referenced child document

– reference bulk data, e.g., portrait

2. To increase resources – read from secondaries in a replica set– use sharding

General Recommendations

Schema Design – Gary Murakami

Embedding over Referencing • Embed

– When “one” or “many” objects are viewed with their parent

– For performance– For atomicity

• Reference– When you need more scaling: max document size

is 16MB– For easy “many to many” associations– For smaller parent documents and working set

Schema Design – Gary Murakami

Legacy Migration

1. Copy existing schema & some data to MongoDB

2. Iterate schema design1. Measure performance and find bottlenecks2. Denormalize by embedding

1. one to one associations first2. one to many associations next3. many to many associations last

3. Examine, measure and analyze, review concerns, scaling

Schema Design – Gary Murakami

New Application

1. Focus on your application 1. Requests2. Responses3. Business-domain model objects / data structures

2. Then persist language object data to MongoDB1. Collections2. Associations3. Refactor for optimization and add indices

Schema Design – Gary Murakami

It’s All About Your Application

• Your schema is the impedance matcher– Design choices: normalize/denormalize,

reference/embed– Melds programming with MongoDB for best of

both– Flexible for development and change

• Programs+Databases = (Big) Data Applications

Schema Design – Gary Murakami

It’s All About Your Application

• Your schema is the impedance matcher– Design choices: normalize/denormalize,

reference/embed– Melds programming with MongoDB for best of

both– Flexible for development and change

• Programs×MongoDB = Great Big Data Applications

• Play chess with God

Schema Design – Gary Murakami

It’s All About Your Application

• Your schema is the impedance matcher– Design choices: normalize/denormalize,

reference/embed– Melds programming with MongoDB for best of

both– Flexible for development and change

• Programs×MongoDB = Great Big Data Applications

• Play music with God – AAC

Lead Engineer / Evangelist

Gary J. Murakami, Ph.D.

#MongoDB

Questions?

"His pattern indicatestwo-dimensional

thinking.”- Spock

Star Trek II: The Wrath of Khan

www.3dchessfederation.com

Thank you so much to our community who made An Evening with MongoDB Minneapolis possible:

• David Hussman• Josh Kennedy• Matthew Chimento• Jeffrey Lemmerman• Dan Chamberlain • Christopher Rueber • Erin Newkirk

Thank you DevJam for hosting our event!