Webinar: Schema Design and Performance Implications
-
Upload
mongodb -
Category
Technology
-
view
2.489 -
download
1
Transcript of Webinar: Schema Design and Performance Implications
Schema Design(and its performance implications)
Jay RunkelPrincipal Solutions [email protected]@jayrunkel
2
Agenda
1. Today’s Example
2. MongoDB Schema Design vs. Relational
3. Modeling Relationships
4. Schema Design and Performance
Today’s Example
4
Medical Records• Collects all patient information in a central repository• Provide central point of access for
– Patients– Care providers: physicians, nurses, etc.– Billing– Insurance reconciliation
• Hospitals, physicians, patients, procedures, records
PatientRecords
Medications
Lab Results
Procedures
Hospital Records
Physicians
Patients
Nurses
Billing
5
Medical Record Data
• Hospitals – have physicians
• Physicians– Have patients– Perform procedures– Belong to hospitals
• Patients– Have physicians– Are the subject of procedures
• Procedures– Associated with a patient– Associated with a physician– Have a record– Variable meta data
• Records– Associated with a procedure– Binary data– Variable fields
6
Lot of Variability
Relational View
Schema Design:
MongoDB vs. Relational
MongoDB Relational
Collections Tables
Documents Rows
Data Use Data Storage
What questions do I have? What answers do I have?
MongoDB versus Relational
Attribute MongoDB Relational
Storage N-dimensional Two-dimensional
Field Values 0, 1, many, or embed Single value
Query Any field or level Any field
Schema Flexible Very structured
Complex Normalized Schemas
Complex Normalized Schemas
13
Documents are Rich Data Structures{ first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}
Fields can contain an array of sub-documents
Fields
Typed field values
Fields can contain arrays
String
Number
Geo-Coordinates
Relationships
Modeling One-to-One Relationships
16
Referencing
Procedure• patient• date• type• physician• type
Results• dataType• size• content: {…}
Use two collections with a reference
Similar to relational
17
Procedure• patient• date• type• results
• equipmentId• data1• data2
• physician
• Results• type• size• content: {…}
Embedding
Document Schema
18
Referencing
Procedure
{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134}
Results
{ “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }
19
EmbeddingProcedure{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}
20
Embedding
• Advantages– Retrieve all relevant information in a single query/document– Avoid implementing joins in application code– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations– Large documents mean more overhead if most fields are not relevant– 16 MB document size limit
21
Atomicity
• Document operations are atomicdb.patients.update({_id: 12345},
{$inc : {numProcedures : 1}, $push : {procedures : “proc123”}, $set : {addr.state : “TX”}})
• No multi-document transactions
db.beginTransaction();db.patients.update({_id: 12345}, …);db.procedure.insert({_id: “proc123”, …});db.records.insert({_id: “rec123”, …});db.endTransaction();
22
Embedding
• Advantages– Retrieve all relevant information in a single query/document– Avoid implementing joins in application code– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations– Large documents mean more overhead if most fields are not relevant– 16 MB document size limit
23
Referencing
• Advantages– Smaller documents– Less likely to reach 16 MB document limit– Infrequently accessed information not accessed on every query– No duplication of data
• Limitations– Two queries required to retrieve information– Cannot update related information atomically
24
One to One: General Recommendations
• Embed– No additional data duplication– Can query or index on
embedded field• e.g., “result.type”
• Exceptional cases…• Embedding results in large
documents• Set of infrequently access
fields
{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}
Modeling One-to-Many Relationships
26
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,
…}, { id: 12346, date: 2015-02-15, type: “blood test”,
…}]}
Pat
ient
s
Embed
One-to-Many RelationshipsModeled in 2 possible ways
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]}
{ _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …}
Pat
ient
s
Reference
Pro
cedu
res
27
One to Many: General Recommendations
• Embed, when possible– Access all information in a single query– Take advantage of update atomicity– No additional data duplication– Can query or index on any field
• e.g., { “phones.type”: “mobile” }
• Exceptional cases:– 16 MB document size– Large number of infrequently accessed fields
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,
…}, { id: 12346, date: 2015-02-15, type: “blood test”,
…}]}
Modeling Many-to-Many Relationships
29
Many to ManyTraditional Relational Association
Join table
Physiciansnamespecialtyphone
Hospitalsname
HosPhysicanRelhospitalIdphysicianIdX
Use arrays instead
30
{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [ { id: 12345, name: “Joe Doctor”, address: {…},
…}, { id: 12346, name: “Mary Well”, address: {…},
…}]}
Many-to-Many RelationshipsEmbedding physicians in hospitals collection
{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [ { id: 63633, name: “Harold Green”, address: {…},
…}, { id: 12345, name: “Joe Doctor”, address: {…},
…}]}
Data Duplication
31
{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}
Many-to-Many RelationshipsReferencing
{ id: 63633, name: “Harold Green”, address: {…}, …}
Hospitals
{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [63633, 12345]}
Physicians
{ id: 12345, name: “Joe Doctor”, address: {…}, …}
{ id: 12346, name: “Mary Well”, address: {…}, …}
32
Many to ManyGeneral Recommendation
• Use case determines whether to reference or embed:1. Data Duplication
• Embedding may result in data duplication
• Duplication may be okay if reads dominate updates
2. Referencing may be required if many related items
3. Hybrid approach• Potentially do both
{ _id: 2, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}
{ _id: 12345, name: “Joe Doctor”, address: {…}, …} { _id: 12346, name: “Mary Well”, address: {…}, …}
Hos
pita
ls
Reference
Phy
sici
ans
What If I Want to Store Large Files in MongoDB?
34
GridFS
Driv
erGridFS APIdoc.jpg(meta data)
doc.jpg(1)doc.jpg
(1)doc.jpg(1)
fs.files fs.chunksdoc.jpg
mongofiles utility provides command line GridFS interface
Schema Design and Performance
Two Examples
Example 1: Hybrid Approach
Embed and Reference
37
Healthcare Example
patients
procedures
Tailor Schema to Queries (cont.)
{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : ["551ac”, “343fs”]}
{ "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”]}
Patient Procedure
Find all patients from NH that have had chest x-rays
Tailor Schema to Queries (cont.)
{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : [ {id : "551ac”, type : “Chest X-ray”}, {id : “343fs”, type : “Blood Test”}]}
{ "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”]}
Patient Procedure
Find all patients from NH that have had chest x-rays
Example 2: Time Series Data
Medical Devices
41
Vital Sign Monitoring Device
Vital Signs Measured:• Blood Pressure• Pulse• Blood Oxygen Levels
Produces data at regular intervals• Once per minute
42
We have a hospital(s) of devices
43
Data From Vital Signs Monitoring Device
{ deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500")}
• One document per minute per device
• Relational approach
44
Document Per Hour (By minute)
{ deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500")}
• Store per-minute data at the hourly level
• Update-driven workload
• 1 document per device per hour
45
Characterizing Write Differences
• Example: data generated every minute• Recording the data for 1 patient for 1 hour:
Document Per Event60 inserts
Document Per Hour1 insert, 59 updates
46
Characterizing Read Differences
• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event 1440 reads
Document Per Hour24 reads
47
Characterizing Memory and Storage Differences
Document Per Minute Document Per HourNumber Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices • 1 years worth of data
100000 * 365 * 24 * 60
100000 * 365 * 24
100000 * 365 * 24 * 60 * 130
100000 * 365 * 24 * 130
100000 * 365 * 24 * 60 * 92
100000 * 365 * 24 * 758
48
Summary• Relationships can be modeled by embedding or references
• Decision should be made in context of application data and query workload– Tailor schema to application workload
• It is okay recommended to violate RDBMS schema design principles– No duplication of data– Normalization
• Different schemas may result in dramatically different– Query performance– Hardware requirements