Schema Design

20
Big Data Schema Design Deepak

description

This is the presentation by Deepak during QBurst Architects Meet held at the hotel Beaumonde The Fern, on Thursday, March 14, 2013.

Transcript of Schema Design

Page 1: Schema Design

Big Data Schema Design

Deepak

Page 2: Schema Design

Overview• Schema design is vital for performance.• Keywords : Non-relational, NOSQL, Distributed• Underlying File system : GFS, HDFS• Examples : Hadoop, GFS, Hbase, Big Tables etc• Example implementations : Facebook, Wallmart

etc.

Page 3: Schema Design

When to use• Typically with systems having >=100’s of

millions/billions rows• Records of the order of 100’s or 1000’s of TB’s• No advanced Query Language needed• Typed columns or other RDBMS features not

needed

Page 4: Schema Design

Hadoop Architecture

Page 5: Schema Design

Hadoop Ecosystem

Page 6: Schema Design

HBase Architecture

Page 7: Schema Design

Overview• HBase runs on top of HDFS• HDFS was chosen because of its fault tolerance,

check summing, failover properties• Java Native client or REST API• Manager manages cluster, Region Servers

manages data

Page 8: Schema Design

HBase Data Model• Table: design-time namespace, has many rows.• Row: atomic key/value container, with one row

key• Column Family: divide columns into physical files• Column: a key in the k/v container inside a row• Timestamp: long milliseconds, sorted descending• Value: a time-versioned value in the k/v container

Page 9: Schema Design

Distribution

Page 10: Schema Design

More distribution

Page 11: Schema Design

Thoughts on the logical view• Unit of scalability is Region.• The rows are not tied to a server. They maybe

moved around for load balancing.• Add nodes so that we do not have too many

regions per node• Too many regions per node will work against

distribution

Page 12: Schema Design

Column Family• Each Column Family represents a Physical storage

unit ( A Directory)• Data that are queried together should be stored

together.• Features such as compression can be enabled per

Column Family

Page 13: Schema Design

Bloom Filter• Generated automatically when an HFile is

flushed to disk• Available in primary memory• Contains Row keys• CK can be stored as part of RK, but that

might overload the memory.• Can filter based on what is stored.

Page 14: Schema Design

Physical View

Page 15: Schema Design

Key Cardinality

Page 16: Schema Design

Tall vs Fat Tables• Fat tables with large amounts of data in each

column.• Tall tables with large amounts of rows.• Tall is good for search or scans• Fat is good for fetches or gets• Rows don’t split • Atomicity is only at row level, having compound

keys, atomicity is not guaranteed

Page 17: Schema Design

Key Design• Sequential keys : Example timestamp as key• With Sequential keys you keep hot spotting on a

region.• Salting to distribute the records• Field promotion• Random keys

Page 18: Schema Design

Key Design Performance

Page 19: Schema Design

Summary• Think twice before you decide on NOSQL

technologies• Avoid hotspots• Store values at appropriate places• Choose the right keys• Store inferences into RDBMS if necessary

Page 20: Schema Design

Visit us:

Facebook: http://www.facebook.com/QBurstTwitter: http://twitter.com/qburst

Google+: https://plus.google.com/+qburst/postsLinkedIn: http://www.linkedin.com/company/qburstYouTube: http://www.youtube.com/QBurstVideos

www.qburst.com