Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2...

22
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1

Transcript of Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2...

Page 1: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

1Dennis Kafura – CS5204 – Operating Systems

Big Table:Distributed Storage System

For Structured Data

Sergejs Melderis

1

Page 2: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

2

B igTable

Dennis Kafura – CS5204 – Operating Systems

Unstructured Data vs. Structured Data

Unstructured data refers to computerized information that either does not have a data model plain text, audio

Structured data can be described by data modelFlat Hierarchical Network RelationalDimensionalObject-relational

Page 3: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

3

B igTable

Dennis Kafura – CS5204 – Operating Systems

Relational Model and RDBMS

most popular model of organizing structured datamodel based on first-order predicate logicprovides a declarative method for specifying data

and queries via SQLdata is organized in tables of fixed-length recordsvariety of open source and commercial

implementationsprovides ACID properties

3

Page 4: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

4

B igTable

Dennis Kafura – CS5204 – Operating Systems

NoSQL

not relational databaseno fixed table schemasno join operationsno sql

flexible and/or no data modelusually do not provide ACID propertiesscale horizontally

4

Page 5: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

5

B igTable

Dennis Kafura – CS5204 – Operating Systems

BigTable

distributed, high performance, fault-tolerant, NoSql storage system build on top of Google File Systemdesigned to scale to a very large size on low cost

commodity hardwareit was designed by Google and used in various

projects (web indexing)the paper was published in 2006related implementationsHB as eHypertableA pac he C as s andraNeptune 5

Page 6: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

6

B igTable

Dennis Kafura – CS5204 – Operating Systems

BigTable Data Model

sparse, distributed, persistent multi-dimensional sorted mapmap is indexed by a row key, column family,

column key, and a timestamp { row : {

column_family : {column : {

timestamp : value}

}

}

6

Page 7: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

7

B igTable

Dennis Kafura – CS5204 – Operating Systems

Webtable

7

“<html>...” “CNN” “CNN.com”

“contents” “anchor:cnnsi.com “anchor:my.look.ca”

t6 t9 t9“com.cnn.www”

Page 8: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

8

B igTable

Dennis Kafura – CS5204 – Operating Systems

Relational Data Model

8

Student

student_id - PKfirst_namelast_namebirthdaymajoracademic_level

C our se

crn PKcoursetitletypeinstructor_idseats

StudentC our se

student_idcrn

Page 9: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

9

B igTable

Dennis Kafura – CS5204 – Operating Systems

Student table

info cour se

last_name <crn>first_namebirthdaymajoracademic_level

student_id

Row Key Column FamilyColumn Qualifier

Column Qualifier

Page 10: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

10

B igTable

Dennis Kafura – CS5204 – Operating Systems

Course table

info students

course <student_id>titletypeinstructor_idseats

cr n

Row Key Column FamilyColumn Qualifier

Column Qualifier

Page 11: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

11

B igTable

Dennis Kafura – CS5204 – Operating Systems

Example

11

“Sergejs” “Melderis” “Computer Science” “YES” “NO”

info:first_name info:last_name info:major courses:96322 courses:96320

“905514”

“CS5204” “Operating Systems” “1983943” “YES” “YES”

info:course info:title info:instructor_id students:905514 students:905520

“96322”

Page 12: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

12

B igTable

Dennis Kafura – CS5204 – Operating Systems

Students data view in JSON

{ 905514: {info : {

first_name : { t1 : Sergejs },last_name : { t1 : Melderis },major : { t1 : Comp Science }

},courses : {

96322: { t1 : “YES” },96320: { t2 : “NO” }

}

}

12

Page 13: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

13

B igTable

Dennis Kafura – CS5204 – Operating Systems

Rows

row keys are arbitrary strings up to 64 KBread and write of data under a single row is atomicordered in lexicographic order by row keyrow range is dynamically partitioned into blocks

called tablets tablets are units of distribution and loadbalancing

13

Page 14: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

14

B igTable

Dennis Kafura – CS5204 – Operating Systems

Columns

Column keys are grouped by column familiesColumn family is a basic unit of access controlAll data stored in a column family is of the same

typeNumber of column families should be smallThere can be unlimited number of columnsColumn key is named using family: qualifier

14

Page 15: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

15

B igTable

Dennis Kafura – CS5204 – Operating Systems

Timestamps

Bigtable can contain multiple versions of the same datatimestamps are 64-bit integers assigned by Bigtable

or clientclient can specify to keep up to n versions of data

15

Page 16: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

16

B igTable

Dennis Kafura – CS5204 – Operating Systems

Implementation

client libraryone master server distributed lock service called Chubbymany tablet servers containing several tabletstablet server handles read and write requestsautomatically splits tablets that have grown too large

(100 - 200 MB)

client data directly goes to tablet server

16

Page 17: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

17

B igTable

Dennis Kafura – CS5204 – Operating Systems

Tablet Location

three-level hierarchy to store tablet locationfirst level is stored in lock serviceroot tablet contains the location of metadata tablesmetadata tablets contain the location of user tables

UserTable1

UserTable2

METADATAtablets

Root tabletLock Service

Page 18: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

18

B igTable

Dennis Kafura – CS5204 – Operating Systems

Distribution of data

One master serverChubby distributed lock serviceHundred or thousands of tablet serversEach tablet contains a contiguous range of rowsMaster distributes tablets across of serversEach tablet server contains tablets with different

ranges

18

Page 19: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

19

B igTable

Dennis Kafura – CS5204 – Operating Systems

Tablet Representation

19

SSTable SSTable

memtable Read Op

Write Op

tablet log

Memory

G F S

Page 20: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

20

B igTable

Dennis Kafura – CS5204 – Operating Systems

Compactions

compaction is a process of writing memtable to SSTableminor compaction write memtable to SSTableshrinks the memory usage of the tablet serverreduces the commit log

merging compaction merges several SSTablesmajor compaction rewrites all SSTables into exactly

one SSTable

20

Page 21: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

21

B igTable

Dennis Kafura – CS5204 – Operating Systems

API

create, delete tables and column familieswrite or delete valueslook up values from individual rowsscan over a subset of the data in a table

21

Page 22: Big Table: Distributed Storage System For …courses.cs.vt.edu/cs5204/fall10-kafura-NVC/Student...2 BigTable Dennis Kafura – CS5204 – Operating Systems Unstructured Data vs. Structured

22

B igTable

Dennis Kafura – CS5204 – Operating Systems 22