7/29/2019 Google's Database
1/51
Bigtable: A Distributed Storage System forStructured Data
Presenter: Guangdong LiuJan 24 th , 2012
7/29/2019 Google's Database
2/51
Dec 8 th , 2011Dec 8 th , 2011
Googles Motivation: Scale!
Consider Google Lots of data
Copies of the web, satellite data, user data,email and USENET, Subversion backing store
Millions of machines Different projects/applications Hundreds of millions of users
Many incoming requests
7/29/2019 Google's Database
3/51
Dec 8 th , 2011Dec 8 th , 2011
Why not A DBMS?
Few DBMSs support the requisite scale Required DB with wide scalability, wide
applicability, high performance and highavailability
Couldnt afford it if there was one Most DBMSs require very expensive
infrastructure DBMSs provide more than Google needs
E.g., full transactions, SQL Google has highly optimized lower-level systems
that could be exploited
GFS, Chubby, MapReduce, Job scheduling
7/29/2019 Google's Database
4/51
Dec 8 th , 2011Dec 8 th , 2011
What is a Bigtable?
A BigTable is a sparse, distributed,persistent multidimensional sorted map. Themap is indexed by a row key, a column key,and a timestamp; each value in the map is anuninterpreted array of bytes.
7/29/2019 Google's Database
5/51
Dec 8 th , 2011Dec 8 th , 2011
Relative to a DBMS, BigTable provides Simplified data retrieval mechanism
A map -> string
No relational operators Atomic updates only possible at row level Arbitrary number of columns per row Arbitrary data type for each column Designed for Googles application set Provides extremely large scale (data,
throughput) at extremely small cost
7/29/2019 Google's Database
6/51
Dec 8 th , 2011Dec 8 th , 2011
Data model: Row
Row keys are arbitrary strings Row is the unit of transactional consistency
Every read or write of data under a single rowis atomic
Data is maintained in lexicographic order by rowkey
Rows with consecutive keys (Row Range) aregrouped together as tablets . Unit of distribution and load-balancing reads of short row ranges are efficient and
typically require communication with only asmall number of machines
7/29/2019 Google's Database
7/51
Dec 8 th , 2011Dec 8 th , 2011
Data model: Column
Column keys are grouped into sets called column families , which form the unit of access control.
Data stored under a column family is usually of thesame type
A column family must be created before data canbe stored in a column key
After a family has been created, any column keywithin the family can be used for queries
Column key is named using the following syntax:family :qualifier
Access control and disk/memory accounting areperformed at column family level
7/29/2019 Google's Database
8/51
Dec 8 th , 2011Dec 8 th , 2011
Data model: timestamps
Each cell in Bigtable can contain multiple versionsof data, each indexed by timestamp Timestamps are 64-bit integers Assigned by:
Bigtable: real-time in microseconds Client application: when unique timestamps are
a necessity Data is stored in decreasing timestamp order, so
that most recent data is easily accessed Application specifies how many versions (n) of data items
are maintained in a cell Bigtable garbage collects obsolete versions
7/29/2019 Google's Database
9/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Example: Zoo
7/29/2019 Google's Database
10/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Example: Zoo
row key col. key timestamp
7/29/2019 Google's Database
11/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Example: Zoo
row key col. key timestamp
- (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs- (zebras, weight, 2006) --> 620 lbs
7/29/2019 Google's Database
12/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Example: Zoo
row key col. key timestamp
- (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs- (zebras, weight, 2006) --> 620 lbs
Each key is sorted inLexicographic order
7/29/2019 Google's Database
13/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Example: Zoo
row key col. key timestamp
- (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs- (zebras, weight, 2006) --> 620 lbs
Timestamp orderingis defined as mostrecent appearsfirst
7/29/2019 Google's Database
14/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Example: Web Indexing
7/29/2019 Google's Database
15/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
7/29/2019 Google's Database
16/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Row
7/29/2019 Google's Database
17/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Columns
7/29/2019 Google's Database
18/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Cells
7/29/2019 Google's Database
19/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
timestamps
7/29/2019 Google's Database
20/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Column family
7/29/2019 Google's Database
21/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Column famil y
family: qualifier
7/29/2019 Google's Database
22/51
Dec 8 th , 2011Dec 8 th , 2011
Data Model
Column family
family: qualifier
7/29/2019 Google's Database
23/51
Dec 8 th , 2011Dec 8 th , 2011
Client APIs
Bigtable APIs provide functions for: Creating/deleting tables, column families Changing cluster , table and column family
metadata such as access control rights Support of single row transactions Allowing cells to be used as integer counters Executing client supplied scripts in the address
space of servers
7/29/2019 Google's Database
24/51
Dec 8 th , 2011Dec 8 th , 2011
Building Blocks
Bigtable uses the distributed Google FileSystem (GFS) to store log and data files
The Google SSTable file format is usedinternally to store Bigtable data
An SSTable provides a persistent , orderedimmutable map from keys to values Operations are provided to look up the value
associated with a specified key, and to iterate overall key/value pairs in a specified key range
7/29/2019 Google's Database
25/51
Dec 8 th , 2011Dec 8 th , 2011
Building Blocks
Bigtable relies on a highly-available andpersistent distributed lock service calledChubby
Chubby provides a namespace that consistsof directories and small files. Eachdirectory or file can be used as a lock Consists of 5 active replicas, one replica is the
master and serves requests Service is functional when majority of thereplicas are running and in communication withone another when there is a quorum
7/29/2019 Google's Database
26/51
Dec 8 th , 2011Dec 8 th , 2011
Building Blocks
Shared pool of machines that also run other distributed applications
7/29/2019 Google's Database
27/51
Dec 8 th , 2011Dec 8 th , 2011
SSTable
An SSTable provides a persistent , orderedimmutable map from keys to values
Each SSTable contains a sequence of blocks A block index (stored at the end of SSTable)
is used to locate blocks Each SSTable contains a sequence of blocks A block index (stored at the end of SSTable)
is used to locate blocks The index is loaded into memory when the
SSTable is open One disk access to get the block
7/29/2019 Google's Database
28/51
Dec 8 th , 2011Dec 8 th , 2011
SSTable
64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
7/29/2019 Google's Database
29/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet
Contains some range of rows of the table Built out of multiple SSTables
Index
64Kblock
64Kblock
64Kblock
SSTable
Index
64Kblock
64Kblock
64Kblock
SSTable
Tablet Start :aardvark End:apple
7/29/2019 Google's Database
30/51
Dec 8 th , 2011Dec 8 th , 2011
Table
Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap
SSTable SSTable SSTable SSTable
Tabletaardvark apple
Tabletapple boat
7/29/2019 Google's Database
31/51
Dec 8 th , 2011Dec 8 th , 2011
Organization
A Bigtable cluster stores tables Each table consists of tablets Initially each table consists of one tablet As a table grows it is automatically split into
multiple tablets Tablets are assigned to tablet servers
Multiple tablets per server. Each tablet is 100-200 MB
Each tablet lives at only one server
7/29/2019 Google's Database
32/51
Dec 8 th , 2011Dec 8 th , 2011
Implementation: Three major components
A library that is linked into every client One master server
Assigning tablets to tablet servers Detecting the addition and deletion of tablet servers
Balancing tablet-server load Garbage collection of files in GFS
Many tablet servers Tablet servers manage tablets Tablet server splits tablets that get too big
Client communicates directly with tablet serverfor reads/writes.
7/29/2019 Google's Database
33/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Location
Since tablets move around from server toserver, given a row, how do clients find theright machine? Need to find a tablet whose row range covers
the target row
7/29/2019 Google's Database
34/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Location
A 3-level hierarchy analogous to that of a B+-tree to store tablet location information : A file stored in chubby contains location of the root
tablet Root tablet contains location of Metadata tablets
The root tablet never splits Each meta-data tablet contains the locations of a set of
user tablets
Client reads the Chubby file that points to the roottablet This starts the location process
Client library caches tablet locations
Moves up the hierarchy if location N/A
7/29/2019 Google's Database
35/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet AssignmentBigTable
BigTable Master
Performs metadata opsand load balancing
BigTable Tablet Server BigTable Tablet Server
Serves data Serves data
Cluster scheduling system GFS Chubby
Holds tabletdata, logs
Holds metadata, handlesmaster election
Handles failover,monitoring
BigTable Client
BigTable ClientLibrary
7/29/2019 Google's Database
36/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Server
When a tablet server starts, it creates andacquires exclusive lock on, a uniquely-namedfile in a specific Chubby directory Call this servers directory
A tablet server stops serving its tablets ifit loses its exclusive lock This may happen if there is a network
connection failure that causes the tablet serverto lose its Chubby session
7/29/2019 Google's Database
37/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Server
A tablet server will attempt to reacquirean exclusive lock on its file as long as thefile still exists
If the file no longer exists then the tabletserver will never be able to serve again Kills itself At some point it can restart; it goes to a pool of
unassigned tablet servers
7/29/2019 Google's Database
38/51
Dec 8 th , 2011Dec 8 th , 2011
Master Startup Operation
Upon start up the master needs to discoverthe current tablet assignment. Grabs unique master lock in Chubby
Prevents concurrent master instantiations Scans servers directory in Chubby for live servers Communicates with every live tablet server
Discover all tablets Scans METADATA table to learn the set of
tablets Unassigned tablets are marked for assignment
7/29/2019 Google's Database
39/51
Dec 8 th , 2011Dec 8 th , 2011
Master Operation
Detect tablet server failures/resumption Master periodically asks each tablet server
for the status of its lock
7/29/2019 Google's Database
40/51
Dec 8 th , 2011Dec 8 th , 2011
Master Operation
Tablet server lost its lock or mastercannot contact tablet server: Master attempts to acquire exclusive lock on
the servers file in the servers directory
If master acquires the lock then the tabletsassigned to the tablet server are assigned toothers
Master deletes the servers file in the
servers directory Assignment of tablets should be balanced If master loses its Chubby session then it
kills itself An election can take place to find a new master
7/29/2019 Google's Database
41/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Server Failure
Tablet server
GFS Chunkserver
SSTable SSTable SSTable
Tablet Tablet Tablet
Tablet server
GFS Chunkserver
SSTable
(replica)SSTable
SSTable
Tablet Tablet Tablet
(replica)SSTable
Logicalview:
Physicallayout:
SSTable
Chubby Server Master
7/29/2019 Google's Database
42/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Server Failure
Chubby Server
Tablet server
GFS Chunkserver
SSTable SSTable SSTable
Tablet Tablet Tablet
Tablet server
GFS Chunkserver
SSTable
(replica)SSTable
SSTable
Tablet Tablet Tablet
(replica)SSTable
Logicalview:
Physicallayout: SSTable
X
X X X X
Master
7/29/2019 Google's Database
43/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Server Failure
Tablet server
GFS Chunkserver
SSTable SSTable SSTable
Tablet Tablet Tablet
(replica)SSTable
Logicalview:
Physicallayout:
Tablet
(other tablet serversdrafted to serve other abandoned tablets )
Backup copy of tabletmade primary
Message sent to tabletserver by master
Extra replica of tablet created automatically by GFS
Chubby Server Master
7/29/2019 Google's Database
44/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Serving
Commit log stores the updates that aremade to the data
Recent updates are stored in memtable
Older updates are stored in SStable files
7/29/2019 Google's Database
45/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Serving
Recovery process Reads/Writes that arrive at tablet server
o Is the request Well-formed?o Authorization: Chubby holds the permission
fileo If a mutation occurs it is wrote to commit log
and finally a group commit is used
7/29/2019 Google's Database
46/51
Dec 8 th , 2011Dec 8 th , 2011
Tablet Serving
Tablet recovery processo Read metadata containing SSTABLES and redo
points
o Redo points are pointers into any commit logso Apply redo points
7/29/2019 Google's Database
47/51
Dec 8 th , 2011Dec 8 th , 2011
Compactions
Minor compaction convert the memtableinto an SSTable Reduce memory usage Reduce log traffic on restart
Merging compaction Reduce number of SSTables Good place to apply policy keep only N
versions
Major compaction Merging compaction that results in only oneSSTable
No deletion records, only live data
7/29/2019 Google's Database
48/51
Dec 8 th , 2011Dec 8 th , 2011
Refinements
Locality groups Clients can group multiple column families
together into a locality group. Compression
Compression applied to each SSTable blockseparately
Uses Bentley and McIlroy's scheme and fast compression algorithm
Caching for read performance Uses Scan Cache and Block Cache
Bloom filters Reduce the number of disk accesses
7/29/2019 Google's Database
49/51
Dec 8 th , 2011Dec 8 th , 2011
Refinements
Commit-log implementation Suppose one log per tablet rather have one logper tablet server
Exploiting SSTable immutability No need to synchronize accesses to file system
when reading SSTables Concurrency control over rows efficient Deletes work like garbage collection on
removing obsolete SSTables Enables quick tablet split: parent SSTables
used by children
7/29/2019 Google's Database
50/51
Dec 8 th , 2011Dec 8 th , 2011
Source
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach, Mike Burrows, TusharChandra, Andrew Fikes, and Robert E. Gruber.Bigtable: A Distributed Storage System forStructured Data. In 7 th OSDI (2006)
http://www.csd.uwo.ca/courses/CS9842/Lectures /google- bigtable .ppt
http://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppt7/29/2019 Google's Database
51/51
Dec 8 th , 2011
2007 The Board of Regents of the University of Nebraska All rights reserved
Thanks
Top Related