Extreme computing Databases and cloud computing · Hive and Pig Stratis D. Viglas . Databases and...

Extreme computingDatabases and cloud computing

Stratis D. Viglas

School of InformaticsUniversity of Edinburgh

Stratis D. Viglas www.inf.ed.ac.uk

Databases and cloud computing Overview

Outline

Databases and cloud computingOverviewRelational databasesRelational data processing on Hadoop MR

NoSQL databasesBigTableHive and Pig



Where’s your data?

• Unprecedented dataset scale• Petabyte-scale is ubiquitous (e.g., eBay, Facebook, CERN and scientific

data in general)• Produced at terabytes per day scale• Powerlaw: a few very large and a lot small datasets

scal

e

number of datasets

• Most datasets are structured• Query logs, click logs, sale records, user preferences

• Objective: large-scale data analytics• Relational databases meet MapReduce



Relational databases vs. MapReduce

• Designed and optimised for solving different problems• Common ground, but also great differences

Relational DB s• Long- and short-running

queries• Read and write workloads• Transactional semantics (ACID)• Fixed schema, integrity

constraints• 35 years of tools, extensions,

data types• SQL for declarative query

processing, query optimisation

MapReduce

• Cluster-based data processing,fault tolerance

• No schema; up to theapplication to interpret data

• Imperative paradigm• No standard query language;

as long as it maps to the MR

dataflow• Programmer has complete

control



Typical database workloads

• Online transaction processing (OLTP)• Real-time, low latency, highly-concurrent• Relatively small set of fixed transactional queries• Data access pattern: random reads, updates, writes (involving relatively

small amounts of data)

• Online analytical processing (OLAP)• Batch workloads, less concurrency• Complex long-running analytical queries, often ad-hoc• Data access pattern: table scans, large amounts of data involved per

query

• Typically, organisations use two DB instances• OLTP frontend→ OLAP backend• Frontend optimised for transactions, backend optimised for analytics


Databases and cloud computing Relational databases

Outline





Three basic building blocks

• Attribute (aka field)• A (name, value) pair

• Tuple (aka record, row)• A set of attributes

• Relation (aka table)• A set of tuples with the same

schema

SID

123-ABC

SID

123-ABC

Name

Mary Jones

...

...

Year

4

SID

123-ABC

Name

Mary Jones

...

...

Year

4

456-DEF John Smith ... 3

... ... ... ...

999-XYZ Jack Black ... 4



Data manipulation

• Isolate a subset of a single relation: selection (σ), projection (π)

• Set operations: intersection, union, cross product, set difference

• More complex operations: joins (./), semi-joins, . . .

σyear=3

πname

SID

123-ABC

Name

Mary Jones

Year

4456-DEF John Smith 3999-XYZ Jack Black 4

Student

CID

ADBS

Name

Adv. Databases

Year

4QSX Querying XML 4

Course

CID

ADBS

Name

Adv. Databases

Year

4QSX Querying XML 4

SID

123-ABC

Name

Mary Jones

Year

4123-ABC Mary Jones 4999-XYZ Jack Black 4999-XYZ Jack Black 4

ADBS Adv. Databases 4QSX Querying XML 4

⋈student.year = course.yearStudent × Course



What can we do with MapReduce?

• MapReduce is a dataflow framework• But writing a Java program to compute an average is time-consuming,

verbose, and kind of dumb• Can’t ask an analyst to do that; can’t ask an IT department to

implement one on demand

• Lessons from relational DBs

• Declarative query processing: specify what should be retrieved, nothow

• Leave it to the system to optimise processing• High-level data models and processing languages

• Other options, revisiting database issues and more tailored forlarge-scale distribution

• NoSQL: non-relational approaches to storing and retrieving data• BigTable (and HBase): a different data and physical model, more

tailored towards large-scale analytics and distributed processing• Hive and Pig: data processing languages


Databases and cloud computing Relational data processing on Hadoop MR

Outline





Selections and projections

• Basically free in MR• Scan input and process it during the map phase

• For selections, test predicate; for projections, drop fields

• No need for a reduce phase

• Only limited by how quickly HDFS can stream data• Computational load is minimal; network I/O is the highest cost• Compression also helps

• Kind of like using a nuclear bomb to kill a mosquito• For example, selections are usually evaluated through indexes• Key is not to identify which parts of the input satisfy the predicate, but

not read the irrelevant parts in the first place

• In a schema-less world, however, it makes sense• Difference in σage>25(T ) if we know that ∀t ∈ T has an attribute age in

the 4th position, or we don’t know which position it is in, or whether allrows have it



Sorting

• One of the most fundamental operations in any type of dataprocessing

• MapReduce will always sort input to reducers by group key• Values within a group are arbitrarily sorted

• What if we want to sort by value also?• For example, k → (v1, r), (v3, r2), (v4, r), . . .

• Easy way out: store values in memory and sort them• Does not scale; what if the elements of a group do not fit in memory?



Secondary sorting

• Working solution: value to key conversion• Also known as secondary sorting

• Form a composite intermediate key and let the framework do the sorting• Key becomes (k , v) pair and not simply k

• Before: k → (v1, r), (v8, r2), (v4, r), (v3, r) . . .• Values from the same group arrive in arbitrary order

• After:(k , v1)→ (v1, r)(k , v3)→ (v3, r)(k , v4)→ (v4, r)(k , v8)→ (v8, r). . .

• Values from the same group arrive in sorted order



Aggregation

• Type of query MapReduce has been designed for• In SQL: select url, avg(time) from visits group by url

• Easy to perform in MapReduce• Map over records, use grouping attribute(s) as the key (url in the

previous example)• MapReduce will automatically group values by keys• Compute the aggregate (average in the example) in reduce phase



Relational joins

• The join operation is ubiquitous in DB query processing• Any single query with two or more sources will need to have a join

(even in the form of a cross product)

• Any DBMS spends most of its time evaluating joins• Probably the most optimised physical operator• Radically different of join evaluation algorithms

• More so when moving to a distributed environment

• MR comes with its own join algorithms• Pretty far from a distributed or parallel DB join algorithm

• Choosing a join algorithm is not straightforward• The choice might depend on the size of the input, its properties,

available memory



Reduce-side join

• Group by join key• Map over both sets of tuples• Tag each tuple with an input identifier

• So we can identify where each tuple came from

• Emit tuple as value with join key as the intermediate key• Runtime brings together tuples sharing the same key• Perform actual join in reducer

• Similar to a sort-merge join in relational databases terminology



Reduce-side join

• In this example, assume|R| < |S|

• Everything takes place in thereducer

• Buffer R groups in mainmemory

• Scan corresponding S partitionforward to compute join pairs

• What if groups don’t fit inmemory?

S5

R6

S3

S10

S8

keep in mainmemory

scan forward and cross referencewith records from other set

R9

R12

}



Map-side join

R2

R1

R4

R10

R8

scan forwardto compute join

R3

R5

S5

S6

S3

S11

S1

S9

S12

• Relational merge-join• If both inputs are sorted on join key, the

join can be computed in one sequentialscan

• Partition and sort both inputs in parallel• Partition inputs consistently in terms of

ranges• E.g., 0− 30, 31− 60, 61− 90, . . .

• If both inputs are already partitioned, joincan be computed in the Map phase

• Reduce phase not necessary

• Keep inputs pre-partitioned on the join key• Similar to clustering (or even indexing) in

relational databases



In-memory join

• Scenario: two relations R and S where |R| � |S| and R fits intomain memory

• Typical case: a key-foreign key join in normalised schemata, or afacts-dimensions join in a data warehouse

• MapReduce implementation is a variant of map-side join, based onreplication (no need for a reduce phase)

1 Distribute R to all workers2 Run map phase over S, each mapper loads R in memory and builds a

hash table for it3 For every s ∈ S probe hash table for R for matches and output each

matching 〈s, r〉, r ∈ R pair

• If R does not fit into main memory• Divide it into n subsets Ri , i = 1, 2, 3, . . . n, such that each Ri fits in

main memory• R ./ S =

⋃ni=1 Ri ./ S



Which join to use when?

• If there is enough memory to hold the smaller relation, usein-memory join

• If both inputs are sorted and pre-partitioned consistently on the joinattributes, use map-side join

• If map-side joins are not applicable, use a reduce-side join since it isthe most general and always applicable


Databases and cloud computing NoSQL databases

Outline





Motivation

• Two potential bottlenecks with RDBMSs

1 Schema rigidity: not optimised for evolving and/or non-uniformschemata

2 Scale-out: sharding and partitioning work great but are hard to get right

• Three driving application scenarios1 In the majority of (Web) applications we only need a key-value interface

• The rest of the information is relatively free-form2 Data consistency is not critical

• Critical data will be managed by a persistent transactional engine3 Automatic scale-out

• Adding both data and hardware should be transparent

• Hence, NoSQL stores were introduced1

1Term evolution: started as ‘No means no’, became ‘No means not-only’, now ‘NewSQL’is picking up traction.



Assumptions and use-cases

• Datasets• Data does not fit in one server or a single rack, and SAN s (Storage

Area Networks2) are too expensive• Data partitioning is imperative

• Reliability• System must be continuously available to serve data• Machines and disks will fail; data and availability should not be

compromised• Data replication is imperative

• Performance and trade-offs• Commodity boxes and disks• Good performance and availability on straightforward setups

2Dedicated network that provides consolidated access to block-level storage.Stratis D. Viglas www.inf.ed.ac.uk


Classification

• Key-value stores• Basic association maps

• (Wide-)Column stores• Each key is associated with a large number of attributes (columns)• Provide a relational-like interface• BigTable is the typical example

• Document-centric stores• Semi-structured documents (used to be XML, the hip new kid is JSON)• Implementation is usually coupled with a high-level dataflow engine

(e.g., MapReduce)

• Graph databases• Programming language constructs mapped to persistent objects• Focus is on object interconnections as opposed to lookups• Do not typically scale as well• More RDBMS-like in their use-cases



Distributed hash tables

• Started from peer-to-peer systems and file-sharing• A lot of your favourite P2P applications work this way

• Optimised for binary objects• Evolved into a general distributed way of storing and retrieving

key-value associations• Best-known example: Chord

• Most other implementations are some permutation of its algorithms• Caters for dynamic data, node joining and leaving, and fault tolerance• Provides performance guarantees



Distributed hash table basics

Domain

hd (d .id)

hn(n.id)

Closest successor

data node

• Data is assigned to network peers• Hash functions are applied on the

identifiers of both data and peers• Hash functions have a common

domain, (typical domain size is 2160

values)

• The closest successor of a data item inthe domain becomes responsible forthe item



The Chord ring

n1 n2

n3

n4

n5n6

n7

n8

• N peers ordered on a ring• Peer n maintains an i-connection to the

2i mod N positions ahead of it on thering

• Any peer can locate the peerresponsible for any data item in log Nhops

• Specialised protocol for peers joiningand leaving the network

• Normal operation: only predecessorsand successors affected

• Heartbeat messages can test theliveness of a peer

• Data is replicated across a node’ssuccessors



Variants and usage

• Every node in the system can serve a request, so long as it knowswhere to propagate it

• Pure Chord implementation uses a progressive propagation algorithm:send the request to the “farthest” node in the identifier space to whichthere is an immediate connection

• Variants include consistent hashing (only one potential location for ahash) or a directory service

• Amazon’s offerings (SimpleDB and DynamoDB), and LinkedIn’sProject Voldemort3 are the typical examples

3http://project-voldemort.com



What about consistency?

• We need to have a consistent view of sequence of updates• Say data item x is avaliable at nodes m and n• Client a updates copy at m; some t time passes• Client b reads copy from n; what value does it read?

• Strict consistency: the system should always return the last write• Either a single node is responsible for each individual data item• Or there is a distributed transaction protocol in place (e.g., 2-phase

commit)• Both options do not scale well• Remember the CAP theorem?

• Eventual consistency: as time t → ∞ all nodes will eventually havethe latest version

• You would never run a production database with this consistency level,but it’s good enough for your list of facebook friends



Wide-column stores

Row-stores

• Great for locality of access: row read/write is a single I/O

• Bad if only interested in a small subset of columns

John 25 student Juliaenterpreneur Justin 18

joke30

...

Column-stores

• Single-column data stored sequentially

• Single-row scans are problematic ...

John25 student

JuliaenterpreneurJustin

18joke

30

Column families and locality groups

• Columns exist by themselves, but can be organised intoindependent families (or locality groups)

• Row-based within a group

• Column-based across groups

John 25student

Juliaenterpreneur

Justin

18 joke30

...

multi-column family

single-column family



Document stores

• Assume there is some structureassociated with the dataset

• The dataset is a document• Arbitrarily nested key-value sets• Embedded into the document

• The database is a collection of suchdocuments, indexed by key in aB-tree

• Different portions of the B-tree atdifferent nodes

• Collection partitioned andreplicated at document level

• Ability to index on documentattributes

{’user_id’: objectid(’123456789’),’line_items’: [

{’sku’: ’jc_123’,’name’: ’The best CD ever’,’price’: 1099},

{’sku’: ’mi_0’,’name’: ’Paper maps for iOS 6’,’price’: 395}

],’shipping’: {

’street’: ’Princes street’,’city’: ’Edinburgh’,’country’: ’UK’,’note’: ’First bench on left’

},’subtotal’: 1494,’tax’: 268,’total’: 1763

}



(De)normalisation

• The first thing you were taught in your undergraduate databasecourse: schema design

• Normalisation is central to this notion: keep separate thingsseparately

• For instance: students taking courses• If all information about all courses students take are inlined into their

records, then what happens if a course changes information?• Must update all student records refering to that course• Local changes are not localised

• NoSQL systems usually argue for denormalisation• Related things will be retrieved together• Updates will be infrequent



Data design for NoSQL

• Data design is not based on functional dependencies, as in relationaldatabases

• Workload-driven design• Figure out the use cases and appropriately design your data

• In the previous example, if the workload usually requests the studentsand the courses they take, then embed course list in student record

• If the workload requests the students taking specific courses, thenembed student records in courses

• If both, use both

• Query languages and queries are not as expressive in NoSQL stores• Or rather, if the intention is to retrieve anything other than what the

representation was designed for, you’re in trouble



Embedded documents

{title: "Schema design",content: "A long post on schema design for NoSQL DBs"comments: [

{username: ’noob’,text: ’How do you add nested comments?’

},{

username: ’expert’,text: ’Hit ctrl+enter at the end of your comment.’

},{

username: ’noob’,text: ’Thanks!’

}]

}



Arbitrarily nested embedded documents

{title: "Schema design",content: "A long post on schema design for NoSQL DBs"comments: [

{username: ’noob’,text: ’How do you add nested comments?’comments: [{

username: ’expert’,text: ’Hit ctrl+enter at the end of your comment.’comments: [{

username: ’noob’,text: ’Thanks!’

}]}]

}]

}



How much should we (de)normalise?

• One extreme is complete normalisation, the other extreme iscomplete denormalisation

• More denormalised design• Larger document size• Harder and inefficient updates• More complex representation• Faster queries

• More normalised design• Maximum flexibility• Maximum update-ability• Simplified representation• More complicated and potentially slower queries

• Most NoSQL databases cannot do joins• And cannot really do much apart from path queries and selections

• The answer is, as usual, “it depends”• With NoSQL databases, data design plays a central role• No clean interface between conceptual and physical design and

querying, as with relational databases


Databases and cloud computing BigTable

Outline





A different data model

• BigTable’s data model is not relational• A table is “a sparse, distributed, persistent multidimensional sorted

map”• The map is indexed by a triplet

• (row:string, column:string, time:int64)

• row and column are keys, time is a timestamp

• Bigtables are mutable at the row level• Support for insertions, deletions, lookups



Rows and columns in more detail

"<html>..."

"<html>..."

"<html>..."

"CNN" "CNN.com"

t3t5

t6

t9 t8com.cnn.www

contents: anchor:cnnsi.com anchor:my.look.ca

• Rows are maintained in sorted lexicographic order• Applications can exploit this property for efficient row scans• Row ranges dynamically partitioned into tablets

• Columns grouped into column families• Column key = family:qualifier

• Column families provide locality hints• Unbounded number of columns per table



Building blocks: SSTable

• The smallest and most basic building block• Persistent immutable map from keys to values

• Stored in GFS• Sequence of disk blocks with a (persistent) index for lookup• Memory-mapped for fast operation

• Two supported operations• Given a key, look up the value associated with it• Iterate over key/value pairs within a given key range

64kBblock

64kBblock

64kBblock

Index

SSTable



Building blocks: Tablets and Tables

• Dynamically partitioned range of rows• Built from multiple SSTables

64kBblock

64kBblock

64kBblock

Index

SSTable

64kBblock

64kBblock

64kBblock

Index

SSTable

Tablet start: aardvark end: apple

• Multiple tablets make up a table• SSTables can be shared beween tablets

SSTable

Tabletaardvark apple

SSTable SSTable SSTable

Tabletapplepie boat



Notes on the architecture

• Similar to GFS

• Single master server, multiple tablet servers

• BigTable master• Assigns tablets to tablet servers• Detects addition and expiration of tablet servers• Balances tablet server load• Handles garbage collection• Handles schema evolution

• Bigtable tablet servers• Each tablet server manages a set of tablets

• Typically between ten to a thousand tablets• Each 100− 200MB by default

• Handles read and write requests to the tablets• Splits tablets when they grow too large



Location dereferencing

Chubby file ...

...

...

...

...

...

...

...

...

...

...

Other metadatatablets

Root tablet(1st metadata level)master file

User table 1

User table nchubby: replicated, persistent lock service; maintains tablet server locations

root tablet: root of the metadata tree

at most three levels in the metadata hierarchy

B-tree like structure, indexed by table identifier and end row



Tablet assignment

• Master keeps track of• Set of live tablet servers• Assignment of tablets to tablet servers• Unassigned tablets

• Each tablet is assigned to one tablet server at a time• Tablet server maintains an exclusive lock on a file in Chubby• Master monitors tablet servers and handles assignment

• Changes to tablet structure• Table creation/deletion (master initiated)• Tablet merging (master initiated)• Tablet splitting (tablet server initiated)



Tablet serving and I/O flow

SSTable SSTable SSTable

memtable read

write

memory

GFS

tablet log

write operations arelogged (in redo records)

recent updates kept sorted in main memory

memtable and SSTablesare merged to servethe read request



Tablet management

• Minor compaction• Converts the memtable into an SSTable• Reduces memory footprint and log traffic on restart

• Merging compaction• Reads the contents of a few SSTables and the memtable, and writes

out a new SSTable• Reduces number of SSTables

• Major compaction• Merging compaction that results in only one SSTable• No deletion records, only live data


Databases and cloud computing Hive and Pig

Outline





High-level data processing

• Hive: data warehousing application in Hadoop• Query language is HQL , variant of SQL• Tables stored on HDFS as flat files• Developed by Facebook, now open source

• Pig: large-scale data processing system• Scripts are written in Pig Latin, a dataflow language• Developed by Yahoo!, now open source• Roughly 1/3 of all Yahoo! internal jobs

• Common idea• Provide higher-level language to facilitate large-data processing• Higher-level language is compiled to Hadoop jobs



Hive: background and components

• Started at Facebook4

• Data was collected by nightly cron jobs into Oracle DB• Extract-transform-load (ETL) via hand-coded python• Grew from 10s of GBs (2006) to 1TB/day new data (2007), now 10x that

• Shell: allows interactive queries• Driver: session handles, fetch, execute• Compiler: parse, plan, optimize• Execution engine: DAG of stages (MR, HDFS, metadata processing)• Metastore: schema, location in HDFS, SerDe

4It had to be good for something apart from wasting my PhD students’ timeStratis D. Viglas www.inf.ed.ac.uk


Logical and physical models

• Tables• Typed columns (int, float, string, boolean)• Also: list, map

• Partitions• For example, range-partition tables by date

• Buckets• Hash partitions within ranges (useful for sampling, join optimization)

• Metastore• Database: namespace containing a set of tables• Holds table definitions (column types, physical layout)• Holds partitioning information• Can be stored in Derby, MySQL, and many other relational databases



Hive processing

• Hive uses HQL , a declarative query language close to SQL

• HQL statements are translated into a syntax tree• Syntax tree is compiled into an execution plan of MapReduce jobs,

executed by Hadoop

SELECT s.word, s.freq, k.freqFROM shakespeare s JOIN bible kON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10;

HQL query Abstract Syntax Tree

map

reduce

map

reduce

map

reduce

map

reduce

map

reduce

map

reduce

MapReduce plan



Pig and Pig Latin

• Similar idea to Hive, but more tailored towards efficiency and aDB-like setting

• Script interface to deploy MapReduce jobs• Maintains schema and performs type checking• Rudimentary optimiser to translate Pig scripts into an efficient

physical dataflow• Sequence of one or more MapReduce jobs• Exploit heuristics and cost model to reduce intermediate data

• Dataflow is scheduled and executed• Runtime tracks job progress and any errors



Example Pig Latin script

Visits = load ’/data/visits’ as (user, url, time);

Visits = foreach Visits

generate user, Canonicalize(url), time;

Pages = load ’/data/pages’ as (url, pagerank);

VP = join Visits by url, Pages by url;

UserVisits = group VP by user;

UserPageranks = foreach UserVisits

generate user, AVG(VP.pagerank) as avgpr;

GoodUsers = filter UserPageranks by avgpr > ’0.5’;

store GoodUsers into ’/data/good_users’;



Java vs. Pig Latin

20406080

100120140160180

Hadoop Pig

lines

of c

ode

50

100

150

200

250

300

Hadoop Pig

min

utes

• Performance on par with raw Hadoop• But with 1/20 of the lines of code• And with 1/16 of the developement time


Extreme computing Databases and cloud computing · Hive and Pig Stratis D. Viglas . Databases and...

Documents

Transcript of Extreme computing Databases and cloud computing · Hive and Pig Stratis D. Viglas . Databases and...