Hadoop Tuning Philippe Bonnet – [email protected]. Processing High Volumes of Data Need for high...

21
Hadoop Tuning Philippe Bonnet – [email protected]

Transcript of Hadoop Tuning Philippe Bonnet – [email protected]. Processing High Volumes of Data Need for high...

Hadoop TuningPhilippe Bonnet – [email protected]

(c) Philippe Bonnet 2014

Processing High Volumes of Data

• Need for high throughput and low latency• Only one solution:

PARTITIONING

(c) Philippe Bonnet 2014

Distributed ArchitecturePicture courtesy of Virtual Geek

Distributed Shared Nothing.On premise or in the cloud:(PaaS, SaaS – e.g., AWS or Pivotal)

(c) Philippe Bonnet 2014

Partitionned Processing

1. Partitioning (sharding): • Defining data partitions + data distribution

2. Resource management• OS support on each node + cluster file system

3. Job scheduling• Scheduling and distribution of programs to be executed

on partitions

4. Failure handling / Recovery• Handling node/network failure

5. Programming model• Supporting search/exploration/cleaning/

integration/learning/mining/OLAP

(c) Philippe Bonnet 2014

Partitioning/Sharding

Three ways to partition data:1. Round Robin2. Hash partitioning3. Range partitioning

Objectives:• Fast partitioning• Minimize skew• Efficient support for scan/point/range queries

Fast Skew Scan Point Range

Round Robin

- + + - -

Hash + - + + -

Range ~ - + ~ +

(c) Philippe Bonnet 2014

Resource Management

• Processing abstraction• Per node processing

• Single vs. multi-threaded• With our without HW acceleration (GPU)

• Communication abstraction• Within a rack• Across racks• Across data centers

• Storage abstraction• Single address space

• Sharding is hidden from programmerVs.• Multiple address space

• Sharding is explicitely performed by the programmer

• Cluster Management• Resource allocation (RAM/Disk/CPU/Network)• Admission control

(c) Philippe Bonnet 2014

Job scheduling

• Execution is synchronized across multiple nodes in the cluster• A batch completes when the slowest node has processed its partition• Processing needs to be restarted in case of node failure

• The entire batch / a job on a single node• Jobs restarted on same / different node

• Batch processing vs. interactive sessions• Architecture:• Master – slave (simple, single-point-of-failure)

• Separates control (master) and data (slaves) planesVs.• Peer-to-peer (robust, complex)

• Control must be distributed/arbitrated across a set of nodes

(c) Philippe Bonnet 2014

Failure Handling

• Types of failures:• Google Data:

• 2009: DRAM• 25K to 70k errors per billion device hours per Mbit• 8% of DIMMS affected by errors

• 2007: Disk drives• Annualized failure rate: 1.7 % for 1 year old device to 8.6% for 3 years old devices

• Failures are the rule, not the exception!• Types of failures:

• Maintenance, HW, SW, poor job scheduling, overload• Permanent / transient

Check out MapReduce Tutorial at Sigmetrics 2009

(c) Philippe Bonnet 2014

Failure Handling

• Improving fault-tolerance:• Introducing redundancy

• HW redundancy in the data center. Might reduce, but does not eliminate failure.• Replication

• Partition replicated within/across racks/data centers• Impacts both availability and performance (job performed on several replicas in parallel,

completes when fastest is done)• Introduces a problem in terms of consistency

• Replication model: master-slave (single-point-of-failure) vs. peer-to-peer (needs arbitration)

• Fast replay• If a failure can be detected quickly on a node, and if it is possible to determine which jobs are

impacted by this failure then it is possible to replay these jobs on the same node (if the failure is transient) or on a different node (if the failure is permanent)• This requires an infrastructure for (i) detecting failures (easy when job fails, hard when job hangs), and

(ii) for restricting their impact (e.g., with software containers)

(c) Philippe Bonnet 2014

Replica Consistency

• Problem: Consider a data item is replicated. Ideally, all observers see the effects of an update when it is performed. In practice, it takes time to propagate the update to all replicas, and in the meantime, observers of different replicas might get inconsistent readings.• CAP theorem (E.Brewer early 00s)• Consider the following three properties of large-scale partitioned systems

• Availability• Single-copy Consistency• Tolerance to network partition (e.g., across data centers)

• Of these three properties, only two can be guaranteed at the same time.

Check out Werner Voegels blog post on the topic

(c) Philippe Bonnet 2014

Replica Consistency

• Atomic (or external) consistency: the effects of updates are immediately seen by all processes. There is no inconsistency, but reads of a replica might be delayed.• Strong consistency: all updates are seen by all processes in the same

order. As a result, the effects of an update are seen by all observers. There is no inconsistency.• Weak consistency: observers might see inconsistencies among

replicas• Eventual consistency: A form of weak consistency, where at some point, in

case there is no failure, all replicas will reflect the last update.

(c) Philippe Bonnet 2014

Programming Model

• Dataflow Paradigm: Parallel Database Systems (Gamma, early 90s) in the context of shared nothing clusters

And River designs explored by Jim Gray (early 00s)SQL abstraction provided to the programmer. The system automatically deals with partitioning

Operations are:- Algebraic (take or 1 or

several partition as input and produce an output which is re-partitioned across nodes)

- Scalar (takes one partition as input and produces a scalar)

(c) Philippe Bonnet 2014

Programming Model

• Map-Reduce: Introduced by Jeff Dean and Sanjay Ghemawat at Google (mid 00s).

A form of dataflow program composed of two operations (a unary algebraic operation map, and a scalar operation reduce) derived from functional programming.Mapper: takes a partition as input and generates (key, value) pairsReducer: takes a list of values for a given key and generates an outputThe system takes care of partition management

(c) Philippe Bonnet 2014

The “hello world” of Map Reduce

Word Count pseudo-code

Map (document): for each word w in document: emit(w, “1”);

Reduce (string key, Iterator intermediateValues): int result = 0; for each v in intermediateValues: result += fromStringToInt(v); emit(key, fromIntToString(result));

(c) Philippe Bonnet 2014

The Hadoop Ecosystem• A dense cluster of Apache projects

• Inspired from Google’s groundbreaking work on MapReduce, GFS, Sawzall, BigTable• Contributions from Industry (Yahoo, Facebook, …) and academia (UC Berkeley’s AmpLab and related startups, e.g., databricks).

• Common, HDFS, Yarn, MapReduce:• Partition management: HDFS is a distributed file system that provides a single address space and manages replication (strong consistency),

YARN provides resource management and job scheduling, MapReduce provides the programming model (Mapper, Reducer and a few other interfaces to customize partition management including checkpointing) in Java (with a streaming interface to incorporate operators implemented in other languages).

• Cassandra• An alternative/complement to the Hadoop stack for multi-master database management (no single point of failure/eventual consistency,

unlike Hadoop)

• Hbase, Hive, Pig, Tez, Mahout• Hbase is a version of Google’s BigTable (running on top of Hadoop common/HDFS)• Hive supports OLAP on top of HDFS, easily combined with Map Reduce• Pig is a dataflow language on top of MapReduce• Tez is a dataflow language on top of YARN• Mahout is a library of data mining/machine learning algorithms on top of MapReduce (now moving to Spark)

• Twill• High-level language to configure and manager YARN

• Spark• A new stack, more performant/more flexible, compatible with Hadoop• Based on work led by I.Stoica and R.Schenker from UC Berkeley’s AmpLab.

(c) Philippe Bonnet 2014

Who uses Hadoop?

Check out an extensive list. Some excerpts:

(c) Philippe Bonnet 2014

Massively Parallel Databases

• Also called MPP, for Massively Parallel Processing. • OLAP workloads at scale, on large clusters• SQL interface, relational schema• Automatic partition management based on dataflow paradigm

• Data is loaded into a database system.• Pros: Column store, compression, indexing, operation pipelining (on a node)• Con: Transaction-level recovery (operators can be marked recoverable so their

state is stored and can be used for fast replay)

• Systems: HP Vertica (based on MIT’s C-store), EMC Pivotal Greenplum (based on PostgreSQL)

(c) Philippe Bonnet 2014

Map-Reduce vs. MPP

• A fierce argument: Mike Stonebraker/David DeWitt vs. Jeff Dean• Series of CACM articles and blogposts in 2010

• Map-Reduce and Parallel Databases: Friends or Foes?• Map-Reduce a major step backwards• Map-Reduce a flexible data processing tool

• HIVE: SQL interface on top of Map-Reduce (See Hive Tuning slides)• New generation of systems transcending the differences• Apache Spark

(c) Philippe Bonnet 2014

Spark

The Berkeley Analytics Stack

(c) Philippe Bonnet 2014

Spark

The core concept underlying Spark (implemented in Tachyon) is the Resilient Distributed Data sets(RDD).Hadoop is not good at reusing data across Mappers (when several Map-Reduce operations are chained), which is common in machine learning or OLAP (with sequences of interactive queries).RDDs provide efficient fault tolerance by (i) constraining the possible transformations on partitions (map, filter and join), and (ii) managing lineage for each RDD (the set of transformations that led to the current state of a partition). Note that each RDD is immutable.See the research papers for details.