MapReduce Improvements in MapR Hadoop

1©MapR Technologies - Confidential

MapReduce Improvements in the MapR Hadoop Distribution

Adam Bordelon, Senior Software Engineer at MapR

Big Data Madison meetup - 9/26/2013

2

What's this all about?

● Background on Hadoop● Big Data: Distributed Filesystems● Big Compute:

– MapReduce– Beyond MapReduce

● Q&A

2

3

Hadoop History

http://s.wsj.net/public/resources/images/MI-BX925_GOOGLE_G_20130818173254.jpg

4

Big Data: Distributed FileSystems

Volume, Variety, Velocity: Can't have big data without a scalable filesystem

http://www.lbisoftware.com/blog/wp-content/uploads/2013/06/data_mountain1.jpg

http://www.lbisoftware.com/blog/wp-content/uploads/2013/06/data_mountain1.jpg

5

HDFS Architecture

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

6

HDFS Architectural Flaws

● Created for storing crawled web-page data

● Files cannot be modified once written/closed.

– Write-once; append-only● Files cannot be read before they are closed.

– Must batch-load data● NameNode stores (in memory)

– Directory/file tree, file->block mapping

– Block replica locations● NameNode only scales to ~100 Million files

– Some users run jobs to concatenate small files● Written in Java, slows during GC.

7

Solution: MapR FileSystem

● Visionary CTO/Co-Founder: M.C. Srivas

– Ran Google search infrastructure team

– Chief Storage Architect at Spinnaker Networks

● Take a step back: What kind of DFS do we need in Hadoop/Distributed-Computer?

– Easy, Scalable, Reliable

● Want traditional apps to work with DFS

– Support random Read/Write

– Standard FS interface (NFS)

● HDFS compatible

– Drop-in replacement, no recompile

8

Easy: Posix-compliant NFS

9

Easy: MapR Volumes

Groups related files/directories into a single tree structure so they can be easily organized, managed, and secured.

● Replication factor

● Scheduled snapshots, mirroring

● Data placement control

– By device-type, rack, or geographic location

● Quotas and usage tracking

● Administrative permissions

100K+ Volumes are okay

10

Each container contains Directories & files Data blocks

Replicated on servers No need to manage directly Use MapR Volumes

Scalable: ContainersFiles/directories are sharded into blocks, which are placed into mini-NNs (containers) on disks

Containers are 16-32 GB disk segments, placed on nodes

11

CLDB

Scalable: Container Location DB

N1, N2

N3, N2

N1, N2

N1, N3

N3, N2

N1

N2

N3

Container location database (CLDB) keeps track of nodes hosting each container and replication chain order

Each container has a replication chain Updates are transactional Failures are handled by rearranging replication Clients cache container locations

12

Scalability Statistics

Containers represent 16 - 32GB of data Each can hold up to 1 Billion files and directories 100M containers = ~ 2 Exabytes (a very large cluster)

250 bytes DRAM to cache a container 25GB to cache all containers for 2EB cluster

− But not necessary, can page to disk Typical large 10PB cluster needs 2GB

Container-reports are 100x - 1000x < HDFS block-reports

Serve 100x more data-nodes Increase container size to 64G to serve 4EB cluster

MapReduce performance not affected

13

Record-breaking SpeedBenchmark MapR

2.1.1CDH 4.1.1 MapR

Speed Increase

Terasort (1x replication, compression disabled)

Total 13m 35s 26m 6s 2X

Map 7m 58s 21m 8s 3X

Reduce 13m 32s 23m 37s 1.8X

DFSIO throughput/node

Read 1003 MB/s 656 MB/s 1.5X

Write 924 MB/s 654 MB/s 1.4X

YCSB (50% read, 50% update)

Throughput 36,584.4 op/s

12,500.5 op/s

2.9X

Runtime 3.80 hr 11.11 hr 2.9X

YCSB (95% read, 5% update)

Throughput 24,704.3 op/s

10,776.4 op/s

2.3X

Runtime 0.56 hr 1.29 hr 2.3X

MapRw/Google

Apache Hadoop

Time 54s 62s

Nodes 1003 1460

Disks 1003 5840

Cores 4012 11680

NEW WORLD RECORD

BREAK TERASORT MINUTE BARRIER

Benchmark hardware configuration: 10 servers, 12x2 cores (2.4 GHz), 12x2TB, 48 GB, 1x10GbE

14

Reliable: CLDB High Availability

● As easy as installing CLDB role on more nodes

– Writes go to CLDB master, replicated to slaves

– CLDB slaves can serve reads● Distributed container metadata, so CLDB only

stores/recovers container locations

– Instant restart (<2 seconds), no single POF● Shared nothing architecture

● (NFS Multinode HA too)

15

vs. Federated NN, NN HA

● Federated NameNodes

– Statically partition namespaces (like Volumes)

– Need additional NN (plus a standby) for each namespace

– Federated NN only in Hadoop-2.x (beta)● NameNode HA

– NameNode responsible for both fs-namespace (metadata) info and block locations; more data to checkpoint/recover.

– Starting standby NN from cold state can take tens-of-minutes for metadata, an hour for block-locations. Need a hot standby.

– Metadata state● All name space edits logged to shared (NFS/NAS) R/W storage, which must

also be HA; Standby polls edit log for changes.● Or use Quorum Journal Manager, separate service/nodes

– Block locations● Data nodes send block reports, location updates, heartbeats to both NNs

16

Reliable: Consistent Snapshots

● Automatic de-duplication

● Saves space by sharing blocks

● Lightning fast

● Zero performance loss on writing to original

● Scheduled, or on-demand

● Easy recovery with drag and drop

17

Reliable: Mirroring

18

MapR Filesystem Summary

● Easy

– Direct Access NFS

– MapR Volumes

● Fast

– C++ vs. Java

– Direct disk access, no layered filesystems

– Lockless transactions

– High-speed RPC

– Native compression

● Scalable

– Containers, distributed metadata

– Container Location DB

● Reliable

– CLDB High Availability

– Snapshots

– Mirroring

19

Big Compute: MapReduce

http://developer.yahoo.com/hadoop/tutorial/module4.html

20

Fast: Direct Shuffle

● Apache Shuffle

– Write map-outputs/spills to local file system

– Merge partitions for a map output into one file, index into it

– Reducers request partitions from Mappers' Http servlets

● MapR Direct Shuffle

– Write to Local Volume in MapR FS (rebalancing)

– Map-output file per reducer (no index file)

– Send shuffleRootFid with MapTaskCompletion on heartbeat

– Direct RPC from Reducer to Mapper using Fid

– Copy is just a file-system copy; no Http overhead

– More copy threads, wider merges

21

Fast: Express Lane

● Long-running jobs shouldn't hog all the slots in the cluster and starve small, fast jobs (e.g. Hive queries)

● One or more small slots reserved on each node for running small jobs

● Small jobs: <10 maps/reds, small input, time limit

22

Reliable: JobTracker HA

23

Easy: Label-based Scheduling

● Assign labels to nodes or regex/glob expressions for nodes

– perfnode1* → “production”

– /.*ssd[0-9]*/ → “fast_ssd”

● Create label expressions for jobs/queues

– Queue “fast_prod” → “production && fast_ss”

● Tasks from these jobs/queues will only be assigned to nodes whose labels match the expression.

● Combine with Data Placement policies for data and compute locality

● No static partitioning necessary

– Frequent labels file refresh

– New nodes automatically fall into appropriate regex/glob labels

– New jobs can specify label expression or use queue's or both

● http://www.mapr.com/doc/display/MapR/Placing+Jobs+on+Specified+Nodes

24

Other Improvements

● Parallel Split Computations in JobClient

– Might as well multi-thread it!

● Runaway Job Protection

– One user's fork-bomb shouldn't degrade others' performance

– CPU/memory firewalls protect system processes

● Map-side join locality

– Files in same directory/container follow same replication chain

– Same key ranges likely to be co-located on same node.

● Zero-config XML

– XML parsing takes too much time

25

MapR MapReduce Summary

● Fast

– Direct Shuffle

– Express Lane

– Parallel Split Computation

– Map-side Join Locality

– Zero-config XML

● Reliable

– JobTracker HA

– Runaway Job Protection

● Easy

– Label-based Scheduling

26

Beyond MapReduce...

http://www.nasa.gov/sites/default/files/potw1335a_0.jpg

27

M7: Enterprise-Grade HBase

Disks

ext3

JVM

DFS

JVM

HBase

Other Distributions

Disks

Unified

Easy Dependable FastNo RegionServers No compactions Consistent low latency

Seamless splits Instant recovery from node failure

Real-time in-memory configuration

Automatic merges Snapshots Disk and network compression

In-memory column families Mirroring Reduced I/O to disk

Unified Data Platform Increased Performance Simplified Administration

28

Apache DrillInteractive analysis of Big Data using standard SQL

Based on Google Dremel

Interactive queriesData analystReporting100 ms-20 min

Data miningModelingLarge ETL20 min-20 hr

MapReduceHive

Pig

Fas t

• Low latency queries• Columnar execution• Complement native interfaces

and MapReduce/Hive/Pig

Op

en

• Community driven open source project• Under Apache Software Foundation

Mo

der

n

• Standard ANSI SQL:2003 (select/into)• Nested/hierarchical data support• Schema is optional• Supports RDBMS, Hadoop and NoSQL

29

Apache YARN aka MR2

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

30

Why MapR?

http://www.mapr.com/products/why-mapr

http://www.mapr.com/products/why-mapr

31

Contact Us!

I'm not in Sales, so go to mapr.com to learn more:– Integrations with AWS, GCE, Ubuntu, Lucidworks

– Partnerships, Customers

– Support, Training, Pricing

– Ecosystem Components

We're hiring!University of Wisconsin-Madison Career Fair tomorrow

Email me at: [email protected]

31

MapReduce Improvements in MapR Hadoop

Technology

Transcript of MapReduce Improvements in MapR Hadoop