MapReduce Improvements in MapR Hadoop
-
Upload
abord -
Category
Technology
-
view
113 -
download
1
description
Transcript of MapReduce Improvements in MapR Hadoop
1©MapR Technologies - Confidential
MapReduce Improvements in the MapR Hadoop Distribution
Adam Bordelon, Senior Software Engineer at MapR
Big Data Madison meetup - 9/26/2013
2
What's this all about?
● Background on Hadoop● Big Data: Distributed Filesystems● Big Compute:
– MapReduce– Beyond MapReduce
● Q&A
2
3
Hadoop History
http://s.wsj.net/public/resources/images/MI-BX925_GOOGLE_G_20130818173254.jpg
4
Big Data: Distributed FileSystems
Volume, Variety, Velocity: Can't have big data without a scalable filesystem
http://www.lbisoftware.com/blog/wp-content/uploads/2013/06/data_mountain1.jpg
5
HDFS Architecture
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
6
HDFS Architectural Flaws
● Created for storing crawled web-page data
● Files cannot be modified once written/closed.
– Write-once; append-only● Files cannot be read before they are closed.
– Must batch-load data● NameNode stores (in memory)
– Directory/file tree, file->block mapping
– Block replica locations● NameNode only scales to ~100 Million files
– Some users run jobs to concatenate small files● Written in Java, slows during GC.
7
Solution: MapR FileSystem
● Visionary CTO/Co-Founder: M.C. Srivas
– Ran Google search infrastructure team
– Chief Storage Architect at Spinnaker Networks
● Take a step back: What kind of DFS do we need in Hadoop/Distributed-Computer?
– Easy, Scalable, Reliable
● Want traditional apps to work with DFS
– Support random Read/Write
– Standard FS interface (NFS)
● HDFS compatible
– Drop-in replacement, no recompile
8
Easy: Posix-compliant NFS
9
Easy: MapR Volumes
Groups related files/directories into a single tree structure so they can be easily organized, managed, and secured.
● Replication factor
● Scheduled snapshots, mirroring
● Data placement control
– By device-type, rack, or geographic location
● Quotas and usage tracking
● Administrative permissions
100K+ Volumes are okay
10
Each container contains Directories & files Data blocks
Replicated on servers No need to manage directly Use MapR Volumes
Scalable: ContainersFiles/directories are sharded into blocks, which are placed into mini-NNs (containers) on disks
Containers are 16-32 GB disk segments, placed on nodes
11
CLDB
Scalable: Container Location DB
N1, N2
N3, N2
N1, N2
N1, N3
N3, N2
N1
N2
N3
Container location database (CLDB) keeps track of nodes hosting each container and replication chain order
Each container has a replication chain Updates are transactional Failures are handled by rearranging replication Clients cache container locations
12
Scalability Statistics
Containers represent 16 - 32GB of data Each can hold up to 1 Billion files and directories 100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container 25GB to cache all containers for 2EB cluster
− But not necessary, can page to disk Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
Serve 100x more data-nodes Increase container size to 64G to serve 4EB cluster
MapReduce performance not affected
13
Record-breaking SpeedBenchmark MapR
2.1.1CDH 4.1.1 MapR
Speed Increase
Terasort (1x replication, compression disabled)
Total 13m 35s 26m 6s 2X
Map 7m 58s 21m 8s 3X
Reduce 13m 32s 23m 37s 1.8X
DFSIO throughput/node
Read 1003 MB/s 656 MB/s 1.5X
Write 924 MB/s 654 MB/s 1.4X
YCSB (50% read, 50% update)
Throughput 36,584.4 op/s
12,500.5 op/s
2.9X
Runtime 3.80 hr 11.11 hr 2.9X
YCSB (95% read, 5% update)
Throughput 24,704.3 op/s
10,776.4 op/s
2.3X
Runtime 0.56 hr 1.29 hr 2.3X
MapRw/Google
Apache Hadoop
Time 54s 62s
Nodes 1003 1460
Disks 1003 5840
Cores 4012 11680
NEW WORLD RECORD
BREAK TERASORT MINUTE BARRIER
Benchmark hardware configuration: 10 servers, 12x2 cores (2.4 GHz), 12x2TB, 48 GB, 1x10GbE
14
Reliable: CLDB High Availability
● As easy as installing CLDB role on more nodes
– Writes go to CLDB master, replicated to slaves
– CLDB slaves can serve reads● Distributed container metadata, so CLDB only
stores/recovers container locations
– Instant restart (<2 seconds), no single POF● Shared nothing architecture
● (NFS Multinode HA too)
15
vs. Federated NN, NN HA
● Federated NameNodes
– Statically partition namespaces (like Volumes)
– Need additional NN (plus a standby) for each namespace
– Federated NN only in Hadoop-2.x (beta)● NameNode HA
– NameNode responsible for both fs-namespace (metadata) info and block locations; more data to checkpoint/recover.
– Starting standby NN from cold state can take tens-of-minutes for metadata, an hour for block-locations. Need a hot standby.
– Metadata state● All name space edits logged to shared (NFS/NAS) R/W storage, which must
also be HA; Standby polls edit log for changes.● Or use Quorum Journal Manager, separate service/nodes
– Block locations● Data nodes send block reports, location updates, heartbeats to both NNs
16
Reliable: Consistent Snapshots
● Automatic de-duplication
● Saves space by sharing blocks
● Lightning fast
● Zero performance loss on writing to original
● Scheduled, or on-demand
● Easy recovery with drag and drop
17
Reliable: Mirroring
18
MapR Filesystem Summary
● Easy
– Direct Access NFS
– MapR Volumes
● Fast
– C++ vs. Java
– Direct disk access, no layered filesystems
– Lockless transactions
– High-speed RPC
– Native compression
● Scalable
– Containers, distributed metadata
– Container Location DB
● Reliable
– CLDB High Availability
– Snapshots
– Mirroring
19
Big Compute: MapReduce
http://developer.yahoo.com/hadoop/tutorial/module4.html
20
Fast: Direct Shuffle
● Apache Shuffle
– Write map-outputs/spills to local file system
– Merge partitions for a map output into one file, index into it
– Reducers request partitions from Mappers' Http servlets
● MapR Direct Shuffle
– Write to Local Volume in MapR FS (rebalancing)
– Map-output file per reducer (no index file)
– Send shuffleRootFid with MapTaskCompletion on heartbeat
– Direct RPC from Reducer to Mapper using Fid
– Copy is just a file-system copy; no Http overhead
– More copy threads, wider merges
21
Fast: Express Lane
● Long-running jobs shouldn't hog all the slots in the cluster and starve small, fast jobs (e.g. Hive queries)
● One or more small slots reserved on each node for running small jobs
● Small jobs: <10 maps/reds, small input, time limit
22
Reliable: JobTracker HA
23
Easy: Label-based Scheduling
● Assign labels to nodes or regex/glob expressions for nodes
– perfnode1* → “production”
– /.*ssd[0-9]*/ → “fast_ssd”
● Create label expressions for jobs/queues
– Queue “fast_prod” → “production && fast_ss”
● Tasks from these jobs/queues will only be assigned to nodes whose labels match the expression.
● Combine with Data Placement policies for data and compute locality
● No static partitioning necessary
– Frequent labels file refresh
– New nodes automatically fall into appropriate regex/glob labels
– New jobs can specify label expression or use queue's or both
● http://www.mapr.com/doc/display/MapR/Placing+Jobs+on+Specified+Nodes
24
Other Improvements
● Parallel Split Computations in JobClient
– Might as well multi-thread it!
● Runaway Job Protection
– One user's fork-bomb shouldn't degrade others' performance
– CPU/memory firewalls protect system processes
● Map-side join locality
– Files in same directory/container follow same replication chain
– Same key ranges likely to be co-located on same node.
● Zero-config XML
– XML parsing takes too much time
25
MapR MapReduce Summary
● Fast
– Direct Shuffle
– Express Lane
– Parallel Split Computation
– Map-side Join Locality
– Zero-config XML
● Reliable
– JobTracker HA
– Runaway Job Protection
● Easy
– Label-based Scheduling
26
Beyond MapReduce...
http://www.nasa.gov/sites/default/files/potw1335a_0.jpg
27
M7: Enterprise-Grade HBase
Disks
ext3
JVM
DFS
JVM
HBase
Other Distributions
Disks
Unified
Easy Dependable FastNo RegionServers No compactions Consistent low latency
Seamless splits Instant recovery from node failure
Real-time in-memory configuration
Automatic merges Snapshots Disk and network compression
In-memory column families Mirroring Reduced I/O to disk
Unified Data Platform Increased Performance Simplified Administration
28
Apache DrillInteractive analysis of Big Data using standard SQL
Based on Google Dremel
Interactive queriesData analystReporting100 ms-20 min
Data miningModelingLarge ETL20 min-20 hr
MapReduceHive
Pig
Fas t
• Low latency queries• Columnar execution• Complement native interfaces
and MapReduce/Hive/Pig
Op
en
• Community driven open source project• Under Apache Software Foundation
Mo
der
n
• Standard ANSI SQL:2003 (select/into)• Nested/hierarchical data support• Schema is optional• Supports RDBMS, Hadoop and NoSQL
29
Apache YARN aka MR2
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
31
Contact Us!
I'm not in Sales, so go to mapr.com to learn more:– Integrations with AWS, GCE, Ubuntu, Lucidworks
– Partnerships, Customers
– Support, Training, Pricing
– Ecosystem Components
We're hiring!University of Wisconsin-Madison Career Fair tomorrow
Email me at: [email protected]
31
32©MapR Technologies - Confidential
Questions?