Cloud Computing Other Mapreduce issues

Cloud ComputingOther Mapreduce issues

Keke Chen

Outline MapReduce performance tuning Comparing mapreduce with DBMS

Negative arguments from the DB community Experimental study (DB vs. mapreduce)

Web interfaces Tracking mapreduce jobs

http://localhost:50030/ http://localhost:50060/ - web UI for task

tracker(s) http://localhost:50070/ - web UI for HDFS

name node(s)

lynx http://localhost:50030/

Configuration files Directory /usr/local/hadoop/conf/

hdfs-site.xml mapred-site.xml core-site.xml

Some parameter tuning http://www.youtube.com/watch?v=hIXmpS-

Some tips http://www.cloudera.com/blog/2009/12/7-

tips-for-improving-mapreduce-performance/

7 tips

1. Configure your cluster correctly2. Use LZO compression3. Tune the number of maps/reduces4. Combiner5. Appropriate and compact Writables6. Reuse Writables7. Profiling tasks

Debates on mapreduce Most opinions are from the database

camp Database techniques for distributed data

store Parallel databases

traditional database approaches Scheme SQL

DB guys’ initial response to MapReduce “Mapreduce: a giant step backword”

A sub-optimal implementation, in that it uses brute force instead of indexing

Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

Missing most of the features that are routinely included in current DBMS

Incompatible with all of the tools DBMS users have come to depend on

A giant step backward in the programming paradigm for large-scale data intensive applications Schema Separating schema from code High-level language

Responses MR handles large data having no schema Takes time to clean large data and pump into a DB There are high-level languages developed: pig, hive, etc Some problems that SQL cannot handle Unix style programming (pipelined processing) is used by

many users

A sub-optimal implementation, in that it uses brute force instead of indexing

No index Data skew : some reducers take longer time High cost in reduce stage: disk I/O

Responses Google’s experience has shown it can scale well Index is not possible if the data has no schema Mapreduce is used to generate web index Writing back to disk increases fault tolerance

Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago Hash-based join Teradata

Responses The authors do not claim it is novel. many users are already using similar ideas in

their own distributed solutions Mapreduce serves as a well developed library

Missing most of the features that are routinely included in current DBMS Bulk loader, transactions, views, integrity

constraints …

Responses Mapreduce is not a DBMS, designed for different

purposes In practice, it does not prevent engineers

implementing solutions quickly Engineers usually take more time to learn DBMS DBMS does not scale to the level of mapreduce

applications

Incompatible with all of the tools DBMS users have come to depend on

Responses Again, it is not DBMS DBMS systems and tools have become

obstacles to data analytics

Some important problems Experimental study on scalability High-level language

Experimental study Sigmod09 “A comparison of approaches

to large scal data analysis” Compare parallel SQL DBMS (anonymous

DBMS-X and Vertica) and mapreduce (hadoop)

Tasks Grep Typical DB tasks: selection, aggregation,

join, UDF

Grep task 2 settings: 535M/node, 1TB/cluster

Hadoop is much faster in loading data

Grep: task execution

Hadoop is the slowest…

Selection task Select pageURL, pageRank from

rankings where pageRank > x

Aggregation task Select sourceIP, sum(adRevenue) from

Uservisits group by source IP

Join task Mapreduce takes 3 phases to do it

UDF task Extract URLs from documents, then

Vertica does not support UDF functions; uses a special program

Discussion System level

Easy to install/configure MR, more challenging to install parallel DBMSs

Available tools for performance tuning for parallel DBMS

MR takes more time in task start-up Compression does not help MR MR has short loading time, while DBMSs

take some time to reorg the input data MR has better fault tolerance

Discussion User level

Easier to program MR Cost to maintain MR code Better tool support for SQL DBMS

Cloud Computing Other Mapreduce issues

Documents

Transcript of Cloud Computing Other Mapreduce issues

Basics of Cloud Computing –Lecture 4 Introduction to MapReduce · 2014-03-04 · Basics of Cloud Computing –Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted

Basics of Cloud Computing – MapReduce

Introduction to Cloud Computingmsakr/15319-s10/lectures/QloudDemo.pdf15-319 Introduction to Cloud Computing Introduction to Cloud Computing ... In Hadoop MapReduce, one node is designated

Kahuna: Problem Diagnosis for MapReduce-Based …spertet/papers/noms-2010.pdf · Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments Jiaqi Tan, Xinghao Pan

Virtualization and Cloud Computing - ERNETsbansal/csl862-virt/lec/intro.pdf · device virtualization –Applications •Cloud Computing –Data in the Cloud (MapReduce, BigTable,

MapReduce in Cloud Computing

Cloud Computing Introduction€¦ · Introduction to Cloud Computing 2.Virtualization I 3.Virtualization II 4.Data Center Networking 5.MapReduce Batch Processing 6.MapReducein Heterogeneous

XHAMI – extended HDFS and MapReduce interface for Big Data ... · XHAMI – extended HDFS and MapReduce interface for Big Data image processing applications in cloud computing environments

Hadoop hands-on: using MapReduce / Spark on the cloud Core Ecosystem Cluster Cloud computing ... AWS...). Ù Automatic deployment + Cloud infrastructure interoperability.

MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.

Distributed Computing & MapReduce

Cloud Computing MapReduce, Batch Processing · §First MapReduce library in 02/2003 §Use cases (back then): §Large-scale machine learning problems §Clustering problems for the

Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Cloud Computing using MapReduce, Hadoop, Sparkoxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/mapreduce.pdf · MapReduce Goals • Cloud Environment: – Commodity nodes (cheap,

Cloud Computing MapReduce in Heterogeneous Environments€¦ · MapReduce in Heterogeneous Environments ... “Improving MapReducePerformance in Heterogeneous Environments”, By

Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation ...

Abhishek Verma, Saurabh Nangia. Cloud computing hype Cynicism MapReduce Vs Parallel DBMS Cost of a cloud Discussion 2.

Cloud Computing and MapReduce Used slides from RAD Lab at UC Berkeley about the cloud ( and slides from Jimmy Lin’s.

Kahuna: Problem Diagnosis for MapReduce-Based Cloud ... · Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments Jiaqi Tan, Xinghao Pan DSO National Laboratories,

Cloud Computing Other Mapreduce issues Keke Chen.