Hadoop and HBase experiences in perf log project

1 Hadoop & HBase Experiences in Perf-Log Project Eric Geng & Gary Zhao Performance Team, Platform 11/24/2011


An introduction to hadoop & hbase and share how we use them to analyze performance metrics and logs

Transcript of Hadoop and HBase experiences in perf log project

Page 1: Hadoop and HBase experiences in perf log project


Hadoop & HBase Experiences in Perf-Log Project

Eric Geng & Gary ZhaoPerformance Team, Platform 11/24/2011

Page 2: Hadoop and HBase experiences in perf log project



Page 3: Hadoop and HBase experiences in perf log project


Architecture of Perf-Log Project

Page 4: Hadoop and HBase experiences in perf log project


Perf Log Format

• Event Level

• Request Level

Page 5: Hadoop and HBase experiences in perf log project


Event Configuration

Page 6: Hadoop and HBase experiences in perf log project


Reports and Charts

Page 7: Hadoop and HBase experiences in perf log project


Log Lookup

Page 8: Hadoop and HBase experiences in perf log project


HDFS Architecture

Page 9: Hadoop and HBase experiences in perf log project



Page 10: Hadoop and HBase experiences in perf log project


HBase Architecture

Page 11: Hadoop and HBase experiences in perf log project


Yahoo! Cloud Serving Benchmark

• 3 HBase nodes on Solaris zones








Write 1808 writes/s 1.6 ms

0.02% > 1s

(due to region


Read 9846 reads/s 0.3 ms 45ms

Page 12: Hadoop and HBase experiences in perf log project


HadoopConfiguration Overview

Page 13: Hadoop and HBase experiences in perf log project


Setting up Hadoop

• Supported Platforms• Linux – best

• Solaris – ok. Just works

• Windows – not recommend

• Required Software• JDK 1.6.x


• Packages• Cloudera

Page 14: Hadoop and HBase experiences in perf log project


Match Hadoop & HBase Version

Hadoop version HBase version Compatible?

0.20.3 release 0.90.x NO

0.20-append 0.90.x YES

0.20.5 release 0.90.x YES

0.21.0 release 0.90.x NO

0.22.x (in

development)0.90.x NO

Page 15: Hadoop and HBase experiences in perf log project


Running Modes of Hadoop

• Standalone OperationBy default, run in a non-distributed mode, as a single Java process, be useful for


• Pseudo-Distributed OperationRun on a single-node in a pseudo-distributed mode where each Hadoop daemon runs

in a separate Java process.

• Fully-Distributed OperationRun in a cluster, the real production environment.

Page 16: Hadoop and HBase experiences in perf log project


Web Access to Hadoop

Page 17: Hadoop and HBase experiences in perf log project


Map Reduce Job Guide

Page 18: Hadoop and HBase experiences in perf log project


Map Reduce Job

MapReduce is a programming model for data processing on Hadoop. It works by breaking the

processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs

as input and output, the types of which may be chosen by the programmer.

• Mapper

A Mapper usually process data in single lines. Ignore the useless lines and collect useful information

from data into <Key, Value> pairs.

• Reducer

Receive the <Key, <Value1, Value2, …>> pairs from Mappers. Tabulate statistics data and write the

results into <Key, Value> pairs.

Page 19: Hadoop and HBase experiences in perf log project


Data Flow

Page 20: Hadoop and HBase experiences in perf log project


Serialization in Hadoop

int IntWritable

long LongWritable

boolean BooleanWritable

byte ByteWritable

float FloatWritable

double DoubleWritable

String Text

null NullWritable

Page 21: Hadoop and HBase experiences in perf log project


Example: WordCount

Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they

work. WordCount is a simple application that counts the number of occurences of each word in a given input set.

public static class Map extends MapReduceBase implements Mapper < LongWritable, Text, Text, IntWritable > {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws

IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {


output.collect(word, one);




public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws

IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();


output.collect(key, new IntWritable(sum));



input Key - Value

data format

output Key - Value

data format

must be extened and


put the word as Key, occurence as

Value into collector

input Key - Value data format match

the output format of Mapper

Page 22: Hadoop and HBase experiences in perf log project


MapReduce Job Configuration

Before running a MapReduce job, the following fields should be set:

• Mapper Class

The mapper class written by yourself to be run.

• Reducer Class

The reducer class written by yourself to be run.

• Input Format & Output Format

Define the format of all input and outputs. A large number of formats are supported in

Hadoop Library.

• OutputKeyClass & OutputValueClass

The data type class of the outputs that Mappers send to Reducers.

Page 23: Hadoop and HBase experiences in perf log project


Example: WordCount

Code to run the job

public class WordCount {

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);








FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));




set output key & value class

set Mapper & Reducer class

set InputFormat & OutputFormat class

set input & output path

Page 24: Hadoop and HBase experiences in perf log project


Example: WordCount

Hello World, Bye World

Hello Hadoop, Goodbye Hadoop

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

< Bye, <1>>

< Goodbye, <1>>

< Hadoop, <1,1>>

< Hello, <1,1>>

< World, <1,1>>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

Map 1

Map 2

Shuffle(auto) Reduce

Page 25: Hadoop and HBase experiences in perf log project


Example in perf-log

Here shows an example of using MapReduce to analyze the log files in perf-log project.

In log files, There are two kinds of record type and each record is a single line.

Event Level

Request Level

Page 26: Hadoop and HBase experiences in perf log project


Example Using MapReduce

Here we use a MapReduce job to calculate the most used event everyday. All the event

records are collected in Map and the most used events are counted in Reduce.


request record…

request record…

event PM_HOME

request record…


request record…

request record…

request record…


request record…

request record…

request record…




(11/12, PLT_LOGIN)

(11/12, PM_HOME)

(11/12, PLT_LOGIN)

(11/12, PM_LOGOUT)




(11/13, CDP_LOGIN)

(11/13, CDP_LOGIN)




(11/12, [PLT_LOGIN,




(11/13, [CDP_LOGIN,





(11/12, PLT_LOGIN)

(11/13, PM_HOME)

(11/14, CDP_HOME)

(11/15, PLT_HOME)




Map Shuffle(auto) Reduce

Page 27: Hadoop and HBase experiences in perf log project


HBase API Guide

Page 28: Hadoop and HBase experiences in perf log project


Table Structure

Tables in HBase have the following features:

1. They are large, sparsely populated tables.

2. Each row has a row key.

3. Table rows are sorted by row key, the table’s primary key.

By default, the sort is byte-ordered.

4. Row columns are grouped into column families. A table’s

column families must be specified up front as part of the

table schema definition and can not be changed.

5. New column family members can be added on demand.

Page 29: Hadoop and HBase experiences in perf log project


Table Structure

Here is the table structure of “perflog” in the pref-log


event req

event_name event_id … req1 req1_id … req2 req2_id …

row1 xxx xxx … xxx xxx xxx xxx xxx …

row2 xxx xxx … xxx xxx xxx xxx xxx …

column family column qualifer

row keycolumn value

Page 30: Hadoop and HBase experiences in perf log project


Column Design

When designing column families and qualifiers, pay

attention to the following two points:

1. Keep the number of column families in your schema low.

HBase currently does not do well with anything above two or three column families.

2. Name the column families and qualifiers as short as


Operating on a table in HBase will cause thousands and thousands of compares on

column names. So short names will improve the performance.

Page 31: Hadoop and HBase experiences in perf log project


HBase Command Shell

HBase provides a command shell to operate the

system. Here are some example commands :

• Status

• Create

• List

• Put

• Scan

• Disable & Drop

Page 32: Hadoop and HBase experiences in perf log project


HBase Command Shell

Page 33: Hadoop and HBase experiences in perf log project


API to Operate Tables in HBase

There are four main methods to operate a table in


• Get

• Put

• Scan

• Delete

**Put and Scan are widely used in perf-log project.

Page 34: Hadoop and HBase experiences in perf log project


Using Put & Scan in HBase

When using put in HBase, notice:

• AutoFlush

• WAL on Puts

When using scan in HBase, notice:

• Scan Attribute Selection

• Scan Caching

Page 35: Hadoop and HBase experiences in perf log project


Using Scan with Filter

HBase filters are a powerful feature that can greatly enhance your effectiveness

working with data stored in tables. Four filters are used in perf-log project:

• SingleColumnValueFilter

You can use this filter when you have exactly one column that decides if an entire row

should be returned or not.

• RowFilter

This filter gives you the ability to filter data based on row keys.

• PageFilter

You paginate through rows by employing this filter.

• FilterList

Enable you to use several filters at the same time.

Page 36: Hadoop and HBase experiences in perf log project


Using Scan with Filter

• PageFilter

There is a fundamental issue with filtering on physically separate servers. Filters run on

different region servers in parallel and can not retain or communicate their current state

across those boundaries and each filter is required to scan at least up to pageCount

rows before ending the scan. Thus you may get more rows than really you want.

Filter filter = new PageFilter(5); // 5 is the pageCount

int totalRows = 0;

byte[] lastRow = null;

while (true) {

Scan scan = new Scan();


if (lastRow != null) {



ResultScanner scanner = table.getScanner(scan);

int localRows = 0;

Result result;

while ((result = scanner.next()) != null) {


lastRow = result.getRow();



if (localRows == 0) break;


Page 37: Hadoop and HBase experiences in perf log project


Using Scan with Filter

• FilterList

When using multiple filters with FilterList, pay attention that putting filters into FilterList

in different orders will generate different results.

pageFilter = new PageFilter(5);

singleColumnValueFilter = new SingleColumnValueFilter(“event”, “name”, CompareOp.EQUAL, “PLT_LOGIN”);

Take out the first 5 records and then

return the ones that their event name

values “PLT_LOGIN”.

filterList = new FilterList();



Take out all the records that their

event name values “PLT_LOGIN” and

then return the first 5 of them.

filterList = new FilterList();



Page 38: Hadoop and HBase experiences in perf log project


Map Reduce with HBase

Here is an example:

static class MyMapper<K, V> extends MapReduceBase implements Mapper<LongWritable, Text, K, V> {

private HTable table;


public void configure(JobConf jc) {


try {

this.table = new HTable(HBaseConfiguration.create(), “table_name”);

} catch (IOException e) {

throw new RuntimeException(“Failed HTable construction”, e);




public void close() throws IOException {




public void map(LongWritable key, Text value, OutputCollector<K, V> output, Reporter reporter) throws IOException {

Put p = new Put();

… // Set your own put.




Page 39: Hadoop and HBase experiences in perf log project


Bulk Load

HBase includes several methods of loading data into tables. The most

straightforward method is to either use a MapReduce job, or use the normal

client APIs; however, these are not always the most efficient methods.

The bulk load feature uses a MapReduce job to output table data in HBase's

internal data format, and then directly loads the data files into a running

cluster. Using bulk load will use less CPU and network resources than simply

using the HBase API.






HFiles HBase

Page 40: Hadoop and HBase experiences in perf log project


Bulk Load

Notic that we use HFileOutputFormat as the output fomat of the map

reduce job used to generate HFile. But the HFileOutputFormat

provided by HBase Library DO NOT support writing multiple column

families into HFile.

But a Multi-family supported version for HFileOutputFormat can be

found HERE:


Page 41: Hadoop and HBase experiences in perf log project


Thank You, and Questions

See More About Hadoop & HBase:

