Estimating Language Models Using Hadoop and Hbase

Estimating Language Models Using Hadoop

and Hbase

Xiaoyang YuT

HE

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2008

Abstract

This thesis presents the work of building a large scale distributed ngram language

model using a MapReduce platform named Hadoop and a distributed database called

Hbase. We propose a method focusing on the time cost and storage size of the model,

exploring different Hbase table structures and compression approaches. The method

is applied to build up training and testing processes using Hadoop MapReduce frame-

work and Hbase. The experiments evaluate and compare different table structures on

training 100 million words for unigram, bigram and trigram models, and the results

suggest a table based on half ngram structure is a good choice for distributed language

model. The results of this work can be applied and further developed in machine trans-

lation and other large scale distributed language processing areas.

i

Acknowledgements

Many thanks to my supervisor Miles Osborne for the numerous advices, great supports

and for inspiring new ideas about this project during our meetings. I would also like to

thank my parent for their trusts in me and encouragements. Thanks a lot to Zhao Rui

for her great suggestions for the thesis.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Xiaoyang Yu)

iii

Table of Contents

1 Introduction 1

2 Background 32.1 Ngram Language Model . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Distributed Language Modeling . . . . . . . . . . . . . . . . . . . . 5

2.3 Hadoop MapReduce Framework . . . . . . . . . . . . . . . . . . . . 6

2.4 Hbase Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Estimating Language Model using Map Reduce 103.1 Generate Word Counts . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Generate Count of Counts . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Generate Good-Turing Smoothing Counts . . . . . . . . . . . . . . . 15

3.4 Generate Ngram Probability . . . . . . . . . . . . . . . . . . . . . . 16

3.5 Generate Hbase Table . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5.1 n-Gram Based Structure . . . . . . . . . . . . . . . . . . . . 19

3.5.2 Current Word Based Structure . . . . . . . . . . . . . . . . . 20

3.5.3 Context Based Structure . . . . . . . . . . . . . . . . . . . . 21

3.5.4 Half Ngram Based Structure . . . . . . . . . . . . . . . . . . 22

3.5.5 Integer Based Structure . . . . . . . . . . . . . . . . . . . . . 24

3.6 Direct Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.7 Caching Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Evaluation Methods 314.1 Time and Space for Building LM . . . . . . . . . . . . . . . . . . . . 31

4.2 LM Perplexity Comparison . . . . . . . . . . . . . . . . . . . . . . . 32

5 Experiments 335.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

iv

5.2 Ngram Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Table Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Discussion 436.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 47

A Source Code 49A.1 NgramCount.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.2 TableGenerator.java . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.3 NgramModel.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

v

Chapter 1

Introduction

In statistical natural language processing, ngram language model is widely used

in various areas like machine translation and speech recognition etc. Ngram model is

trained using large sets of unlabelled texts to estimate a probability model. Generally

speaking, more data gives better model. As we can easily get huge training texts from

corpus or web, the computational power and storage space of single computer are both

limited. Distributed language models are proposed to meet the needs of processing vast

amount of data, trying to solve above problems. In the field of distributed language

model processing, most of the relevant works mainly concentrated on establishing the

methods for model training and testing, but no detailed discussions about the storage

structure. The objective of our work is to efficiently construct and store the model in a

distributed cluster environment. Moreover, our interest is to explore the capability of

coping different database table structures with language models.

The novel aspects of our project are as follows. We show that a distributed database

can be well integrated with distributed language modeling, providing the input/output

data sink for both the distributed training and testing processes. Using distributed

database helps to reduce the computational and storage cost. Meanwhile, the choice of

database table structure affects the efficiency a lot. We find that half ngram based table

structure is a good choice for distributed language models from the aspects of time and

space cost, comparing with other structures in our experiments.

We use a distributed computing platform called Hadoop, which is an open source

implementation of the MapReduce framework proposed by Google [3]. A distributed

database named Hbase is used on top of Hadoop, providing a model similar to Google’s

Bigtable storage structure [2]. The training methods are based on Google’s work on

large language model in machine translation [1], and we propose a different method

1

Chapter 1. Introduction 2

using Good-Turing smoothing with a back-off model and pruning. We store the ngram

model in the distributed database with 5 different table structures, and a back-off esti-

mation is performed in a testing process using MapReduce. We propose different table

structures based on ngram, current word, context, half ngram and converted integer.

Using Hadoop and Hbase, we build up unigram, bigram and trigram models with 100

million words from British National Corpus. All the table structures are evaluated and

compared for each ngram order with a testing set of 35,000 words. The training and

testing processes are split into several steps. For each step the time and space cost

are compared in the experiments. The perplexity of testing set is also compared with

traditional language modeling package SRILM [6]. The results are discussed about the

choice of proper table structure for the consideration of efficiency.

The rest of this thesis is organised as following: Chapter 2 introduces the ngram

language model, relative works on distributed language modeling, the Hadoop MapRe-

duce framework and Hbase distributed database; Chapter 3 gives the detail of our meth-

ods, illustrating all the steps in MapReduce training and testing tasks, as well as all the

different table structures we proposed; Chapter 4 describes the evaluation methods we

use; Chapter 5 shows all the experiments and results; Chapter 6 discusses about the

choice of table structure for Hadoop/Hbase and possible future works.

Chapter 2

Background

In statistical language processing, it is essential to have a language model which

measures the probability of how likely a sequence of words may occur in some context.

Ngram language model is the major language modeling method, along with smoothing

and back-off methods to deal with the problem of data sparseness. Estimating this in

a distributed environment enables us to process vast amount of data and get a better

language model. Hadoop is a distributed computing framework that can be used for

language modeling, and Hbase is a distributed database which may store the model

data as database tables and integrate with Hadoop platform.

2.1 Ngram Language Model

Ngram model is the leading method for statistical language processing. It tries to

predict the next word wn given n-1 words context w1,w2, ...,wn−1 by estimating the

probability function:

P(wn|w1, ...,wn−1) (2.1)

Usually a Markov assumption is applied that only the prior local context - the last

n-1 words - affects the next word. The probability function can be expressed by the

frequency of words occurrence in a corpus using Maximum Likelihood Estimation

without smoothing:

p(wn|w1, ...,wn−1) =f (w1, ...,wn)

f (w1, ...wn−1)(2.2)

where f (w1, ...,wn) is the counts of how many times we see the sentence w1, ...,wn

in the corpus. One important aspect is the count smoothing, which adjusts the empiri-

cal counts collected from training texts to the expected counts for ngrams. Considering

3

Chapter 2. Background 4

ngrams that don’t appear in the training set, but are seen in testing text, for a maximum

likelihood estimation the probability would be zero, so we need to find a better es-

timation for the expected counts. A popular smoothing method called Good-Turing

smoothing is based on the count of ngram counts, the expected counts are adjusted

with the formula:

r∗= (r +1)Nr+1

Nr(2.3)

where r is the actual counts, Nr is the count of ngrams that occur exactly r times.

For words that are seen frequently, r can be large numbers, and Nr+1 is likely to be

zero, in this case derivations can be made to look up Nr+2,Nr+3, ...Nr+n, find a non-

zero count and use that value instead.

The idea of count smoothing is to better estimate probabilities for ngrams, mean-

while, for unseen ngrams, we can to do a back-off estimation using the probability of

lower order tokens, which is usually more robust in the model. For a ngram w1, ...,wn

that doesn’t appear in training set, we estimate the probability of n-1 gram w2, ...,wn

instead:

P(wn|w1, ...,wn−1) =

p(wn|w1, ...,wn−1) if found (w1, ....,wn)

λ(w1, ....,wn−1)∗ p(wn|w2, ...,wn−1) otherwise(2.4)

where λ(w1, ...,wn−1) is the back-off weight. In general, back-off requires more

lookups and computations. Modern back-off smoothing techniques like Kneser-Ney

smoothing [5] use more parameters to estimate each ngram probability instead of a

simple Maximum Likelihood Estimation.

To evaluate the language model, we can calculate the cross entropy and perplexity

for testing ngram. A good language model should give a higher probability for one

ngram prediction. The cross entropy is the average entropy of each word prediction in

the ngram:

H(pLM) = −1n

pLM(w1,w2, ...,wn)

= −1n

n

∑i=1

logpLM(wn|w1, ...,wn−1) (2.5)

Perplexity is defined based on cross entropy:

PP = 2H(pLM)

= 2−1

n

n

∑i=1

logpLM(wn|w1, ...,wn−1)(2.6)


Generally speaking, perplexity is the average number of choices at each word pre-

diction. So the lower perplexity we get, the higher probability of the prediction is,

meaning a better language model.

2.2 Distributed Language Modeling

large scale distributed language modeling is quite a new topic. Typically using a

server/client diagram as Figure 2.1, we need to efficiently manipulate on large data sets,

to communicate with cluster workers and organise their works. The server controls

and distributes tasks to all the workers, and clients send requests to server to execute

queries or commands. Relative ideals can be seen in Distributed Information Retrieval

[7]. Recent works include a method to split a large corpus into many non-overlapping

chunks, and make each worker in the cluster load one of the chunks with its suffix

array index [8]. The use of suffix array index helps to quickly find the proper context

we want, and we can count the words occurrence in each worker simultaneously. In

this kind of approach, the raw word counts are stored and served in the distributed

system, and the clients collect needed counts and then compute the probability. The

system is applied on N-best list re-ranking problem and shows a nice improvement on

a 2.97 billion-word corpus.

Figure 2.1: Server/Client Architecture

A similar architecture is proposed later with interpolated models [4]. The corpus


is also split into chunks along with their suffix array index. Also a smoothed n-gram

model is computed and stored separately in some other workers. Then the client re-

quests both the raw word counts and the smoothed probabilities from different workers,

and computes the final probability by linear weighted blending. The authors apply this

approach on N-best list re-ranking and integrate with machine translation decoder to

show good improvement in translation quality trained on 4 billion words.

Previous two methods solve the problem to store large corpus and provide word

counts for clients to estimate probabilities. Later works [1] describe a method to store

only the smoothed probabilities for distributed n-gram models. In previous methods,

the client needs to look up each worker to find proper context using suffix array index.

On the other hand this method result in exactly one worker being contacted per n-gram,

and in exactly two workers for context-depended backoff [1]. The authors propose a

new backoff method called Stupid Backoff, which is a simpler scheme using only the

frequencies and an empirical chosen back-off weight.

P(wn|w1, ....,wn−1) =

f (w1, ...,wn)

f (w1, ....wn−1)if f (w1, ....,wn) > 0

α∗ f (w2, ...,wn)f (w2, ....wn−1)

otherwise

(2.7)

where α is set to 0.4 based on their earlier experiments [1]. The experiments di-

rectly store 5-grams model using different sources. According to their results, 1.8T

tokens from web include a 16M vocabulary, and generate a 300G n-grams with a 1.8T

language model with their Stupid Backoff.

These previous works are the theoretical principles of our project. The last work

used Google’s distributed programming model MapReduce [3], but the authors didn’t

describe the way to store the data. We adapt the MapReduce model for our work, be-

cause it contains a clear workflow and has already been proved as a mature model and

widely used in various applications by Google, Yahoo and other companies. Although

Google’s own implementation of MapReduce is proprietary, we can still choose open

source implementation of this model.

2.3 Hadoop MapReduce Framework

Hadoop is an open source implementation of MapReduce programming model.

It is based on Java and uses the Hadoop Distributed File System (HDFS) to create

multiple replicas of data blocks for reliability, distributing them around the clusters

and splitting the task into small blocks. According to their website, Hadoop has been

demonstrated on 2,000 nodes and is designed up to support 10,000 node clusters, so it

enables us to extend our clusters in the future.

A general MapReduce architecture1can be illustrated as Figure 2.2. At first the

input files are split into small blocks named FileSplits, and the Map operation is created

parallelized with one task per FileSplit.

Figure 2.2: MapReduce Architecture

Input and Output types of a Map-Reduce job:

(input) <k1, v1>→map→<k2, v2>→ combine→<k2, v2>→ reduce→<k3,

v3> (output)

The FileSplit input is treated as a key/value pair, and user specifies a Map func-

tion to process the key/value pair to generate a set of intermediate key/value pairs [3].

When the Map operation is finished, the output is passed to Partitioner, usually a hash

function, so all the pairs shared with same key can be collected together later on. After

the intermediate pairs are generated, a Combine function is called to do a reduce-like

job in each Map node to speed up the processing. Then a Reduce function merges

all intermediate values associated with the same intermediate key and writes to output

files. Map and Reduce operations are working independently on small blocks of data.

The final output will be one file per executed reduce task, and Hadoop stores the output

files in HDFS.

For input text files, each line is parsed as one “value” string, so the Map function

is processing at the sentence level; for output files, the format is one key/value pair per

one record, and thus if we want to reprocess the output files, the task is working at the

record pair level. For language model training using MapReduce, our original inputs

1http://wiki.apache.org/hadoop/HadoopMapReduce

are text files, and what we will finally get from Hadoop are ngram/probability pairs.

It means that, theoretically we can only use Hadoop to build language model. This

chapter gives a brief overview of ngram language models, the smoothing and back-off

methods, the relevant works on distributed language modeling, the Hadoop framework

and Hbase database.

2.4 Hbase Database

Hbase is the distributed database on top of HDFS, whose structure is very similar to

Google’s Bigtable model. The reason we look into using Hbase with language model

is, although we can do our work all in Hadoop and store all outputs as text or binary

files in HDFS, it will consume more time and extra computations for parsing input

records. The bottleneck of Hadoop MapReduce is that the input/output is file based,

either plain text or binary file, which is reasonable for storing large amount of data

but not suitable for query process. For example, if we want to query probability for

one ngram, we have to load all the files into map function, parse all of them, and

compare the key with ngram to find the probability value. Basically we need to do this

comparison for each ngram in test texts, which will cost quite a long time.

Instead of parsing files, we can make use of a database structure such as Hbase,

to store the ngram probabilities in database table. The advantage is obvious: database

structure is designed to meet the needs of multiple queries; the data is indexed and

compressed, reducing the storage size; tables can be easily created, modified, updated

or deleted. As in distributed database, the table is stored in distributed filesystem,

providing scalable and reliable storage. Meanwhile, for language modeling, the model

data is highly structured. The basic format is ngram/probability pair, which can be

easily constructed into more organised and compact structures. Considering we may

get huge amount of data, the compressed structures are essential from both time and

storage aspects.

Hbase stores data in labelled tables. The table is designed to have a sparse struc-

ture. Data is stored in table rows, and each row has an unique key with arbitary

number of columns. So maybe one row can have thousands of columns, while an-

other row can have only one column. In addition, a column name is a string with the

<family>:<label> form, where <family> is a column family name assigned to a

group of columns, and label can be any string you like. The concept of column fami-

lies is that only administrative operations can modify family names, but user can create


arbitary labels on demand. A conceptual view of Hbase table can be illustrated below:

Row Key Column color: Column prize: Column size:

car color:red Mini Cooper medium

color:green Volkswagen Beetle small

BMW 2.6k

... ... ...

Another important aspect of Hbase is that the table is column oriented, which

means physically the tables are stored per column. Each column is stored in one file

split, and each column family is stored closely in HDFS, all the empty cells in a col-

umn are not stored. This feature actually implies that in Hbase it is less expensive to

retrieve a column instead of a row, because to retrieve a row the client must request all

the column splits, meanwhile to retrieve a column only one column split is requested,

which is basically one single file in HDFS. On the other hand, writing a table is row

based. Only one row is locked for updating, so all the writes among clusters can be

atomic by default.

The relationship between Hadoop, Hbase and HDFS can be illustrated as Figure

2.3:

Figure 2.3: Relationship between Hadoop, Hbase and HDFS

Above overviews are the basic techniques and tools used in this project. The next

chapter describes our methods to build up language modeling process for ngram model

with smoothing and back-off using Hadoop and Hbase.

Chapter 3

Estimating Language Model using

Map Reduce

The distributed training process described in Google’s work[1] is split into three

steps: convert words to ids, generate ngrams per sentence, and compute ngram prob-

abilities. We extend it to do a Good-Turing smoothing estimation, so extra steps are

included to calculate the count of ngram counts, store the count in HDFS, and then

try to fetch the data to adjust the raw counts. We decide to store the ngram string di-

rectly, considering more training data can be added and updated later on. We estimate

a back-off model to compute the probability, and several database table structures are

designed for comparison. The testing process is also using MapReduce, acting like a

distributed decoder, so we can process mulitiple testing texts together.

Figure 3.1 is the flow chart for training process. Figure 3.2 shows a simplified

testing process. The rest of this chapter explains the details of each step in the training

and testing process.

3.1 Generate Word Counts

The first step is to parse training text, find all the ngrams, and emit their counts.

The map function reads one text line as input. The key is the docid, and the value is

the text. For each line, it is split into all the unigrams, bigrams, trigrams up to ngrams.

These ngrams are the intermediate keys, and the values are a single count 1. Then a

combiner function sums up all the values for the same key within Map tasks. And a

reduce function same as the combiner collects all the output from combiner, sums up

values for the same key. The final key is the same with map function’s output, which

10

Chapter 3. Estimating Language Model using Map Reduce 11

Figure 3.1: Training Process

Figure 3.2: Testing Process

is the ngram, and the value is the raw counts of this ngram throughout the training data

set. A partitioner based on the hashcode of first two words is used, which makes sure

that not only all the values with a same key goes into one reduce function, but also the

average load is balanced [1]. Also we include a pruning count, any raw counts below

this pruning count will be dropped. The simplified map and reduce functions can be

illustrated as below:

map(LongWritabe docid, Text line,

OutputCollector<Text, IntWritable> output) {

words[] = line.split(blank space or punctuation)

for i=1..order

for k = 0..(words.length - i)

output.collect(words[k..(k+i-1)], 1)

}

reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output ) {

int sum=0;

while (values.hasNext())

sum += values.next().get();


if sum > prune

output.collect(key, sum);

}

Figure 3.3 describes the process of map - combine -reduce functions given some

sample text. The combine function starts right after map, so it inherits the same key/-

value pairs from its previous map task. The output from reduce function is the raw

counts, also the keys are sorted, implying some index processes. This feature can be

used in enhancements for fast lookups.

Figure 3.3: Generate Raw Counts

Because this step generates all the ngrams, it is possible to collect the total number

of unigram counts, bigram counts etc. These numbers are necessary for smoothing

techniques. Here we only collect the unigram counts for Good-Turing smoothing. It is

easy to collect the total bigram or trigram counts in the similar way, which is be needed

for Kneser-Ney smoothing.

enum MyCounter {INPUT_WORDS};


OutputCollector<Text, IntWritable> output,

Reporter reporter) {

.....

if sum>prune

output.collect(key, sum);

if key is unigram

reporter.incrCounter(MyCounter.INPUT_WORDS, 1);

}

3.2 Generate Count of Counts

Good-Turing smoothing is based on the count of ngram counts. We need to collect

all the count of counts for unigram, bigram, trigram up to ngram. To do this, all the

raw counts are loaded into a new MapReduce job. For each ngram, the map function

emits one count of the raw counts along with the ngram order, so the output key has

the format of <ngram- order raw-counts>, and the value is <count-of-counts>. The

combine and reduce function merge all the count of counts with the same key. The final

output should be fairly small, normally a single file is enough to store all the count of

counts. The simplified map function is like this:

// input key is ngram, input value is the raw counts

public void map(Text key, IntWritable value,

OutputCollector<IntWritable, IntWritable> output){

words[] = toStringArray(key);

// words.length is the ngram order

// combine the order with the counts

String combine = words.length + value.toString();

output.collect(toText(combine), one);

}

The combine and reduce function are the same with previous step. An example of

the MapReduce job is illustrated as Figure 3.4.

One important point about the count of counts is that for higher counts, the count of

counts is usually incontinuous. This always happens for frequent ngrams, which have

a quite high number of counts. We don’t have to store or estimate these empty count

of counts, instead we can adjust the smoothing formula being used in the next step.


Figure 3.4: Generate Count of Counts

3.3 Generate Good-Turing Smoothing Counts

Now we have both the raw counts and the count of counts, following the Good-

Turing smoothing formula, we can estimate the smoothed counts for each ngram. In

this step, the inputs are still the ngrams and their raw counts, and each map function

reads in the count of counts file, stores all the data in a HashMap structure and com-

putes the smoothed counts. The basic formula is :

r∗= (r +1)Nr+1

Nr(3.1)

if we can find both Nr+1 and Nr in the HashMap, then the above formula can be

applied directly. Otherwise, we try to look up the “nearest” count of counts, e.g. if we

can’t find Nr+1 , we try to look up Nr+2, then Nr+3, Nr+4 ..etc. We decide to look up at

most 5 counts Nr+1 .. Nr+5. If we can’t find any of these counts, we use the raw counts

intead. In this situation, the raw counts are typically very large, meaning the ngram has

a relatively high probability, so we don’t have to adjust the counts. For each ngram,

the smoothing process is needed only once, so we actually don’t need any combine or

reduce function for count smoothing. The map function can be illustrated as:

HashMap countHash;

void configure(){

reader = openFile(count-of-counts);

//k is the ’ngram-order ngram-counts’, v is the count of counts

while (reader.next(k, v)) {

order = k[0];

counts = k[1..length-1];

countHash.put(order, (counts, v));

}

reader.close();

}

//input key is the ngram, value is the raw counts

void map(Text key, IntWritable value,

OutputCollector<Text, Text> output) {

for i=1..5

if exists countHash.get(key.length).get(value) &&

exists countHash.get(key.length).get(value+i) {

Nr= countHash.get(key.length).get(value);

Nr1= countHash.get(key.length).get(value+i);

c=(value+1)*Nr1/Nr;

break;

}

else

c=value;

//c is the smoothed counts for ngram ’key’

}

Figure 3.5 shows an example of the Good-Turing smoothing.

3.4 Generate Ngram Probability

To estimate the probability of one ngram w1,w2, ...,wn, we need the counts of

w1..wn and w1..wn−1. Because one MapReduce function, either map or reduce, is

working based on one key, in order to collect both above counts, we need to emit

the two counts to one reduce function. The approach is to use the current context

w1,w2, ...,wn−1 as the key, and combine the current word wn with the counts as the

Figure 3.5: Good-Turing Smoothing

value. This can be done in the map function of previous step. Notice that except for

the highest order ngram, all lower order ngrams also need to be emitted with their own

smoothed counts, providing the counts for being context as themselves. Figure 3.6

gives an example of the emitted context with current word format.

...

//c is the smoothed counts for ngram ’key’

//suffix is the current word, the last item in string array

suffix = key[length-1];

context = key[0..length-2];

output.collect(context, (’\t’+suffix+’\t’+c));

if order< highest order

output.collect(key, c);

...

Now the reduce function can receive both the counts of the context and the counts

with current word, and computes the conditional probability for ngram based on the

formula:

p(wn|w1, ...,wn−1) =f (w1, ...,wn)

f (w1, ...wn−1)(3.2)

//the input k is the context

void reduce(Text k, Iterator v ) {


Figure 3.6: Emit context with current word

while (v.hasnext) {

value=v.get;

//if has current word

if (value starts with ’\t’) {

get current word w

get counts C

}

//if no current word

else

get base counts Cbase

//compute the probability

prob= C/Cbase;

if prob > 1.0

prob=1.0;

}

...

}

After Good-Turing smoothing, some counts might be quite small, so the probability

might be over 1.0. In this case we need to adjust it down to 1.0. For a back-off model,

we use the simple scheme proposed by Google [1], in which the back-off weight is

set to 0.4. The number 0.4 is chosen based on empirical experiments, and is proved

as a stable choice in previous works. If we want to estimate dynamic back-off weight


Figure 3.7: Estimate probability

for each ngram, more steps are required but people have discussed that the choice of

specific back-off or smoothing method is less relevant as the training corpora becomes

large [4].

In this step, we get all the probabilities for ngrams, and with the back-off weight,

we can estimate probability for testing ngram based on queries. So the next important

step is to store these probabilities in the distributed environment.

3.5 Generate Hbase Table

Hbase can be used as the data input/output sink in Hadoop MapReduce jobs, so

it is straight-forward to integrate it within previous reduce function. Essential modi-

fications are needed, because writing Hbase table is row based, each time we need to

generate a key with some context as the column. There are several different choices,

either a simple scheme of each ngram per row, or more structured rows based on cur-

rent word and context. Two major aspects are considered: the write/query speed and

the table storage size.

3.5.1 n-Gram Based Structure

An initial structure is very simple, similar to the text output format. Each n-gram

is stored in a separate row, so the table has a flat structure with one single column. For

one row, the key is ngram itself, and the column stores its probability. Table 3.1 is an

example of this structure. This structure is easy to implement and maintain, yet it is

only sorted but uncompressed. Considering there might be lots of same probabilities

among different ngrams, e.g. higher order ngrams always appear once, the table is not

capable to represent all the same probabilities in an efficient way. Also since we only

have one column, the advantages of distributed database, arbitary sparse columns, have

less effects for this table.

key column family:label

(gt:prob)

a 0.11

a big 0.67

a big house 1.0

buy 0.11

buy a 0.67

... ...

Table 3.1: n-Gram based table structure

3.5.2 Current Word Based Structure

Alternatively, for all the ngrams w1,w2, ...,wn that share the same current word wn,

we can store them in one row with the key wn. All the possible context are stored in

seperate columns with the column name format <column family:context>. Table 3.2

is an example. This table is a sparse column structure. For some word with lots of

context, the row could be quite long, while for some word with less context the row

could be short. In this table we reduce the number of rows, and expand all context

into seperate columns. So intead of one single column split, we have lots of small

column splits. From the view of distributed database, the data is sparsely stored, but

from the view of data structure, it is still somewhat uncompressed, e.g. if we have two

current words in two rows, and both of them have the same context, or have the same

probability for some context, we still store them seperately. This results in multiple

column splits that have very similar structures, which means some kind of redundance.

We also need to collect the unigram probability for each current word and store it

in a separate column. The process to create table rows can be illustrated as:

//k is the ngram, c is the probability

gt:unigram gt:a gt:a big gt:i gt:buy ..

a 0.15 0.667

big 0.057 0.667

house 0.3 0.667 1.0

buy 0.15 1.0

... ... ... ... ... ... ...

Table 3.2: Word based table structure

for each ngram (k, c) {

word=k[length-1];

if k.length == 1

column="gt:unigram";

else{

context=k[0..length-2];

column="gt:context";

}

output.collect(w, (column, c) );

}

3.5.3 Context Based Structure

Similar to current word based structure, we can also use the context w1,w2, ...,wn−1

as the key for each row, and store all the possible following words wn in seperate

columns with the format <column family:word>. This table will have more rows

compared with word based table, but still less than ngram based table. For large data set

or higher order ngrams, the context can be quite large, on the other hand the columns

can be reduced. The column splits are separated by different words, and for all words

that only occur once, the split is still very small - only one column value per split.

Generally speaking, if we have n unigrams in the data set, we will have around n

columns in the table. For a training set containing 100 million words, the number of

unigram is around 30,000 , so the table could be really sparse. Example of this table

structure is illustrated as Table 3.3.

To avoid redundancy, only unigram keys are stored with their own probabilities in

gt:unigram gt:a gt:big gt:i gt:buy gt:the gt:house ..

a 0.11 0.67 0.67

a big 1.0

buy 0.11 0.67 0.67

buy a 1.0

i 0.04 1.0

... ... ... ... ... ... ... ... ...

Table 3.3: Context based table structure

<gt:unigram>, since we can also store the probability of “a big” in <gt:unigram>

which is redundant with row “a” column <gt:big>.



context=k[length-1];

if k.length == 1


else{

word=k[length-1];

context=k[0..length-2];

column="gt:word";

}

output.collect(context, (column, c) );

}

Since there may be many columns that only appear once and have the same value,

typically seen in higher order ngrams, possible compression can be made to combine

these columns together, reducing the column splits.

3.5.4 Half Ngram Based Structure

For the previous two structures, we can get either large number of rows, or large

number of columns. So there is a possible trade off between the rows and the columns.

We can combine the word based and context based structure together, balancing the

number of rows and columns. Our method is to split the n grams into n/2 grams, and

use n/2 gram as the row key, the rest n/2 gram as the column label. For example, for a

4-gram model (w1,w2,w3,w4), the row key is (w1w2), and the column is <gt:w3w4 >.

An example for 4-gram model is illustrated as Table 3.4:


gt:unigram gt:a gt:big gt:house gt:big house gt:new house ..

a 0.11 0.67 0.67

a big 1.0 0.01

buy 0.11 0.67

buy a 1.0 0.04

i 0.04

... ... ... ... ... ... ... ...

Table 3.4: Half ngram based table structure

For higher order ngrams, this structure can reduce lots of rows and insert them as

columns. Theoretically the costs of splitting ngram into word-context and context-

word are identical, but n/2 gram-n/2 gram will require a bit more parsing. Similar to

previous method, the process can be written as:



half = int(length/2);

context = k[0..half];

if k.length == 1


else{

word=k[half..length-1];

column="gt:word";

}


}


3.5.5 Integer Based Structure

Instead of store all the strings, we can also convert all the words into integers, and

store the integers in table. Extra steps are needed to convert each unigram into an

unique integer number and keep the convert unigram-integer map in the distributed

filesystem. The advantage of using integers is the size will be smaller than strings,

because of storing a long length string with a single integer number. But on the other

hand we need another encoding/decoding process to do the converting, and this results

in more computational time. This method is a trade off between computational time

and the storage size. Also this structure can be combined with previous methods for

better compressions.

For the simple ngram per row scheme, integer based structure can be illustrated as

Table 3.5:

unigram integer

a 1

big 2

house 3

buy 4


(gt:prob)

1 0.11

1 2 0.67

1 2 3 1.0

4 0.11

4 1 0.67

... ...

Table 3.5: Integer based table structure

Notice that if we store “a big” as 12, it might conflict with another word with

number 12 in the convert map. So we have to add a space between the numbers,

similar to the ngram strings.

We need an extra MapReduce job to reprocess the raw counts, convert unigram into

integer, and store in a table:

//extra step to convert unigram into unique integer

//input key is the ngram, value is the raw counts

int id;

column=’’convert:integer’’;

map(Text k, IntWritable v,

OutputCollector<Text, MapWritabe> output) {

if (k.length==1) {

id++;

output.collect(k, (column, id));

}

}

...

Then we can query the convert table to store integer instead of strings:

column=’’convert:integer’’;


for each ngram (k,c) {

for each k[i] in k

//query the convert table or file,

//find the integer for each k[i]

intK[i]=get(k[i], column);

column="gt:prob";

row = combineInteger(intK);


}

3.6 Direct Query

The next process is the testing function. The actual back-off is performed here in

the query. Based on the back-off model, for each testing ngram, we need to query the

ngram, if not found, then find n-1 gram, until we reach unigram. For different table

structures, we just need to generate different row and column names. The advantage of

using MapReduce for the testing is that, we can put multiple testing texts into HDFS,

and a MapReduce job can process all of them to generate the raw counts, just the

same as we have done in the training process, then for each ngram with its counts,

we directly estimate the probability using back-off model, and multiply by the counts.

In such a method, each different ngram is processed only once, which speeds up the

whole process especially for lower order ngrams.

We call this method Direct Query because we query each ngram directly from

Hbase table, so the more testing ngrams we have, the more time it will cost. Also the

perplexity of the estimation is computed and collected as an evaluation value for the

language model.

An example using the simple ngram based table structure is:

//the job for raw counts

//input key is the docid, value is one line in the text

map(LongWritabe docid, Text line,

OutputCollector<Text, IntWritable> output) {

words[] = line.split(blank space or punctuation)

for i=1..order

for k = 0..(words.length - i)

output.collect(words[k..(k+i-1)], 1)

}


OutputCollector<Text, IntWritable> output ) {

...

}

//the job for estimating probability

//the input key is the ngram, value is the raw counts

column=’’gt:prob’’;

map(Text k, IntWritable v,

OutputCollector<Text, FloatWritable> output) {

// a back-off calculation

row=k.toString;

while (finished==false) {

value=table.get(row, column);

if (value != null) {

//find the probability

prob=alpha*value;


finished=true;

} else {

//unseen unigram

if k.length==1 {

prob=unseen_prob;

finished=true;

}

else {

//back-off to n-1 gram

ngram=row;

row=ngram[0..length-2];

alpha=alpha*0.4;

finished=false;

}

}

}

//now we have the probability in prob

count+=v;

if prob >1.0

prob = 1.0

H+=v*log(prob)/log(2);

output.collect(k, prob);

}

//compute the total perplexity

void close(){

perplexity = pow(2.0, -H / count);

...

}

Figure 3.8 is an example of this direct query process. If the estimated probability is

above 1.0, we need to adjust it down to 1.0. The map function collects the total counts

and computes the total perplexity in the overrided close method. As a reference, the

probability for each ngram is collected in HDFS as the output of final reduce job.


Figure 3.8: Direct Query

3.7 Caching Query

There is a little more we can do to speed up the query process. Considering queries

for different ngram that share the same n-1 gram, in a back-off model we query the

ngram first, then if not found, we move on to n-1 gram. Suppose we need to back-off

for each ngram, the n-1 gram will be requested for multiple times. Here is where the

caching steps in. For each back-off step, we store the n-1 gram probability as HashMap

in memory inside the working node. Everytime when node comes to a new n-1 back-

off query, it will first look up in the HashMap, if not found, then ask the Hbase table,

and add the new n-1 gram into HashMap.

The simplified scheme can be written as:

...

//k is the ngram

HashMap cache;

while not finished {

prob = table.get(k, column);

if prob != null

found probability, finished

else {

let row be the n-1 gram from k

if cache.exists(row)

prob = cache.get(row)

else {

prob = table.get(row, column);

if prob !=null {

found probability, finished

if number of cache.keys < maxlimit

cache.add(row, prob);

else

cache.clear();

cache.add(row, prob);

}

}

}

}

...

Figure 3.9 is an example of this caching query process. We don’t need to store

probabilities for ngrams, only the n-1 grams. Also there is a maximum limit of the

number of keys in the HashMap. We can’t store all the n-1 grams into HashMap,

otherwise it will become huge and eat up all the working node’s memory. So we

only store up to maxlimit n-1 grams, and when counts are over the limit, the previous

HashMap is dropped and filled in new items. It is like an updating process, and another

alternative is to delete the first key in HashMap and push in new one.

Above methods establish the whole process of distributed language model training

and testing. The next chapter describes the evaluation methods we use.


Figure 3.9: Caching Query

Chapter 4

Evaluation Methods

Our major interest is to explore the computational and storage cost building dis-

tributed language model using Hadoop MapReduce framework and Hbase distributed

database. The evaluation focuses on the comparison of time and space using different

table structures. Also as language model the perplexity for a testing set is evaluated

and compared with traditional language modeling tools.

4.1 Time and Space for Building LM

There are two major processes we can evaluate, the training process and testing

process. Because we are experimenting different table structures, we split the training

process into two steps: the first one is generating raw counts and collecting Good-

Turing smoothing parameters, the second one is generating table. Apparently the first

step is the same for all different tables, so we can focus on the second step for compar-

ison.

The comparison of time cost is based on the average value of program running time

taken from multiple runs to avoid deviations. Network latency and other disturbance

may vary the result, but the error should be remaining in 1-2 minutes level. The pro-

gram calculates its running time by comparing the current system time before and after

the MapReduce job is submitted and executed. Each table structure is compared with

the same ngram order, also different ngram order for one table is compared to see the

relationship between the order and the time cost in each method.

To collect the size of the language model, we can use command-line programs

provided in Hadoop framework. Because the Hbase table is actually stored in HDFS

filesystem, we can directly calculate the size of the directory created by Hbase. Typ-

31

Chapter 4. Evaluation Methods 32

ically, Hbase creates a directory called “/hbase” in HDFS, and creates a sub directory

with the name of the table, e.g. if we create a table named “trigram”, then we can

calculate the size of the directory “/hbase/trigram” as the size of the table. It is not

a perfectly accurate estimation of the table data because the directory contains other

meta info files but since these files are usually quite small, we can estimate them to-

gether. Another point of view is that the table has two parts: the data and the info, so

we can calculate these two parts together.

4.2 LM Perplexity Comparison

For a language model, the perplexity for testing set is a common evaluation to see

how good the model is. The perplexity means the average choice for each ngram.

Generally speaking, the better model it is, the lower perplexity it will get. The order

of ngram also affects the perplexity for the same model. For a normal size training

set, higher order ngram model will always get lower perplexity. Meanwhile, the more

training set we have, the better model we will get, so the perplexity will become lower.

We can also compare the distributed language model with traditional language

modeling tools like SRILM [6]. SRILM is a rich functional package for building and

evaluating language models. The package is written in C++, and because it runs lo-

cally in single computer, the processing speed is fast. The shortcoming of SRILM is

it will eat up memory and even overflow when processing huge amount of data. Still

we can compare SRILM with our distributed language model on the same training

set which is not so huge. The smoothing methods need to be nearly the same, e.g.

Good-Turing smoothing. The specific parameters may vary, but the similar smoothing

method can show whether the distributed language model is stable enough compared

with traditional language modeling package.

Applying these evaluation methods, the experiments and results are illustrated in

the next chapter.

Chapter 5

Experiments

The experiments are done in a cluster environment with 2 working nodes and 1

master server. The working nodes are running Hadoop , HDFS and Hbase slaves, and

the master server controls all of them. For each experiment, we repeat three times and

choose the average value as the result. The results are shown as figures and tables

for different ngram orders of all the table structures, the time and space cost, also the

perplexity for testing set.

We first compare the size and number of unique ngrams for training and testing

data taken from a 100 million-word corpus. The time and space cost for generating raw

counts, generating table structures, finally the testing queries are compared separately.

We also compare the unigram, bigram and trigram models for each step. All of the 5

table structures we proposed are compared on time cost for both training and testing

process, as well as the table size of these different structures. Finally the perplexity for

each ngram order is compared with SRILM, and the model data size from SRILM is

calculated as a reference.

5.1 Data

The data is taken from British National Corpus, which is around 100 million words.

We choose some random texts from the corpus as the testing set. The testing set is

about 35,000 words. All the remaining texts are used as training set.

The corpus is constructed as one sentence per line. For completeness we include

all the punctuation in the sentence as the part of ngam, e.g. <the house’s door> is

parsed as <the> <house> <’s> <door>.

The approximate words and file size for training and testing data is:

33

Chapter 5. Experiments 34

training data testing data

tokens 110 million 40,000

data size 541MB 202KB

Table 5.1: Data

Notice that the tokens contain both the words and the punctuation, slightly increas-

ing the total number of ngrams.

5.2 Ngram Order

We choose to train up to trigram model for the training process. A counts pruning

number of 1 is applied for all the models. Table 5.2 shows the number of unique

ngrams for each order on the training and testing data set. Figure 5.1 draws the unique

ngram number for training and testing data. We can see from the figure that the training

data has a sharper line. Considering the different number of tokens, it shows that larger

data set results in more trigrams, meaning more varieties.



unigram 284,921 7,303

bigram 4,321,467 25,862

trigram 9,090,713 33,120

Table 5.2: Ngram numbers for different order

For all the different table structures, the steps 1 & 2 of raw counting and Good-

Turing parameter estimating are identical. So we can evaluate this part first, apart from

choosing table types. The time and space costs are compared in Table 5.3 and Table

5.4. For testing process only the raw counting step is required. All the raw counting

outputs are stored in HDFS as compressed binary files.

The trend line is shown in Figure 5.2. There is a big difference between the two

lines. As we can see from Table 5.3, when the tokens are relatively small, the process-

ing time is nearly the same, but when the tokens become large, the difference between

unigram, bigram and trigram model is largely increasing. Another aspect affecting the


Figure 5.1: Unique ngram number

Figure 5.2: Time cost in raw counting



raw counting & parameter estimating raw counting


unigram 15 min 38 sec 19 sec

bigram 41 min 6 sec 20 sec

trigram 75 min 41 sec 22 sec

Table 5.3: Time cost in raw counting

time cost is that during the training process, a second MapReduce job is needed to

estimate the count of counts for Good-Turing smoothing. For this job, only one reduce

task is launched in the working nodes, producing a single output file. It is easy to know

that as the ngram order increases, the input records also become larger, requiring more

processing time.


raw counting & parameter estimating raw counting


unigram 1.5 MB 35 KB

bigram 22 MB 141 KB

trigram 49 MB 234 KB

Table 5.4: Space cost in raw counting

Table 5.4 shows the size of raw counts file for each ngram order. We can see the

size is increasing rapidly, implying the size of the table will also increase sharply.

Notice that in Table 5.4 the size is only for each ngram order, but in our table, we need

to store all the orders from unigram, bigram up to ngram. So the table’s size is the total

sum, further increasing the number. Figure 5.3 suggests that training and testing data

have the similar up going trend line, meaning the space cost is less relevant with the

number of tokens, and it is like a monotone increasing function with the ngram order.


Figure 5.3: Space cost in raw counting

5.3 Table Structures

For each of the 5 table structures, we generate the tables with unigram, bigram and

trigram model. Table 5.5 and 5.6 show the time and space cost for training process

for each order in generating table ( step 3 ); Table 5.7 and 5.8 shows the time cost for

testing process in the querying step with two different methods.

unigram bigram trigram

type 1: ngram based 5 min 40 sec 62 min 21 sec 300 min 26 sec

type 2: word based 5 min 41 sec 64 min 42 sec 200 min 30 sec

type 3: context based 6 min 40 sec 64 min 39 sec 185 min 20 sec

type 4: half ngram based 6 min 7 sec 60 min 37 sec 190 min 40 sec

type 5: integer based 16 min 26 sec >240 min >10 hours

Table 5.5: Time cost in training step 3, generating table

The size of different table structure is calculated based on the size of each directory

of the table in HDFS, including the data and meta table info. For type 5, an extra step

of converting ngram into integer costs about 5 minutes, this is added in the time cost



type 1: ngram based 2.4 MB 44 MB 173 MB

type 2: word based 2.4 MB 60 MB 185 MB

type 3: context based 2.4 MB 44 MB 135 MB

type 4: half ngram based 2.4 MB 44 MB 129 MB

type 5: integer based 4.1 MB ∼40 MB ∼120 MB

Table 5.6: Space cost of different table structures

for type 5. The size of the convertid table is around 2.4MB, this figure is also added

in the space cost for type5. Taking into account for the calculation errors caused by

network latency, transmission failure, I/O delay and error, the time cost may vary with

an error range of±1−2 minutes. Figure 5.4 and 5.5 compares the time and space cost

for these five table types.

All the tables are created with same compression option in Hbase. As we can see

from Figure 5.4 and 5.5, type 5 costs much longer time than other table types, but the

space cost reduces. For unigram models, all the first 4 tables behave similarly, which

is easy to understand because for unigram all these structures are identical. For bigram

models, the training time keeps almost the same except type 5. The space cost keeps

reducing through type 3 - 5 but type 2 slightly increases the space cost. The trend

is more obvious with trigram models. Type 2 - 4 all make good effort to reduce the

training time, type 2 slightly increases the space cost but type 3 - 5 greatly reduce the

space cost.

Figure 5.4: Time cost in training step 3, generating table


Figure 5.5: Space cost of different table structures

Another comparison can be made to see the trend as the ngram order grows, how

the time and space cost will become. Figure 5.6 and 5.7 illustrate the trend line for time

cost and space cost with different ngram orders. Because type 5 needs extra MapRe-

duce job and uses two tables, the time cost is much bigger than other types. Type 1

costs a bit more time at trigram model compared with type 2 - 4. The lowest number

is from type 4, about 5 minutes less than type 4, that is 3.1% smaller. Meanwhile, the

space cost shows a different trend. For trigram model type 1 and 2 climb up to higher

numbers while the others are smaller. The lowest number is from type 4, about 4.4%

smaller than type 3.

Figure 5.6: Time cost for each ngram order

The next step is probability query for testing data. The time costs for both direct

query and caching query are illustrated in Table 5.7 and 5.8. Type 2, the word based

structure, costs a bit more compared with type 1, 3 and 4. Similar to training process,


Figure 5.7: Space cost for each ngram order

type 5 is still heavy time consuming. Type 1 shows a good result this time, not quite

like the performance in training process. Still, we need to tolerate an error range of

±1-2 minutes because of possible interruptions like network problem etc.






type 5: integer based 10 min 39 sec > 50 min > 1.5 hour

Table 5.7: Time cost in testing, direct query

Figure 5.8 shows the differences between type 1, 3 and 4 are relatively small, con-

sidering the deviations that might affect the value. In Figure 5.9 it has the similar trend.

Comparing direct query and caching query, we can see that the caching query doesn’t

improve much on the time cost as we expected. For bigram and trigram models the

time cost is even higher than direct query. But still the comparison between different

table types is similar to Figure 5.8, type 3 and type 4 perform better, type 1 is good

while type 2 and type 5 fall behind others.

Also the perplexity of testing processes compared with SRILM is shown in Table

5.9. For the SRILM package, it uses Good-Turing smoothing as default, and the per-

plexity is computed using default parameters for unigram, bigram and trigram models.

We can see the perplexity from SRILM tools is higher than our model, especially for







type 5: integer based 11 min 2 sec > 50 min >1.5 hour

Table 5.8: Time cost in testing, caching query

Figure 5.8: Time cost in direct query

Figure 5.9: Time cost in caching query


lower ngram order. Another figure from SRILM is the size of the language model

data, which is by default 683 MB without compression, but since in Hbase we have

used compression, we try to compress the file with gzip and the compressed size is

only 161MB, which is a 76.5% compression. Although we can’t compare the original

number directly because SRILM also stores other information in that plain text file,

obviously increasing the original size, the compressed file can be compared as a refer-

ence which is under the similar situation. We can see that the table size of type 3, 4, 5

is still smaller.

Hadoop+Hbase 2544.732 599.171 517.3108

SRILM 3564.03 742.269 573.563

Table 5.9: Perplexity for testing data

From our experiments, we can see there are some efficiency differences between

different table structures and ngram orders. The next chapter discusses about the proper

choice of table structure and future works based on our results.

Chapter 6

Discussion

6.1 Conclusion

From our experiments, we think the table of half ngram based structure is a good

choice for distributed language model using Hadoop and Hbase. Considering both

the time and space cost, the ngram based structure is too simple to take advantage of

distributed features, although this flat structure is fast and easy to manipulate locally

and is widely used in traditional language modeling tools like SRILM.

The word based structure and context based structure is basically a transposed table

pair. Say we have r words with c contexts, word based structure is a table of r rows

with c maximum columns, and context based structure is a table of c rows with r max-

imum columns. So the question is which one fits better for Hbase, which is sparsely

distributed and column oriented. In theory the cost of parsing both types are the same,

because we emit the context with all the following words as intermediate keys, there

is no need for extra parsing, so we only need to compare whether larger rows or larger

columns is better. From the experiments we find that context based structure is better,

meaning that more rows is less expensive than more columns. Less columns means

less column splits, since Hbase stores each column split separately, this will reduce the

I/O requests. This may explain why the word based structure has a poor performance

on the testing process. Another aspect is the column label names. For word based

structure the labels are all the possible contexts from unigram, bigram up to n-1 gram.

Meanwhile the context based structure only has all the unigrams as the column labels.

Again because Hbase is column oriented, the table needs to maintain an index of all the

column labels, and it is better to have simpler label names, especially when we come

to higher ngram order like 5-grams.

43

Chapter 6. Discussion 44

The half ngram based structure is a trade-off between word and context based struc-

tures. It moves some of the columns in word based structure to generate new rows. So

we will have less columns than word based structure, and less rows than context based

structure. For lower ngram order, this trade-off doesn’t have significant advantages.

Compared with word and context based structure, it will cost a little more time for the

extra string parsing, which is reconstructing the intermediate keys into half ngrams.

But it benefits from the balancing table structure for the I/O time cost when writing

a table. The time cost is not the lowest, but considering as we expand the cluster to

more machines, the difference of time cost will be smaller but the more compact table

structure will hold more data and give us a better model.

The integer based structure turns out to be much less efficient than we expected.

The idea of converting string into integer comes from Google’s method to store uni-

gram counts locally on every working node. Here we try to store this mapping in the

Hbase table, considering the scalability for huge data sets. But from the experiments

we find out that the cost of manipulating two tables in Hbase is too much even for

trigram models. Still we can get the approximate table size of this structure, and it

indeed shows a decrease as we expected.

Looking at the trends, we expect the half ngram structure will perform better for

higher ngram order and more training data. The context structure is also a possible

choice. However there are many aspects that may affect the training time, e.g. the

network connection problem, the hard disk failure, the cpu load etc. The figure in

our experiments is computed as an average number for several runs, and each run

the training time is different. Generally speaking we can’t avoid the deviation when

calculating the time cost, but we can see the trends as the ngram order increases. For

the space cost, each run the table size is identical, which means the model is stable and

we can rely more on this for evaluation. The table size of half ngram based structure is

also smaller compared with SRILM model data size. From this point of view, the half

ngram structure is a proper choice both on time and space cost.

The caching query doesn’t show big improvements on the time cost. This implies

that the way to store a cache in memory and check both the memory and the Hbase

table is not efficient in a MapReduce job. Because each working node will only receive

some part of the whole data, to store previous data information locally can’t grantee

the information is indeed useful for future data, so the node will actually waste some

time on searching in the local HashMap but find nothing and still have to look up in

the Hbase table.


For the perplexity comparison, our model shows a better result than SRILM, this is

related with the algorithm implementation differences, e.g. the way to deal with unseen

unigram, the way to do pruning etc. SRILM uses a more complex way to compute

the probability for unseen unigrams, which is adding 1 to the total and redistribute

the probability for all existing unigrams, these require more steps using MapReduce

and we think it will be too expensive to add another MapReduce job to update the

existing probabilities, so we choose a simpler scheme of adding 1 to the total and not

redistributing. This works well when there are not so many unseen unigrams but results

poorly for testing data with many unseen testing data.

6.2 Future Works

There are a lot of works that can be done in the future. The Hadoop and Hbase

release newer version with lots of bug fixes and enhancements. The first thing we can

do is to upgrade our program to newer frameworks to see if there are any changes or

improvements. There are lots of aspects that maybe influence the final result, and some

of them may be less relevant but some may be quite important. We can try to adjust

some configurations in Hadoop and Hbase for a better performance, e.g. the maximum

number of map and reduce tasks, the buffer size, the data compression type etc. Also

the experiments can be extended to larger cluster with more machines, increasing the

computational power to reduce the time cost.

It will be very interesting to see and compare the performance for higher ngram

order and more data, e.g. 5-gram model with billions of words. Also, for these table

structures, possible further compression or encoding can be added. For word or context

based structure, there will be many rows that only have one column, and the column

values are the same, so it is possible to combine these rows with the same column value

as a new row. For the integer based structure, maybe storing the mapping file locally

in each working node is better than storing it in Hbase table.

Because of the different implementation of Good-Turing algorithm, our model

shows a better probability as the lower perplexity compared with SRILM tools, still

we can implement more advanced smoothing methods like Kneser-Ney using Hadoop

and Hbase for comparison. More MapReduce tasks are needed to compute parameters,

also these parameters need to be stored either in HDFS or in Hbase table along with

the ngram probability, so the table structures may need improvements, which contains

a lot of potential researches.


The output of this work can be applied in machine translation and other statistical

language processing fields, providing an open source, distributed extension for these

researches and powering large scale language models.

Bibliography

[1] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in

machine translation. In Proceedings of the 2007 Joint Conference on Empirical

Methods in Natural Language Processing and Computational Natural Language

Learning (EMNLP-CoNLL), pages 858–867, 2007.

[2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,

T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system

for structured data. In OSDI ’06, pages 205–218, 2006.

[3] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clus-

ters. In OSDI ’04, pages 137–150, 2004.

[4] A. Emami, K. Papineni, and J. Sorensen. Large-scale distributed language model-

ing. Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Interna-

tional Conference on, 4:IV–37–IV–40, 15-20 April 2007.

[5] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling.

Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International

Conference on, 1:181–184 vol.1, 9-12 May 1995.

[6] A. Stolcke. Srilm - an extensible language modeling toolkit. In 7th International

Conference on Spoken Language Processing, pages 901–904, September 2002.

[7] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In

SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference

on Research and development in information retrieval, pages 254–261, New York,

NY, USA, 1999. ACM.

[8] Y. Zhang, A. S. Hildebrand, and S. Vogel. Distributed language modeling for n-

best list re-ranking. In Proceedings of the 2006 Conference on Empirical Methods

47

Bibliography 48

in Natural Language Processing, pages 216–223, Sydney, Australia, July 2006.

Association for Computational Linguistics.

Appendix A

Source Code

A.1 NgramCount.java

NgramCount is the class to parse the input texts, generate ngram counts and count

of counts.

package ngram ;

import j a v a . i o . IOExcep t ion ;

import j a v a . u t i l . A r r a y L i s t ;

import j a v a . u t i l . I t e r a t o r ;

import j a v a . u t i l . L i s t ;

import j a v a . u t i l . r e g e x . P a t t e r n ;

import j a v a . u t i l . r e g e x . Matcher ;

import org . apache . hadoop . con f . C o n f i g u r a t i o n ;

import org . apache . hadoop . con f . C o n f i g u r e d ;

import org . apache . hadoop . f s . F i l e S y s t e m ;

import org . apache . hadoop . f s . Pa th ;

import org . apache . hadoop . i o . I n t W r i t a b l e ;

import org . apache . hadoop . i o . LongWr i t ab l e ;

import org . apache . hadoop . i o . Text ;

import org . apache . hadoop . i o . S e q u e n c e F i l e ;

import org . apache . hadoop . i o . S e q u e n c e F i l e . Compress ionType ;

import org . apache . hadoop . mapred . J o b C l i e n t ;

import org . apache . hadoop . mapred . RunningJob ;

49

Appendix A. Source Code 50

import org . apache . hadoop . mapred . JobConf ;

import org . apache . hadoop . mapred . MapReduceBase ;

import org . apache . hadoop . mapred . Mapper ;

import org . apache . hadoop . mapred . O u t p u t C o l l e c t o r ;

import org . apache . hadoop . mapred . P a r t i t i o n e r ;

import org . apache . hadoop . mapred . Reducer ;

import org . apache . hadoop . mapred . R e p o r t e r ;

import org . apache . hadoop . mapred . C o u n t e r s ;

import org . apache . hadoop . mapred . S e q u e n c e F i l e I n p u t F o r m a t ;

import org . apache . hadoop . mapred . S e q u e n c e F i l e O u t p u t F o r m a t ;

import org . apache . hadoop . u t i l . Tool ;

import org . apache . hadoop . u t i l . ToolRunner ;

/∗ ∗∗ p a r s e t h e t e x t f i l e s t o c o u n t a l l t h e ngrams and

∗ e s t i m a t e c o u n t o f c o u n t s f o r Good−Tu r ing smoo th ing

∗∗ @author Xiaoyang Yu

∗∗ /

p u b l i c c l a s s NgramCount ex tends C o n f i g u r e d implementsTool {

p r o t e c t e d f i n a l s t a t i c S t r i n g GTPARA = ” gt−p a r a m e t e r ” ;

p r o t e c t e d f i n a l s t a t i c S t r i n g RAWCOUNTDIR = ” raw−c o u n t s

” ;

/ / u se t h e c o u n t e r t o c o l l e c t t h e t o t a l number o f

un igrams

p r o t e c t e d s t a t i c enum MyCounter {INPUT WORDS

} ;

/∗ ∗∗ t h e map c l a s s t o p a r s e each l i n e i n t o ngrams

∗

∗ /

p u b l i c s t a t i c c l a s s countMap ex tends MapReduceBase

implementsMapper<LongWri tab le , Text , Text , I n t W r i t a b l e > {

p r i v a t e i n t o r d e r ;

p r i v a t e S t r i n g words ;

p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = newI n t W r i t a b l e ( 1 ) ;

/ / r e t r i e v e ngram o r d e r from j o b c o n f i g u r a t i o n

p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {o r d e r = j o b . g e t I n t ( ” o r d e r ” , 3 ) ;

}

/ / e m i t a l l t h e unigram , bigram . . up t o ngram

p u b l i c vo id map ( LongWr i t ab l e key , Text va lue ,

O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > o u t p u t ,

R e p o r t e r r e p o r t e r )

throws IOExcep t i on {S t r i n g l i n e = v a l u e . t o S t r i n g ( ) ;

i f ( l i n e == n u l l | | l i n e . t r i m ( ) . l e n g t h ( ) == 0)

re turn ;

P a t t e r n m y P a t t e r n = P a t t e r n . compi l e ( ”\\w+ |\\ p{ Punc t

}+” ) ;

Matcher myMatcher = m y P a t t e r n . ma tche r ( l i n e ) ;

S t r i n g B u f f e r sb = new S t r i n g B u f f e r ( ) ;

whi le ( myMatcher . f i n d ( ) ) {sb . append ( myMatcher . group ( ) ) . append ( ”\ t ” ) ;

}l i n e = sb . t o S t r i n g ( ) ;

S t r i n g [ ] c u r r e n t = l i n e . s p l i t ( ”\\ s+” ) ;

i n t i = 0 ;

i f ( o r d e r > 1)

c u r r e n t = ( ”<s> ” + l i n e ) . s p l i t ( ”\\ s+” ) ;

words = ” ” ;

f o r ( i = 1 ; i <= o r d e r ; i ++) {f o r ( i n t k = 0 ; k <= c u r r e n t . l e n g t h − i ; k ++) {

words = c u r r e n t [ k ] ;

i f ( words . l e n g t h ( ) != 0 ) {f o r ( i n t j = k + 1 ; j < k + i ; j ++)

words = words + ” ” + c u r r e n t [ j ] ;

o u t p u t . c o l l e c t ( new Text ( words ) , one ) ;

}}

}}

}

/∗ ∗∗ A combiner c l a s s t h a t j u s t e m i t s t h e sum o f t h e

∗ i n p u t v a l u e s .

∗ /

p u b l i c s t a t i c c l a s s countCombine ex tends MapReduceBase

implementsReducer<Text , I n t W r i t a b l e , Text , I n t W r i t a b l e > {

p u b l i c vo id r e d u c e ( Text key , I t e r a t o r 

v a l u e s ,



throws IOExcep t i on {i n t sum = 0 ;

whi le ( v a l u e s . hasNext ( ) ) {sum += v a l u e s . n e x t ( ) . g e t ( ) ;

}o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ;

}}

/∗ ∗∗ A r e d u c e r c l a s s t h a t j u s t e m i t s t h e sum o f t h e i n p u t

∗ v a l u e s , a l s o do t h e c o u n t s prun ing , c o l l e c t t h e

∗ t o t a l number o f un igrams .

∗ /

p u b l i c s t a t i c c l a s s countReduce ex tends MapReduceBase


p r i v a t e i n t prune ;

p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {prune = j o b . g e t I n t ( ” p rune ” , 0 ) ;

}


v a l u e s ,





}

i f ( sum > prune ) {o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ;

/∗ ∗∗ c a l c u l a t e t h e t o t a l number , o n l y c o u n t s

∗ unigram , i g n o r e <s>

∗ /

S t r i n g [ ] c u r r e n t = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;

i f ( c u r r e n t . l e n g t h == 1 && ! c u r r e n t [ 0 ] . e q u a l s ( ”<s

>” ) )

r e p o r t e r . i n c r C o u n t e r ( MyCounter . INPUT WORDS , sum

) ;

}}

}

/∗ ∗∗ a p a r t i t i o n c l a s s t o b a l a n c e t h e load based on f i r s t

∗ two words .

∗∗ /

p u b l i c s t a t i c c l a s s c o u n t P a r t i t i o n ex tendsMapReduceBase implements

P a r t i t i o n e r <Text , I n t W r i t a b l e > {p u b l i c i n t g e t P a r t i t i o n ( Text key , I n t W r i t a b l e va lue ,

i n t n u m P a r t i t i o n s ) {S t r i n g [ ] l i n e = key . t o S t r i n g ( ) . s p l i t ( ” ” ) ;

S t r i n g p r e f i x = ( l i n e . l e n g t h > 1) ? ( l i n e [ 0 ] + l i n e

[ 1 ] ) : l i n e [ 0 ] ;

re turn p r e f i x . hashCode ( ) % n u m P a r t i t i o n s ;

}}

/∗ ∗∗ t h e second t a s k i s t o c o u n t t h e count−of−c o u n t s f o r

∗ Good−T u r i n g s m o o t h i n g .

∗∗ a map c l a s s t o p a r s e t h e raw c o u n t s and e m i t c o u n t

∗ o f c o u n t s w i t h t h e ngram o r d e r

∗ /

p u b l i c s t a t i c c l a s s MapB ex tends MapReduceBase

implementsMapper<Text , I n t W r i t a b l e , I n t W r i t a b l e , I n t W r i t a b l e >

{p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = new

I n t W r i t a b l e ( 1 ) ;

p u b l i c vo id map ( Text key , I n t W r i t a b l e va lue ,

O u t p u t C o l l e c t o r o u t p u t ,

R e p o r t e r r e p o r t e r ) throws IOExcep t i on {i f ( ! key . t o S t r i n g ( ) . e q u a l s ( ”<s>” ) ) {

S t r i n g [ ] r e s u l t = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;

S t r i n g combine = r e s u l t . l e n g t h + v a l u e . t o S t r i n g ( )

;

o u t p u t . c o l l e c t ( new I n t W r i t a b l e ( I n t e g e r . p a r s e I n t (

combine ) ) , one ) ;

}}

}

/∗ ∗∗ a re du c e c l a s s t o e m i t t h e sums o f i n p u t v a l u e s

∗∗ /

p u b l i c s t a t i c c l a s s ReduceB ex tends MapReduceBase

implementsReducer {

p u b l i c vo id r e d u c e ( I n t W r i t a b l e key , I t e r a t o r v a l u e s ,

O u t p u t C o l l e c t o r o u t p u t ,

R e p o r t e r r e p o r t e r ) throws IOExcep t i on {i n t sum = 0 ;



}}

s t a t i c i n t p r i n t U s a g e ( ) {System . o u t . p r i n t l n ( ” usage : [ OPTIONS ] ”

+ ”\ n O p t io n s i n c l u d e : ” ) ;

System . o u t

. p r i n t l n ( ”−o r d e r <o r d e r >:\ t t h e max ngram o r d e r \n

\ t d e f a u l t : 3\ n ”

+ ”−prune <prune >:\ t t h e p r u n i n g c o u n t s \n\t d e f a u l t : 0\ n ” ) ;

re turn −1;

}

p u b l i c i n t run ( S t r i n g [ ] a r g s ) throws E x c e p t i o n {i n t o r d e r = 3 ;

i n t prune = 0 ;

L i s t <S t r i n g > o t h e r a r g s = new A r r a y L i s t <S t r i n g >() ;

f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; ++ i ) {t r y {

i f ( ”−o r d e r ” . e q u a l s ( a r g s [ i ] ) ) {o r d e r = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;

System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” +

a r g s [ i ] ) ;

} e l s e i f ( ”−prune ” . e q u a l s ( a r g s [ i ] ) ) {prune = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;

System . o u t . p r i n t l n ( ” use c o u n t p r u n i n g : ” + a r g s

[ i ] ) ;

} e l s e {o t h e r a r g s . add ( a r g s [ i ] ) ;

}} ca tch ( NumberFormatExcept ion e x c e p t ) {

System . o u t . p r i n t l n ( ”ERROR: I n t e g e r e x p e c t e d

i n s t e a d o f ”

+ a r g s [ i ] ) ;

re turn p r i n t U s a g e ( ) ;

} ca tch ( Ar ray IndexOutOfBoundsExcep t ion e x c e p t ) {System . o u t . p r i n t l n ( ”ERROR: R e q u i r e d p a r a m e t e r

m i s s i n g from ”

+ a r g s [ i − 1 ] ) ;



}}/ / Make s u r e t h e r e are e x a c t l y 1 p a r a m e t e r s l e f t .

i f ( o t h e r a r g s . s i z e ( ) != 1 ) {System . o u t . p r i n t l n ( ”ERROR: Wrong number o f

p a r a m e t e r s : ”

+ a r g s . l e n g t h + ” i n s t e a d o f 1 . ” ) ;


}

JobConf jobA = new JobConf ( ge tConf ( ) , NgramCount .

c l a s s ) ;

jobA . setJobName ( ” ngramcount ” ) ;

/ / t h e k e y s are words ( s t r i n g s )

jobA . s e t O u t p u t K e y C l a s s ( Text . c l a s s ) ;

/ / t h e v a l u e s are c o u n t s ( i n t s )

jobA . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ;

jobA . s e t M a p p e r C l a s s ( countMap . c l a s s ) ;

jobA . s e t C o m b i n e r C l a s s ( countCombine . c l a s s ) ;

jobA . s e t P a r t i t i o n e r C l a s s ( c o u n t P a r t i t i o n . c l a s s ) ;

jobA . s e t R e d u c e r C l a s s ( coun tReduce . c l a s s ) ;

jobA . s e t I n t ( ” o r d e r ” , o r d e r ) ;

jobA . s e t I n t ( ” p rune ” , p rune ) ;

jobA . s e t I n p u t P a t h ( new Pa th ( o t h e r a r g s . g e t ( 0 ) ) ) ;

jobA . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ;

jobA . s e t O u t p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;

i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( new Pa th (RAWCOUNTDIR)

) )

F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( new Pa th (RAWCOUNTDIR) ) ;

S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobA , t rue) ;


S e q u e n c e F i l e O u t p u t F o r m a t . s e t O u t p u t C o m p r e s s i o n T y p e (

jobA , Compress ionType .BLOCK) ;

/∗ ∗∗ r e c o r d t h e r u n n i n g t i m e

∗ /

long t 1 = System . c u r r e n t T i m e M i l l i s ( ) ;

RunningJob r j = J o b C l i e n t . r un Jo b ( jobA ) ;

C o u n t e r s c s = r j . g e t C o u n t e r s ( ) ;

t r y {Pa th o u t F i l e = new Pa th ( ”num−of−un ig rams ” ) ;

i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( o u t F i l e ) )

F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( o u t F i l e ) ;

S e q u e n c e F i l e . W r i t e r w r i t e r = S e q u e n c e F i l e .

c r e a t e W r i t e r ( F i l e S y s t e m . g e t ( jobA ) , jobA , o u t F i l e

, Tex t . c l a s s , LongWr i t ab l e . c l a s s ) ;

System . o u t . p r i n t l n ( ” t o t a l i s : ”+ cs . g e t C o u n t e r (

MyCounter . INPUT WORDS) ) ;

w r i t e r . append ( new Text ( ” t o t a l ” ) , new LongWr i t ab l e (

c s . g e t C o u n t e r ( MyCounter . INPUT WORDS) ) ) ;

w r i t e r . c l o s e ( ) ;

} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” Can ’ t w r i t e f i l e : num−of−

un ig rams f o r t h e ngram model . E x i t . ” ) ;

re turn −1;

}System . o u t . p r i n t l n ( ” Job 1 f i n i s h e d . Now im s t a r t i n g

j o b 2 ” ) ;

/∗ ∗∗ t h e second t a s k i s t o c o u n t t h e count−of−c o u n t s

∗ f o r Good−Tu r ing smoo th ing

∗∗ /

JobConf jobB = new JobConf ( ge tConf ( ) , NgramCount .

c l a s s ) ;


jobB . s e t I n t ( ” o r d e r ” , o r d e r ) ;

jobB . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;

jobB . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ;

jobB . s e t O u t p u t K e y C l a s s ( I n t W r i t a b l e . c l a s s ) ;

jobB . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ;

jobB . s e t M a p p e r C l a s s (MapB . c l a s s ) ;

jobB . s e t C o m b i n e r C l a s s ( ReduceB . c l a s s ) ;

jobB . s e t R e d u c e r C l a s s ( ReduceB . c l a s s ) ;

jobB . setNumReduceTasks ( 1 ) ;

/ / i n p u t i s t h e raw c o u n t s

jobB . s e t I n p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;

/ / o u t p u t i s c o u n t o f c o u n t s

Pa th tmpDir = new Pa th (GTPARA) ;

jobB . s e t O u t p u t P a t h ( tmpDir ) ;

i f ( F i l e S y s t e m . g e t ( jobB ) . e x i s t s ( tmpDir ) )

F i l e S y s t e m . g e t ( jobB ) . d e l e t e ( tmpDir ) ;

S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobB , t rue) ;


jobB , Compress ionType .BLOCK) ;

J o b C l i e n t . r u nJ ob ( jobB ) ;

System . o u t . p r i n t l n ( ” Job 2 f i n i s h e d ! Now you can s t a r t

hbase t a b l e g e n e r a t o r . ” ) ;


long s e c = ( t 2 − t 1 ) / 1000 ; / / s e c o n d s

System . o u t . p r i n t l n ( ” r u n n i n g t ime i s : ” + s e c / 60 + ”

m i n u t e s ” + s e c% 60 + ” s e c o n d s ” ) ;

re turn 0 ;

}

p u b l i c s t a t i c vo id main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n

{i n t e r rCode = ToolRunner . run ( new C o n f i g u r a t i o n ( ) , new

NgramCount ( ) ,

a r g s ) ;


System . e x i t ( e r rCode ) ;

}}

A.2 TableGenerator.java

TableGenerator is the class to estimate ngram probability and store it in different

Hbase tables.

package ngram ;



import j a v a . u t i l . HashMap ;



import j a v a . u t i l . Map . E n t r y ;





import org . apache . hadoop . hbase . H B a s e C o n f i g u r a t i o n ;

import org . apache . hadoop . hbase . HTable ;

import org . apache . hadoop . hbase . i o . I m m u t a b l e B y t e s W r i t a b l e ;

import org . apache . hadoop . hbase . mapred . TableReduce ;



import org . apache . hadoop . i o . MapWri table ;







/∗ ∗∗ t h e t h i r d t a s k i s t o c a l c u l a t e t h e Good−Tu r ing

∗ smoo th ing c o u n t s f o r t r a i n i n g s e t , e s t i m a t e

∗ t h e c o n d i t i o n a l p r o b a b i l i t y and s t o r e i t i n Hbase

∗ t a b l e


∗ /

p u b l i c c l a s s T a b l e G e n e r a t o r ex tends C o n f i g u r e d implementsTool {

p r o t e c t e d f i n a l s t a t i c S t r i n g GTPARA = ” gt−p a r a m e t e r ” ;

p r o t e c t e d f i n a l s t a t i c S t r i n g RAWCOUNTDIR = ” raw−c o u n t s

” ;

/∗ ∗∗ a map c l a s s t o read i n raw c o u n t s and c o u n t o f

∗ coun t s , compute t h e Good−Tu r ing smoo th ing c o u n t s

∗∗ /

p u b l i c s t a t i c c l a s s es t ima teMap ex tends MapReduceBase

implements Mapper<Text , I n t W r i t a b l e , Text , Text> {i n t o r d e r ;

HashMap> countHash

;

countHash = new HashMap>() ;

Pa th t m p F i l e = new Pa th ( new Pa th (GTPARA) , ” p a r t

−00000” ) ;

/ / open t h e c o u n t o f c o u n t s f i l e t o read a l l c o u n t

/ / o f c o u n t s i n t o hashmap

t r y {F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( j o b ) ;

S e q u e n c e F i l e . Reader r e a d e r = new S e q u e n c e F i l e .

Reader ( f i l e S y s , tmpFi l e , j o b ) ;

I n t W r i t a b l e k = new I n t W r i t a b l e ( ) , v = newI n t W r i t a b l e ( ) ;

whi le ( r e a d e r . n e x t ( k , v ) ) {I n t e g e r o r d e r = I n t e g e r . p a r s e I n t ( k . t o S t r i n g ( ) .

s u b s t r i n g ( 0 , 1 ) ) ;

I n t e g e r key = I n t e g e r . p a r s e I n t ( k . t o S t r i n g ( ) .

s u b s t r i n g ( 1 ) ) ;

i f ( countHash . c o n t a i n s K e y ( o r d e r ) ) {HashMap tmp = countHash . g e t

( o r d e r ) ;

tmp . p u t ( key , new I n t e g e r ( v . g e t ( ) ) ) ;

countHash . p u t ( o r d e r , tmp ) ;

} e l s e {HashMap tmp = new HashMap() ;

tmp . p u t ( key , new I n t e g e r ( v . g e t ( ) ) ) ;

countHash . p u t ( o r d e r , tmp ) ;

}}r e a d e r . c l o s e ( ) ;

} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” c a n t open p a r a m e t e r f i l e ! ” ) ;

};

}

O u t p u t C o l l e c t o r <Text , Text> o u t p u t , R e p o r t e r r )

throws IOExcep t i on {I n t e g e r v = v a l u e . g e t ( ) ;

f l o a t c = v . f l o a t V a l u e ( ) ;

S t r i n g [ ] words = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;

I n t e g e r o r d e r = words . l e n g t h ;

i f ( o r d e r > t h i s . o r d e r )

re turn ;

/∗ ∗∗ r ∗=( r +1)∗Nr +1/ Nr t h e c o u n t o f c o u n t s Nr must be

∗ e x i s t e d , o t h e r w i s e we use t h e c l o s e s t c o u n t s

∗ i n s t e a d

∗ /

f o r ( i n t i = 1 ; i < 5 ; i ++) {i f ( countHash . g e t ( o r d e r ) . c o n t a i n s K e y ( v )

&& countHash . g e t ( o r d e r ) . c o n t a i n s K e y ( v + i ) ) {I n t e g e r Nr1 = countHash . g e t ( o r d e r ) . g e t ( v + i ) ;

I n t e g e r Nr = countHash . g e t ( o r d e r ) . g e t ( v ) ;

c = ( v . f l o a t V a l u e ( ) + 1 f ) ∗ Nr1 . f l o a t V a l u e ( )

/ Nr . f l o a t V a l u e ( ) ;

break ;

}}/∗ ∗∗ t h e G−T smoothed c o u n t s w i t h T e x t key i s c , t h e n

∗ we e m i t t h e t o k e n based on t h e l a s t n−1 gram as

∗ t h e new key

∗ /

i f ( o r d e r > 1) {S t r i n g s u f f i x = words [ o r d e r − 1 ] ;

S t r i n g word = words [ 0 ] ;

f o r ( i n t i = 1 ; i < o r d e r − 1 ; i ++) {word = word + ” ” + words [ i ] ;

}o u t p u t . c o l l e c t ( new Text ( word ) , new Text ( ”\ t ” +

s u f f i x + ”\ t ” + F l o a t . t o S t r i n g ( c ) ) ) ;

}i f ( o r d e r == 1 | | ( o r d e r > 1 && o r d e r < t h i s . o r d e r )

)

o u t p u t . c o l l e c t ( key , new Text ( F l o a t . t o S t r i n g ( c ) ) ) ;

}}

/∗ ∗∗ a map c l a s s needed by i n t e g e r based t a b l e s t r u c t u r e ,

∗ c o n v e r t t h e un igrams i n t o u n i qu e i n t e g e r s

∗∗ /

p u b l i c s t a t i c c l a s s conver tMap ex tends MapReduceBase

implementsMapper<Text , I n t W r i t a b l e , Text , LongWri tab le > {

Long i d = 0 l ;

/∗ ∗∗ t h e i n p u t key i s ngram , v a l u e i s t h e raw c o u n t s

∗ t h e o u t p u t key i s unigram , v a l u e i s t h e u n iq ue

∗ i n t e g e r i d

∗ /


O u t p u t C o l l e c t o r <Text , LongWri tab le > o u t p u t ,

R e p o r t e r r )

throws IOExcep t i on {S t r i n g [ ] words = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;

i f ( words . l e n g t h == 1) {o u t p u t . c o l l e c t ( key , new LongWr i t ab l e (++ i d ) ) ;

}}

}

/∗ ∗∗ a re du c e c l a s s needed by i n t e g e r based t a b l e

∗ s t r u c t u r e , s t o r e t h e unigram− i n t e r g e r mapping i n

∗ Hbase t a b l e

∗∗ /

p u b l i c s t a t i c c l a s s c o n v e r t T a b l e ex tends TableReduce<

Text , LongWri tab le > {/ / t h e row key i s t h e unigram T e x t k , t h e column i s

c o n v e r t : i d

p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <LongWri tab le > v ,

O u t p u t C o l l e c t o r <Text , MapWritable> o u t p u t ,

R e p o r t e r r )

throws IOExcep t i on {Long i d = v . n e x t ( ) . g e t ( ) ;

MapWri table mw = new MapWri table ( ) ;

Text column = new Text ( ” c o n v e r t : i d ” ) ;

mw. p u t ( column , new I m m u t a b l e B y t e s W r i t a b l e ( Long .

t o S t r i n g ( i d )

. g e t B y t e s ( ) ) ) ;

o u t p u t . c o l l e c t ( k , mw) ;

}}

/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n ngram based

∗ t a b l e

∗∗ /

p u b l i c s t a t i c c l a s s s i mp leT ab l eR edu ce ex tendsTableReduce<Text , Text> {

Long t o t a l ;

p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {

t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;

}

p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <Text> v ,


R e p o r t e r r )

throws IOExcep t i on {F l o a t base = 1 f ;

HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g ,

F l o a t >() ;

whi le ( v . hasNext ( ) ) {S t r i n g r e c o r d = v . n e x t ( ) . t o S t r i n g ( ) ;

i f ( r e c o r d . s t a r t s W i t h ( ”\ t ” ) ) {S t r i n g [ ] tmp = r e c o r d . s p l i t ( ”\ t ” ) ;

c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ;

} e l s ebase = F l o a t . p a r s e F l o a t ( r e c o r d ) ;

}MapWri table mw = new MapWri table ( ) ;

/∗ ∗∗ w r i t e t o t h e column <g t : prob > , t h e row i s t h e

∗ ngram

∗ /

i f ( ! c o u n t s . i sEmpty ( ) ) {f o r ( Ent ry<S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t

( ) ) {F l o a t prob = e n t r y . g e t V a l u e ( ) / ba se ;

i f ( p rob > 1 f )

prob = 1 f ;

mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( prob ) .

g e t B y t e s ( ) ) ) ;

o u t p u t . c o l l e c t ( new Text ( k + ” ” + e n t r y . getKey

( ) ) , mw) ;

}}i f ( k . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) . l e n g t h == 1) {

mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( base /

t o t a l ) . g e t B y t e s ( ) ) ) ;


}}

}

/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n word based

∗ t a b l e

∗∗ /

p u b l i c s t a t i c c l a s s wordTableReduce ex tends TableReduce

<Text , Text> {Long t o t a l ;

p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;

}

/∗ ∗∗ k i s t h e n−1 gram c o n t e x t

∗ /



R e p o r t e r r )



F l o a t >() ;

/∗ ∗∗ w r i t e t o t h e column f a m i l y <g t : > , column l a b e l

∗ i s t h e n−1 gram T e x t k , t h e row i s t h e word from

∗ HashMap c o u n t s

∗ /



i f ( p rob > 1 f )

prob = 1 f ;

mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : ” + k ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( prob ) .

g e t B y t e s ( ) ) ) ;

o u t p u t . c o l l e c t ( new Text ( e n t r y . getKey ( ) ) , mw) ;


mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : unigram ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( base /

t o t a l ) . g e t B y t e s ( ) ) ) ;


}

}}

/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n c o n t e x t based

t a b l e

∗∗ /

p u b l i c s t a t i c c l a s s c o n t e x t T a b l e R e d u c e ex tendsTableReduce<Text , Text> {

Long t o t a l ;

@Override


}


∗ /



R e p o r t e r r )



F l o a t >() ;





}


MapWri table mw = new MapWri table ( ) ;


i s t h e word from

∗ HashMap coun t s , t h e row i s t h e n−1 gram T e x t k

∗ /



i f ( p rob > 1 f )

prob = 1 f ;

mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : ” + e n t r y . getKey ( ) ) ,

new I m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g (

prob )

. g e t B y t e s ( ) ) ) ;



mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : unigram ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t

. t o S t r i n g ( base / t o t a l ) . g e t B y t e s ( ) ) ) ;


}}

}

/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n h a l f ngram

based t a b l e

∗

∗ /

p u b l i c s t a t i c c l a s s combineTableReduce ex tendsTableReduce<Text , Text> {

Long t o t a l ;


}


∗ /



R e p o r t e r r )



F l o a t >() ;







i s t h e l a s t h a l f

∗ ngram c o n t e x t , t h e row i s t h e f i r s t h a l f ngram

c o n t e x t

∗ /

i f ( p rob > 1 f )

prob = 1 f ;

mw. c l e a r ( ) ;

S t r i n g ngram = k . t o S t r i n g ( ) + ” ” + e n t r y .

getKey ( ) ;

S t r i n g [ ] a r r a y = ngram . s p l i t ( ”\\ s+” ) ;

i n t h a l f = a r r a y . l e n g t h / 2 ;

S t r i n g row = a r r a y [ 0 ] ;

S t r i n g l a b e l = a r r a y [ h a l f ] ;

f o r ( i n t i = 1 ; i < h a l f ; i ++) {row = row + ” ” + a r r a y [ i ] ;

}f o r ( i n t i = h a l f + 1 ; i < a r r a y . l e n g t h ; i ++) {

l a b e l = l a b e l + ” ” + a r r a y [ i ] ;

}mw. p u t ( new Text ( ” g t : ” + l a b e l ) , new

I m m u t a b l e B y t e s W r i t a b l e (

F l o a t . t o S t r i n g ( prob ) . g e t B y t e s ( ) ) ) ;

o u t p u t . c o l l e c t ( new Text ( row ) , mw) ;


mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : unigram ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t



}}

}

/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n i n t e g e r based

t a b l e

∗∗ /

p u b l i c s t a t i c c l a s s i n t e g e r T a b l e R e d u c e ex tendsTableReduce<Text , Text> {

Long t o t a l ;

p r i v a t e H B a s e C o n f i g u r a t i o n c o n f i g ;

p r i v a t e HTable t a b l e ;


t r y {c o n f i g = new H B a s e C o n f i g u r a t i o n ( j o b ) ;

t a b l e = new HTable ( c o n f i g , new Text ( ” idmap ” ) ) ;

} ca tch ( IOExcep t ion e ) {System . o u t

. p r i n t l n ( ”Can ’ t open t a b l e : idmap f o r t h e

ngram model . ” ) ;

re turn ;

}}

/∗ ∗∗ c o n v e r t t h e ngram s t r i n g i n t o long i n t e g e r s . Not

found words are

∗ r e t u r n e d as ’0 ’

∗∗ @param ngram t h e ngram s t r i n g

∗ @return t h e i n t e g e r r e p r e s e n t a t i o n o f t h e s t r i n g

∗ /

p r i v a t e S t r i n g c o n v e r t I D ( S t r i n g ngram ) throwsIOExcep t i on {

Text Column = new Text ( ” c o n v e r t : i d ” ) ;

S t r i n g myid = ” 0 ” ;

byte [ ] v a l u e B y t e s = t a b l e . g e t ( new Text ( a r r a y [ 0 ] ) ,

Column ) ;

i f ( v a l u e B y t e s != n u l l ) {myid = new S t r i n g ( v a l u e B y t e s ) ;

} e l s ere turn ” 0 ” ;

i f ( a r r a y . l e n g t h > 1) {f o r ( i n t i = 1 ; i < a r r a y . l e n g t h ; i ++) {

v a l u e B y t e s = t a b l e . g e t ( new Text ( a r r a y [ i ] ) ,

Column ) ;

i f ( v a l u e B y t e s != n u l l ) {myid = myid + ” ” + new S t r i n g ( v a l u e B y t e s ) ;

} e l s ere turn ” 0 ” ;

}}re turn myid ;

}


∗ /



R e p o r t e r r )

throws IOExcep t i on {F l o a t ba se = 1 f ;


F l o a t >() ;

/∗ ∗∗ w r i t e t o t h e column <g t : prob > , t h e row i s t h e

i n t e g e r

∗ r e p r e s e n t a t i o n o f ngram

∗ /



i f ( p rob > 1 f )

prob = 1 f ;

mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e (

F l o a t . t o S t r i n g ( prob ) . g e t B y t e s ( ) ) ) ;

S t r i n g i d = c o n v e r t I D ( k . t o S t r i n g ( ) + ”\ t ” +

e n t r y . getKey ( ) ) ;

o u t p u t . c o l l e c t ( new Text ( i d ) , mw) ;


mw. c l e a r ( ) ;

mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t


S t r i n g i d = c o n v e r t I D ( k . t o S t r i n g ( ) ) ;

o u t p u t . c o l l e c t ( new Text ( i d ) , mw) ;

}}

}


i n t t y p e = 1 ;




i f ( o r d e r < 1) {System . o u t

. p r i n t l n ( ” o r d e r i s i n v a l i d , u s i n g d e f a u l t

i n s t e a d ” ) ;

o r d e r = 3 ;

}} e l s e i f ( ”−t y p e ” . e q u a l s ( a r g s [ i ] ) ) {

t y p e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;

i f ( t y p e < 1 | | t y p e > 5) {System . o u t

. p r i n t l n ( ” t y p e i s o u t o f r a n g e (1−5) ,

u s i n g d e f a u l t i n s t e a d ” ) ;

t y p e = 1 ;

}} e l s e {

o t h e r a r g s . add ( a r g s [ i ] ) ;

}} ca tch ( NumberFormatExcept ion e x c e p t ) {

System . o u t . p r i n t l n ( ”ERROR: I n t e g e r e x p e c t e d


+ a r g s [ i ] ) ;


+ a r g s [ i − 1 ] ) ;





+ a r g s . l e n g t h + ” i n s t e a d o f 1 . ” ) ;


}System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” + o r d e r ) ;

System . o u t . p r i n t l n ( ” t h e t a b l e t y p e i s s e t t o : ” + t y p e

) ;

/∗ ∗∗ f o r i n t e g e r based t a b l e , e x t r a MapReduce s t e p i s

r e q u i r e d t o g e n e r a t e

∗ t h e c o n v e r t map

∗ /

i f ( t y p e == 5) {JobConf c o n v e r t J o b = new JobConf ( ge tConf ( ) ,

T a b l e G e n e r a t o r . c l a s s ) ;

c o n v e r t J o b . setJobName ( ” g e n e r a t e c o n v e r t mapping

t a b l e ” ) ;

c o n v e r t J o b . s e t M a p p e r C l a s s ( conver tMap . c l a s s ) ;

c o n v e r t J o b . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t .

c l a s s ) ;

c o n v e r t J o b . se tMapOutpu tKeyClass ( Text . c l a s s ) ;

c o n v e r t J o b . s e t M a p O u t p u t V a l u e C l a s s ( LongWr i t ab l e .

c l a s s ) ;

c o n v e r t J o b . s e t I n p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;

c o n v e r t J o b . s e t R e d u c e r C l a s s ( c o n v e r t T a b l e . c l a s s ) ;

/ / i n i t t h e hbase t a b l e w i t h t h e c o n v e r t map t a b l e

name ”idmap”


TableReduce . i n i t J o b ( ” idmap ” , c o n v e r t T a b l e . c l a s s ,

c o n v e r t J o b ) ;


∗ /


J o b C l i e n t . r u nJ ob ( c o n v e r t J o b ) ;



System . o u t . p r i n t l n ( ” r u n n i n g t ime i s : ” + s e c / 60 +

” m i n u t e s ”

+ s e c % 60 + ” s e c o n d s ” ) ;

}/∗ ∗∗ Job 3 c a l c u l a t e s t h e a d j u s t e d Good−Tu r ing c o u n t s

f o r t r a i n i n g da ta

∗ and w r i t e t h e smoothed c o u n t s t o Hbase t a b l e

∗ /

JobConf jobC = new JobConf ( ge tConf ( ) , T a b l e G e n e r a t o r .

c l a s s ) ;

jobC . setJobName ( ” ngram wi th hbase ” ) ;

jobC . s e t I n t ( ” o r d e r ” , o r d e r ) ;

jobC . s e t M a p p e r C l a s s ( es t ima teMap . c l a s s ) ;

jobC . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;

jobC . se tMapOutpu tKeyClass ( Text . c l a s s ) ;

jobC . s e t M a p O u t p u t V a l u e C l a s s ( Text . c l a s s ) ;

jobC . s e t I n p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;

sw i t ch ( t y p e ) {case 1 :

jobC . s e t R e d u c e r C l a s s ( s im p le Ta b l e Re duc e . c l a s s ) ;

/ / i n i t t h e hbase t a b l e w i t h s p e c i f i e d t a b l e name

TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) ,

s i mp leT ab l eR ed uce . c l a s s ,

jobC ) ;

break ;


case 2 :

jobC . s e t R e d u c e r C l a s s ( wordTableReduce . c l a s s ) ;



wordTableReduce . c l a s s , jobC ) ;

break ;

case 3 :

jobC . s e t R e d u c e r C l a s s ( c o n t e x t T a b l e R e d u c e . c l a s s ) ;



c o n t e x t T a b l e R e d u c e . c l a s s ,

jobC ) ;

break ;

case 4 :

jobC . s e t R e d u c e r C l a s s ( combineTableReduce . c l a s s ) ;



combineTableReduce . c l a s s ,

jobC ) ;

break ;

case 5 :

jobC . s e t R e d u c e r C l a s s ( i n t e g e r T a b l e R e d u c e . c l a s s ) ;



i n t e g e r T a b l e R e d u c e . c l a s s ,

jobC ) ;

break ;

d e f a u l t :

jobC . s e t R e d u c e r C l a s s ( s im p le Ta b l e Re duc e . c l a s s ) ;



s i mp leT ab l eR ed uce . c l a s s ,

jobC ) ;

break ;

}


/ / open t h e num−of−unigrams f i l e t o read t h e t o t a l

unigram c o u n t s

t r y {Long t o t a l = 0L ;

Pa th t m p F i l e = new Pa th ( ”num−of−un ig rams ” ) ;

F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( jobC ) ;


Reader ( f i l e S y s ,

tmpFi l e , jobC ) ;

LongWr i t ab l e v a l u e = new LongWr i t ab l e ( ) ;

i f ( r e a d e r . n e x t ( new Text ( ) , v a l u e ) ) {t o t a l = v a l u e . g e t ( ) ;

}r e a d e r . c l o s e ( ) ;

jobC . se tLong ( ” t o t a l ” , t o t a l ) ;

} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” Can ’ t open f i l e : num−of−

un ig rams f o r t h e ngram model . E x i t . ” ) ;

re turn −1;

}/∗ ∗∗ r e c o r d t h e r u n n i n g t i m e

∗ /


J o b C l i e n t . r u nJ ob ( jobC ) ;




m i n u t e s ” + s e c

% 60 + ” s e c o n d s ” ) ;

re turn 0 ;

}

s t a t i c i n t p r i n t U s a g e ( ) {

System . o u t . p r i n t l n ( ” usage : [ OPTIONS ] < t a b l e name>”


System . o u t . p r i n t l n ( ”−o r d e r <o r d e r >:\ t t h e max ngram

o r d e r \n\ t d e f a u l t : 3\ n ”

+ ”−t y p e <type >:\ t t h e t a b l e t y p e \n\ t d e f a u l t

: 1\ n ” ) ;

System . o u t . p r i n t l n ( ” t h e t a b l e t y p e can be chosen from

: ” ) ;

System . o u t . p r i n t l n ( ” 1 . ngram based \n ” + ” 2 . word

based \n ”

+ ” 3 . c o n t e x t based \n ” + ” 4 . h a l f ngram based \n ”

+ ” 5 . i n t e g e r based ” ) ;

re turn −1;

}


{i n t e r rCode = ToolRunner . run ( new C o n f i g u r a t i o n ( ) , new

T a b l e G e n e r a t o r ( ) ,

a r g s ) ;

System . e x i t ( e r rCode ) ;

}}

A.3 NgramModel.java

NgramModel is the class to query probability from Hbase table for testing texts,

and compute the perplexity.

package ngram ;



import j a v a . u t i l . HashMap ;



import j a v a . u t i l . r e g e x . Matcher ;

import j a v a . u t i l . r e g e x . P a t t e r n ;







import org . apache . hadoop . i o . F l o a t W r i t a b l e ;



import org . apache . hadoop . i o . S e q u e n c e F i l e . Compress ionType ;






import org . apache . hadoop . mapred . Reducer ;


import org . apache . hadoop . mapred . Tex tOu tpu tFo rma t ;


import org . apache . hadoop . mapred . S e q u e n c e F i l e O u t p u t F o r m a t ;



import org . apache . hadoop . hbase . HTable ;

import org . apache . hadoop . hbase . H B a s e C o n f i g u r a t i o n ;

/∗ ∗∗ t r y t o e v a l u a t e t h e ngram model , e s t i m a t e p e r p l e x i t y

∗ s c o r e s f o r t e s t i n g s e n t e n c e s .


∗∗ /

p u b l i c c l a s s NgramModel ex tends C o n f i g u r e d implementsTool ,

Mapper<Text , I n t W r i t a b l e , Text , F l o a t W r i t a b l e > {p r i v a t e JobConf myconf ;

p r i v a t e H B a s e C o n f i g u r a t i o n c o n f i g ;

p r i v a t e HTable t a b l e ;

p r i v a t e HTable c o n v e r t T a b l e ;

p r i v a t e HashMap<S t r i n g , F l o a t > mycache ;

p r i v a t e i n t type , cache , o r d e r ;

p r i v a t e Double p e r p l e x i t y = 0 . 0 , H = 0 . 0 ;

p r i v a t e long c o u n t = 0L , t o t a l = 0L ;

/∗ ∗∗ a map c l a s s t o do t h e word c o u n t s

∗∗ /

p u b l i c s t a t i c c l a s s myMap ex tends MapReduceBase

implementsMapper<LongWri tab le , Text , Text , I n t W r i t a b l e > {

p r i v a t e i n t o r d e r ;

p r i v a t e S t r i n g words ;


}

p u b l i c vo id map ( LongWr i t ab l e key , Text va lue ,



throws IOExcep t i on {S t r i n g l i n e = v a l u e . t o S t r i n g ( ) ;

i f ( l i n e == n u l l | | l i n e . t r i m ( ) . l e n g t h ( ) == 0)

re turn ;

P a t t e r n m y P a t t e r n = P a t t e r n . compi l e ( ”\\w+ |\\ p{ Punc t

}+” ) ;

Matcher myMatcher = m y P a t t e r n . ma tche r ( l i n e ) ;

S t r i n g B u f f e r sb = new S t r i n g B u f f e r ( ) ;

whi le ( myMatcher . f i n d ( ) ) {sb . append ( myMatcher . group ( ) ) . append ( ”\ t ” ) ;

}l i n e = sb . t o S t r i n g ( ) ;

S t r i n g [ ] c u r r e n t = l i n e . s p l i t ( ”\\ s+” ) ;

words = ” ” ;

i f ( o r d e r > 1)

c u r r e n t = ( ”<s> ” + l i n e ) . s p l i t ( ”\\ s+” ) ;

f o r ( i n t i = 0 ; i <= c u r r e n t . l e n g t h − o r d e r ; i ++) {words = c u r r e n t [ i ] ;

f o r ( i n t j = i + 1 ; j < i + o r d e r ; j ++)

words = words + ” ” + c u r r e n t [ j ] ;

o u t p u t . c o l l e c t ( new Text ( words ) , new I n t W r i t a b l e

( 1 ) ) ;

}}

}

/∗ ∗∗ A r e d u c e r c l a s s t h a t j u s t e m i t s t h e sum o f t h e i n p u t

v a l u e s .

∗ /

p u b l i c s t a t i c c l a s s myReduce ex tends MapReduceBase



v a l u e s ,

}}

/∗ ∗∗ open t h e Hbase t a b l e f o r query

∗ /

p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {myconf = j o b ;

S t r i n g tableName = j o b . g e t ( ” t a b l e ” ) ;

t y p e = j o b . g e t I n t ( ” t y p e ” , 1 ) ;

cache = j o b . g e t I n t ( ” cache ” , 0 ) ;

t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;

o r d e r = j o b . g e t I n t ( ” o r d e r ” , 3 ) ;

mycache = new HashMap<S t r i n g , F l o a t >() ;

t r y {c o n f i g = new H B a s e C o n f i g u r a t i o n ( j o b ) ;

t a b l e = new HTable ( c o n f i g , new Text ( tableName ) ) ;

c o n v e r t T a b l e = new HTable ( c o n f i g , new Text ( ” idmap ” )

) ;

} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” Can ’ t open t a b l e : ” + tableName

+ ” f o r t h e ngram model . E x i t . ” ) ;

re turn ;

}}

/∗ ∗∗ query t h e Hbase t a b l e f o r d i f f e r e n t t a b l e t y p e s t o

g e t t h e p r o b a b i l i t y

∗

∗ @param ngram

∗ t h e ngram t o query

∗ @return t h e p r o b a b i l i t y o f t h e i n p u t ngram

∗ @throws I O E x c e p t i o n

∗ /

p r i v a t e byte [ ] q u e r y T a b l e ( S t r i n g ngram ) throwsIOExcep t i on {

S t r i n g row , column ;


i n t l e n g t h = a r r a y . l e n g t h ;

i f ( l e n g t h == 0)

re turn n u l l ;

sw i t ch ( t y p e ) {/ / ngram based

case 1 :

row = ngram ;

column = ” g t : p rob ” ;

break ;

/ / word based

case 2 :

row = a r r a y [ l e n g t h − 1 ] ;

i f ( l e n g t h == 1)

column = ” g t : unigram ” ;

e l s e {column = ” g t : ” + a r r a y [ 0 ] ;

f o r ( i n t i = 1 ; i < l e n g t h − 1 ; i ++) {column = column + ” ” + a r r a y [ i ] ;

}}break ;

/ / c o n t e x t based

case 3 :

row = a r r a y [ 0 ] ;

f o r ( i n t i = 1 ; i < l e n g t h − 1 ; i ++) {row = row + ” ” + a r r a y [ i ] ;

}i f ( l e n g t h == 1)


e l s ecolumn = ” g t : ” + a r r a y [ l e n g t h − 1 ] ;

break ;

/ / h a l f ngram based

case 4 :

i f ( l e n g t h == 1) {row = a r r a y [ 0 ] ;


} e l s e {i n t h a l f = a r r a y . l e n g t h / 2 ;

row = a r r a y [ 0 ] ;

column = ” g t : ” + a r r a y [ h a l f ] ;

f o r ( i n t i = 1 ; i < h a l f ; i ++) {row = row + ” ” + a r r a y [ i ] ;

}f o r ( i n t i = h a l f + 1 ; i < a r r a y . l e n g t h ; i ++) {

column = column + ” ” + a r r a y [ i ] ;

}}break ;

/ / i n t e g e r based

case 5 :

byte [ ] v a l u e B y t e s = c o n v e r t T a b l e . g e t ( new Text ( a r r a y

[ 0 ] ) , new Text (

” c o n v e r t : i d ” ) ) ;

i f ( v a l u e B y t e s != n u l l )

row = new S t r i n g ( v a l u e B y t e s ) ;

e l s ere turn n u l l ;

i f ( a r r a y . l e n g t h > 1) {f o r ( i n t i = 1 ; i < a r r a y . l e n g t h ; i ++) {

v a l u e B y t e s = c o n v e r t T a b l e . g e t ( new Text ( a r r a y [ i

] ) , new Text (

” c o n v e r t : i d ” ) ) ;

i f ( v a l u e B y t e s != n u l l ) {row = row + ” ” + new S t r i n g ( v a l u e B y t e s ) ;

} e l s ere turn n u l l ;

}}column = ” g t : p rob ” ;

break ;

d e f a u l t :

row = ngram ;

column = ” g t : p rob ” ;

break ;

}re turn t a b l e . g e t ( new Text ( row ) , new Text ( column ) ) ;

}


O u t p u t C o l l e c t o r <Text , F l o a t W r i t a b l e > o u t p u t ,


throws IOExcep t i on {F l o a t prob = 0 f , a l p h a = 1 f ;

/∗ ∗∗ a back−o f f c a l c u l a t i o n t o g e t t h e p r o b a b i l i t y

∗ /

boolean f i n i s h e d = f a l s e ;

S t r i n g row = key . t o S t r i n g ( ) ;

whi le ( f i n i s h e d == f a l s e ) {i f ( cache == 1 && mycache . c o n t a i n s K e y ( row ) ) {

prob = a l p h a ∗ mycache . g e t ( row ) ;

f i n i s h e d = t rue ;

} e l s e {

byte [ ] v a l u e B y t e s = q u e r y T a b l e ( row ) ;

i f ( v a l u e B y t e s != n u l l ) {/∗ ∗∗ f i n d t h e p r o b a b i l i t y

∗ /

prob = a l p h a ∗ F l o a t . va lueOf ( new S t r i n g (

v a l u e B y t e s ) ) ;

i f ( cache == 1 && row . s p l i t ( ”\\ s+” ) . l e n g t h <

o r d e r ) {i f ( mycache . s i z e ( ) > 100)

mycache . c l e a r ( ) ;

mycache . p u t ( row , F l o a t . va lueOf ( new S t r i n g (

v a l u e B y t e s ) ) ) ;

}f i n i s h e d = t rue ;

} e l s e {S t r i n g [ ] words = row . s p l i t ( ”\\ s+” ) ;

/∗ ∗∗ unseen unigram

∗ /

i f ( words . l e n g t h == 1) {prob = a l p h a / ( t o t a l + 1 f ) ;

f i n i s h e d = t rue ;

} e l s e {/∗ ∗∗ back−o f f t o n−1 gram words

∗ /

row = words [ 1 ] ;

f o r ( i n t i = 2 ; i < words . l e n g t h ; i ++)

row = row + ” ” + words [ i ] ;

a l p h a = a l p h a ∗ 0 . 4 f ;

f i n i s h e d = f a l s e ;

}}

}

}c o u n t += v a l u e . g e t ( ) ;

i f ( p rob > 1 f )

prob = 1 f ;

i f ( p rob > 0 f )

H += v a l u e . g e t ( ) ∗ Math . l o g ( prob . doub leVa lue ( ) ) /

Math . l o g ( 2 . 0 ) ;

o u t p u t . c o l l e c t ( key , new F l o a t W r i t a b l e ( prob ) ) ;

}

/∗ ∗∗ compute t h e f i n a l p e r p l e x i t y score , w r i t e t o a f i l e

∗ /

p u b l i c vo id c l o s e ( ) throws IOExcep t i on {p e r p l e x i t y = Math . pow ( 2 . 0 , −H / c o u n t ) ;

Pa th o u t F i l e = new Pa th ( ” p e r p l e x i t y ” ) ;

F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( myconf ) ;

i f ( f i l e S y s . e x i s t s ( o u t F i l e ) )

f i l e S y s . d e l e t e ( o u t F i l e ) ;

S e q u e n c e F i l e . W r i t e r w r i t e r = S e q u e n c e F i l e .

c r e a t e W r i t e r ( f i l e S y s , myconf ,

o u t F i l e , Tex t . c l a s s , F l o a t W r i t a b l e . c l a s s ,

Compress ionType .NONE) ;

w r i t e r . append ( new Text ( ” p e r p l e x i t y i s : ” ) , newF l o a t W r i t a b l e ( p e r p l e x i t y

. f l o a t V a l u e ( ) ) ) ;

w r i t e r . c l o s e ( ) ;

}

s t a t i c i n t p r i n t U s a g e ( ) {System . o u t . p r i n t l n ( ” usage : [ OPTIONS ] < t e s t i n p u t d i r >

< t a b l e name>”


System . o u t

. p r i n t l n ( ”−o r d e r <o r d e r >:\ t t h e max ngram o r d e r \n

\ t d e f a u l t : 3\ n ”

+ ”−cache <0|1>:\ t u se t h e c a c h i n g query \n\t d e f a u l t : 0\ n ”

+ ”−t y p e <type >:\ t t h e t a b l e t y p e \n\ t d e f a u l t

: 1\ n ” ) ;

System . o u t . p r i n t l n ( ” t h e t a b l e t y p e can be chosen from

: ” ) ;

System . o u t . p r i n t l n ( ” 1 . ngram based \n ” + ” 2 . word

based \n ”

+ ” 3 . c o n t e x t based \n ” + ” 4 . combined word and

c o n t e x t based \n ”

+ ” 5 . i n t e g e r based ” ) ;

re turn −1;

}


i n t t y p e = 1 ;

i n t cache = 0 ;

JobConf jobA = new JobConf ( ge tConf ( ) , NgramModel .

c l a s s ) ;

jobA . setJobName ( ” raw c o u n t ” ) ;




} e l s e i f ( ”−t y p e ” . e q u a l s ( a r g s [ i ] ) ) {t y p e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;

} e l s e i f ( ”−cache ” . e q u a l s ( a r g s [ i ] ) ) {cache = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;

} e l s e {o t h e r a r g s . add ( a r g s [ i ] ) ;

}


} ca tch ( NumberFormatExcept ion e x c e p t ) {System . o u t . p r i n t l n ( ”ERROR: I n t e g e r e x p e c t e d


+ a r g s [ i ] ) ;




+ a r g s [ i − 1 ] ) ;





+ o t h e r a r g s . s i z e ( ) + ” i n s t e a d o f 2 . ” ) ;


}System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” + o r d e r ) ;

System . o u t . p r i n t l n ( ” t h e t a b l e t y p e i s s e t t o : ” + t y p e

) ;

System . o u t . p r i n t l n ( ” use t h e c a c h i n g que ry : ” + ( ( cache

==1) ? ” yes ” : ” no ” ) ) ;

jobA . s e t I n t ( ” o r d e r ” , o r d e r ) ;

Pa th i n p u t D i r = new Pa th ( o t h e r a r g s . g e t ( 0 ) ) ;

jobA . a d d I n p u t P a t h ( i n p u t D i r ) ;

jobA . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ;

S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobA , t rue) ;


jobA ,

Compress ionType .BLOCK) ;

Pa th tmp = new Pa th ( ” tmp ” ) ;

i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( tmp ) )


F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( tmp ) ;

jobA . s e t O u t p u t P a t h ( tmp ) ;

jobA . s e t O u t p u t K e y C l a s s ( Text . c l a s s ) ;

jobA . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ;

jobA . s e t M a p p e r C l a s s (myMap . c l a s s ) ;

jobA . s e t C o m b i n e r C l a s s ( myReduce . c l a s s ) ;

jobA . s e t R e d u c e r C l a s s ( myReduce . c l a s s ) ;


∗ /


J o b C l i e n t . r u nJ ob ( jobA ) ;

JobConf jobB = new JobConf ( ge tConf ( ) , NgramModel .

c l a s s ) ;

jobB . setJobName ( ” e v a l u a t i o n ” ) ;

jobB . s e t ( ” t a b l e ” , o t h e r a r g s . g e t ( 1 ) ) ;

jobB . s e t I n t ( ” o r d e r ” , o r d e r ) ;

jobB . s e t I n t ( ” t y p e ” , t y p e ) ;

jobB . s e t I n t ( ” cache ” , cache ) ;

jobB . s e t I n p u t P a t h ( tmp ) ;

jobB . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;

jobB . s e t O u t p u t F o r m a t ( Tex tOu tpu tFo rma t . c l a s s ) ;

jobB . s e t O u t p u t K e y C l a s s ( Text . c l a s s ) ;

jobB . s e t O u t p u t V a l u e C l a s s ( F l o a t W r i t a b l e . c l a s s ) ;

jobB . setNumReduceTasks ( 0 ) ;

Pa th r e s u l t D i r = new Pa th ( ” prob ” ) ;

i f ( F i l e S y s t e m . g e t ( jobB ) . e x i s t s ( r e s u l t D i r ) )

F i l e S y s t e m . g e t ( jobB ) . d e l e t e ( r e s u l t D i r ) ;

jobB . s e t O u t p u t P a t h ( r e s u l t D i r ) ;

jobB . s e t M a p p e r C l a s s ( NgramModel . c l a s s ) ;

long m y t o t a l = 1 ;

t r y {Pa th t m p F i l e = new Pa th ( ”num−of−un ig rams ” ) ;

F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( jobB ) ;



Reader ( f i l e S y s ,

tmpFi l e , jobB ) ;

LongWr i t ab l e v a l u e = new LongWr i t ab l e ( ) ;

i f ( r e a d e r . n e x t ( new Text ( ) , v a l u e ) ) {m y t o t a l = v a l u e . g e t ( ) ;

}r e a d e r . c l o s e ( ) ;

} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” can ’ t open f i l e num−of−unigrams

, e x i t ” ) ;

re turn −1;

}System . o u t . p r i n t l n ( ” t h e t o t a l unigram number i s : ” +

m y t o t a l ) ;

jobB . se tLong ( ” t o t a l ” , m y t o t a l ) ;

J o b C l i e n t . r u nJ ob ( jobB ) ;


JobConf con f = new JobConf ( ge tConf ( ) , NgramModel .

c l a s s ) ;

con f . setJobName ( ” t e s t ” ) ;

Pa th t m p F i l e = new Pa th ( ” p e r p l e x i t y ” ) ;

F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( con f ) ;

S e q u e n c e F i l e . Reader r e a d e r = new S e q u e n c e F i l e . Reader (

f i l e S y s , tmpFi l e ,

con f ) ;

Text k = new Text ( ) ;

F l o a t W r i t a b l e v = new F l o a t W r i t a b l e ( ) ;

whi le ( r e a d e r . n e x t ( k , v ) ) {System . o u t . p r i n t l n ( k . t o S t r i n g ( ) + ” : ” + v . g e t ( ) ) ;

}long s e c = ( t 2 − t 1 ) / 1000 ; / / s e c o n d s


m i n u t e s ” + s e c


% 60 + ” s e c o n d s ” ) ;

re turn 0 ;

}


{i n t r e s = ToolRunner . run ( new C o n f i g u r a t i o n ( ) , new

NgramModel ( ) , a r g s ) ;

System . e x i t ( r e s ) ;

}}

Estimating Language Models Using Hadoop and Hbase

Documents

Transcript of Estimating Language Models Using Hadoop and Hbase