Post on 10-Apr-2015
Estimating Language Models Using Hadoop
and Hbase
Xiaoyang YuT
HE
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2008
Abstract
This thesis presents the work of building a large scale distributed ngram language
model using a MapReduce platform named Hadoop and a distributed database called
Hbase. We propose a method focusing on the time cost and storage size of the model,
exploring different Hbase table structures and compression approaches. The method
is applied to build up training and testing processes using Hadoop MapReduce frame-
work and Hbase. The experiments evaluate and compare different table structures on
training 100 million words for unigram, bigram and trigram models, and the results
suggest a table based on half ngram structure is a good choice for distributed language
model. The results of this work can be applied and further developed in machine trans-
lation and other large scale distributed language processing areas.
i
Acknowledgements
Many thanks to my supervisor Miles Osborne for the numerous advices, great supports
and for inspiring new ideas about this project during our meetings. I would also like to
thank my parent for their trusts in me and encouragements. Thanks a lot to Zhao Rui
for her great suggestions for the thesis.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Xiaoyang Yu)
iii
Table of Contents
1 Introduction 1
2 Background 32.1 Ngram Language Model . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Distributed Language Modeling . . . . . . . . . . . . . . . . . . . . 5
2.3 Hadoop MapReduce Framework . . . . . . . . . . . . . . . . . . . . 6
2.4 Hbase Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Estimating Language Model using Map Reduce 103.1 Generate Word Counts . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Generate Count of Counts . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Generate Good-Turing Smoothing Counts . . . . . . . . . . . . . . . 15
3.4 Generate Ngram Probability . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Generate Hbase Table . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 n-Gram Based Structure . . . . . . . . . . . . . . . . . . . . 19
3.5.2 Current Word Based Structure . . . . . . . . . . . . . . . . . 20
3.5.3 Context Based Structure . . . . . . . . . . . . . . . . . . . . 21
3.5.4 Half Ngram Based Structure . . . . . . . . . . . . . . . . . . 22
3.5.5 Integer Based Structure . . . . . . . . . . . . . . . . . . . . . 24
3.6 Direct Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Caching Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Evaluation Methods 314.1 Time and Space for Building LM . . . . . . . . . . . . . . . . . . . . 31
4.2 LM Perplexity Comparison . . . . . . . . . . . . . . . . . . . . . . . 32
5 Experiments 335.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
iv
5.2 Ngram Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Table Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Discussion 436.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Bibliography 47
A Source Code 49A.1 NgramCount.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.2 TableGenerator.java . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.3 NgramModel.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
v
Chapter 1
Introduction
In statistical natural language processing, ngram language model is widely used
in various areas like machine translation and speech recognition etc. Ngram model is
trained using large sets of unlabelled texts to estimate a probability model. Generally
speaking, more data gives better model. As we can easily get huge training texts from
corpus or web, the computational power and storage space of single computer are both
limited. Distributed language models are proposed to meet the needs of processing vast
amount of data, trying to solve above problems. In the field of distributed language
model processing, most of the relevant works mainly concentrated on establishing the
methods for model training and testing, but no detailed discussions about the storage
structure. The objective of our work is to efficiently construct and store the model in a
distributed cluster environment. Moreover, our interest is to explore the capability of
coping different database table structures with language models.
The novel aspects of our project are as follows. We show that a distributed database
can be well integrated with distributed language modeling, providing the input/output
data sink for both the distributed training and testing processes. Using distributed
database helps to reduce the computational and storage cost. Meanwhile, the choice of
database table structure affects the efficiency a lot. We find that half ngram based table
structure is a good choice for distributed language models from the aspects of time and
space cost, comparing with other structures in our experiments.
We use a distributed computing platform called Hadoop, which is an open source
implementation of the MapReduce framework proposed by Google [3]. A distributed
database named Hbase is used on top of Hadoop, providing a model similar to Google’s
Bigtable storage structure [2]. The training methods are based on Google’s work on
large language model in machine translation [1], and we propose a different method
1
Chapter 1. Introduction 2
using Good-Turing smoothing with a back-off model and pruning. We store the ngram
model in the distributed database with 5 different table structures, and a back-off esti-
mation is performed in a testing process using MapReduce. We propose different table
structures based on ngram, current word, context, half ngram and converted integer.
Using Hadoop and Hbase, we build up unigram, bigram and trigram models with 100
million words from British National Corpus. All the table structures are evaluated and
compared for each ngram order with a testing set of 35,000 words. The training and
testing processes are split into several steps. For each step the time and space cost
are compared in the experiments. The perplexity of testing set is also compared with
traditional language modeling package SRILM [6]. The results are discussed about the
choice of proper table structure for the consideration of efficiency.
The rest of this thesis is organised as following: Chapter 2 introduces the ngram
language model, relative works on distributed language modeling, the Hadoop MapRe-
duce framework and Hbase distributed database; Chapter 3 gives the detail of our meth-
ods, illustrating all the steps in MapReduce training and testing tasks, as well as all the
different table structures we proposed; Chapter 4 describes the evaluation methods we
use; Chapter 5 shows all the experiments and results; Chapter 6 discusses about the
choice of table structure for Hadoop/Hbase and possible future works.
Chapter 2
Background
In statistical language processing, it is essential to have a language model which
measures the probability of how likely a sequence of words may occur in some context.
Ngram language model is the major language modeling method, along with smoothing
and back-off methods to deal with the problem of data sparseness. Estimating this in
a distributed environment enables us to process vast amount of data and get a better
language model. Hadoop is a distributed computing framework that can be used for
language modeling, and Hbase is a distributed database which may store the model
data as database tables and integrate with Hadoop platform.
2.1 Ngram Language Model
Ngram model is the leading method for statistical language processing. It tries to
predict the next word wn given n-1 words context w1,w2, ...,wn−1 by estimating the
probability function:
P(wn|w1, ...,wn−1) (2.1)
Usually a Markov assumption is applied that only the prior local context - the last
n-1 words - affects the next word. The probability function can be expressed by the
frequency of words occurrence in a corpus using Maximum Likelihood Estimation
without smoothing:
p(wn|w1, ...,wn−1) =f (w1, ...,wn)
f (w1, ...wn−1)(2.2)
where f (w1, ...,wn) is the counts of how many times we see the sentence w1, ...,wn
in the corpus. One important aspect is the count smoothing, which adjusts the empiri-
cal counts collected from training texts to the expected counts for ngrams. Considering
3
Chapter 2. Background 4
ngrams that don’t appear in the training set, but are seen in testing text, for a maximum
likelihood estimation the probability would be zero, so we need to find a better es-
timation for the expected counts. A popular smoothing method called Good-Turing
smoothing is based on the count of ngram counts, the expected counts are adjusted
with the formula:
r∗= (r +1)Nr+1
Nr(2.3)
where r is the actual counts, Nr is the count of ngrams that occur exactly r times.
For words that are seen frequently, r can be large numbers, and Nr+1 is likely to be
zero, in this case derivations can be made to look up Nr+2,Nr+3, ...Nr+n, find a non-
zero count and use that value instead.
The idea of count smoothing is to better estimate probabilities for ngrams, mean-
while, for unseen ngrams, we can to do a back-off estimation using the probability of
lower order tokens, which is usually more robust in the model. For a ngram w1, ...,wn
that doesn’t appear in training set, we estimate the probability of n-1 gram w2, ...,wn
instead:
P(wn|w1, ...,wn−1) =
p(wn|w1, ...,wn−1) if found (w1, ....,wn)
λ(w1, ....,wn−1)∗ p(wn|w2, ...,wn−1) otherwise(2.4)
where λ(w1, ...,wn−1) is the back-off weight. In general, back-off requires more
lookups and computations. Modern back-off smoothing techniques like Kneser-Ney
smoothing [5] use more parameters to estimate each ngram probability instead of a
simple Maximum Likelihood Estimation.
To evaluate the language model, we can calculate the cross entropy and perplexity
for testing ngram. A good language model should give a higher probability for one
ngram prediction. The cross entropy is the average entropy of each word prediction in
the ngram:
H(pLM) = −1n
pLM(w1,w2, ...,wn)
= −1n
n
∑i=1
logpLM(wn|w1, ...,wn−1) (2.5)
Perplexity is defined based on cross entropy:
PP = 2H(pLM)
= 2−1
n
n
∑i=1
logpLM(wn|w1, ...,wn−1)(2.6)
Chapter 2. Background 5
Generally speaking, perplexity is the average number of choices at each word pre-
diction. So the lower perplexity we get, the higher probability of the prediction is,
meaning a better language model.
2.2 Distributed Language Modeling
large scale distributed language modeling is quite a new topic. Typically using a
server/client diagram as Figure 2.1, we need to efficiently manipulate on large data sets,
to communicate with cluster workers and organise their works. The server controls
and distributes tasks to all the workers, and clients send requests to server to execute
queries or commands. Relative ideals can be seen in Distributed Information Retrieval
[7]. Recent works include a method to split a large corpus into many non-overlapping
chunks, and make each worker in the cluster load one of the chunks with its suffix
array index [8]. The use of suffix array index helps to quickly find the proper context
we want, and we can count the words occurrence in each worker simultaneously. In
this kind of approach, the raw word counts are stored and served in the distributed
system, and the clients collect needed counts and then compute the probability. The
system is applied on N-best list re-ranking problem and shows a nice improvement on
a 2.97 billion-word corpus.
Figure 2.1: Server/Client Architecture
A similar architecture is proposed later with interpolated models [4]. The corpus
Chapter 2. Background 6
is also split into chunks along with their suffix array index. Also a smoothed n-gram
model is computed and stored separately in some other workers. Then the client re-
quests both the raw word counts and the smoothed probabilities from different workers,
and computes the final probability by linear weighted blending. The authors apply this
approach on N-best list re-ranking and integrate with machine translation decoder to
show good improvement in translation quality trained on 4 billion words.
Previous two methods solve the problem to store large corpus and provide word
counts for clients to estimate probabilities. Later works [1] describe a method to store
only the smoothed probabilities for distributed n-gram models. In previous methods,
the client needs to look up each worker to find proper context using suffix array index.
On the other hand this method result in exactly one worker being contacted per n-gram,
and in exactly two workers for context-depended backoff [1]. The authors propose a
new backoff method called Stupid Backoff, which is a simpler scheme using only the
frequencies and an empirical chosen back-off weight.
P(wn|w1, ....,wn−1) =
f (w1, ...,wn)
f (w1, ....wn−1)if f (w1, ....,wn) > 0
α∗ f (w2, ...,wn)f (w2, ....wn−1)
otherwise
(2.7)
where α is set to 0.4 based on their earlier experiments [1]. The experiments di-
rectly store 5-grams model using different sources. According to their results, 1.8T
tokens from web include a 16M vocabulary, and generate a 300G n-grams with a 1.8T
language model with their Stupid Backoff.
These previous works are the theoretical principles of our project. The last work
used Google’s distributed programming model MapReduce [3], but the authors didn’t
describe the way to store the data. We adapt the MapReduce model for our work, be-
cause it contains a clear workflow and has already been proved as a mature model and
widely used in various applications by Google, Yahoo and other companies. Although
Google’s own implementation of MapReduce is proprietary, we can still choose open
source implementation of this model.
2.3 Hadoop MapReduce Framework
Hadoop is an open source implementation of MapReduce programming model.
It is based on Java and uses the Hadoop Distributed File System (HDFS) to create
multiple replicas of data blocks for reliability, distributing them around the clusters
Chapter 2. Background 7
and splitting the task into small blocks. According to their website, Hadoop has been
demonstrated on 2,000 nodes and is designed up to support 10,000 node clusters, so it
enables us to extend our clusters in the future.
A general MapReduce architecture1can be illustrated as Figure 2.2. At first the
input files are split into small blocks named FileSplits, and the Map operation is created
parallelized with one task per FileSplit.
Figure 2.2: MapReduce Architecture
Input and Output types of a Map-Reduce job:
(input) <k1, v1>→map→<k2, v2>→ combine→<k2, v2>→ reduce→<k3,
v3> (output)
The FileSplit input is treated as a key/value pair, and user specifies a Map func-
tion to process the key/value pair to generate a set of intermediate key/value pairs [3].
When the Map operation is finished, the output is passed to Partitioner, usually a hash
function, so all the pairs shared with same key can be collected together later on. After
the intermediate pairs are generated, a Combine function is called to do a reduce-like
job in each Map node to speed up the processing. Then a Reduce function merges
all intermediate values associated with the same intermediate key and writes to output
files. Map and Reduce operations are working independently on small blocks of data.
The final output will be one file per executed reduce task, and Hadoop stores the output
files in HDFS.
For input text files, each line is parsed as one “value” string, so the Map function
is processing at the sentence level; for output files, the format is one key/value pair per
one record, and thus if we want to reprocess the output files, the task is working at the
record pair level. For language model training using MapReduce, our original inputs
1http://wiki.apache.org/hadoop/HadoopMapReduce
Chapter 2. Background 8
are text files, and what we will finally get from Hadoop are ngram/probability pairs.
It means that, theoretically we can only use Hadoop to build language model. This
chapter gives a brief overview of ngram language models, the smoothing and back-off
methods, the relevant works on distributed language modeling, the Hadoop framework
and Hbase database.
2.4 Hbase Database
Hbase is the distributed database on top of HDFS, whose structure is very similar to
Google’s Bigtable model. The reason we look into using Hbase with language model
is, although we can do our work all in Hadoop and store all outputs as text or binary
files in HDFS, it will consume more time and extra computations for parsing input
records. The bottleneck of Hadoop MapReduce is that the input/output is file based,
either plain text or binary file, which is reasonable for storing large amount of data
but not suitable for query process. For example, if we want to query probability for
one ngram, we have to load all the files into map function, parse all of them, and
compare the key with ngram to find the probability value. Basically we need to do this
comparison for each ngram in test texts, which will cost quite a long time.
Instead of parsing files, we can make use of a database structure such as Hbase,
to store the ngram probabilities in database table. The advantage is obvious: database
structure is designed to meet the needs of multiple queries; the data is indexed and
compressed, reducing the storage size; tables can be easily created, modified, updated
or deleted. As in distributed database, the table is stored in distributed filesystem,
providing scalable and reliable storage. Meanwhile, for language modeling, the model
data is highly structured. The basic format is ngram/probability pair, which can be
easily constructed into more organised and compact structures. Considering we may
get huge amount of data, the compressed structures are essential from both time and
storage aspects.
Hbase stores data in labelled tables. The table is designed to have a sparse struc-
ture. Data is stored in table rows, and each row has an unique key with arbitary
number of columns. So maybe one row can have thousands of columns, while an-
other row can have only one column. In addition, a column name is a string with the
<family>:<label> form, where <family> is a column family name assigned to a
group of columns, and label can be any string you like. The concept of column fami-
lies is that only administrative operations can modify family names, but user can create
Chapter 2. Background 9
arbitary labels on demand. A conceptual view of Hbase table can be illustrated below:
Row Key Column color: Column prize: Column size:
car color:red Mini Cooper medium
color:green Volkswagen Beetle small
BMW 2.6k
... ... ...
Another important aspect of Hbase is that the table is column oriented, which
means physically the tables are stored per column. Each column is stored in one file
split, and each column family is stored closely in HDFS, all the empty cells in a col-
umn are not stored. This feature actually implies that in Hbase it is less expensive to
retrieve a column instead of a row, because to retrieve a row the client must request all
the column splits, meanwhile to retrieve a column only one column split is requested,
which is basically one single file in HDFS. On the other hand, writing a table is row
based. Only one row is locked for updating, so all the writes among clusters can be
atomic by default.
The relationship between Hadoop, Hbase and HDFS can be illustrated as Figure
2.3:
Figure 2.3: Relationship between Hadoop, Hbase and HDFS
Above overviews are the basic techniques and tools used in this project. The next
chapter describes our methods to build up language modeling process for ngram model
with smoothing and back-off using Hadoop and Hbase.
Chapter 3
Estimating Language Model using
Map Reduce
The distributed training process described in Google’s work[1] is split into three
steps: convert words to ids, generate ngrams per sentence, and compute ngram prob-
abilities. We extend it to do a Good-Turing smoothing estimation, so extra steps are
included to calculate the count of ngram counts, store the count in HDFS, and then
try to fetch the data to adjust the raw counts. We decide to store the ngram string di-
rectly, considering more training data can be added and updated later on. We estimate
a back-off model to compute the probability, and several database table structures are
designed for comparison. The testing process is also using MapReduce, acting like a
distributed decoder, so we can process mulitiple testing texts together.
Figure 3.1 is the flow chart for training process. Figure 3.2 shows a simplified
testing process. The rest of this chapter explains the details of each step in the training
and testing process.
3.1 Generate Word Counts
The first step is to parse training text, find all the ngrams, and emit their counts.
The map function reads one text line as input. The key is the docid, and the value is
the text. For each line, it is split into all the unigrams, bigrams, trigrams up to ngrams.
These ngrams are the intermediate keys, and the values are a single count 1. Then a
combiner function sums up all the values for the same key within Map tasks. And a
reduce function same as the combiner collects all the output from combiner, sums up
values for the same key. The final key is the same with map function’s output, which
10
Chapter 3. Estimating Language Model using Map Reduce 11
Figure 3.1: Training Process
Chapter 3. Estimating Language Model using Map Reduce 12
Figure 3.2: Testing Process
is the ngram, and the value is the raw counts of this ngram throughout the training data
set. A partitioner based on the hashcode of first two words is used, which makes sure
that not only all the values with a same key goes into one reduce function, but also the
average load is balanced [1]. Also we include a pruning count, any raw counts below
this pruning count will be dropped. The simplified map and reduce functions can be
illustrated as below:
map(LongWritabe docid, Text line,
OutputCollector<Text, IntWritable> output) {
words[] = line.split(blank space or punctuation)
for i=1..order
for k = 0..(words.length - i)
output.collect(words[k..(k+i-1)], 1)
}
reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output ) {
int sum=0;
while (values.hasNext())
sum += values.next().get();
Chapter 3. Estimating Language Model using Map Reduce 13
if sum > prune
output.collect(key, sum);
}
Figure 3.3 describes the process of map - combine -reduce functions given some
sample text. The combine function starts right after map, so it inherits the same key/-
value pairs from its previous map task. The output from reduce function is the raw
counts, also the keys are sorted, implying some index processes. This feature can be
used in enhancements for fast lookups.
Figure 3.3: Generate Raw Counts
Because this step generates all the ngrams, it is possible to collect the total number
of unigram counts, bigram counts etc. These numbers are necessary for smoothing
techniques. Here we only collect the unigram counts for Good-Turing smoothing. It is
easy to collect the total bigram or trigram counts in the similar way, which is be needed
for Kneser-Ney smoothing.
enum MyCounter {INPUT_WORDS};
reduce(Text key, Iterator<IntWritable> values,
Chapter 3. Estimating Language Model using Map Reduce 14
OutputCollector<Text, IntWritable> output,
Reporter reporter) {
.....
if sum>prune
output.collect(key, sum);
if key is unigram
reporter.incrCounter(MyCounter.INPUT_WORDS, 1);
}
3.2 Generate Count of Counts
Good-Turing smoothing is based on the count of ngram counts. We need to collect
all the count of counts for unigram, bigram, trigram up to ngram. To do this, all the
raw counts are loaded into a new MapReduce job. For each ngram, the map function
emits one count of the raw counts along with the ngram order, so the output key has
the format of <ngram- order raw-counts>, and the value is <count-of-counts>. The
combine and reduce function merge all the count of counts with the same key. The final
output should be fairly small, normally a single file is enough to store all the count of
counts. The simplified map function is like this:
// input key is ngram, input value is the raw counts
public void map(Text key, IntWritable value,
OutputCollector<IntWritable, IntWritable> output){
words[] = toStringArray(key);
// words.length is the ngram order
// combine the order with the counts
String combine = words.length + value.toString();
output.collect(toText(combine), one);
}
The combine and reduce function are the same with previous step. An example of
the MapReduce job is illustrated as Figure 3.4.
One important point about the count of counts is that for higher counts, the count of
counts is usually incontinuous. This always happens for frequent ngrams, which have
a quite high number of counts. We don’t have to store or estimate these empty count
of counts, instead we can adjust the smoothing formula being used in the next step.
Chapter 3. Estimating Language Model using Map Reduce 15
Figure 3.4: Generate Count of Counts
3.3 Generate Good-Turing Smoothing Counts
Now we have both the raw counts and the count of counts, following the Good-
Turing smoothing formula, we can estimate the smoothed counts for each ngram. In
this step, the inputs are still the ngrams and their raw counts, and each map function
reads in the count of counts file, stores all the data in a HashMap structure and com-
putes the smoothed counts. The basic formula is :
r∗= (r +1)Nr+1
Nr(3.1)
if we can find both Nr+1 and Nr in the HashMap, then the above formula can be
applied directly. Otherwise, we try to look up the “nearest” count of counts, e.g. if we
can’t find Nr+1 , we try to look up Nr+2, then Nr+3, Nr+4 ..etc. We decide to look up at
most 5 counts Nr+1 .. Nr+5. If we can’t find any of these counts, we use the raw counts
intead. In this situation, the raw counts are typically very large, meaning the ngram has
a relatively high probability, so we don’t have to adjust the counts. For each ngram,
the smoothing process is needed only once, so we actually don’t need any combine or
reduce function for count smoothing. The map function can be illustrated as:
Chapter 3. Estimating Language Model using Map Reduce 16
HashMap countHash;
void configure(){
reader = openFile(count-of-counts);
//k is the ’ngram-order ngram-counts’, v is the count of counts
while (reader.next(k, v)) {
order = k[0];
counts = k[1..length-1];
countHash.put(order, (counts, v));
}
reader.close();
}
//input key is the ngram, value is the raw counts
void map(Text key, IntWritable value,
OutputCollector<Text, Text> output) {
for i=1..5
if exists countHash.get(key.length).get(value) &&
exists countHash.get(key.length).get(value+i) {
Nr= countHash.get(key.length).get(value);
Nr1= countHash.get(key.length).get(value+i);
c=(value+1)*Nr1/Nr;
break;
}
else
c=value;
//c is the smoothed counts for ngram ’key’
}
Figure 3.5 shows an example of the Good-Turing smoothing.
3.4 Generate Ngram Probability
To estimate the probability of one ngram w1,w2, ...,wn, we need the counts of
w1..wn and w1..wn−1. Because one MapReduce function, either map or reduce, is
working based on one key, in order to collect both above counts, we need to emit
the two counts to one reduce function. The approach is to use the current context
w1,w2, ...,wn−1 as the key, and combine the current word wn with the counts as the
Chapter 3. Estimating Language Model using Map Reduce 17
Figure 3.5: Good-Turing Smoothing
value. This can be done in the map function of previous step. Notice that except for
the highest order ngram, all lower order ngrams also need to be emitted with their own
smoothed counts, providing the counts for being context as themselves. Figure 3.6
gives an example of the emitted context with current word format.
...
//c is the smoothed counts for ngram ’key’
//suffix is the current word, the last item in string array
suffix = key[length-1];
context = key[0..length-2];
output.collect(context, (’\t’+suffix+’\t’+c));
if order< highest order
output.collect(key, c);
...
Now the reduce function can receive both the counts of the context and the counts
with current word, and computes the conditional probability for ngram based on the
formula:
p(wn|w1, ...,wn−1) =f (w1, ...,wn)
f (w1, ...wn−1)(3.2)
//the input k is the context
void reduce(Text k, Iterator v ) {
Chapter 3. Estimating Language Model using Map Reduce 18
Figure 3.6: Emit context with current word
while (v.hasnext) {
value=v.get;
//if has current word
if (value starts with ’\t’) {
get current word w
get counts C
}
//if no current word
else
get base counts Cbase
//compute the probability
prob= C/Cbase;
if prob > 1.0
prob=1.0;
}
...
}
After Good-Turing smoothing, some counts might be quite small, so the probability
might be over 1.0. In this case we need to adjust it down to 1.0. For a back-off model,
we use the simple scheme proposed by Google [1], in which the back-off weight is
set to 0.4. The number 0.4 is chosen based on empirical experiments, and is proved
as a stable choice in previous works. If we want to estimate dynamic back-off weight
Chapter 3. Estimating Language Model using Map Reduce 19
Figure 3.7: Estimate probability
for each ngram, more steps are required but people have discussed that the choice of
specific back-off or smoothing method is less relevant as the training corpora becomes
large [4].
In this step, we get all the probabilities for ngrams, and with the back-off weight,
we can estimate probability for testing ngram based on queries. So the next important
step is to store these probabilities in the distributed environment.
3.5 Generate Hbase Table
Hbase can be used as the data input/output sink in Hadoop MapReduce jobs, so
it is straight-forward to integrate it within previous reduce function. Essential modi-
fications are needed, because writing Hbase table is row based, each time we need to
generate a key with some context as the column. There are several different choices,
either a simple scheme of each ngram per row, or more structured rows based on cur-
rent word and context. Two major aspects are considered: the write/query speed and
the table storage size.
3.5.1 n-Gram Based Structure
An initial structure is very simple, similar to the text output format. Each n-gram
is stored in a separate row, so the table has a flat structure with one single column. For
one row, the key is ngram itself, and the column stores its probability. Table 3.1 is an
example of this structure. This structure is easy to implement and maintain, yet it is
Chapter 3. Estimating Language Model using Map Reduce 20
only sorted but uncompressed. Considering there might be lots of same probabilities
among different ngrams, e.g. higher order ngrams always appear once, the table is not
capable to represent all the same probabilities in an efficient way. Also since we only
have one column, the advantages of distributed database, arbitary sparse columns, have
less effects for this table.
key column family:label
(gt:prob)
a 0.11
a big 0.67
a big house 1.0
buy 0.11
buy a 0.67
... ...
Table 3.1: n-Gram based table structure
3.5.2 Current Word Based Structure
Alternatively, for all the ngrams w1,w2, ...,wn that share the same current word wn,
we can store them in one row with the key wn. All the possible context are stored in
seperate columns with the column name format <column family:context>. Table 3.2
is an example. This table is a sparse column structure. For some word with lots of
context, the row could be quite long, while for some word with less context the row
could be short. In this table we reduce the number of rows, and expand all context
into seperate columns. So intead of one single column split, we have lots of small
column splits. From the view of distributed database, the data is sparsely stored, but
from the view of data structure, it is still somewhat uncompressed, e.g. if we have two
current words in two rows, and both of them have the same context, or have the same
probability for some context, we still store them seperately. This results in multiple
column splits that have very similar structures, which means some kind of redundance.
We also need to collect the unigram probability for each current word and store it
in a separate column. The process to create table rows can be illustrated as:
//k is the ngram, c is the probability
Chapter 3. Estimating Language Model using Map Reduce 21
key column family:label
gt:unigram gt:a gt:a big gt:i gt:buy ..
a 0.15 0.667
big 0.057 0.667
house 0.3 0.667 1.0
buy 0.15 1.0
... ... ... ... ... ... ...
Table 3.2: Word based table structure
for each ngram (k, c) {
word=k[length-1];
if k.length == 1
column="gt:unigram";
else{
context=k[0..length-2];
column="gt:context";
}
output.collect(w, (column, c) );
}
3.5.3 Context Based Structure
Similar to current word based structure, we can also use the context w1,w2, ...,wn−1
as the key for each row, and store all the possible following words wn in seperate
columns with the format <column family:word>. This table will have more rows
compared with word based table, but still less than ngram based table. For large data set
or higher order ngrams, the context can be quite large, on the other hand the columns
can be reduced. The column splits are separated by different words, and for all words
that only occur once, the split is still very small - only one column value per split.
Generally speaking, if we have n unigrams in the data set, we will have around n
columns in the table. For a training set containing 100 million words, the number of
unigram is around 30,000 , so the table could be really sparse. Example of this table
structure is illustrated as Table 3.3.
To avoid redundancy, only unigram keys are stored with their own probabilities in
Chapter 3. Estimating Language Model using Map Reduce 22
key column family:label
gt:unigram gt:a gt:big gt:i gt:buy gt:the gt:house ..
a 0.11 0.67 0.67
a big 1.0
buy 0.11 0.67 0.67
buy a 1.0
i 0.04 1.0
... ... ... ... ... ... ... ... ...
Table 3.3: Context based table structure
<gt:unigram>, since we can also store the probability of “a big” in <gt:unigram>
which is redundant with row “a” column <gt:big>.
//k is the ngram, c is the probability
for each ngram (k, c) {
context=k[length-1];
if k.length == 1
column="gt:unigram";
else{
word=k[length-1];
context=k[0..length-2];
column="gt:word";
}
output.collect(context, (column, c) );
}
Since there may be many columns that only appear once and have the same value,
typically seen in higher order ngrams, possible compression can be made to combine
these columns together, reducing the column splits.
3.5.4 Half Ngram Based Structure
For the previous two structures, we can get either large number of rows, or large
number of columns. So there is a possible trade off between the rows and the columns.
We can combine the word based and context based structure together, balancing the
Chapter 3. Estimating Language Model using Map Reduce 23
number of rows and columns. Our method is to split the n grams into n/2 grams, and
use n/2 gram as the row key, the rest n/2 gram as the column label. For example, for a
4-gram model (w1,w2,w3,w4), the row key is (w1w2), and the column is <gt:w3w4 >.
An example for 4-gram model is illustrated as Table 3.4:
key column family:label
gt:unigram gt:a gt:big gt:house gt:big house gt:new house ..
a 0.11 0.67 0.67
a big 1.0 0.01
buy 0.11 0.67
buy a 1.0 0.04
i 0.04
... ... ... ... ... ... ... ...
Table 3.4: Half ngram based table structure
For higher order ngrams, this structure can reduce lots of rows and insert them as
columns. Theoretically the costs of splitting ngram into word-context and context-
word are identical, but n/2 gram-n/2 gram will require a bit more parsing. Similar to
previous method, the process can be written as:
//k is the ngram, c is the probability
for each ngram (k, c) {
half = int(length/2);
context = k[0..half];
if k.length == 1
column="gt:unigram";
else{
word=k[half..length-1];
column="gt:word";
}
output.collect(context, (column, c) );
}
Chapter 3. Estimating Language Model using Map Reduce 24
3.5.5 Integer Based Structure
Instead of store all the strings, we can also convert all the words into integers, and
store the integers in table. Extra steps are needed to convert each unigram into an
unique integer number and keep the convert unigram-integer map in the distributed
filesystem. The advantage of using integers is the size will be smaller than strings,
because of storing a long length string with a single integer number. But on the other
hand we need another encoding/decoding process to do the converting, and this results
in more computational time. This method is a trade off between computational time
and the storage size. Also this structure can be combined with previous methods for
better compressions.
For the simple ngram per row scheme, integer based structure can be illustrated as
Table 3.5:
unigram integer
a 1
big 2
house 3
buy 4
key column family:label
(gt:prob)
1 0.11
1 2 0.67
1 2 3 1.0
4 0.11
4 1 0.67
... ...
Table 3.5: Integer based table structure
Notice that if we store “a big” as 12, it might conflict with another word with
number 12 in the convert map. So we have to add a space between the numbers,
similar to the ngram strings.
We need an extra MapReduce job to reprocess the raw counts, convert unigram into
integer, and store in a table:
Chapter 3. Estimating Language Model using Map Reduce 25
//extra step to convert unigram into unique integer
//input key is the ngram, value is the raw counts
int id;
column=’’convert:integer’’;
map(Text k, IntWritable v,
OutputCollector<Text, MapWritabe> output) {
if (k.length==1) {
id++;
output.collect(k, (column, id));
}
}
...
Then we can query the convert table to store integer instead of strings:
column=’’convert:integer’’;
//k is the ngram, c is the probability
for each ngram (k,c) {
for each k[i] in k
//query the convert table or file,
//find the integer for each k[i]
intK[i]=get(k[i], column);
column="gt:prob";
row = combineInteger(intK);
output.collect(context, (column, c) );
}
3.6 Direct Query
The next process is the testing function. The actual back-off is performed here in
the query. Based on the back-off model, for each testing ngram, we need to query the
ngram, if not found, then find n-1 gram, until we reach unigram. For different table
structures, we just need to generate different row and column names. The advantage of
using MapReduce for the testing is that, we can put multiple testing texts into HDFS,
and a MapReduce job can process all of them to generate the raw counts, just the
same as we have done in the training process, then for each ngram with its counts,
Chapter 3. Estimating Language Model using Map Reduce 26
we directly estimate the probability using back-off model, and multiply by the counts.
In such a method, each different ngram is processed only once, which speeds up the
whole process especially for lower order ngrams.
We call this method Direct Query because we query each ngram directly from
Hbase table, so the more testing ngrams we have, the more time it will cost. Also the
perplexity of the estimation is computed and collected as an evaluation value for the
language model.
An example using the simple ngram based table structure is:
//the job for raw counts
//input key is the docid, value is one line in the text
map(LongWritabe docid, Text line,
OutputCollector<Text, IntWritable> output) {
words[] = line.split(blank space or punctuation)
for i=1..order
for k = 0..(words.length - i)
output.collect(words[k..(k+i-1)], 1)
}
reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output ) {
...
}
//the job for estimating probability
//the input key is the ngram, value is the raw counts
column=’’gt:prob’’;
map(Text k, IntWritable v,
OutputCollector<Text, FloatWritable> output) {
// a back-off calculation
row=k.toString;
while (finished==false) {
value=table.get(row, column);
if (value != null) {
//find the probability
prob=alpha*value;
Chapter 3. Estimating Language Model using Map Reduce 27
finished=true;
} else {
//unseen unigram
if k.length==1 {
prob=unseen_prob;
finished=true;
}
else {
//back-off to n-1 gram
ngram=row;
row=ngram[0..length-2];
alpha=alpha*0.4;
finished=false;
}
}
}
//now we have the probability in prob
count+=v;
if prob >1.0
prob = 1.0
H+=v*log(prob)/log(2);
output.collect(k, prob);
}
//compute the total perplexity
void close(){
perplexity = pow(2.0, -H / count);
...
}
Figure 3.8 is an example of this direct query process. If the estimated probability is
above 1.0, we need to adjust it down to 1.0. The map function collects the total counts
and computes the total perplexity in the overrided close method. As a reference, the
probability for each ngram is collected in HDFS as the output of final reduce job.
Chapter 3. Estimating Language Model using Map Reduce 28
Figure 3.8: Direct Query
3.7 Caching Query
There is a little more we can do to speed up the query process. Considering queries
for different ngram that share the same n-1 gram, in a back-off model we query the
ngram first, then if not found, we move on to n-1 gram. Suppose we need to back-off
for each ngram, the n-1 gram will be requested for multiple times. Here is where the
caching steps in. For each back-off step, we store the n-1 gram probability as HashMap
in memory inside the working node. Everytime when node comes to a new n-1 back-
off query, it will first look up in the HashMap, if not found, then ask the Hbase table,
and add the new n-1 gram into HashMap.
The simplified scheme can be written as:
...
//k is the ngram
HashMap cache;
while not finished {
prob = table.get(k, column);
if prob != null
found probability, finished
Chapter 3. Estimating Language Model using Map Reduce 29
else {
let row be the n-1 gram from k
if cache.exists(row)
prob = cache.get(row)
else {
prob = table.get(row, column);
if prob !=null {
found probability, finished
if number of cache.keys < maxlimit
cache.add(row, prob);
else
cache.clear();
cache.add(row, prob);
}
}
}
}
...
Figure 3.9 is an example of this caching query process. We don’t need to store
probabilities for ngrams, only the n-1 grams. Also there is a maximum limit of the
number of keys in the HashMap. We can’t store all the n-1 grams into HashMap,
otherwise it will become huge and eat up all the working node’s memory. So we
only store up to maxlimit n-1 grams, and when counts are over the limit, the previous
HashMap is dropped and filled in new items. It is like an updating process, and another
alternative is to delete the first key in HashMap and push in new one.
Above methods establish the whole process of distributed language model training
and testing. The next chapter describes the evaluation methods we use.
Chapter 3. Estimating Language Model using Map Reduce 30
Figure 3.9: Caching Query
Chapter 4
Evaluation Methods
Our major interest is to explore the computational and storage cost building dis-
tributed language model using Hadoop MapReduce framework and Hbase distributed
database. The evaluation focuses on the comparison of time and space using different
table structures. Also as language model the perplexity for a testing set is evaluated
and compared with traditional language modeling tools.
4.1 Time and Space for Building LM
There are two major processes we can evaluate, the training process and testing
process. Because we are experimenting different table structures, we split the training
process into two steps: the first one is generating raw counts and collecting Good-
Turing smoothing parameters, the second one is generating table. Apparently the first
step is the same for all different tables, so we can focus on the second step for compar-
ison.
The comparison of time cost is based on the average value of program running time
taken from multiple runs to avoid deviations. Network latency and other disturbance
may vary the result, but the error should be remaining in 1-2 minutes level. The pro-
gram calculates its running time by comparing the current system time before and after
the MapReduce job is submitted and executed. Each table structure is compared with
the same ngram order, also different ngram order for one table is compared to see the
relationship between the order and the time cost in each method.
To collect the size of the language model, we can use command-line programs
provided in Hadoop framework. Because the Hbase table is actually stored in HDFS
filesystem, we can directly calculate the size of the directory created by Hbase. Typ-
31
Chapter 4. Evaluation Methods 32
ically, Hbase creates a directory called “/hbase” in HDFS, and creates a sub directory
with the name of the table, e.g. if we create a table named “trigram”, then we can
calculate the size of the directory “/hbase/trigram” as the size of the table. It is not
a perfectly accurate estimation of the table data because the directory contains other
meta info files but since these files are usually quite small, we can estimate them to-
gether. Another point of view is that the table has two parts: the data and the info, so
we can calculate these two parts together.
4.2 LM Perplexity Comparison
For a language model, the perplexity for testing set is a common evaluation to see
how good the model is. The perplexity means the average choice for each ngram.
Generally speaking, the better model it is, the lower perplexity it will get. The order
of ngram also affects the perplexity for the same model. For a normal size training
set, higher order ngram model will always get lower perplexity. Meanwhile, the more
training set we have, the better model we will get, so the perplexity will become lower.
We can also compare the distributed language model with traditional language
modeling tools like SRILM [6]. SRILM is a rich functional package for building and
evaluating language models. The package is written in C++, and because it runs lo-
cally in single computer, the processing speed is fast. The shortcoming of SRILM is
it will eat up memory and even overflow when processing huge amount of data. Still
we can compare SRILM with our distributed language model on the same training
set which is not so huge. The smoothing methods need to be nearly the same, e.g.
Good-Turing smoothing. The specific parameters may vary, but the similar smoothing
method can show whether the distributed language model is stable enough compared
with traditional language modeling package.
Applying these evaluation methods, the experiments and results are illustrated in
the next chapter.
Chapter 5
Experiments
The experiments are done in a cluster environment with 2 working nodes and 1
master server. The working nodes are running Hadoop , HDFS and Hbase slaves, and
the master server controls all of them. For each experiment, we repeat three times and
choose the average value as the result. The results are shown as figures and tables
for different ngram orders of all the table structures, the time and space cost, also the
perplexity for testing set.
We first compare the size and number of unique ngrams for training and testing
data taken from a 100 million-word corpus. The time and space cost for generating raw
counts, generating table structures, finally the testing queries are compared separately.
We also compare the unigram, bigram and trigram models for each step. All of the 5
table structures we proposed are compared on time cost for both training and testing
process, as well as the table size of these different structures. Finally the perplexity for
each ngram order is compared with SRILM, and the model data size from SRILM is
calculated as a reference.
5.1 Data
The data is taken from British National Corpus, which is around 100 million words.
We choose some random texts from the corpus as the testing set. The testing set is
about 35,000 words. All the remaining texts are used as training set.
The corpus is constructed as one sentence per line. For completeness we include
all the punctuation in the sentence as the part of ngam, e.g. <the house’s door> is
parsed as <the> <house> <’s> <door>.
The approximate words and file size for training and testing data is:
33
Chapter 5. Experiments 34
training data testing data
tokens 110 million 40,000
data size 541MB 202KB
Table 5.1: Data
Notice that the tokens contain both the words and the punctuation, slightly increas-
ing the total number of ngrams.
5.2 Ngram Order
We choose to train up to trigram model for the training process. A counts pruning
number of 1 is applied for all the models. Table 5.2 shows the number of unique
ngrams for each order on the training and testing data set. Figure 5.1 draws the unique
ngram number for training and testing data. We can see from the figure that the training
data has a sharper line. Considering the different number of tokens, it shows that larger
data set results in more trigrams, meaning more varieties.
training data testing data
tokens 110 million 40,000
unigram 284,921 7,303
bigram 4,321,467 25,862
trigram 9,090,713 33,120
Table 5.2: Ngram numbers for different order
For all the different table structures, the steps 1 & 2 of raw counting and Good-
Turing parameter estimating are identical. So we can evaluate this part first, apart from
choosing table types. The time and space costs are compared in Table 5.3 and Table
5.4. For testing process only the raw counting step is required. All the raw counting
outputs are stored in HDFS as compressed binary files.
The trend line is shown in Figure 5.2. There is a big difference between the two
lines. As we can see from Table 5.3, when the tokens are relatively small, the process-
ing time is nearly the same, but when the tokens become large, the difference between
unigram, bigram and trigram model is largely increasing. Another aspect affecting the
Chapter 5. Experiments 35
Figure 5.1: Unique ngram number
Figure 5.2: Time cost in raw counting
Chapter 5. Experiments 36
training data testing data
raw counting & parameter estimating raw counting
tokens 110 million 40,000
unigram 15 min 38 sec 19 sec
bigram 41 min 6 sec 20 sec
trigram 75 min 41 sec 22 sec
Table 5.3: Time cost in raw counting
time cost is that during the training process, a second MapReduce job is needed to
estimate the count of counts for Good-Turing smoothing. For this job, only one reduce
task is launched in the working nodes, producing a single output file. It is easy to know
that as the ngram order increases, the input records also become larger, requiring more
processing time.
training data testing data
raw counting & parameter estimating raw counting
tokens 110 million 40,000
unigram 1.5 MB 35 KB
bigram 22 MB 141 KB
trigram 49 MB 234 KB
Table 5.4: Space cost in raw counting
Table 5.4 shows the size of raw counts file for each ngram order. We can see the
size is increasing rapidly, implying the size of the table will also increase sharply.
Notice that in Table 5.4 the size is only for each ngram order, but in our table, we need
to store all the orders from unigram, bigram up to ngram. So the table’s size is the total
sum, further increasing the number. Figure 5.3 suggests that training and testing data
have the similar up going trend line, meaning the space cost is less relevant with the
number of tokens, and it is like a monotone increasing function with the ngram order.
Chapter 5. Experiments 37
Figure 5.3: Space cost in raw counting
5.3 Table Structures
For each of the 5 table structures, we generate the tables with unigram, bigram and
trigram model. Table 5.5 and 5.6 show the time and space cost for training process
for each order in generating table ( step 3 ); Table 5.7 and 5.8 shows the time cost for
testing process in the querying step with two different methods.
unigram bigram trigram
type 1: ngram based 5 min 40 sec 62 min 21 sec 300 min 26 sec
type 2: word based 5 min 41 sec 64 min 42 sec 200 min 30 sec
type 3: context based 6 min 40 sec 64 min 39 sec 185 min 20 sec
type 4: half ngram based 6 min 7 sec 60 min 37 sec 190 min 40 sec
type 5: integer based 16 min 26 sec >240 min >10 hours
Table 5.5: Time cost in training step 3, generating table
The size of different table structure is calculated based on the size of each directory
of the table in HDFS, including the data and meta table info. For type 5, an extra step
of converting ngram into integer costs about 5 minutes, this is added in the time cost
Chapter 5. Experiments 38
unigram bigram trigram
type 1: ngram based 2.4 MB 44 MB 173 MB
type 2: word based 2.4 MB 60 MB 185 MB
type 3: context based 2.4 MB 44 MB 135 MB
type 4: half ngram based 2.4 MB 44 MB 129 MB
type 5: integer based 4.1 MB ∼40 MB ∼120 MB
Table 5.6: Space cost of different table structures
for type 5. The size of the convertid table is around 2.4MB, this figure is also added
in the space cost for type5. Taking into account for the calculation errors caused by
network latency, transmission failure, I/O delay and error, the time cost may vary with
an error range of±1−2 minutes. Figure 5.4 and 5.5 compares the time and space cost
for these five table types.
All the tables are created with same compression option in Hbase. As we can see
from Figure 5.4 and 5.5, type 5 costs much longer time than other table types, but the
space cost reduces. For unigram models, all the first 4 tables behave similarly, which
is easy to understand because for unigram all these structures are identical. For bigram
models, the training time keeps almost the same except type 5. The space cost keeps
reducing through type 3 - 5 but type 2 slightly increases the space cost. The trend
is more obvious with trigram models. Type 2 - 4 all make good effort to reduce the
training time, type 2 slightly increases the space cost but type 3 - 5 greatly reduce the
space cost.
Figure 5.4: Time cost in training step 3, generating table
Chapter 5. Experiments 39
Figure 5.5: Space cost of different table structures
Another comparison can be made to see the trend as the ngram order grows, how
the time and space cost will become. Figure 5.6 and 5.7 illustrate the trend line for time
cost and space cost with different ngram orders. Because type 5 needs extra MapRe-
duce job and uses two tables, the time cost is much bigger than other types. Type 1
costs a bit more time at trigram model compared with type 2 - 4. The lowest number
is from type 4, about 5 minutes less than type 4, that is 3.1% smaller. Meanwhile, the
space cost shows a different trend. For trigram model type 1 and 2 climb up to higher
numbers while the others are smaller. The lowest number is from type 4, about 4.4%
smaller than type 3.
Figure 5.6: Time cost for each ngram order
The next step is probability query for testing data. The time costs for both direct
query and caching query are illustrated in Table 5.7 and 5.8. Type 2, the word based
structure, costs a bit more compared with type 1, 3 and 4. Similar to training process,
Chapter 5. Experiments 40
Figure 5.7: Space cost for each ngram order
type 5 is still heavy time consuming. Type 1 shows a good result this time, not quite
like the performance in training process. Still, we need to tolerate an error range of
±1-2 minutes because of possible interruptions like network problem etc.
unigram bigram trigram
type 1: ngram based 5 min 5 sec 30 min 20 sec 40 min 2 sec
type 2: word based 11 min 14 sec 48 min 28 sec 88 min 17 sec
type 3: context based 6 min 31 sec 26 min 32 sec 40 min 15 sec
type 4: half ngram based 6 min 33 sec 26 min 4 sec 42 min 4 sec
type 5: integer based 10 min 39 sec > 50 min > 1.5 hour
Table 5.7: Time cost in testing, direct query
Figure 5.8 shows the differences between type 1, 3 and 4 are relatively small, con-
sidering the deviations that might affect the value. In Figure 5.9 it has the similar trend.
Comparing direct query and caching query, we can see that the caching query doesn’t
improve much on the time cost as we expected. For bigram and trigram models the
time cost is even higher than direct query. But still the comparison between different
table types is similar to Figure 5.8, type 3 and type 4 perform better, type 1 is good
while type 2 and type 5 fall behind others.
Also the perplexity of testing processes compared with SRILM is shown in Table
5.9. For the SRILM package, it uses Good-Turing smoothing as default, and the per-
plexity is computed using default parameters for unigram, bigram and trigram models.
We can see the perplexity from SRILM tools is higher than our model, especially for
Chapter 5. Experiments 41
unigram bigram trigram
type 1: ngram based 5 min 10 sec 24 min 17 sec 33 min 41 sec
type 2: word based 16 min 58 sec 49 min 10 sec 87 min 8 sec
type 3: context based 6 min 36 sec 31 min 13 sec 52 min 56 sec
type 4: half ngram based 6 min 31 sec 29 min 45 sec 51 min 13 sec
type 5: integer based 11 min 2 sec > 50 min >1.5 hour
Table 5.8: Time cost in testing, caching query
Figure 5.8: Time cost in direct query
Figure 5.9: Time cost in caching query
Chapter 5. Experiments 42
lower ngram order. Another figure from SRILM is the size of the language model
data, which is by default 683 MB without compression, but since in Hbase we have
used compression, we try to compress the file with gzip and the compressed size is
only 161MB, which is a 76.5% compression. Although we can’t compare the original
number directly because SRILM also stores other information in that plain text file,
obviously increasing the original size, the compressed file can be compared as a refer-
ence which is under the similar situation. We can see that the table size of type 3, 4, 5
is still smaller.
Hadoop+Hbase 2544.732 599.171 517.3108
SRILM 3564.03 742.269 573.563
Table 5.9: Perplexity for testing data
From our experiments, we can see there are some efficiency differences between
different table structures and ngram orders. The next chapter discusses about the proper
choice of table structure and future works based on our results.
Chapter 6
Discussion
6.1 Conclusion
From our experiments, we think the table of half ngram based structure is a good
choice for distributed language model using Hadoop and Hbase. Considering both
the time and space cost, the ngram based structure is too simple to take advantage of
distributed features, although this flat structure is fast and easy to manipulate locally
and is widely used in traditional language modeling tools like SRILM.
The word based structure and context based structure is basically a transposed table
pair. Say we have r words with c contexts, word based structure is a table of r rows
with c maximum columns, and context based structure is a table of c rows with r max-
imum columns. So the question is which one fits better for Hbase, which is sparsely
distributed and column oriented. In theory the cost of parsing both types are the same,
because we emit the context with all the following words as intermediate keys, there
is no need for extra parsing, so we only need to compare whether larger rows or larger
columns is better. From the experiments we find that context based structure is better,
meaning that more rows is less expensive than more columns. Less columns means
less column splits, since Hbase stores each column split separately, this will reduce the
I/O requests. This may explain why the word based structure has a poor performance
on the testing process. Another aspect is the column label names. For word based
structure the labels are all the possible contexts from unigram, bigram up to n-1 gram.
Meanwhile the context based structure only has all the unigrams as the column labels.
Again because Hbase is column oriented, the table needs to maintain an index of all the
column labels, and it is better to have simpler label names, especially when we come
to higher ngram order like 5-grams.
43
Chapter 6. Discussion 44
The half ngram based structure is a trade-off between word and context based struc-
tures. It moves some of the columns in word based structure to generate new rows. So
we will have less columns than word based structure, and less rows than context based
structure. For lower ngram order, this trade-off doesn’t have significant advantages.
Compared with word and context based structure, it will cost a little more time for the
extra string parsing, which is reconstructing the intermediate keys into half ngrams.
But it benefits from the balancing table structure for the I/O time cost when writing
a table. The time cost is not the lowest, but considering as we expand the cluster to
more machines, the difference of time cost will be smaller but the more compact table
structure will hold more data and give us a better model.
The integer based structure turns out to be much less efficient than we expected.
The idea of converting string into integer comes from Google’s method to store uni-
gram counts locally on every working node. Here we try to store this mapping in the
Hbase table, considering the scalability for huge data sets. But from the experiments
we find out that the cost of manipulating two tables in Hbase is too much even for
trigram models. Still we can get the approximate table size of this structure, and it
indeed shows a decrease as we expected.
Looking at the trends, we expect the half ngram structure will perform better for
higher ngram order and more training data. The context structure is also a possible
choice. However there are many aspects that may affect the training time, e.g. the
network connection problem, the hard disk failure, the cpu load etc. The figure in
our experiments is computed as an average number for several runs, and each run
the training time is different. Generally speaking we can’t avoid the deviation when
calculating the time cost, but we can see the trends as the ngram order increases. For
the space cost, each run the table size is identical, which means the model is stable and
we can rely more on this for evaluation. The table size of half ngram based structure is
also smaller compared with SRILM model data size. From this point of view, the half
ngram structure is a proper choice both on time and space cost.
The caching query doesn’t show big improvements on the time cost. This implies
that the way to store a cache in memory and check both the memory and the Hbase
table is not efficient in a MapReduce job. Because each working node will only receive
some part of the whole data, to store previous data information locally can’t grantee
the information is indeed useful for future data, so the node will actually waste some
time on searching in the local HashMap but find nothing and still have to look up in
the Hbase table.
Chapter 6. Discussion 45
For the perplexity comparison, our model shows a better result than SRILM, this is
related with the algorithm implementation differences, e.g. the way to deal with unseen
unigram, the way to do pruning etc. SRILM uses a more complex way to compute
the probability for unseen unigrams, which is adding 1 to the total and redistribute
the probability for all existing unigrams, these require more steps using MapReduce
and we think it will be too expensive to add another MapReduce job to update the
existing probabilities, so we choose a simpler scheme of adding 1 to the total and not
redistributing. This works well when there are not so many unseen unigrams but results
poorly for testing data with many unseen testing data.
6.2 Future Works
There are a lot of works that can be done in the future. The Hadoop and Hbase
release newer version with lots of bug fixes and enhancements. The first thing we can
do is to upgrade our program to newer frameworks to see if there are any changes or
improvements. There are lots of aspects that maybe influence the final result, and some
of them may be less relevant but some may be quite important. We can try to adjust
some configurations in Hadoop and Hbase for a better performance, e.g. the maximum
number of map and reduce tasks, the buffer size, the data compression type etc. Also
the experiments can be extended to larger cluster with more machines, increasing the
computational power to reduce the time cost.
It will be very interesting to see and compare the performance for higher ngram
order and more data, e.g. 5-gram model with billions of words. Also, for these table
structures, possible further compression or encoding can be added. For word or context
based structure, there will be many rows that only have one column, and the column
values are the same, so it is possible to combine these rows with the same column value
as a new row. For the integer based structure, maybe storing the mapping file locally
in each working node is better than storing it in Hbase table.
Because of the different implementation of Good-Turing algorithm, our model
shows a better probability as the lower perplexity compared with SRILM tools, still
we can implement more advanced smoothing methods like Kneser-Ney using Hadoop
and Hbase for comparison. More MapReduce tasks are needed to compute parameters,
also these parameters need to be stored either in HDFS or in Hbase table along with
the ngram probability, so the table structures may need improvements, which contains
a lot of potential researches.
Chapter 6. Discussion 46
The output of this work can be applied in machine translation and other statistical
language processing fields, providing an open source, distributed extension for these
researches and powering large scale language models.
Bibliography
[1] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in
machine translation. In Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language
Learning (EMNLP-CoNLL), pages 858–867, 2007.
[2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system
for structured data. In OSDI ’06, pages 205–218, 2006.
[3] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clus-
ters. In OSDI ’04, pages 137–150, 2004.
[4] A. Emami, K. Papineni, and J. Sorensen. Large-scale distributed language model-
ing. Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Interna-
tional Conference on, 4:IV–37–IV–40, 15-20 April 2007.
[5] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling.
Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International
Conference on, 1:181–184 vol.1, 9-12 May 1995.
[6] A. Stolcke. Srilm - an extensible language modeling toolkit. In 7th International
Conference on Spoken Language Processing, pages 901–904, September 2002.
[7] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In
SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference
on Research and development in information retrieval, pages 254–261, New York,
NY, USA, 1999. ACM.
[8] Y. Zhang, A. S. Hildebrand, and S. Vogel. Distributed language modeling for n-
best list re-ranking. In Proceedings of the 2006 Conference on Empirical Methods
47
Bibliography 48
in Natural Language Processing, pages 216–223, Sydney, Australia, July 2006.
Association for Computational Linguistics.
Appendix A
Source Code
A.1 NgramCount.java
NgramCount is the class to parse the input texts, generate ngram counts and count
of counts.
package ngram ;
import j a v a . i o . IOExcep t ion ;
import j a v a . u t i l . A r r a y L i s t ;
import j a v a . u t i l . I t e r a t o r ;
import j a v a . u t i l . L i s t ;
import j a v a . u t i l . r e g e x . P a t t e r n ;
import j a v a . u t i l . r e g e x . Matcher ;
import org . apache . hadoop . con f . C o n f i g u r a t i o n ;
import org . apache . hadoop . con f . C o n f i g u r e d ;
import org . apache . hadoop . f s . F i l e S y s t e m ;
import org . apache . hadoop . f s . Pa th ;
import org . apache . hadoop . i o . I n t W r i t a b l e ;
import org . apache . hadoop . i o . LongWr i t ab l e ;
import org . apache . hadoop . i o . Text ;
import org . apache . hadoop . i o . S e q u e n c e F i l e ;
import org . apache . hadoop . i o . S e q u e n c e F i l e . Compress ionType ;
import org . apache . hadoop . mapred . J o b C l i e n t ;
import org . apache . hadoop . mapred . RunningJob ;
49
Appendix A. Source Code 50
import org . apache . hadoop . mapred . JobConf ;
import org . apache . hadoop . mapred . MapReduceBase ;
import org . apache . hadoop . mapred . Mapper ;
import org . apache . hadoop . mapred . O u t p u t C o l l e c t o r ;
import org . apache . hadoop . mapred . P a r t i t i o n e r ;
import org . apache . hadoop . mapred . Reducer ;
import org . apache . hadoop . mapred . R e p o r t e r ;
import org . apache . hadoop . mapred . C o u n t e r s ;
import org . apache . hadoop . mapred . S e q u e n c e F i l e I n p u t F o r m a t ;
import org . apache . hadoop . mapred . S e q u e n c e F i l e O u t p u t F o r m a t ;
import org . apache . hadoop . u t i l . Tool ;
import org . apache . hadoop . u t i l . ToolRunner ;
/∗ ∗∗ p a r s e t h e t e x t f i l e s t o c o u n t a l l t h e ngrams and
∗ e s t i m a t e c o u n t o f c o u n t s f o r Good−Tu r ing smoo th ing
∗∗ @author Xiaoyang Yu
∗∗ /
p u b l i c c l a s s NgramCount ex tends C o n f i g u r e d implementsTool {
p r o t e c t e d f i n a l s t a t i c S t r i n g GTPARA = ” gt−p a r a m e t e r ” ;
p r o t e c t e d f i n a l s t a t i c S t r i n g RAWCOUNTDIR = ” raw−c o u n t s
” ;
/ / u se t h e c o u n t e r t o c o l l e c t t h e t o t a l number o f
un igrams
p r o t e c t e d s t a t i c enum MyCounter {INPUT WORDS
} ;
/∗ ∗∗ t h e map c l a s s t o p a r s e each l i n e i n t o ngrams
∗
Appendix A. Source Code 51
∗ /
p u b l i c s t a t i c c l a s s countMap ex tends MapReduceBase
implementsMapper<LongWri tab le , Text , Text , I n t W r i t a b l e > {
p r i v a t e i n t o r d e r ;
p r i v a t e S t r i n g words ;
p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = newI n t W r i t a b l e ( 1 ) ;
/ / r e t r i e v e ngram o r d e r from j o b c o n f i g u r a t i o n
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {o r d e r = j o b . g e t I n t ( ” o r d e r ” , 3 ) ;
}
/ / e m i t a l l t h e unigram , bigram . . up t o ngram
p u b l i c vo id map ( LongWr i t ab l e key , Text va lue ,
O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > o u t p u t ,
R e p o r t e r r e p o r t e r )
throws IOExcep t i on {S t r i n g l i n e = v a l u e . t o S t r i n g ( ) ;
i f ( l i n e == n u l l | | l i n e . t r i m ( ) . l e n g t h ( ) == 0)
re turn ;
P a t t e r n m y P a t t e r n = P a t t e r n . compi l e ( ”\\w+ |\\ p{ Punc t
}+” ) ;
Matcher myMatcher = m y P a t t e r n . ma tche r ( l i n e ) ;
S t r i n g B u f f e r sb = new S t r i n g B u f f e r ( ) ;
whi le ( myMatcher . f i n d ( ) ) {sb . append ( myMatcher . group ( ) ) . append ( ”\ t ” ) ;
}l i n e = sb . t o S t r i n g ( ) ;
S t r i n g [ ] c u r r e n t = l i n e . s p l i t ( ”\\ s+” ) ;
i n t i = 0 ;
i f ( o r d e r > 1)
c u r r e n t = ( ”<s> ” + l i n e ) . s p l i t ( ”\\ s+” ) ;
words = ” ” ;
Appendix A. Source Code 52
f o r ( i = 1 ; i <= o r d e r ; i ++) {f o r ( i n t k = 0 ; k <= c u r r e n t . l e n g t h − i ; k ++) {
words = c u r r e n t [ k ] ;
i f ( words . l e n g t h ( ) != 0 ) {f o r ( i n t j = k + 1 ; j < k + i ; j ++)
words = words + ” ” + c u r r e n t [ j ] ;
o u t p u t . c o l l e c t ( new Text ( words ) , one ) ;
}}
}}
}
/∗ ∗∗ A combiner c l a s s t h a t j u s t e m i t s t h e sum o f t h e
∗ i n p u t v a l u e s .
∗ /
p u b l i c s t a t i c c l a s s countCombine ex tends MapReduceBase
implementsReducer<Text , I n t W r i t a b l e , Text , I n t W r i t a b l e > {
p u b l i c vo id r e d u c e ( Text key , I t e r a t o r <I n t W r i t a b l e >
v a l u e s ,
O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > o u t p u t ,
R e p o r t e r r e p o r t e r )
throws IOExcep t i on {i n t sum = 0 ;
whi le ( v a l u e s . hasNext ( ) ) {sum += v a l u e s . n e x t ( ) . g e t ( ) ;
}o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ;
}}
/∗ ∗∗ A r e d u c e r c l a s s t h a t j u s t e m i t s t h e sum o f t h e i n p u t
Appendix A. Source Code 53
∗ v a l u e s , a l s o do t h e c o u n t s prun ing , c o l l e c t t h e
∗ t o t a l number o f un igrams .
∗ /
p u b l i c s t a t i c c l a s s countReduce ex tends MapReduceBase
implementsReducer<Text , I n t W r i t a b l e , Text , I n t W r i t a b l e > {
p r i v a t e i n t prune ;
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {prune = j o b . g e t I n t ( ” p rune ” , 0 ) ;
}
p u b l i c vo id r e d u c e ( Text key , I t e r a t o r <I n t W r i t a b l e >
v a l u e s ,
O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > o u t p u t ,
R e p o r t e r r e p o r t e r )
throws IOExcep t i on {i n t sum = 0 ;
whi le ( v a l u e s . hasNext ( ) ) {sum += v a l u e s . n e x t ( ) . g e t ( ) ;
}
i f ( sum > prune ) {o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ;
/∗ ∗∗ c a l c u l a t e t h e t o t a l number , o n l y c o u n t s
∗ unigram , i g n o r e <s>
∗ /
S t r i n g [ ] c u r r e n t = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;
i f ( c u r r e n t . l e n g t h == 1 && ! c u r r e n t [ 0 ] . e q u a l s ( ”<s
>” ) )
r e p o r t e r . i n c r C o u n t e r ( MyCounter . INPUT WORDS , sum
) ;
}}
Appendix A. Source Code 54
}
/∗ ∗∗ a p a r t i t i o n c l a s s t o b a l a n c e t h e load based on f i r s t
∗ two words .
∗∗ /
p u b l i c s t a t i c c l a s s c o u n t P a r t i t i o n ex tendsMapReduceBase implements
P a r t i t i o n e r <Text , I n t W r i t a b l e > {p u b l i c i n t g e t P a r t i t i o n ( Text key , I n t W r i t a b l e va lue ,
i n t n u m P a r t i t i o n s ) {S t r i n g [ ] l i n e = key . t o S t r i n g ( ) . s p l i t ( ” ” ) ;
S t r i n g p r e f i x = ( l i n e . l e n g t h > 1) ? ( l i n e [ 0 ] + l i n e
[ 1 ] ) : l i n e [ 0 ] ;
re turn p r e f i x . hashCode ( ) % n u m P a r t i t i o n s ;
}}
/∗ ∗∗ t h e second t a s k i s t o c o u n t t h e count−of−c o u n t s f o r
∗ Good−T u r i n g s m o o t h i n g .
∗∗ a map c l a s s t o p a r s e t h e raw c o u n t s and e m i t c o u n t
∗ o f c o u n t s w i t h t h e ngram o r d e r
∗ /
p u b l i c s t a t i c c l a s s MapB ex tends MapReduceBase
implementsMapper<Text , I n t W r i t a b l e , I n t W r i t a b l e , I n t W r i t a b l e >
{p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = new
I n t W r i t a b l e ( 1 ) ;
p u b l i c vo id map ( Text key , I n t W r i t a b l e va lue ,
O u t p u t C o l l e c t o r <I n t W r i t a b l e , I n t W r i t a b l e > o u t p u t ,
Appendix A. Source Code 55
R e p o r t e r r e p o r t e r ) throws IOExcep t i on {i f ( ! key . t o S t r i n g ( ) . e q u a l s ( ”<s>” ) ) {
S t r i n g [ ] r e s u l t = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;
S t r i n g combine = r e s u l t . l e n g t h + v a l u e . t o S t r i n g ( )
;
o u t p u t . c o l l e c t ( new I n t W r i t a b l e ( I n t e g e r . p a r s e I n t (
combine ) ) , one ) ;
}}
}
/∗ ∗∗ a re du c e c l a s s t o e m i t t h e sums o f i n p u t v a l u e s
∗∗ /
p u b l i c s t a t i c c l a s s ReduceB ex tends MapReduceBase
implementsReducer<I n t W r i t a b l e , I n t W r i t a b l e , I n t W r i t a b l e ,
I n t W r i t a b l e > {
p u b l i c vo id r e d u c e ( I n t W r i t a b l e key , I t e r a t o r <
I n t W r i t a b l e > v a l u e s ,
O u t p u t C o l l e c t o r <I n t W r i t a b l e , I n t W r i t a b l e > o u t p u t ,
R e p o r t e r r e p o r t e r ) throws IOExcep t i on {i n t sum = 0 ;
whi le ( v a l u e s . hasNext ( ) ) {sum += v a l u e s . n e x t ( ) . g e t ( ) ;
}o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ;
}}
s t a t i c i n t p r i n t U s a g e ( ) {System . o u t . p r i n t l n ( ” usage : [ OPTIONS ] < i n p u t d i r >”
Appendix A. Source Code 56
+ ”\ n O p t io n s i n c l u d e : ” ) ;
System . o u t
. p r i n t l n ( ”−o r d e r <o r d e r >:\ t t h e max ngram o r d e r \n
\ t d e f a u l t : 3\ n ”
+ ”−prune <prune >:\ t t h e p r u n i n g c o u n t s \n\t d e f a u l t : 0\ n ” ) ;
re turn −1;
}
p u b l i c i n t run ( S t r i n g [ ] a r g s ) throws E x c e p t i o n {i n t o r d e r = 3 ;
i n t prune = 0 ;
L i s t <S t r i n g > o t h e r a r g s = new A r r a y L i s t <S t r i n g >() ;
f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; ++ i ) {t r y {
i f ( ”−o r d e r ” . e q u a l s ( a r g s [ i ] ) ) {o r d e r = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;
System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” +
a r g s [ i ] ) ;
} e l s e i f ( ”−prune ” . e q u a l s ( a r g s [ i ] ) ) {prune = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;
System . o u t . p r i n t l n ( ” use c o u n t p r u n i n g : ” + a r g s
[ i ] ) ;
} e l s e {o t h e r a r g s . add ( a r g s [ i ] ) ;
}} ca tch ( NumberFormatExcept ion e x c e p t ) {
System . o u t . p r i n t l n ( ”ERROR: I n t e g e r e x p e c t e d
i n s t e a d o f ”
+ a r g s [ i ] ) ;
re turn p r i n t U s a g e ( ) ;
} ca tch ( Ar ray IndexOutOfBoundsExcep t ion e x c e p t ) {System . o u t . p r i n t l n ( ”ERROR: R e q u i r e d p a r a m e t e r
m i s s i n g from ”
+ a r g s [ i − 1 ] ) ;
Appendix A. Source Code 57
re turn p r i n t U s a g e ( ) ;
}}/ / Make s u r e t h e r e are e x a c t l y 1 p a r a m e t e r s l e f t .
i f ( o t h e r a r g s . s i z e ( ) != 1 ) {System . o u t . p r i n t l n ( ”ERROR: Wrong number o f
p a r a m e t e r s : ”
+ a r g s . l e n g t h + ” i n s t e a d o f 1 . ” ) ;
re turn p r i n t U s a g e ( ) ;
}
JobConf jobA = new JobConf ( ge tConf ( ) , NgramCount .
c l a s s ) ;
jobA . setJobName ( ” ngramcount ” ) ;
/ / t h e k e y s are words ( s t r i n g s )
jobA . s e t O u t p u t K e y C l a s s ( Text . c l a s s ) ;
/ / t h e v a l u e s are c o u n t s ( i n t s )
jobA . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ;
jobA . s e t M a p p e r C l a s s ( countMap . c l a s s ) ;
jobA . s e t C o m b i n e r C l a s s ( countCombine . c l a s s ) ;
jobA . s e t P a r t i t i o n e r C l a s s ( c o u n t P a r t i t i o n . c l a s s ) ;
jobA . s e t R e d u c e r C l a s s ( coun tReduce . c l a s s ) ;
jobA . s e t I n t ( ” o r d e r ” , o r d e r ) ;
jobA . s e t I n t ( ” p rune ” , p rune ) ;
jobA . s e t I n p u t P a t h ( new Pa th ( o t h e r a r g s . g e t ( 0 ) ) ) ;
jobA . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ;
jobA . s e t O u t p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;
i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( new Pa th (RAWCOUNTDIR)
) )
F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( new Pa th (RAWCOUNTDIR) ) ;
S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobA , t rue) ;
Appendix A. Source Code 58
S e q u e n c e F i l e O u t p u t F o r m a t . s e t O u t p u t C o m p r e s s i o n T y p e (
jobA , Compress ionType .BLOCK) ;
/∗ ∗∗ r e c o r d t h e r u n n i n g t i m e
∗ /
long t 1 = System . c u r r e n t T i m e M i l l i s ( ) ;
RunningJob r j = J o b C l i e n t . r un Jo b ( jobA ) ;
C o u n t e r s c s = r j . g e t C o u n t e r s ( ) ;
t r y {Pa th o u t F i l e = new Pa th ( ”num−of−un ig rams ” ) ;
i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( o u t F i l e ) )
F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( o u t F i l e ) ;
S e q u e n c e F i l e . W r i t e r w r i t e r = S e q u e n c e F i l e .
c r e a t e W r i t e r ( F i l e S y s t e m . g e t ( jobA ) , jobA , o u t F i l e
, Tex t . c l a s s , LongWr i t ab l e . c l a s s ) ;
System . o u t . p r i n t l n ( ” t o t a l i s : ”+ cs . g e t C o u n t e r (
MyCounter . INPUT WORDS) ) ;
w r i t e r . append ( new Text ( ” t o t a l ” ) , new LongWr i t ab l e (
c s . g e t C o u n t e r ( MyCounter . INPUT WORDS) ) ) ;
w r i t e r . c l o s e ( ) ;
} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” Can ’ t w r i t e f i l e : num−of−
un ig rams f o r t h e ngram model . E x i t . ” ) ;
re turn −1;
}System . o u t . p r i n t l n ( ” Job 1 f i n i s h e d . Now im s t a r t i n g
j o b 2 ” ) ;
/∗ ∗∗ t h e second t a s k i s t o c o u n t t h e count−of−c o u n t s
∗ f o r Good−Tu r ing smoo th ing
∗∗ /
JobConf jobB = new JobConf ( ge tConf ( ) , NgramCount .
c l a s s ) ;
Appendix A. Source Code 59
jobB . s e t I n t ( ” o r d e r ” , o r d e r ) ;
jobB . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;
jobB . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ;
jobB . s e t O u t p u t K e y C l a s s ( I n t W r i t a b l e . c l a s s ) ;
jobB . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ;
jobB . s e t M a p p e r C l a s s (MapB . c l a s s ) ;
jobB . s e t C o m b i n e r C l a s s ( ReduceB . c l a s s ) ;
jobB . s e t R e d u c e r C l a s s ( ReduceB . c l a s s ) ;
jobB . setNumReduceTasks ( 1 ) ;
/ / i n p u t i s t h e raw c o u n t s
jobB . s e t I n p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;
/ / o u t p u t i s c o u n t o f c o u n t s
Pa th tmpDir = new Pa th (GTPARA) ;
jobB . s e t O u t p u t P a t h ( tmpDir ) ;
i f ( F i l e S y s t e m . g e t ( jobB ) . e x i s t s ( tmpDir ) )
F i l e S y s t e m . g e t ( jobB ) . d e l e t e ( tmpDir ) ;
S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobB , t rue) ;
S e q u e n c e F i l e O u t p u t F o r m a t . s e t O u t p u t C o m p r e s s i o n T y p e (
jobB , Compress ionType .BLOCK) ;
J o b C l i e n t . r u nJ ob ( jobB ) ;
System . o u t . p r i n t l n ( ” Job 2 f i n i s h e d ! Now you can s t a r t
hbase t a b l e g e n e r a t o r . ” ) ;
long t 2 = System . c u r r e n t T i m e M i l l i s ( ) ;
long s e c = ( t 2 − t 1 ) / 1000 ; / / s e c o n d s
System . o u t . p r i n t l n ( ” r u n n i n g t ime i s : ” + s e c / 60 + ”
m i n u t e s ” + s e c% 60 + ” s e c o n d s ” ) ;
re turn 0 ;
}
p u b l i c s t a t i c vo id main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n
{i n t e r rCode = ToolRunner . run ( new C o n f i g u r a t i o n ( ) , new
NgramCount ( ) ,
a r g s ) ;
Appendix A. Source Code 60
System . e x i t ( e r rCode ) ;
}}
A.2 TableGenerator.java
TableGenerator is the class to estimate ngram probability and store it in different
Hbase tables.
package ngram ;
import j a v a . i o . IOExcep t ion ;
import j a v a . u t i l . A r r a y L i s t ;
import j a v a . u t i l . HashMap ;
import j a v a . u t i l . I t e r a t o r ;
import j a v a . u t i l . L i s t ;
import j a v a . u t i l . Map . E n t r y ;
import org . apache . hadoop . con f . C o n f i g u r a t i o n ;
import org . apache . hadoop . con f . C o n f i g u r e d ;
import org . apache . hadoop . f s . F i l e S y s t e m ;
import org . apache . hadoop . f s . Pa th ;
import org . apache . hadoop . hbase . H B a s e C o n f i g u r a t i o n ;
import org . apache . hadoop . hbase . HTable ;
import org . apache . hadoop . hbase . i o . I m m u t a b l e B y t e s W r i t a b l e ;
import org . apache . hadoop . hbase . mapred . TableReduce ;
import org . apache . hadoop . i o . I n t W r i t a b l e ;
import org . apache . hadoop . i o . LongWr i t ab l e ;
import org . apache . hadoop . i o . MapWri table ;
import org . apache . hadoop . i o . S e q u e n c e F i l e ;
import org . apache . hadoop . i o . Text ;
import org . apache . hadoop . mapred . J o b C l i e n t ;
import org . apache . hadoop . mapred . JobConf ;
import org . apache . hadoop . mapred . MapReduceBase ;
import org . apache . hadoop . mapred . Mapper ;
Appendix A. Source Code 61
import org . apache . hadoop . mapred . O u t p u t C o l l e c t o r ;
import org . apache . hadoop . mapred . R e p o r t e r ;
import org . apache . hadoop . mapred . S e q u e n c e F i l e I n p u t F o r m a t ;
import org . apache . hadoop . u t i l . Tool ;
import org . apache . hadoop . u t i l . ToolRunner ;
/∗ ∗∗ t h e t h i r d t a s k i s t o c a l c u l a t e t h e Good−Tu r ing
∗ smoo th ing c o u n t s f o r t r a i n i n g s e t , e s t i m a t e
∗ t h e c o n d i t i o n a l p r o b a b i l i t y and s t o r e i t i n Hbase
∗ t a b l e
∗∗ @author Xiaoyang Yu
∗ /
p u b l i c c l a s s T a b l e G e n e r a t o r ex tends C o n f i g u r e d implementsTool {
p r o t e c t e d f i n a l s t a t i c S t r i n g GTPARA = ” gt−p a r a m e t e r ” ;
p r o t e c t e d f i n a l s t a t i c S t r i n g RAWCOUNTDIR = ” raw−c o u n t s
” ;
/∗ ∗∗ a map c l a s s t o read i n raw c o u n t s and c o u n t o f
∗ coun t s , compute t h e Good−Tu r ing smoo th ing c o u n t s
∗∗ /
p u b l i c s t a t i c c l a s s es t ima teMap ex tends MapReduceBase
implements Mapper<Text , I n t W r i t a b l e , Text , Text> {i n t o r d e r ;
HashMap<I n t e g e r , HashMap<I n t e g e r , I n t e g e r >> countHash
;
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {o r d e r = j o b . g e t I n t ( ” o r d e r ” , 3 ) ;
Appendix A. Source Code 62
countHash = new HashMap<I n t e g e r , HashMap<I n t e g e r ,
I n t e g e r >>() ;
Pa th t m p F i l e = new Pa th ( new Pa th (GTPARA) , ” p a r t
−00000” ) ;
/ / open t h e c o u n t o f c o u n t s f i l e t o read a l l c o u n t
/ / o f c o u n t s i n t o hashmap
t r y {F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( j o b ) ;
S e q u e n c e F i l e . Reader r e a d e r = new S e q u e n c e F i l e .
Reader ( f i l e S y s , tmpFi l e , j o b ) ;
I n t W r i t a b l e k = new I n t W r i t a b l e ( ) , v = newI n t W r i t a b l e ( ) ;
whi le ( r e a d e r . n e x t ( k , v ) ) {I n t e g e r o r d e r = I n t e g e r . p a r s e I n t ( k . t o S t r i n g ( ) .
s u b s t r i n g ( 0 , 1 ) ) ;
I n t e g e r key = I n t e g e r . p a r s e I n t ( k . t o S t r i n g ( ) .
s u b s t r i n g ( 1 ) ) ;
i f ( countHash . c o n t a i n s K e y ( o r d e r ) ) {HashMap<I n t e g e r , I n t e g e r > tmp = countHash . g e t
( o r d e r ) ;
tmp . p u t ( key , new I n t e g e r ( v . g e t ( ) ) ) ;
countHash . p u t ( o r d e r , tmp ) ;
} e l s e {HashMap<I n t e g e r , I n t e g e r > tmp = new HashMap<
I n t e g e r , I n t e g e r >() ;
tmp . p u t ( key , new I n t e g e r ( v . g e t ( ) ) ) ;
countHash . p u t ( o r d e r , tmp ) ;
}}r e a d e r . c l o s e ( ) ;
} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” c a n t open p a r a m e t e r f i l e ! ” ) ;
};
}
Appendix A. Source Code 63
p u b l i c vo id map ( Text key , I n t W r i t a b l e va lue ,
O u t p u t C o l l e c t o r <Text , Text> o u t p u t , R e p o r t e r r )
throws IOExcep t i on {I n t e g e r v = v a l u e . g e t ( ) ;
f l o a t c = v . f l o a t V a l u e ( ) ;
S t r i n g [ ] words = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;
I n t e g e r o r d e r = words . l e n g t h ;
i f ( o r d e r > t h i s . o r d e r )
re turn ;
/∗ ∗∗ r ∗=( r +1)∗Nr +1/ Nr t h e c o u n t o f c o u n t s Nr must be
∗ e x i s t e d , o t h e r w i s e we use t h e c l o s e s t c o u n t s
∗ i n s t e a d
∗ /
f o r ( i n t i = 1 ; i < 5 ; i ++) {i f ( countHash . g e t ( o r d e r ) . c o n t a i n s K e y ( v )
&& countHash . g e t ( o r d e r ) . c o n t a i n s K e y ( v + i ) ) {I n t e g e r Nr1 = countHash . g e t ( o r d e r ) . g e t ( v + i ) ;
I n t e g e r Nr = countHash . g e t ( o r d e r ) . g e t ( v ) ;
c = ( v . f l o a t V a l u e ( ) + 1 f ) ∗ Nr1 . f l o a t V a l u e ( )
/ Nr . f l o a t V a l u e ( ) ;
break ;
}}/∗ ∗∗ t h e G−T smoothed c o u n t s w i t h T e x t key i s c , t h e n
∗ we e m i t t h e t o k e n based on t h e l a s t n−1 gram as
∗ t h e new key
∗ /
i f ( o r d e r > 1) {S t r i n g s u f f i x = words [ o r d e r − 1 ] ;
S t r i n g word = words [ 0 ] ;
f o r ( i n t i = 1 ; i < o r d e r − 1 ; i ++) {word = word + ” ” + words [ i ] ;
Appendix A. Source Code 64
}o u t p u t . c o l l e c t ( new Text ( word ) , new Text ( ”\ t ” +
s u f f i x + ”\ t ” + F l o a t . t o S t r i n g ( c ) ) ) ;
}i f ( o r d e r == 1 | | ( o r d e r > 1 && o r d e r < t h i s . o r d e r )
)
o u t p u t . c o l l e c t ( key , new Text ( F l o a t . t o S t r i n g ( c ) ) ) ;
}}
/∗ ∗∗ a map c l a s s needed by i n t e g e r based t a b l e s t r u c t u r e ,
∗ c o n v e r t t h e un igrams i n t o u n i qu e i n t e g e r s
∗∗ /
p u b l i c s t a t i c c l a s s conver tMap ex tends MapReduceBase
implementsMapper<Text , I n t W r i t a b l e , Text , LongWri tab le > {
Long i d = 0 l ;
/∗ ∗∗ t h e i n p u t key i s ngram , v a l u e i s t h e raw c o u n t s
∗ t h e o u t p u t key i s unigram , v a l u e i s t h e u n iq ue
∗ i n t e g e r i d
∗ /
p u b l i c vo id map ( Text key , I n t W r i t a b l e va lue ,
O u t p u t C o l l e c t o r <Text , LongWri tab le > o u t p u t ,
R e p o r t e r r )
throws IOExcep t i on {S t r i n g [ ] words = key . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) ;
i f ( words . l e n g t h == 1) {o u t p u t . c o l l e c t ( key , new LongWr i t ab l e (++ i d ) ) ;
}}
}
Appendix A. Source Code 65
/∗ ∗∗ a re du c e c l a s s needed by i n t e g e r based t a b l e
∗ s t r u c t u r e , s t o r e t h e unigram− i n t e r g e r mapping i n
∗ Hbase t a b l e
∗∗ /
p u b l i c s t a t i c c l a s s c o n v e r t T a b l e ex tends TableReduce<
Text , LongWri tab le > {/ / t h e row key i s t h e unigram T e x t k , t h e column i s
c o n v e r t : i d
p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <LongWri tab le > v ,
O u t p u t C o l l e c t o r <Text , MapWritable> o u t p u t ,
R e p o r t e r r )
throws IOExcep t i on {Long i d = v . n e x t ( ) . g e t ( ) ;
MapWri table mw = new MapWri table ( ) ;
Text column = new Text ( ” c o n v e r t : i d ” ) ;
mw. p u t ( column , new I m m u t a b l e B y t e s W r i t a b l e ( Long .
t o S t r i n g ( i d )
. g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( k , mw) ;
}}
/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n ngram based
∗ t a b l e
∗∗ /
p u b l i c s t a t i c c l a s s s i mp leT ab l eR edu ce ex tendsTableReduce<Text , Text> {
Long t o t a l ;
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {
Appendix A. Source Code 66
t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;
}
p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <Text> v ,
O u t p u t C o l l e c t o r <Text , MapWritable> o u t p u t ,
R e p o r t e r r )
throws IOExcep t i on {F l o a t base = 1 f ;
HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g ,
F l o a t >() ;
whi le ( v . hasNext ( ) ) {S t r i n g r e c o r d = v . n e x t ( ) . t o S t r i n g ( ) ;
i f ( r e c o r d . s t a r t s W i t h ( ”\ t ” ) ) {S t r i n g [ ] tmp = r e c o r d . s p l i t ( ”\ t ” ) ;
c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ;
} e l s ebase = F l o a t . p a r s e F l o a t ( r e c o r d ) ;
}MapWri table mw = new MapWri table ( ) ;
/∗ ∗∗ w r i t e t o t h e column <g t : prob > , t h e row i s t h e
∗ ngram
∗ /
i f ( ! c o u n t s . i sEmpty ( ) ) {f o r ( Ent ry<S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t
( ) ) {F l o a t prob = e n t r y . g e t V a l u e ( ) / ba se ;
i f ( p rob > 1 f )
prob = 1 f ;
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( prob ) .
g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( new Text ( k + ” ” + e n t r y . getKey
( ) ) , mw) ;
Appendix A. Source Code 67
}}i f ( k . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) . l e n g t h == 1) {
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( base /
t o t a l ) . g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( k , mw) ;
}}
}
/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n word based
∗ t a b l e
∗∗ /
p u b l i c s t a t i c c l a s s wordTableReduce ex tends TableReduce
<Text , Text> {Long t o t a l ;
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;
}
/∗ ∗∗ k i s t h e n−1 gram c o n t e x t
∗ /
p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <Text> v ,
O u t p u t C o l l e c t o r <Text , MapWritable> o u t p u t ,
R e p o r t e r r )
throws IOExcep t i on {F l o a t base = 1 f ;
HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g ,
F l o a t >() ;
Appendix A. Source Code 68
whi le ( v . hasNext ( ) ) {S t r i n g r e c o r d = v . n e x t ( ) . t o S t r i n g ( ) ;
i f ( r e c o r d . s t a r t s W i t h ( ”\ t ” ) ) {S t r i n g [ ] tmp = r e c o r d . s p l i t ( ”\ t ” ) ;
c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ;
} e l s ebase = F l o a t . p a r s e F l o a t ( r e c o r d ) ;
}MapWri table mw = new MapWri table ( ) ;
/∗ ∗∗ w r i t e t o t h e column f a m i l y <g t : > , column l a b e l
∗ i s t h e n−1 gram T e x t k , t h e row i s t h e word from
∗ HashMap c o u n t s
∗ /
i f ( ! c o u n t s . i sEmpty ( ) ) {f o r ( Ent ry<S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t
( ) ) {F l o a t prob = e n t r y . g e t V a l u e ( ) / ba se ;
i f ( p rob > 1 f )
prob = 1 f ;
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : ” + k ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( prob ) .
g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( new Text ( e n t r y . getKey ( ) ) , mw) ;
}}i f ( k . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) . l e n g t h == 1) {
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : unigram ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( base /
t o t a l ) . g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( k , mw) ;
}
Appendix A. Source Code 69
}}
/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n c o n t e x t based
t a b l e
∗∗ /
p u b l i c s t a t i c c l a s s c o n t e x t T a b l e R e d u c e ex tendsTableReduce<Text , Text> {
Long t o t a l ;
@Override
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;
}
/∗ ∗∗ k i s t h e n−1 gram c o n t e x t
∗ /
p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <Text> v ,
O u t p u t C o l l e c t o r <Text , MapWritable> o u t p u t ,
R e p o r t e r r )
throws IOExcep t i on {F l o a t base = 1 f ;
HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g ,
F l o a t >() ;
whi le ( v . hasNext ( ) ) {S t r i n g r e c o r d = v . n e x t ( ) . t o S t r i n g ( ) ;
i f ( r e c o r d . s t a r t s W i t h ( ”\ t ” ) ) {S t r i n g [ ] tmp = r e c o r d . s p l i t ( ”\ t ” ) ;
c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ;
} e l s ebase = F l o a t . p a r s e F l o a t ( r e c o r d ) ;
}
Appendix A. Source Code 70
MapWri table mw = new MapWri table ( ) ;
/∗ ∗∗ w r i t e t o t h e column f a m i l y <g t : > , column l a b e l
i s t h e word from
∗ HashMap coun t s , t h e row i s t h e n−1 gram T e x t k
∗ /
i f ( ! c o u n t s . i sEmpty ( ) ) {f o r ( Ent ry<S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t
( ) ) {F l o a t prob = e n t r y . g e t V a l u e ( ) / ba se ;
i f ( p rob > 1 f )
prob = 1 f ;
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : ” + e n t r y . getKey ( ) ) ,
new I m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g (
prob )
. g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( k , mw) ;
}}i f ( k . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) . l e n g t h == 1) {
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : unigram ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t
. t o S t r i n g ( base / t o t a l ) . g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( k , mw) ;
}}
}
/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n h a l f ngram
based t a b l e
∗
Appendix A. Source Code 71
∗ /
p u b l i c s t a t i c c l a s s combineTableReduce ex tendsTableReduce<Text , Text> {
Long t o t a l ;
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;
}
/∗ ∗∗ k i s t h e n−1 gram c o n t e x t
∗ /
p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <Text> v ,
O u t p u t C o l l e c t o r <Text , MapWritable> o u t p u t ,
R e p o r t e r r )
throws IOExcep t i on {F l o a t base = 1 f ;
HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g ,
F l o a t >() ;
whi le ( v . hasNext ( ) ) {S t r i n g r e c o r d = v . n e x t ( ) . t o S t r i n g ( ) ;
i f ( r e c o r d . s t a r t s W i t h ( ”\ t ” ) ) {S t r i n g [ ] tmp = r e c o r d . s p l i t ( ”\ t ” ) ;
c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ;
} e l s ebase = F l o a t . p a r s e F l o a t ( r e c o r d ) ;
}MapWri table mw = new MapWri table ( ) ;
/∗ ∗∗ w r i t e t o t h e column f a m i l y <g t : > , column l a b e l
i s t h e l a s t h a l f
∗ ngram c o n t e x t , t h e row i s t h e f i r s t h a l f ngram
c o n t e x t
∗ /
Appendix A. Source Code 72
i f ( ! c o u n t s . i sEmpty ( ) ) {f o r ( Ent ry<S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t
( ) ) {F l o a t prob = e n t r y . g e t V a l u e ( ) / ba se ;
i f ( p rob > 1 f )
prob = 1 f ;
mw. c l e a r ( ) ;
S t r i n g ngram = k . t o S t r i n g ( ) + ” ” + e n t r y .
getKey ( ) ;
S t r i n g [ ] a r r a y = ngram . s p l i t ( ”\\ s+” ) ;
i n t h a l f = a r r a y . l e n g t h / 2 ;
S t r i n g row = a r r a y [ 0 ] ;
S t r i n g l a b e l = a r r a y [ h a l f ] ;
f o r ( i n t i = 1 ; i < h a l f ; i ++) {row = row + ” ” + a r r a y [ i ] ;
}f o r ( i n t i = h a l f + 1 ; i < a r r a y . l e n g t h ; i ++) {
l a b e l = l a b e l + ” ” + a r r a y [ i ] ;
}mw. p u t ( new Text ( ” g t : ” + l a b e l ) , new
I m m u t a b l e B y t e s W r i t a b l e (
F l o a t . t o S t r i n g ( prob ) . g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( new Text ( row ) , mw) ;
}}i f ( k . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) . l e n g t h == 1) {
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : unigram ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t
. t o S t r i n g ( base / t o t a l ) . g e t B y t e s ( ) ) ) ;
o u t p u t . c o l l e c t ( k , mw) ;
}}
}
Appendix A. Source Code 73
/∗ ∗∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n i n t e g e r based
t a b l e
∗∗ /
p u b l i c s t a t i c c l a s s i n t e g e r T a b l e R e d u c e ex tendsTableReduce<Text , Text> {
Long t o t a l ;
p r i v a t e H B a s e C o n f i g u r a t i o n c o n f i g ;
p r i v a t e HTable t a b l e ;
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;
t r y {c o n f i g = new H B a s e C o n f i g u r a t i o n ( j o b ) ;
t a b l e = new HTable ( c o n f i g , new Text ( ” idmap ” ) ) ;
} ca tch ( IOExcep t ion e ) {System . o u t
. p r i n t l n ( ”Can ’ t open t a b l e : idmap f o r t h e
ngram model . ” ) ;
re turn ;
}}
/∗ ∗∗ c o n v e r t t h e ngram s t r i n g i n t o long i n t e g e r s . Not
found words are
∗ r e t u r n e d as ’0 ’
∗∗ @param ngram t h e ngram s t r i n g
∗ @return t h e i n t e g e r r e p r e s e n t a t i o n o f t h e s t r i n g
∗ /
p r i v a t e S t r i n g c o n v e r t I D ( S t r i n g ngram ) throwsIOExcep t i on {
S t r i n g [ ] a r r a y = ngram . s p l i t ( ”\\ s+” ) ;
Appendix A. Source Code 74
Text Column = new Text ( ” c o n v e r t : i d ” ) ;
S t r i n g myid = ” 0 ” ;
byte [ ] v a l u e B y t e s = t a b l e . g e t ( new Text ( a r r a y [ 0 ] ) ,
Column ) ;
i f ( v a l u e B y t e s != n u l l ) {myid = new S t r i n g ( v a l u e B y t e s ) ;
} e l s ere turn ” 0 ” ;
i f ( a r r a y . l e n g t h > 1) {f o r ( i n t i = 1 ; i < a r r a y . l e n g t h ; i ++) {
v a l u e B y t e s = t a b l e . g e t ( new Text ( a r r a y [ i ] ) ,
Column ) ;
i f ( v a l u e B y t e s != n u l l ) {myid = myid + ” ” + new S t r i n g ( v a l u e B y t e s ) ;
} e l s ere turn ” 0 ” ;
}}re turn myid ;
}
/∗ ∗∗ k i s t h e n−1 gram c o n t e x t
∗ /
p u b l i c vo id r e d u c e ( Text k , I t e r a t o r <Text> v ,
O u t p u t C o l l e c t o r <Text , MapWritable> o u t p u t ,
R e p o r t e r r )
throws IOExcep t i on {F l o a t ba se = 1 f ;
HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g ,
F l o a t >() ;
whi le ( v . hasNext ( ) ) {S t r i n g r e c o r d = v . n e x t ( ) . t o S t r i n g ( ) ;
i f ( r e c o r d . s t a r t s W i t h ( ”\ t ” ) ) {S t r i n g [ ] tmp = r e c o r d . s p l i t ( ”\ t ” ) ;
Appendix A. Source Code 75
c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ;
} e l s ebase = F l o a t . p a r s e F l o a t ( r e c o r d ) ;
}MapWri table mw = new MapWri table ( ) ;
/∗ ∗∗ w r i t e t o t h e column <g t : prob > , t h e row i s t h e
i n t e g e r
∗ r e p r e s e n t a t i o n o f ngram
∗ /
i f ( ! c o u n t s . i sEmpty ( ) ) {f o r ( Ent ry<S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t
( ) ) {F l o a t prob = e n t r y . g e t V a l u e ( ) / ba se ;
i f ( p rob > 1 f )
prob = 1 f ;
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e (
F l o a t . t o S t r i n g ( prob ) . g e t B y t e s ( ) ) ) ;
S t r i n g i d = c o n v e r t I D ( k . t o S t r i n g ( ) + ”\ t ” +
e n t r y . getKey ( ) ) ;
o u t p u t . c o l l e c t ( new Text ( i d ) , mw) ;
}}i f ( k . t o S t r i n g ( ) . s p l i t ( ”\\ s+” ) . l e n g t h == 1) {
mw. c l e a r ( ) ;
mw. p u t ( new Text ( ” g t : p rob ” ) , newI m m u t a b l e B y t e s W r i t a b l e ( F l o a t
. t o S t r i n g ( base / t o t a l ) . g e t B y t e s ( ) ) ) ;
S t r i n g i d = c o n v e r t I D ( k . t o S t r i n g ( ) ) ;
o u t p u t . c o l l e c t ( new Text ( i d ) , mw) ;
}}
Appendix A. Source Code 76
}
p u b l i c i n t run ( S t r i n g [ ] a r g s ) throws E x c e p t i o n {i n t o r d e r = 3 ;
i n t t y p e = 1 ;
L i s t <S t r i n g > o t h e r a r g s = new A r r a y L i s t <S t r i n g >() ;
f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; ++ i ) {t r y {
i f ( ”−o r d e r ” . e q u a l s ( a r g s [ i ] ) ) {o r d e r = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;
i f ( o r d e r < 1) {System . o u t
. p r i n t l n ( ” o r d e r i s i n v a l i d , u s i n g d e f a u l t
i n s t e a d ” ) ;
o r d e r = 3 ;
}} e l s e i f ( ”−t y p e ” . e q u a l s ( a r g s [ i ] ) ) {
t y p e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;
i f ( t y p e < 1 | | t y p e > 5) {System . o u t
. p r i n t l n ( ” t y p e i s o u t o f r a n g e (1−5) ,
u s i n g d e f a u l t i n s t e a d ” ) ;
t y p e = 1 ;
}} e l s e {
o t h e r a r g s . add ( a r g s [ i ] ) ;
}} ca tch ( NumberFormatExcept ion e x c e p t ) {
System . o u t . p r i n t l n ( ”ERROR: I n t e g e r e x p e c t e d
i n s t e a d o f ”
+ a r g s [ i ] ) ;
re turn p r i n t U s a g e ( ) ;
} ca tch ( Ar ray IndexOutOfBoundsExcep t ion e x c e p t ) {System . o u t . p r i n t l n ( ”ERROR: R e q u i r e d p a r a m e t e r
m i s s i n g from ”
Appendix A. Source Code 77
+ a r g s [ i − 1 ] ) ;
re turn p r i n t U s a g e ( ) ;
}}/ / Make s u r e t h e r e are e x a c t l y 1 p a r a m e t e r s l e f t .
i f ( o t h e r a r g s . s i z e ( ) != 1 ) {System . o u t . p r i n t l n ( ”ERROR: Wrong number o f
p a r a m e t e r s : ”
+ a r g s . l e n g t h + ” i n s t e a d o f 1 . ” ) ;
re turn p r i n t U s a g e ( ) ;
}System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” + o r d e r ) ;
System . o u t . p r i n t l n ( ” t h e t a b l e t y p e i s s e t t o : ” + t y p e
) ;
/∗ ∗∗ f o r i n t e g e r based t a b l e , e x t r a MapReduce s t e p i s
r e q u i r e d t o g e n e r a t e
∗ t h e c o n v e r t map
∗ /
i f ( t y p e == 5) {JobConf c o n v e r t J o b = new JobConf ( ge tConf ( ) ,
T a b l e G e n e r a t o r . c l a s s ) ;
c o n v e r t J o b . setJobName ( ” g e n e r a t e c o n v e r t mapping
t a b l e ” ) ;
c o n v e r t J o b . s e t M a p p e r C l a s s ( conver tMap . c l a s s ) ;
c o n v e r t J o b . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t .
c l a s s ) ;
c o n v e r t J o b . se tMapOutpu tKeyClass ( Text . c l a s s ) ;
c o n v e r t J o b . s e t M a p O u t p u t V a l u e C l a s s ( LongWr i t ab l e .
c l a s s ) ;
c o n v e r t J o b . s e t I n p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;
c o n v e r t J o b . s e t R e d u c e r C l a s s ( c o n v e r t T a b l e . c l a s s ) ;
/ / i n i t t h e hbase t a b l e w i t h t h e c o n v e r t map t a b l e
name ”idmap”
Appendix A. Source Code 78
TableReduce . i n i t J o b ( ” idmap ” , c o n v e r t T a b l e . c l a s s ,
c o n v e r t J o b ) ;
/∗ ∗∗ r e c o r d t h e r u n n i n g t i m e
∗ /
long t 1 = System . c u r r e n t T i m e M i l l i s ( ) ;
J o b C l i e n t . r u nJ ob ( c o n v e r t J o b ) ;
long t 2 = System . c u r r e n t T i m e M i l l i s ( ) ;
long s e c = ( t 2 − t 1 ) / 1000 ; / / s e c o n d s
System . o u t . p r i n t l n ( ” r u n n i n g t ime i s : ” + s e c / 60 +
” m i n u t e s ”
+ s e c % 60 + ” s e c o n d s ” ) ;
}/∗ ∗∗ Job 3 c a l c u l a t e s t h e a d j u s t e d Good−Tu r ing c o u n t s
f o r t r a i n i n g da ta
∗ and w r i t e t h e smoothed c o u n t s t o Hbase t a b l e
∗ /
JobConf jobC = new JobConf ( ge tConf ( ) , T a b l e G e n e r a t o r .
c l a s s ) ;
jobC . setJobName ( ” ngram wi th hbase ” ) ;
jobC . s e t I n t ( ” o r d e r ” , o r d e r ) ;
jobC . s e t M a p p e r C l a s s ( es t ima teMap . c l a s s ) ;
jobC . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;
jobC . se tMapOutpu tKeyClass ( Text . c l a s s ) ;
jobC . s e t M a p O u t p u t V a l u e C l a s s ( Text . c l a s s ) ;
jobC . s e t I n p u t P a t h ( new Pa th (RAWCOUNTDIR) ) ;
sw i t ch ( t y p e ) {case 1 :
jobC . s e t R e d u c e r C l a s s ( s im p le Ta b l e Re duc e . c l a s s ) ;
/ / i n i t t h e hbase t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) ,
s i mp leT ab l eR ed uce . c l a s s ,
jobC ) ;
break ;
Appendix A. Source Code 79
case 2 :
jobC . s e t R e d u c e r C l a s s ( wordTableReduce . c l a s s ) ;
/ / i n i t t h e hbase t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) ,
wordTableReduce . c l a s s , jobC ) ;
break ;
case 3 :
jobC . s e t R e d u c e r C l a s s ( c o n t e x t T a b l e R e d u c e . c l a s s ) ;
/ / i n i t t h e hbase t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) ,
c o n t e x t T a b l e R e d u c e . c l a s s ,
jobC ) ;
break ;
case 4 :
jobC . s e t R e d u c e r C l a s s ( combineTableReduce . c l a s s ) ;
/ / i n i t t h e hbase t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) ,
combineTableReduce . c l a s s ,
jobC ) ;
break ;
case 5 :
jobC . s e t R e d u c e r C l a s s ( i n t e g e r T a b l e R e d u c e . c l a s s ) ;
/ / i n i t t h e hbase t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) ,
i n t e g e r T a b l e R e d u c e . c l a s s ,
jobC ) ;
break ;
d e f a u l t :
jobC . s e t R e d u c e r C l a s s ( s im p le Ta b l e Re duc e . c l a s s ) ;
/ / i n i t t h e hbase t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) ,
s i mp leT ab l eR ed uce . c l a s s ,
jobC ) ;
break ;
}
Appendix A. Source Code 80
/ / open t h e num−of−unigrams f i l e t o read t h e t o t a l
unigram c o u n t s
t r y {Long t o t a l = 0L ;
Pa th t m p F i l e = new Pa th ( ”num−of−un ig rams ” ) ;
F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( jobC ) ;
S e q u e n c e F i l e . Reader r e a d e r = new S e q u e n c e F i l e .
Reader ( f i l e S y s ,
tmpFi l e , jobC ) ;
LongWr i t ab l e v a l u e = new LongWr i t ab l e ( ) ;
i f ( r e a d e r . n e x t ( new Text ( ) , v a l u e ) ) {t o t a l = v a l u e . g e t ( ) ;
}r e a d e r . c l o s e ( ) ;
jobC . se tLong ( ” t o t a l ” , t o t a l ) ;
} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” Can ’ t open f i l e : num−of−
un ig rams f o r t h e ngram model . E x i t . ” ) ;
re turn −1;
}/∗ ∗∗ r e c o r d t h e r u n n i n g t i m e
∗ /
long t 1 = System . c u r r e n t T i m e M i l l i s ( ) ;
J o b C l i e n t . r u nJ ob ( jobC ) ;
long t 2 = System . c u r r e n t T i m e M i l l i s ( ) ;
long s e c = ( t 2 − t 1 ) / 1000 ; / / s e c o n d s
System . o u t . p r i n t l n ( ” r u n n i n g t ime i s : ” + s e c / 60 + ”
m i n u t e s ” + s e c
% 60 + ” s e c o n d s ” ) ;
re turn 0 ;
}
s t a t i c i n t p r i n t U s a g e ( ) {
Appendix A. Source Code 81
System . o u t . p r i n t l n ( ” usage : [ OPTIONS ] < t a b l e name>”
+ ”\ n O p t io n s i n c l u d e : ” ) ;
System . o u t . p r i n t l n ( ”−o r d e r <o r d e r >:\ t t h e max ngram
o r d e r \n\ t d e f a u l t : 3\ n ”
+ ”−t y p e <type >:\ t t h e t a b l e t y p e \n\ t d e f a u l t
: 1\ n ” ) ;
System . o u t . p r i n t l n ( ” t h e t a b l e t y p e can be chosen from
: ” ) ;
System . o u t . p r i n t l n ( ” 1 . ngram based \n ” + ” 2 . word
based \n ”
+ ” 3 . c o n t e x t based \n ” + ” 4 . h a l f ngram based \n ”
+ ” 5 . i n t e g e r based ” ) ;
re turn −1;
}
p u b l i c s t a t i c vo id main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n
{i n t e r rCode = ToolRunner . run ( new C o n f i g u r a t i o n ( ) , new
T a b l e G e n e r a t o r ( ) ,
a r g s ) ;
System . e x i t ( e r rCode ) ;
}}
A.3 NgramModel.java
NgramModel is the class to query probability from Hbase table for testing texts,
and compute the perplexity.
package ngram ;
import j a v a . i o . IOExcep t ion ;
import j a v a . u t i l . A r r a y L i s t ;
import j a v a . u t i l . HashMap ;
import j a v a . u t i l . I t e r a t o r ;
Appendix A. Source Code 82
import j a v a . u t i l . L i s t ;
import j a v a . u t i l . r e g e x . Matcher ;
import j a v a . u t i l . r e g e x . P a t t e r n ;
import org . apache . hadoop . con f . C o n f i g u r a t i o n ;
import org . apache . hadoop . con f . C o n f i g u r e d ;
import org . apache . hadoop . f s . Pa th ;
import org . apache . hadoop . f s . F i l e S y s t e m ;
import org . apache . hadoop . i o . I n t W r i t a b l e ;
import org . apache . hadoop . i o . LongWr i t ab l e ;
import org . apache . hadoop . i o . F l o a t W r i t a b l e ;
import org . apache . hadoop . i o . S e q u e n c e F i l e ;
import org . apache . hadoop . i o . Text ;
import org . apache . hadoop . i o . S e q u e n c e F i l e . Compress ionType ;
import org . apache . hadoop . mapred . J o b C l i e n t ;
import org . apache . hadoop . mapred . JobConf ;
import org . apache . hadoop . mapred . MapReduceBase ;
import org . apache . hadoop . mapred . Mapper ;
import org . apache . hadoop . mapred . O u t p u t C o l l e c t o r ;
import org . apache . hadoop . mapred . Reducer ;
import org . apache . hadoop . mapred . R e p o r t e r ;
import org . apache . hadoop . mapred . Tex tOu tpu tFo rma t ;
import org . apache . hadoop . mapred . S e q u e n c e F i l e I n p u t F o r m a t ;
import org . apache . hadoop . mapred . S e q u e n c e F i l e O u t p u t F o r m a t ;
import org . apache . hadoop . u t i l . Tool ;
import org . apache . hadoop . u t i l . ToolRunner ;
import org . apache . hadoop . hbase . HTable ;
import org . apache . hadoop . hbase . H B a s e C o n f i g u r a t i o n ;
/∗ ∗∗ t r y t o e v a l u a t e t h e ngram model , e s t i m a t e p e r p l e x i t y
∗ s c o r e s f o r t e s t i n g s e n t e n c e s .
∗∗ @author Xiaoyang Yu
Appendix A. Source Code 83
∗∗ /
p u b l i c c l a s s NgramModel ex tends C o n f i g u r e d implementsTool ,
Mapper<Text , I n t W r i t a b l e , Text , F l o a t W r i t a b l e > {p r i v a t e JobConf myconf ;
p r i v a t e H B a s e C o n f i g u r a t i o n c o n f i g ;
p r i v a t e HTable t a b l e ;
p r i v a t e HTable c o n v e r t T a b l e ;
p r i v a t e HashMap<S t r i n g , F l o a t > mycache ;
p r i v a t e i n t type , cache , o r d e r ;
p r i v a t e Double p e r p l e x i t y = 0 . 0 , H = 0 . 0 ;
p r i v a t e long c o u n t = 0L , t o t a l = 0L ;
/∗ ∗∗ a map c l a s s t o do t h e word c o u n t s
∗∗ /
p u b l i c s t a t i c c l a s s myMap ex tends MapReduceBase
implementsMapper<LongWri tab le , Text , Text , I n t W r i t a b l e > {
p r i v a t e i n t o r d e r ;
p r i v a t e S t r i n g words ;
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {o r d e r = j o b . g e t I n t ( ” o r d e r ” , 3 ) ;
}
p u b l i c vo id map ( LongWr i t ab l e key , Text va lue ,
O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > o u t p u t ,
R e p o r t e r r e p o r t e r )
throws IOExcep t i on {S t r i n g l i n e = v a l u e . t o S t r i n g ( ) ;
i f ( l i n e == n u l l | | l i n e . t r i m ( ) . l e n g t h ( ) == 0)
re turn ;
Appendix A. Source Code 84
P a t t e r n m y P a t t e r n = P a t t e r n . compi l e ( ”\\w+ |\\ p{ Punc t
}+” ) ;
Matcher myMatcher = m y P a t t e r n . ma tche r ( l i n e ) ;
S t r i n g B u f f e r sb = new S t r i n g B u f f e r ( ) ;
whi le ( myMatcher . f i n d ( ) ) {sb . append ( myMatcher . group ( ) ) . append ( ”\ t ” ) ;
}l i n e = sb . t o S t r i n g ( ) ;
S t r i n g [ ] c u r r e n t = l i n e . s p l i t ( ”\\ s+” ) ;
words = ” ” ;
i f ( o r d e r > 1)
c u r r e n t = ( ”<s> ” + l i n e ) . s p l i t ( ”\\ s+” ) ;
f o r ( i n t i = 0 ; i <= c u r r e n t . l e n g t h − o r d e r ; i ++) {words = c u r r e n t [ i ] ;
f o r ( i n t j = i + 1 ; j < i + o r d e r ; j ++)
words = words + ” ” + c u r r e n t [ j ] ;
o u t p u t . c o l l e c t ( new Text ( words ) , new I n t W r i t a b l e
( 1 ) ) ;
}}
}
/∗ ∗∗ A r e d u c e r c l a s s t h a t j u s t e m i t s t h e sum o f t h e i n p u t
v a l u e s .
∗ /
p u b l i c s t a t i c c l a s s myReduce ex tends MapReduceBase
implementsReducer<Text , I n t W r i t a b l e , Text , I n t W r i t a b l e > {
p u b l i c vo id r e d u c e ( Text key , I t e r a t o r <I n t W r i t a b l e >
v a l u e s ,
O u t p u t C o l l e c t o r <Text , I n t W r i t a b l e > o u t p u t ,
R e p o r t e r r e p o r t e r )
throws IOExcep t i on {i n t sum = 0 ;
Appendix A. Source Code 85
whi le ( v a l u e s . hasNext ( ) ) {sum += v a l u e s . n e x t ( ) . g e t ( ) ;
}o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ;
}}
/∗ ∗∗ open t h e Hbase t a b l e f o r query
∗ /
p u b l i c vo id c o n f i g u r e ( JobConf j o b ) {myconf = j o b ;
S t r i n g tableName = j o b . g e t ( ” t a b l e ” ) ;
t y p e = j o b . g e t I n t ( ” t y p e ” , 1 ) ;
cache = j o b . g e t I n t ( ” cache ” , 0 ) ;
t o t a l = j o b . ge tLong ( ” t o t a l ” , 1 ) ;
o r d e r = j o b . g e t I n t ( ” o r d e r ” , 3 ) ;
mycache = new HashMap<S t r i n g , F l o a t >() ;
t r y {c o n f i g = new H B a s e C o n f i g u r a t i o n ( j o b ) ;
t a b l e = new HTable ( c o n f i g , new Text ( tableName ) ) ;
c o n v e r t T a b l e = new HTable ( c o n f i g , new Text ( ” idmap ” )
) ;
} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” Can ’ t open t a b l e : ” + tableName
+ ” f o r t h e ngram model . E x i t . ” ) ;
re turn ;
}}
/∗ ∗∗ query t h e Hbase t a b l e f o r d i f f e r e n t t a b l e t y p e s t o
g e t t h e p r o b a b i l i t y
∗
Appendix A. Source Code 86
∗ @param ngram
∗ t h e ngram t o query
∗ @return t h e p r o b a b i l i t y o f t h e i n p u t ngram
∗ @throws I O E x c e p t i o n
∗ /
p r i v a t e byte [ ] q u e r y T a b l e ( S t r i n g ngram ) throwsIOExcep t i on {
S t r i n g row , column ;
S t r i n g [ ] a r r a y = ngram . s p l i t ( ”\\ s+” ) ;
i n t l e n g t h = a r r a y . l e n g t h ;
i f ( l e n g t h == 0)
re turn n u l l ;
sw i t ch ( t y p e ) {/ / ngram based
case 1 :
row = ngram ;
column = ” g t : p rob ” ;
break ;
/ / word based
case 2 :
row = a r r a y [ l e n g t h − 1 ] ;
i f ( l e n g t h == 1)
column = ” g t : unigram ” ;
e l s e {column = ” g t : ” + a r r a y [ 0 ] ;
f o r ( i n t i = 1 ; i < l e n g t h − 1 ; i ++) {column = column + ” ” + a r r a y [ i ] ;
}}break ;
/ / c o n t e x t based
case 3 :
row = a r r a y [ 0 ] ;
f o r ( i n t i = 1 ; i < l e n g t h − 1 ; i ++) {row = row + ” ” + a r r a y [ i ] ;
Appendix A. Source Code 87
}i f ( l e n g t h == 1)
column = ” g t : unigram ” ;
e l s ecolumn = ” g t : ” + a r r a y [ l e n g t h − 1 ] ;
break ;
/ / h a l f ngram based
case 4 :
i f ( l e n g t h == 1) {row = a r r a y [ 0 ] ;
column = ” g t : unigram ” ;
} e l s e {i n t h a l f = a r r a y . l e n g t h / 2 ;
row = a r r a y [ 0 ] ;
column = ” g t : ” + a r r a y [ h a l f ] ;
f o r ( i n t i = 1 ; i < h a l f ; i ++) {row = row + ” ” + a r r a y [ i ] ;
}f o r ( i n t i = h a l f + 1 ; i < a r r a y . l e n g t h ; i ++) {
column = column + ” ” + a r r a y [ i ] ;
}}break ;
/ / i n t e g e r based
case 5 :
byte [ ] v a l u e B y t e s = c o n v e r t T a b l e . g e t ( new Text ( a r r a y
[ 0 ] ) , new Text (
” c o n v e r t : i d ” ) ) ;
i f ( v a l u e B y t e s != n u l l )
row = new S t r i n g ( v a l u e B y t e s ) ;
e l s ere turn n u l l ;
i f ( a r r a y . l e n g t h > 1) {f o r ( i n t i = 1 ; i < a r r a y . l e n g t h ; i ++) {
Appendix A. Source Code 88
v a l u e B y t e s = c o n v e r t T a b l e . g e t ( new Text ( a r r a y [ i
] ) , new Text (
” c o n v e r t : i d ” ) ) ;
i f ( v a l u e B y t e s != n u l l ) {row = row + ” ” + new S t r i n g ( v a l u e B y t e s ) ;
} e l s ere turn n u l l ;
}}column = ” g t : p rob ” ;
break ;
d e f a u l t :
row = ngram ;
column = ” g t : p rob ” ;
break ;
}re turn t a b l e . g e t ( new Text ( row ) , new Text ( column ) ) ;
}
p u b l i c vo id map ( Text key , I n t W r i t a b l e va lue ,
O u t p u t C o l l e c t o r <Text , F l o a t W r i t a b l e > o u t p u t ,
R e p o r t e r r e p o r t e r )
throws IOExcep t i on {F l o a t prob = 0 f , a l p h a = 1 f ;
/∗ ∗∗ a back−o f f c a l c u l a t i o n t o g e t t h e p r o b a b i l i t y
∗ /
boolean f i n i s h e d = f a l s e ;
S t r i n g row = key . t o S t r i n g ( ) ;
whi le ( f i n i s h e d == f a l s e ) {i f ( cache == 1 && mycache . c o n t a i n s K e y ( row ) ) {
prob = a l p h a ∗ mycache . g e t ( row ) ;
f i n i s h e d = t rue ;
} e l s e {
Appendix A. Source Code 89
byte [ ] v a l u e B y t e s = q u e r y T a b l e ( row ) ;
i f ( v a l u e B y t e s != n u l l ) {/∗ ∗∗ f i n d t h e p r o b a b i l i t y
∗ /
prob = a l p h a ∗ F l o a t . va lueOf ( new S t r i n g (
v a l u e B y t e s ) ) ;
i f ( cache == 1 && row . s p l i t ( ”\\ s+” ) . l e n g t h <
o r d e r ) {i f ( mycache . s i z e ( ) > 100)
mycache . c l e a r ( ) ;
mycache . p u t ( row , F l o a t . va lueOf ( new S t r i n g (
v a l u e B y t e s ) ) ) ;
}f i n i s h e d = t rue ;
} e l s e {S t r i n g [ ] words = row . s p l i t ( ”\\ s+” ) ;
/∗ ∗∗ unseen unigram
∗ /
i f ( words . l e n g t h == 1) {prob = a l p h a / ( t o t a l + 1 f ) ;
f i n i s h e d = t rue ;
} e l s e {/∗ ∗∗ back−o f f t o n−1 gram words
∗ /
row = words [ 1 ] ;
f o r ( i n t i = 2 ; i < words . l e n g t h ; i ++)
row = row + ” ” + words [ i ] ;
a l p h a = a l p h a ∗ 0 . 4 f ;
f i n i s h e d = f a l s e ;
}}
}
Appendix A. Source Code 90
}c o u n t += v a l u e . g e t ( ) ;
i f ( p rob > 1 f )
prob = 1 f ;
i f ( p rob > 0 f )
H += v a l u e . g e t ( ) ∗ Math . l o g ( prob . doub leVa lue ( ) ) /
Math . l o g ( 2 . 0 ) ;
o u t p u t . c o l l e c t ( key , new F l o a t W r i t a b l e ( prob ) ) ;
}
/∗ ∗∗ compute t h e f i n a l p e r p l e x i t y score , w r i t e t o a f i l e
∗ /
p u b l i c vo id c l o s e ( ) throws IOExcep t i on {p e r p l e x i t y = Math . pow ( 2 . 0 , −H / c o u n t ) ;
Pa th o u t F i l e = new Pa th ( ” p e r p l e x i t y ” ) ;
F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( myconf ) ;
i f ( f i l e S y s . e x i s t s ( o u t F i l e ) )
f i l e S y s . d e l e t e ( o u t F i l e ) ;
S e q u e n c e F i l e . W r i t e r w r i t e r = S e q u e n c e F i l e .
c r e a t e W r i t e r ( f i l e S y s , myconf ,
o u t F i l e , Tex t . c l a s s , F l o a t W r i t a b l e . c l a s s ,
Compress ionType .NONE) ;
w r i t e r . append ( new Text ( ” p e r p l e x i t y i s : ” ) , newF l o a t W r i t a b l e ( p e r p l e x i t y
. f l o a t V a l u e ( ) ) ) ;
w r i t e r . c l o s e ( ) ;
}
s t a t i c i n t p r i n t U s a g e ( ) {System . o u t . p r i n t l n ( ” usage : [ OPTIONS ] < t e s t i n p u t d i r >
< t a b l e name>”
+ ”\ n O p t io n s i n c l u d e : ” ) ;
System . o u t
Appendix A. Source Code 91
. p r i n t l n ( ”−o r d e r <o r d e r >:\ t t h e max ngram o r d e r \n
\ t d e f a u l t : 3\ n ”
+ ”−cache <0|1>:\ t u se t h e c a c h i n g query \n\t d e f a u l t : 0\ n ”
+ ”−t y p e <type >:\ t t h e t a b l e t y p e \n\ t d e f a u l t
: 1\ n ” ) ;
System . o u t . p r i n t l n ( ” t h e t a b l e t y p e can be chosen from
: ” ) ;
System . o u t . p r i n t l n ( ” 1 . ngram based \n ” + ” 2 . word
based \n ”
+ ” 3 . c o n t e x t based \n ” + ” 4 . combined word and
c o n t e x t based \n ”
+ ” 5 . i n t e g e r based ” ) ;
re turn −1;
}
p u b l i c i n t run ( S t r i n g [ ] a r g s ) throws E x c e p t i o n {i n t o r d e r = 3 ;
i n t t y p e = 1 ;
i n t cache = 0 ;
JobConf jobA = new JobConf ( ge tConf ( ) , NgramModel .
c l a s s ) ;
jobA . setJobName ( ” raw c o u n t ” ) ;
L i s t <S t r i n g > o t h e r a r g s = new A r r a y L i s t <S t r i n g >() ;
f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; ++ i ) {t r y {
i f ( ”−o r d e r ” . e q u a l s ( a r g s [ i ] ) ) {o r d e r = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;
} e l s e i f ( ”−t y p e ” . e q u a l s ( a r g s [ i ] ) ) {t y p e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;
} e l s e i f ( ”−cache ” . e q u a l s ( a r g s [ i ] ) ) {cache = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ;
} e l s e {o t h e r a r g s . add ( a r g s [ i ] ) ;
}
Appendix A. Source Code 92
} ca tch ( NumberFormatExcept ion e x c e p t ) {System . o u t . p r i n t l n ( ”ERROR: I n t e g e r e x p e c t e d
i n s t e a d o f ”
+ a r g s [ i ] ) ;
re turn p r i n t U s a g e ( ) ;
} ca tch ( Ar ray IndexOutOfBoundsExcep t ion e x c e p t ) {System . o u t . p r i n t l n ( ”ERROR: R e q u i r e d p a r a m e t e r
m i s s i n g from ”
+ a r g s [ i − 1 ] ) ;
re turn p r i n t U s a g e ( ) ;
}}/ / Make s u r e t h e r e are e x a c t l y 2 p a r a m e t e r s l e f t .
i f ( o t h e r a r g s . s i z e ( ) != 2 ) {System . o u t . p r i n t l n ( ”ERROR: Wrong number o f
p a r a m e t e r s : ”
+ o t h e r a r g s . s i z e ( ) + ” i n s t e a d o f 2 . ” ) ;
re turn p r i n t U s a g e ( ) ;
}System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” + o r d e r ) ;
System . o u t . p r i n t l n ( ” t h e t a b l e t y p e i s s e t t o : ” + t y p e
) ;
System . o u t . p r i n t l n ( ” use t h e c a c h i n g que ry : ” + ( ( cache
==1) ? ” yes ” : ” no ” ) ) ;
jobA . s e t I n t ( ” o r d e r ” , o r d e r ) ;
Pa th i n p u t D i r = new Pa th ( o t h e r a r g s . g e t ( 0 ) ) ;
jobA . a d d I n p u t P a t h ( i n p u t D i r ) ;
jobA . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ;
S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobA , t rue) ;
S e q u e n c e F i l e O u t p u t F o r m a t . s e t O u t p u t C o m p r e s s i o n T y p e (
jobA ,
Compress ionType .BLOCK) ;
Pa th tmp = new Pa th ( ” tmp ” ) ;
i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( tmp ) )
Appendix A. Source Code 93
F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( tmp ) ;
jobA . s e t O u t p u t P a t h ( tmp ) ;
jobA . s e t O u t p u t K e y C l a s s ( Text . c l a s s ) ;
jobA . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ;
jobA . s e t M a p p e r C l a s s (myMap . c l a s s ) ;
jobA . s e t C o m b i n e r C l a s s ( myReduce . c l a s s ) ;
jobA . s e t R e d u c e r C l a s s ( myReduce . c l a s s ) ;
/∗ ∗∗ r e c o r d t h e r u n n i n g t i m e
∗ /
long t 1 = System . c u r r e n t T i m e M i l l i s ( ) ;
J o b C l i e n t . r u nJ ob ( jobA ) ;
JobConf jobB = new JobConf ( ge tConf ( ) , NgramModel .
c l a s s ) ;
jobB . setJobName ( ” e v a l u a t i o n ” ) ;
jobB . s e t ( ” t a b l e ” , o t h e r a r g s . g e t ( 1 ) ) ;
jobB . s e t I n t ( ” o r d e r ” , o r d e r ) ;
jobB . s e t I n t ( ” t y p e ” , t y p e ) ;
jobB . s e t I n t ( ” cache ” , cache ) ;
jobB . s e t I n p u t P a t h ( tmp ) ;
jobB . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;
jobB . s e t O u t p u t F o r m a t ( Tex tOu tpu tFo rma t . c l a s s ) ;
jobB . s e t O u t p u t K e y C l a s s ( Text . c l a s s ) ;
jobB . s e t O u t p u t V a l u e C l a s s ( F l o a t W r i t a b l e . c l a s s ) ;
jobB . setNumReduceTasks ( 0 ) ;
Pa th r e s u l t D i r = new Pa th ( ” prob ” ) ;
i f ( F i l e S y s t e m . g e t ( jobB ) . e x i s t s ( r e s u l t D i r ) )
F i l e S y s t e m . g e t ( jobB ) . d e l e t e ( r e s u l t D i r ) ;
jobB . s e t O u t p u t P a t h ( r e s u l t D i r ) ;
jobB . s e t M a p p e r C l a s s ( NgramModel . c l a s s ) ;
long m y t o t a l = 1 ;
t r y {Pa th t m p F i l e = new Pa th ( ”num−of−un ig rams ” ) ;
F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( jobB ) ;
Appendix A. Source Code 94
S e q u e n c e F i l e . Reader r e a d e r = new S e q u e n c e F i l e .
Reader ( f i l e S y s ,
tmpFi l e , jobB ) ;
LongWr i t ab l e v a l u e = new LongWr i t ab l e ( ) ;
i f ( r e a d e r . n e x t ( new Text ( ) , v a l u e ) ) {m y t o t a l = v a l u e . g e t ( ) ;
}r e a d e r . c l o s e ( ) ;
} ca tch ( IOExcep t ion e ) {System . o u t . p r i n t l n ( ” can ’ t open f i l e num−of−unigrams
, e x i t ” ) ;
re turn −1;
}System . o u t . p r i n t l n ( ” t h e t o t a l unigram number i s : ” +
m y t o t a l ) ;
jobB . se tLong ( ” t o t a l ” , m y t o t a l ) ;
J o b C l i e n t . r u nJ ob ( jobB ) ;
long t 2 = System . c u r r e n t T i m e M i l l i s ( ) ;
JobConf con f = new JobConf ( ge tConf ( ) , NgramModel .
c l a s s ) ;
con f . setJobName ( ” t e s t ” ) ;
Pa th t m p F i l e = new Pa th ( ” p e r p l e x i t y ” ) ;
F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( con f ) ;
S e q u e n c e F i l e . Reader r e a d e r = new S e q u e n c e F i l e . Reader (
f i l e S y s , tmpFi l e ,
con f ) ;
Text k = new Text ( ) ;
F l o a t W r i t a b l e v = new F l o a t W r i t a b l e ( ) ;
whi le ( r e a d e r . n e x t ( k , v ) ) {System . o u t . p r i n t l n ( k . t o S t r i n g ( ) + ” : ” + v . g e t ( ) ) ;
}long s e c = ( t 2 − t 1 ) / 1000 ; / / s e c o n d s
System . o u t . p r i n t l n ( ” r u n n i n g t ime i s : ” + s e c / 60 + ”
m i n u t e s ” + s e c
Appendix A. Source Code 95
% 60 + ” s e c o n d s ” ) ;
re turn 0 ;
}
p u b l i c s t a t i c vo id main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n
{i n t r e s = ToolRunner . run ( new C o n f i g u r a t i o n ( ) , new
NgramModel ( ) , a r g s ) ;
System . e x i t ( r e s ) ;
}}