Document Similarity with Cloud Computing

Document Similarity with Cloud Computing

by Bryan Bende

What is Cloud Computing ?"A style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet." - Wikipedia● Resources could be storage, processing power, applications, etc ● Third-party providers own the cloud● Customers rent resources for an affordable price

Amazon Web Services● Amazon provides several web services that utilize cloud

computing:○ Elastic Compute Cloud (EC2) ○ Simple Storage Service (S3)○ Simple DB○ Simple Queue Service (SQS)○ Elastic Map Reduce

● Pay only for what you use - services typicaly charge based on

bandwith in and out, and hourly or monthly usage, rates are very affordable

Amazon Elastic Compute Cloud (EC2)● Provides resizable computing capacity● Customer requests a number of instances and the type of OS

image to load on each instance ● Intances are allocated on-demand and can be added at any time

(more than 20 instances requires approval) ○ Small Instance (Default)

■ 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage

○ Large Instance ■ 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute

Units each), 850 GB of instance storage○ Extra Large Instance

■ 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage

On-Demand Instances Linux/UNIX Usage Windows Usage Small (Default) $0.10 per hour $0.125 per hour Large $0.40 per hour $0.50 per hour Extra Large $0.80 per hour $1.00 per hour Also pay $.10/GB data in, $.17/GB data out (for first 10TB)

Amazon Simple Storage Service (S3)● Provides data storage in the cloud● Write, read, and delete objects up to 5GB in size, number of objects

is unlimited. ● Each object is stored in a bucket and retrieved via a unique,

developer-assigned key

Storage ● $0.150 per GB – first 50 TB / month of storage used

Data Transfer ● $0.10 per GB – all data transfer in ● $0.17 per GB – first 10 TB / month data transfer out

Requests ● $0.01 per 1,000 PUT, COPY, POST, or LIST requests● $0.01 per 10,000 GET and all other requests*

How do we use these services ?Typical Scenario:

1. Transfer data to be processed into S32. Launch a cluster of machines on EC23. Transfer data from S3 onto master node of cluster4. Launch a job that uses the cluster to process the data5. Send results back to S3, or SCP back to local machine6. Shutdown EC2 instances

All data on the EC2 instances is lost when shutting down

How do we use the cluster to process the data ?

Map Reduce / Hadoop● Map Reduce is a software framework to support

distributed computing on large data sets● Does not have to be used with cloud computing,

could be used with a personal cluster of machines ● Map task produces key/value pairs from input● Reduce task receives all the key/value pairs with the

same key● Framework handles distributing the data, developers

only write the Map and Reduce operations● Hadoop is a Java-based open-source

implementation of Map Reduce

Diagram from: http://www.sigcrap.org/2008/01/23/mapreduce-a-major-disruptionto-database-dogma/

Hadoop Continued...Map Function:public void map( LongWritable key, Text value, OutputCollector<Text,Tuple> output, Reporter reporter) throws IOException {.... }

Reduce Function:public void reduce(Text key, Iterator<Tuple> values, OutputCollector<Text, Tuple> output, Reporter reporter) throws IOException {.... }

Main Method● Creates a Job● Specifies the Map class, Reduce class, Input path, Output path, Number of

Map tasks, and Number of Reduce tasks● Submits the job

Experiment: Compute Document SimilarityMotivated by Pairwise Document Similarity in Large Collection with Map Reduce by Tamer Elsayed, Jimmy Lin, and Douglas W. Oard

● Score every document in a large collection against every other document in the collection

● Similar to the process of scoring a query against a document, but instead of a query we have another document

● Data Set - Wikipedia Abstracts provided by DBPedia ○ Pre-processed so each abstract is on a single line with the wikipedia URL

at the beginning of each line○ Used Wikipedia URL as a document id, rest of the text as the document

Example Data: <http://dbpedia.org/resource/Bulls-Pistons_rivalry> The Bulls-Pistons rivalry originated in the 1970's and was most intense in the late 1980s - early 1990's, a period when the Bulls' superstar, Michael Jordan, ...

Step 1 - Inverted File with Map Reduce● Each line of the input file gets passed to a Mapper (i.e. each

mapper handles one document at a time because of DBPedia format, makes everything simpler)

● Mapper tokenizes and normalizes the text● Produces key value pairs where each key is a word and the value is

a tuple containing the doc id, doc term frequency, and doc length ○ <word1, (doc1, dtf1, docLength)>○ <word2, (doc1, dtf2, docLength)>

● Each Reducer receives all the records for a single key at one time

(handled by the framework)● Iterates over each record and uses the dtf and doc length to

calculate a score for the word in the given document● Produces a posting list for the word

○ <word1, (doc1, w1), (doc2, w2) ... (docN, wN)>

Inverted File Example

Scoring Function● Okapi Term Weighting - Variation from paper by Scott

Olsson and Douglas Oard, Improving Text Classification

w(tf,dl) = ( tf / ( 0.5 + 1.5( dl / avdl ) + tf ) )

tf = term frequency in the documentdl = document lengthavdl = average document length for collection

● Wrote utility to pre-compute AVDL, hard-coded into the Inverted File Reducer

Step 2 - Map Over the Inverted File● Each Mapper receives one posting list at a time● For each posting, go to every other posting and produce a tuple

where the key contains the doc ids of each posting, and the value contains the product of the weights

○ <(doc1, doc2), combined weight>○ <(doc1, doc3), combined weight> ○ <(doc1, docN), combined weight>

● Each Reducer receives all records for one pair of doc ids at one time

● Sums all the combined weights to get the total score for doc X vs doc Y

Tools and Technologies● Amazon EC2 and S3● Map Reduce / Hadoop 0.17 ● Cloud9

○ Library developed by Jimmy Lin at University of Maryland ○ Helper classes and script for working with Hadoop

● JetS3t Cockpit○ Application to manage S3 buckets

● Small Text ○ Java Library for performing external sorting of large files

Experiment Steps1. Transfer DBPedia data into S32. Start EC2 Cluster using Cloud9 scripts3. Transfer DBPedia data on master node of cluster4. Put DBPedia data into Hadoop's Distributed File System (HDFS)5. Transfer JAR file containg Mapper, Reducer, and Main Method into

the master node6. Submit the Inverted File job7. Submit the Document Similarity Job8. SCP the results of the Document Similarity job back to local

machine9. Concatenate all the partial result files to one file

10. Run SmallText to perform an external sort on the concatenated file

First Attempt● Used subset of DBPedia Data

○ First 160K Abstracts from full set○ 98 MB file ○ 460k Unique Words

● Inverted File○ 2 Small EC2 Instances○ 2 Map Tasks, 1 Reduce Task so all output is one file○ Completed in approximately 5 minutes○ 170MB Inverted File

● Document Similarity○ 20 Small EC2 Instances○ 5 Map Tasks per instance (100 total)○ 1 Reduce Task per instance (20 total)○ Map phase only 50% complete after 12 hours

What was the problem ?

Document Frequency Cut (DF-Cut)● Document Frequency Cut is the process of ignoring the most

frequent terms in the collection when generating the inverted file● Most frequent terms generate the longest posting lists● Longest posting lists generate the most pairs during the mapping

phase● Common words in the collection aren't a big factor in determining

document similarity● Paper by Elsayed describes using a 1% DF-Cut which ignored the

nine-thousand most frequent words● Used a .5% DF-Cut on DBPedia data set which ignored

approximately 2,300 words

Second Attempt● Same parameters as first attempt except Inverted File used .

5%DF-Cut● Inverted File

○ Size reduced to around 80MB○ Completed slightly faster

● Document Similarity ○ Completed in approximately 1.5 hours○ Produced 16GB of output

● Small Text Sorting○ Completed in approximately 30 mins

● Most Similar Documents○ African_Broadbill and African_Shrike-flycatcher

Third Attempt● Increased size of data set

○ First 600k Abstracts○ 400MB File○ 1.2 Million Words

● Same number of instances and tasks as previous attempts● Same DF-Cut which removed 6k words ● Changed Inverted File to produce postings list sorted by Document

Id● Changed Document Similarity to not produce <doc1, doc2> and

<doc2, doc1>● Document Similarity completed in 2.5 hours● Produced 30GB of output files● SmallText completed sorting in 2 hours● Most similar documents seem inaccurate

Conclusions● Amazon Web Services makes it easy for anyone to use

Cloud Computing for data mining tasks● Map Reduce / Hadoop makes it easy to implement

distributed processing, hides complexity from developer● Can be hard to debug problems in the cloud● More efficient way to store and read the inverted file● Scoring function may not be accurate

Document Similarity with Cloud Computing

Data & Analytics

Transcript of Document Similarity with Cloud Computing