Hadoop exercise

9
Exercise

Transcript of Hadoop exercise

Page 1: Hadoop exercise

Exercise

Page 2: Hadoop exercise

Similarity of Documents

• Simple inner product

• Cosine similarity

• Term weights

– Standard problem in IR

– tf-idf, BM25, etc.

• A document d is represented as a vector Wd of term weights wt,d, which indicate the importance of each term t in the document

di

dj

Page 3: Hadoop exercise

Trivial Solution

• A trivial solution is to take each vector and compute its similarity (dot prod) with every other vector in the collection, which means that we will load each vector o(n) times and load each term o(dft

2) times

• This will work for small collection, but our goal is to have a scalable and efficient sol for large collections

Page 4: Hadoop exercise

• Load weights for each term once

• Each term contributes o(dft2) partial scores

• Allows efficiency tricks

Each term contributes only if appears in

Better Solution

Page 5: Hadoop exercise

mapindex

reduce

Better Solution: Map-Reduce

Page 6: Hadoop exercise

Solution – parts we have done

• An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs

• Indexing:

– We build a standard inverted index (Frakes and Baeza-Yates, 1992), where each term is associated with a list of docid’s for documents that contain it and the associated term weight.

– Use tf-idf to compute the weights.

– Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docid and term weight as the value.

– The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.

Page 7: Hadoop exercise

Exercise - tasks

• Task 1– Write map-reduce pseudo-code to compute sim(di,dj) based on

the idea described in previous page.

– Given the tf-idf code

– Given the inverted index computation code

– Write the similarity code to compute sim(di,dj) based on the pseudo-code designed above.

– Execute code on the provided data set, and test with the provided testing code.

• Group / individual presentation (60 min)

Page 8: Hadoop exercise

Solution• An efficient solution to the pair wise document similarity problem,

expressed as two separate MapReduce jobs

• Indexing:

– We build a standard inverted index (Frakes and Baeza-Yates, 1992), where each term is associated with a list of docid’s for documents that contain it and the associated term weight.

– Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tuple consisting of the docid and term weight as the value.

– The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.

• Pairwise Similarity:

– Mapping over each posting, the mapper generates key tuples corresponding to pairs of docids in the postings.

– These key tuples are associated with the product of the corresponding term weights

– They represent the individual term contributions to the final inner product.

– The MapReduce runtime sorts the tuples and then the reducer sums all the individual score contributions for a pair to generate the final similarity score.

Page 9: Hadoop exercise

End of session

Day – 2: Exercise