Hadoop exercise

Exercise

Similarity of Documents

• Simple inner product

• Cosine similarity

• Term weights

– Standard problem in IR

– tf-idf, BM25, etc.

• A document d is represented as a vector Wd of term weights wt,d, which indicate the importance of each term t in the document

di

dj

Trivial Solution

• A trivial solution is to take each vector and compute its similarity (dot prod) with every other vector in the collection, which means that we will load each vector o(n) times and load each term o(dft

2) times

• This will work for small collection, but our goal is to have a scalable and efficient sol for large collections

• Load weights for each term once

• Each term contributes o(dft2) partial scores

• Allows efficiency tricks

Each term contributes only if appears in

Better Solution

mapindex

reduce

Better Solution: Map-Reduce

Solution – parts we have done

• An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs

• Indexing:

– We build a standard inverted index (Frakes and Baeza-Yates, 1992), where each term is associated with a list of docid’s for documents that contain it and the associated term weight.

– Use tf-idf to compute the weights.

– Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docid and term weight as the value.

– The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.

Exercise - tasks

• Task 1– Write map-reduce pseudo-code to compute sim(di,dj) based on

the idea described in previous page.

– Given the tf-idf code

– Given the inverted index computation code

– Write the similarity code to compute sim(di,dj) based on the pseudo-code designed above.

– Execute code on the provided data set, and test with the provided testing code.

• Group / individual presentation (60 min)

Solution• An efficient solution to the pair wise document similarity problem,

expressed as two separate MapReduce jobs

• Indexing:

– We build a standard inverted index (Frakes and Baeza-Yates, 1992), where each term is associated with a list of docid’s for documents that contain it and the associated term weight.

– Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tuple consisting of the docid and term weight as the value.

– The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.

• Pairwise Similarity:

– Mapping over each posting, the mapper generates key tuples corresponding to pairs of docids in the postings.

– These key tuples are associated with the product of the corresponding term weights

– They represent the individual term contributions to the final inner product.

– The MapReduce runtime sorts the tuples and then the reducer sums all the individual score contributions for a pair to generate the final similarity score.

End of session

Day – 2: Exercise

Hadoop exercise

Data & Analytics

Transcript of Hadoop exercise