Detecting Near Duplicates for Web Crawling

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma

Presenter: Siyuan Hua

Application and why

Algorithm

Google story

Web Documents

Files in a file system

E-mails

Domain-specific corpora

Web Mirrors Clustering for “related documents” Data extraction Plagiarism Spam detection Duplicates in domain-specific corpora

Simhash compute each document to a f bit value and each bit is relevant to a unique feature of the document

Properties of simhash value:◦ The fingerprint of a document is a “ hash” value of its

features◦ Similar documents have similar hash values

Definition:◦ Given a collection of f-bit fingerprints and a query fingerprint F, identify

whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version there are set of query fingerprints instead of a single query fingerprint)

Simple Solution:◦ Linear search O(mn) time

Scale Problem: ◦ 1M query document against 8 billion( ) existing web pages in100

seconds. ◦ Simple solution require comparisons! (impossible in 100

seconds)

2034 22

Oberservation:◦ Pre-compute all F’ such that Hamming distance between F’ and F is at

most k. Assume K=3 F’ and comparisons! Too much time!

◦ Pre-compute all F’ such that some existing fingerprint is at most Hamming distance k away from F’. Too much space!

416643

)2log(

Their solution:◦ Initiation: They build t tables: . Associated with table Ti are two

quantities: an integer and a permutation over the f bit-positions.

◦ Given fingerprint F and an integer k, we probe these tables in parallel:

◦ Step 1: Identify all permuted fingerprints in Ti whose top bit-positions match the top bit-positions of (F).

◦ Step 2: For each of the permuted fingerprints identified in Step 1, check if it differs from (F) in at most k bit positions.

Example:◦ 64 bit fingerprint divided to 6 blocks can build 20 tables◦ Space: Reasonable! Time: Awesome!

tTTT ,...,, 21

GB6420 )8)2(log(20 34

Exploration of Design Parameters:◦ (1) A small set of permutations to avoid blowup in space requirements◦ (2) Large values for various Pi to avoid checking too many fingerprints in

Step 2.

Tradeoff◦ Increasing the number of tables increases pi and hence reduces the

query time. Decreasing the number of tables reduces storage requirements, but reduces pi and hence increases the query time

Story:◦ Assume that existing fingerprints are stored in file F and that the batch of

query fingerprints are stored in file Q. With 8B 64-bit fingerprints, file F will occupy 64GB

◦ They use GFS files which is broken into 64MB chunks. Each chunk is replicated at three (almost) randomly chosen machines in a cluster, each chunk is stored as a file in the local file system.

◦ F is divided to 64-MB chunk while Q keeps entirety.

◦ MapReduce computes all the duplications in parallel

Detecting Near Duplicates for Web Crawling

Documents

Transcript of Detecting Near Duplicates for Web Crawling

Web crawling

Crawling The Web For a Search Engine Or Why Crawling is Cool.

Advanced Crawling Techniques Chapter 6. Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Lecture Two: Duplicates

Detecting Near Duplicates in Software Documentationalgorithm strengths and weaknesses, and discuss the benefits of duplicate management in softwaredocuments. Keywords: software documentation,

Focused Crawling with Scalable Ordinal Regression Solverssaketh/research/icml07slides.pdf · Focused Crawling Focused Crawling Focused Crawling Given a topic (seed pages) ﬁnd out

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari.

Duplicates in Infinite Campus Preventing, Finding & Correcting Duplicates

Is Crawling Legal?

Albert Crawling

Why Clean Family Tree Duplicates

Manual on Checking for duplicates and inconsistent …€ Based Monitoring System Checking for duplicates and inconsistent entries ...

Crawling and Flying Insects in Albertainsectsofalberta.com/.../crawling-and-flying-insects-in-alberta.pdf · Crawling and Flying Insects in Alberta ... Cersi (apendages) Ovipositor

Eliminate Duplicates Report

5 Benefits of Web Crawling Services Over In-house Crawling

Remove Duplicates Stage

Detecting Duplicates over Sliding Windows with RAM-Efficient Detached Counting Bloom Filter Arrays

Duplicates 2014

Remove Outlook Duplicates Data