Data De-duplication (Spring 2014)

24
Data Deduplication for Language Documentation UNDER THE GUIDANCE OF:- DR. JAN CHOMICKI AND DR. JEFF GOOD PRESENTED BY: KAUSHAL HAKANI, SHAIL PARIKH, SHASHANK RALLAPA

Transcript of Data De-duplication (Spring 2014)

Data Deduplicationfor Language DocumentationUNDER THE GUIDANCE OF:-

DR. JAN CHOMICKI AND DR. JEFF GOOD

PRESENTED BY:

KAUSHAL HAKANI, SHAIL PARIKH, SHASHANK RALLAPALLI

Outline

Introduction

Challenges

Steps followed

Algorithms used

Approach

Experimental Results

Limitations

Conclusions

Introduction

13 Villages

7-9 “languages” spoken

4 local isolates

2 dialect clusters

12000 people

Localist attitudes

Various class of people collecting data

Aim

Detect duplicate files in the data obtained by the researchers in Cameroon.

Decide which files to keep and which to remove.

Remove duplicate files (De-duplicate)

Maintain information about provenance of the deleted data.

Dataset

Dataset (continued)

Initial observations about the dataset reveals that it contains following types of files Audio/Visual

Audio recordings

Video recordings

Photographs/Scanned images

Textual

Transcriptions (some time-aligned, XML)

Questionnaire data

Lexical data (e.g., vocabulary items in a database)

Dataset (continued)

Metadata Contains information about the actual data files

System generated file Files generated by MAC OS (DS_Store)

There were approximately 231 unique file extensions that we observed when we parsed the dataset.

Challenges

Lack of standards in naming convention.

Decide suitable factor of de-duplication File Name based or File Content based

Decide a suitable factor to take this decision Get sample data to run different de-duplication techniques

Challenges (continued)

Decide what de-duplication methods would be required Edit Distance

Jaccard Similarity

Checksum and examination of data within file.

There were few other challenges that we faced Come up with appropriate factors to decide what files to delete from

the dataset

Moving files over different filesystems.

Steps

Initial Filtering

• Group by File Size• Sampling

Sampled Data

• De-duplicate on file name?• De-duplicate on file content?

Steps

Experimental Observation

• De-duplicate based on file name• Decide the de-duplication techniques to be used

Implementation

• Edit Distance• Jaccard Similarity• Custom Methods

Steps

Test sample data

• Results were satisfactory• Also got data to compare results against

Ran on Actual Data

• Could potentially remove 384.41 GB out of a total of 928.45 GB. That is about 41.4% of the data.

Algorithms

Used following standard de-duplication algorithms Edit-Distance

Jaccard Similarity (Using n-grams)

Also used specialized algorithms Copy removal (Special to dataset)

Bus removal (Again, a special method) NOT This

Edit-Distance

This algorithm gives the dissimilarity between two strings.

It calculates the cost of converting a given string two the other one.

The cost of insert, delete and replacement as 1.

For example:

String s1 = “Mail Juice-21.gif”

String s2 = “Mail Juice-18.gif”

Example

String1 = “Mail Juice-21.gif”

String2 = “Mail Juice-18.gif”

Set the cost of insert = 1 , delete = 1 and replacement = 1.

Total cost of converting S1 to S2 is: 2.

Jaccard Coefficient

This algorithm measures the similarity of two strings.

It divides the strings based on decidable factor k.

Then it calculates the containment of the grams of one string in the list of grams of other string

Jaccard Coefficient =

Example

String1 = MailJuice21

String2 = MailJuice18

Grams:-

String1[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_2, _21, 21_]

String2[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_1, _18, 18_]

S1 U S2 = 15

S1 ∩ S2 = 9

Jaccard Coefficient = 0.6 i.e. 60% Chance that they are similar.

Custom Methods

There were certain cases were the files were duplicate but name were not the same.

For example

FILE NAME FILE SIZE

FOO50407.JPG 1.7 MB

FOO50407 (COPY).WAV 1.7 MB

Experimental ResultsOn sample data:

WAV, 98%

Deleted file size vs Total deleted file size

OTHERS, 2%

Experimental ResultsOn Total Data:

WAV; 94%

File Size Deleted/Total File Size Deleted

OTH-ERS 6%

Generated Log File

The column names from left to right are, new file name, old file name, old directory, size and timestamp.

Limitations We have observed a few limitations that exist in the system we

made. Our system isn’t sensitive to the different date formats appearing

with in the file name and treats each of them differently.

Example: 25-05-2008 and 2008-25-5 are treated differently

Our system is also insensitive to abbreviations

Example: MK for MunKen is not taken to be similar

So, human observation is still required to completely de-duplicate the data, provided the ingestion is unstructured.

Conclusion Data de-duplication is a job-specific or to be precise, application-

specific task.

So, according to given specifications and our implemented logic, we can safely say, our methods have succeeded in de-duplicating a huge amount of data and freeing almost 400 GB of the given hard-drive of 1 TB.

Thank You!!Questions??