CHAPTER 2 DUPLICATE RECORD DETECTION USING ANFIS · document detection technique that could be...

26
38 CHAPTER 2 DUPLICATE RECORD DETECTION USING ANFIS 2.1 INTRODUCTION The problem of duplicate detection is to find out whether the same real-world object is represented by two or more distinct entries in the database. Duplicate detection is otherwise known as Record linkage or record matching. It is a greatly researched topic and is of vital importance in fields such as master data management, data warehousing and ETL (Extraction, Transformation and Loading), customer relationship management, and data integration (Ahmed K. Elmagarmid et al 2007). Theoretically, each candidate requires to be compared with all the others. The first trouble is due to the quadratic characteristic of the problem. Several solutions have been proposed to decrease the number of comparisons to tackle this problem and thereby enhance the efficiency of duplicate detection. The common objective is to focus on comparison of the objects that have at least a few promptly identifiable similarities and prevent comparisons of immensely dissimilar objects. A common technique to prevent costly comparison of all pairs of records is careful grouping of the records into smaller subsets. It is enough if all the pairs within each partition are compared if duplicate records occur in the same partition (Uwe Draisbach and Felix Naumann 2009).

Transcript of CHAPTER 2 DUPLICATE RECORD DETECTION USING ANFIS · document detection technique that could be...

38

CHAPTER 2

DUPLICATE RECORD DETECTION USING ANFIS

2.1 INTRODUCTION

The problem of duplicate detection is to find out whether the same

real-world object is represented by two or more distinct entries in the

database. Duplicate detection is otherwise known as Record linkage or record

matching. It is a greatly researched topic and is of vital importance in fields

such as master data management, data warehousing and ETL (Extraction,

Transformation and Loading), customer relationship management, and data

integration (Ahmed K. Elmagarmid et al 2007).

Theoretically, each candidate requires to be compared with all the

others. The first trouble is due to the quadratic characteristic of the problem.

Several solutions have been proposed to decrease the number of comparisons

to tackle this problem and thereby enhance the efficiency of duplicate

detection. The common objective is to focus on comparison of the objects that

have at least a few promptly identifiable similarities and prevent comparisons

of immensely dissimilar objects. A common technique to prevent costly

comparison of all pairs of records is careful grouping of the records into

smaller subsets. It is enough if all the pairs within each partition are compared

if duplicate records occur in the same partition (Uwe Draisbach and Felix

Naumann 2009).

39

Clustering is the categorization of objects into diverse groups, or

more exactly, the division of a data set into subsets (clusters), so that the data

in each subset (ideally) reveals a few common traits, frequently proximity

based on certain defined distance measure. Blocking increases the efficiency

of duplicate detection and substantially improves the speed of the process of

comparison (Jebamalar Tamilselvi and Saravanan 2010). The term ‘blocking’

refers to the process of splitting the database into a set of mutually exclusive

subsets (blocks) such that the matches do not occur across blocks.

The next problem of duplicate detection is the definition of a

suitable similarity measure to determine if a given pair is truly a duplicate.

Combinations of edit-distance, TF-IDF weighting and other text-based

similarity measures are employed by conventional methods. Further, by

means of a similarity-threshold pairs must be classified into duplicates that is,

similarity greater than or equal to the threshold and non-duplicates that is,

similarity less than threshold (Uwe Draisbach and Felix Naumann 2009).

Two common types of duplicate detection methods are the

fingerprint-based and full text-based methods (Hui Yang and Jamie Callan

2006). Textual similarity, typically quantified using a similarity function such

as, edit distance or cosine similarity is utilized by most of the current

approaches to determine whether the two representations are duplicates.

Hamid Haidarian Shahri and Ahmad Abdollahzadeh Barforush

(2004) have proposed an adjustable fuzzy expert system for the data cleaning

process which identifies and removes fuzzy duplicates by avoiding the

repetitive. Few of the important merits of their proposed approach for use in

diverse information systems include adaptability, simplicity of use,

extendibility, quick development time and run time efficiency. Eugenio

Cesario et al (2005) have presented an incremental algorithm to cluster

duplicate tuples in large databases, which has permitted the allocation of any

40

new tuple t to the cluster consisting of database tuples that has the greatest

similarity with t. Allocation of extremely identical objects to the same buckets

has been the objective of the hash-based indexing method which is the core of

their approach. The substantial improvement in efficiency over modern index

structure for proximity searches in metric spaces attained by the proposed

method has been proved by means of experimental validation.

Hamid H. Shahri and Saied H. Shahri (2006) have proposed a

duplicate-elimination framework that permits easy and flexible data cleaning

that does not necessitate the execution of any coding by users. The

uncertainty of the problem has been handled inherently by exploiting the

fuzzy inference, and the framework has been enabled to adjust to the specific

concept of similarity suitable for each domain by distinct machine learning

capabilities. The users have been permitted to operate with or without training

data by the framework, which is extensible and accommodative. Moreover,

the fast implementation of several preceding methods for duplicate

elimination has been permitted by their framework.

Xiao Mansheng et al (2009) have proposed a grouping based sub-

fuzzy clustering property optimization method. The method has first

processed the properties of group record to obtain the representation of the

group by effectively decreasing the dimension of property, and then

approximately duplicated records in groups have been detected using a

similarity comparison method. The method has been proved to solve the

problem of approximate duplicate record detection in large datasets and the

possess improved the precision of detection and efficiency by both theoretical

analysis and experimental results.

Ying Pei et al (2009) have proposed an enhanced K-medoids

clustering algorithm (IKMC) to address the problem of approximate duplicate

records identification. It has employed edit-distance method and the weights

41

of features by considering each record in database as a separate data object

and obtained the similarity measure among the records. It has then identified

the duplicate records by clustering these similarity values. The algorithm has

prevented a huge number of I/O operations used by the conventional

"sort/merge" algorithm for sequencing by comparing the similarity value with

the preset similarity threshold and automatically adjusting the number of

clusters. The algorithm has been proved to possess excellent detection

precision and high availability by means of experiment.

Hannaneh Hajishirzi et al (2010) have proposed a near-duplicate

document detection technique that could be easily adjusted for a specific

domain. Each document has been represented as a real-valued sparse k-gram

vector in their method. The weights have been trained to optimize for a

particular similarity function like cosine similarity or Jaccard coefficient. The

enhanced similarity measure has been capable of reliably detecting the

approximate-duplicate documents. In addition, efficient similarity evaluation

has been obtained by using locality sensitive hashing scheme to map these

vectors to a few numbers of hash-values as document signatures.

Automatic duplicate detection (Jie Song et al 2010) and (Jebamalar

Tamilselvi et al 2009) depending on the particular domain is complicated for

two reasons. First, duplicate representations are not similar as they differ

slightly in their values. Second, theoretical required comparison of all pairs of

records is impractical for huge quantities of data. In order to overcome these

problems, two major methods are available such as,

(i) Similarity measures can be used for comparing two records to

automatically detect duplicates so as that the effectiveness of

duplicate detection is improved by carefully selected similarity

measures.

42

(ii) Algorithms can be constructed to search for duplicates in extremely

huge quantities of data in such a way that the efficiency of

duplicate detection is improved by carefully designed algorithms

(Felix Naumann 2010).

Many approaches have been presented to address this problem.

Some of these approaches focus on the efficiency of the process of

comparison and others rely on the accuracy of the resulting records. In this

research, an efficient approach is employed using similarity functions and

ANFIS which improves the accuracy of the duplicate detection process with

respect to other well-known measures. Here, ANFIS is trained to discriminate

between pairs of records corresponding to duplicates and non-duplicates using

the training vectors generated by these attribute similarities.

This approach adopts ANFIS and similarity functions to improve

duplicate detection in two phases namely training phase and testing phase. In

the training phase, the dataset contains the labeled duplicates and non-

duplicates which are trained separately using ANFIS. First,a pair of duplicate

or non-duplicate records is considered and the similarity is computed using

three similarity measures to obtain the similarity distance between the two

individual records. The similarity values of all the records are then combined

to generate a feature vector. The steps are repeated for all the pairs of records

in the training dataset to obtain a set of feature vectors .The latter trains the

ANFIS according to the feature vectors.

In the duplicate detection phase, an efficient K-means clustering is

used to partition the dataset into small partitions based on some common

features. The efficiency of the duplicate detection process is increased by

reducing the time taken for comparison of records. Hence, the similarity

computations and other processes are performed on the particular cluster in

which the record falls. Thus, to improve the quality of the database the

43

duplicate and non-duplicate records are identified for the input dataset. The

experimentation is carried out on the real dataset containing duplicate records

and the results demonstrate that the proposed approach improves the accuracy

in duplicate detection.

2.2 ADAPTIVE NEURO-FUZZY INFERENCE SYSTEMS

ANFIS is a Neuro-fuzzy system that uses the learning techniques

of neural networks, with the efficiency of fuzzy inference systems

(Esposito et al 2000). ANFIS uses a hybrid learning algorithm to specify

parameters of Sugeno-type fuzzy inference systems. It uses the Least-squares

method with the Back propagation gradient descent method. Train FIS

membership function parameters simulate a given training data set. ANFIS

can be called using optional parameters to validate the model.

ANFIS structure is similar to the neural network structure based on

the Takagi Sugeno model. According to the Sugeno fuzzy model, rule sets are

as follows,

111111 ryqxpfTHENBisyandAisxIf

222222 ryqxpfTHENBisyandAisxIf

There is a forward pass and a backward pass for the training of the

network. The input vector is propagated, layer by layer through the network

by forward pass. In the backward pass, the error is sent back through the

network in a similar manner to back propagation. The Figure 2.1 shows the

architecture of ANFIS.

44

Figure 2.1 ANFIS architecture for a two rule Sugeno system

Layer 1 In this layer, every node i is an adaptive node with a node

function,

2,1)(,1 iforxOiAi

)1.2(

4,3)(2,1

iforyOiBi

)2.2(

where x (or y) is the input to node i and Ai (or Bi-2) is a linguistic label (such

as "small" or "large") associated with this node. In other words, O1,i is the

membership grade of a fuzzy set A ( =A1 , A2 , B1 or B2 ) and it specifies the

degree to which the given input x (or y) satisfies the quantifier A. The

generalized bell function is an example of an appropriate parameterized

membership function for A.

ib

i

i

A

a

cxx

2

1

1)(

)3.2(

45

where {ai, bi, ci} is the parameter set. The bell-shaped function varies

according to the values of these parameters. Thus, it presents various forms of

membership function for the fuzzy set A. Parameters used in this layer are

referred to as premise parameters.

Layer 2 In this layer, every node is a fixed node labeled , whose

output is the product of all the incoming signals,

2,1),()(,2 iyxwOii BAii

)4.2(

The output of each node represents the firing strength of a rule and

any other T-norm operators that performs fuzzy AND can be used as the node

function in this layer.

Layer 3 In this layer, every node is a fixed node labeled N. The ith

node calculates the ratio of the ith

rule's firing strength to the sum of all rules'

firing strengths.

21

,3ww

wwO i

ii

)5.2(

For ease, outputs of this layer are called normalized firing

strengths.

Layer 4 In layer 4, every node i is an adaptive node with a node

function,

)(,4 iiiiiii ryqxpwfwO)6.2(

46

whereiw is a normalized firing strength from layer 3 and {pi, qi, ri} is the

parameter set of this node. Parameters in this layer are referred to as

consequent parameters.

Layer 5 The single node in this layer is a fixed node labeled ,

which computes the overall output as the summation of all incoming signals

i i

i ii

ii

iiw

fwfwO ,5

)7.2(

2.3 SIMILARITY FUNCTIONS UTILIZED IN THE PROPOSED

APPROACH

Recognizing similarities in large collections of data is a major issue

in the context of duplicate record detection. Similarity between records and

fields are determined using ‘similarity functions’. Different similarity

computation functions are available for different data types. Therefore, the

user should consider the data type of the attribute such as numerical or string

while selecting the function (Israr Ahmed et al 2010). Thus, similarity

functions appropriate for a given domain is essential for obtaining high

accuracy. Similarity functions can be categorized into three groups namely

Character-based similarity functions, which allow contiguous sequences of

mismatched characters e.g., edit distance, Jaro distance, Text (token) based

similarity functions, which do not view strings as contiguous sequences but as

unordered bags of elements e.g., cosine similarity, Jaccard index and Finger

printing techniques.

47

Given a set of records },...,,,{ 321 nrrrrR where each record is a set

of tokens taken from a database },...,,,{ 321 mttttD . The similarity search

problem is aimed at finding all pairs of records Rrr yx , such that

),( yx rrsim ,where is a given similarity threshold value. The similarity

between two non-empty records xr and yr can be measured by using the

following functions.

1.5.1 Levenshtein Distance - Character- Based Similarity

Levenshtein Distance is described as the least number of edit

operations necessary to convert one string into another. Insertion, deletion and

substitution of characters are the permitted edit operations and unit cost is

assigned to each of these edit operations. The minimum number of such

operations can be computed using Dynamic Programming in time equal to the

product of the string lengths. For example, a character-based edit distance

between strings s =“company” and t = “corporation” is computed as follows.

There are several edit operations that transform s into t, but a Levenshtein

distance of ‘6’ between s and t implies that a minimum of ‘6’ operations are

required to transform s into t as unit cost is assigned to each operation. The

following illustrates the six edit operations applied to transform s into t.

1. Substitute “m” with “r”: “company” “corpany”

2. Insert “o”: “corpany” “corpoany”

3. Insert “r”: “corpoany” “corporany”

4. Insert “t”: “corporany” “corporatny”

5. Insert “i”: “corporatny” “corporatiny”

6. Insert “o”: “corporatiny” “corporationy”

7. Delete “y”: “corporationy” “corporation”.

48

Thus, Levenshtein Distance that takes two strings, s of length ‘n’,

and t of length ‘m’ as inputs and then, computes the Damerau-Levenshtein

distance between the strings.

2.3.2 Cosine Similarity - Text-Based Similarity

Cosine similarity is a similarity measure between two vectors that

measures the cosine angle between them. The cosine similarity, between

the two vectors of attributes, A and B can be expressed by means of a dot

product and magnitude as discussed in the section 1.2.3.

2.3.3 MD5 - Fingerprint-Based Similarity

For fingerprint computation, a standard hashing approach such as,

MD5 hashing that efficiently generates a fingerprint for the input message is

employed. MD5 (Message-Digest algorithm 5) has a 128-bit (16-byte) hash

value is an extensively used cryptographic hash function.

2.4 AN APPROACH TO DUPLICATE RECORD DETECTION

USING ANFIS

Let D be a database that contains records composed of k different

fields. Given database, },....,,{ 21 nRRRD where each record iR includes k

fields such as Name, Address and Pin Code. The proposed approach shown in

the Figure 2.2 has two phases such as training phase and duplicate detection

phase.

49

Figure 2.2 The workflow of the proposed approach

At first, the proposed approach takes the training dataset and

directly computes the similarity values for each field using m different

similarity metrics. The similarity values are then combined to generate a

feature vector which gives record level similarity. Thus, the feature vector is

generated for all the pair of records using these similarity values and it is then

given as input to the learning algorithm, ANFIS. Hence, ANFIS is trained by

both the duplicate and the non-duplicate records individually by generating

the feature vectors to discriminate the duplicate and non-duplicate records

during the testing phase.

Table 2.1 shows some sample records in the real dataset which is

used to test the proposed approach.

50

Table 2.1 Sample real dataset

S.No Name Address Pin code Contact No. E-mail

1 Mr.Rajendra

Sharma

403 Vandit,

Appartment,

Bhaikaka Nagar,

Thaltej,

Ahmedabad

380059 9829061356 [email protected]

2 Mr Bhagirath Patel ,

Dipali Patel

6th G/F, Dk House,

Mithakali, Ahmedabad

380006 jkssahmedabad@yahoo.

.com

3 Mr.Hariharan 101, Shivam Complex,

Nr.Silicon Tower, Law

Garden, Ahmedabad

38000945 9376124916 /

9820217300

hariharian@pacesetters. co.in

4 401, Neelkamal

Complex, Nr Shreeji

Baug Society,

Navrangpura,

Ahmedabad

9924067888

5 Mr Raghevendra S 36, Second Floor, 2nd

Main, Devi, Park

Extension Bangalore

560003 9892531242 [email protected]

6 Mr.Soloman Davis # 2, 14th Main, Oppo

Gautham College,Near

Shankarmutt,

Mahalakshmipuram,

Bangalore

5600 9810522125 [email protected]

7 Mr Raghevendra S 36, Second Floor, 2nd

Main, Devi, Park

Extension Bangalore

560003 9892531242 [email protected]

8 Mr Idris Khan 9886060482

9 Mr Pravin Sinha K No. 25, 16th Cross,

23rd Main ,J P Nagar,

5th Phase,

Bengalore

560078 9986222999 exceed_credit@rediffmail

10 Mr.Hariharan 102, Shivam Complex,

Nr.Silicon Tower, Law

Garden, Ahmedabad

380009 9376124916 /

9820217300

[email protected]

2.4.1 Similarity Computation for All Pairs of Records

Similarity computation is carried out by finding the similarity

functions on each record field. Each function compares the similarity of each

field with the other record fields and assigns a similarity value for each field.

51

For better duplicate detection, accurate similarity functions are very important

to calculate the distance between the records. Levenshtein distance, Cosine

similarity and MD5 are the three similarity measures used in the proposed

approach. The three measures are computed for all the attributes of record

pairs because different similarity operations have varying significance in

different domains. Here, the similarity computation process is explained for

two records 3R and 5R present in the Table 2.1. Three distance measures for

the “Name” attribute of these two records are computed as follows

Levenshtein distance

The chosen name fields of the records are “Hariharan” and

“Raghevendra”. The "Levenshtein distance" is computed by calculating the

minimum number of operations required to transform one string to the other.

Usually these operations are replace, insert or deletion of a character. The

Levenshtein distance between the “Name” fields of the records is 7 as seven

edit operations are required to change the word "Hariharan" to

"Raghevendra".

Cosine similarity

The Cosine similarity between the two records name field

“hariharan” and “raghevendra” are calculated as follows: First, the dimension

of both the strings are obtained by taking the union of the two

strings “hariharan” and “raghevendra” as (a, d, e, g, h, i, n, r, v) and the

frequency of occurrence vectors of the two strings are then calculated i.e.

“hariharan” = (3, 0, 0, 0, 2, 1, 1, 2, 0) and “raghevendra”= (2, 1, 2, 1, 1, 0, 1,

2, 1). The dot Product and magnitude of both strings are found out. In this

example, the dot product is 13 and the magnitude of both strings is 4.5588 and

4.1231 respectively. The product of magnitude of both the strings is then

52

calculated as 17.9722. This indicates that the similarity between both the

strings is 72%.

MD5

MD5 hashing technique is used for generating the message digest

of all the fields of all the records. Here, first it computes the 32-bit message

digest of the two fields of records using hashing. It then uses the edit distance

operation to calculate the distance between the generated message digest

values of two fields. For example, the message digest value for the name field

of the two records “hariharan” and “raghevendra” are as follows.

The 32-bit message digest of “hariharan”

faeaceaacceMDhariharanHash 638380276789201340230)'5','('

The 32-bit message digest of “raghevendra”

b2d6252fda917e2771f6ef765391d397

)'5MD','araghevendr('Hash

The distance between the two name fields

29)2b'17fda6252def6f2771e9397d765391

',a63f'0276ac38e80a7892013e('0e23acc4EditDist

Finally, the distance between the name fields of the two records

is 29. When comparing a pair of records, all three similarity measures are

computed for each record to obtain the similarity. Thus, the duplicate records

will be detected effectively by using these three measures of record.

2.4.2 Combining Similarity Across Multiple Fields

As the computation of similarity between the fields can vary

significantly depending on the domain and specific field under consideration,

the usual similarity functions may fail to find the similarity correctly. To

53

attain accurate similarity computations, it is therefore necessary to adapt

similarity measures for each field of the database with respect to the particular

data domain. Consequently, these similarity values obtained from different

similarity measures are combined to compute the distance between any two

records. When considering a database D that contains records composed of n

different fields and a set of m distance metrics, similarity between any pair of

records can be represented by an m -length vector as shown in the Figure 2.3.

Each component of the vector represents the computed similarity value

between two records that is calculated using one of the m distance metrics.

Figure 2.3 Computation of record similarity from individual field

similarities

54

2.4.3 ANFIS Training

The ANFIS structure is trained in the proposed approach by

providing a set of feature vectors generated from 2.4.2. The target output in a

feature vector is a binary matrix that differentiates the duplicate and non-

duplicate record pairs based on the threshold value. For efficient

classification, fuzzy rules are automatically generated from the neurons to

classify the input data.

2.4.4 Clustering of Input Dataset

Clustering is the initial step for detecting the duplicated records.

The process groups the data records with respect to its similarity measure.

The records presented in the input datasets are represented by a group of

clusters. This enables duplicate detection only on the data records given in the

most relevant clusters. This process automatically reduces the time

complexity. The clustering of data records is carried out using the k-means

clustering that is a widely accepted clustering method among the data mining

community. The basic steps employed in clustering are given as,

(1) Initialize k -centroids, one for each cluster

(2) Compute the similarity of each k -centroids with the data

records presented in the dataset

(3) Allocate data records to cluster iC whose similarity measure is

high.

(4) Update the k centroids

(5) Repeat Step 2 to step 4, until there is no movement of the data

records between the clusters.

55

2.4.5 ANFIS in Duplicate Detection

In the duplicate detection phase, the comparison is done for each of

the data records present only within the cluster to reduce the number of record

comparisons. Pairs of records that fall under each cluster are the candidates

for a full similarity computation. The record comparison is then performed by

utilizing the previous process in the same way. Here, the same similarity

metrics are used to calculate the distances of each pair of each field finding

the potential duplicate or non-duplicate records, thus creating the distance

feature vectors for ANFIS. The feature vectors are then fed to ANFIS as input

to obtain a binary matrix that distinguishes the duplicate and the non-

duplicate records based on the threshold value.

2.5 RESULTS AND DISCUSSION

The performance of the proposed approach is extensively analyzed

on two datasets, the real dataset and the Restaurant with the help of the

following evaluation metrics to ensure the efficiency of the approach.

2.5.1 Evaluation Metrics

A good quality duplicate detection process should have a high

precision and recall and also very low false positives and false negatives. The

performance is evaluated according to the following quality metrics

Precision (P): It is defined as the fraction of identified duplicate

pairs that are correct.

identifiedrecordsduplicateofnumber Total

identifiedrecordsduplicate trueofNumber(P)Precision

56

Recall (R): It is the fraction of actual duplicate pairs that are

identified correctly in the input dataset.

datasetin thepresentrecordsduplicateofnumber Total

identifiedrecordsduplicate trueofNumber (R)Recall

Recall would imply a measure of completeness, whereas precision

would imply a measure of exactness or fidelity.

F-measure (F): It gives equal weight to both precision and recall

and it is the harmonic mean of the two. The traditional F-measure or balanced

F-score is computed as,

)PR(

)PR2(F

)8.2(

False Positives (FP): The percentage of incorrect pairs of records

detected as duplicates relative to the actual number of duplicates is called the

false positive percentage. The false positive percentage can be greater than

100 if the approach produces many incorrect pairs. Lower false positive

percentage results in higher confidence in the approach.

datasetin thepresent recordsduplicateofnumber Total

duplicatesasidentified wrongly recordsofpairsincorrect ofPercentage(FP)PositivesFalse

False Negatives (FN): The false negative percentage is the

percentage of undetected duplicates in the input dataset relative to the number

of duplicates. Lower false negative percentage indicates good duplicate

detection.

datasetin thepresent recordsduplicateofnumber Total

recordsduplicateundetectedofPercentage(FN)Negatives-False

57

2.5.2 Experimentation on Real Dataset

The input for the experimentation is the real dataset that is taken

to test the proposed approach. The input dataset contains attributes such as

“name”, “address”, “contact number”, “e-mail id” of clients of a certain

organization. The dataset has a total of 1300 records in which 700 records for

training and the remaining 600 records are used for testing. At first, the

training dataset is converted into the feature vector that is directly given to the

ANFIS for training. The structure of the trained ANFIS is shown in Figure 2.4

Figure 2.4 The structure of the trained ANFIS of the proposed

approach

The ANFIS automatically generates the fuzzy rules that are shown

in the Figure 2.5.

58

Figure 2.5 Fuzzy rules generated from the ANFIS

2.5.3 Performance Evaluation on Real Dataset

Experiment 2.1: F-measure

The performance of the proposed approach is evaluated using

evaluation metrics discussed in the section 2.4.1.

Table 2.2 Results of quality metrics for the proposed approach

Approach R (%) P (%) F (%) FP (%) FN (%)

Proposed work without clustering 81 72 76 27 18

Proposed work after applying K-

means clustering 84 75 79 28 16

59

Figure 2.6 Evaluation metrics obtained for the proposed approach

From the Table 2.2 and Figure 2.6, it can be observed that the

performance of the proposed technique with clustering is better in terms of

high recall and precision and also low false positives and false negatives.

F- measure of the proposed approach with clustering is 3% more than the

work without clustering.

Experiment 2.2: Computation Time

Time taken for duplicate record detection differs for different input

records and also varies with the number of records. More number of records

in the input dataset takes more time for comparison. The main objective of

using clustering is to reduce the time taken for the comparison. From the

Table 2.3 and Figure 2.7, it is clear that the clustering reduces the

computation time.

60

Table 2.3 Performance in terms of computation time

Records

Time of proposed

approach with

Clustering (Sec)

Time of proposed

approach without

Clustering (Sec)

200 42 86

300 92 126

400 136 183

500 234 288

Figure 2.7 Performance in terms of computation time

2.5.4 Performance Evaluation on Restaurant Dataset

The Restaurant dataset used in the proposed approach is taken from

the riddle repositories for analysis. The Table 2.4 and the Figure 2.8 show the

performance of the proposed approach with clustering and with no clustering

evaluated based on the basis of accuracy on the Restaurant dataset. The work

with K-means clustering has achieved high F-measure of approximately 25%

more than the approach without clustering.

61

Table 2.4 F-measure of the proposed approach in the Restaurant dataset

Iterations

F-measure of

proposed approach

with Clustering

F-measure of

proposed approach

without Clustering

50 0.8178 0.5455

100 0.8571 0.5488

150 0.7926 0.5432

Figure 2.8 F-measure of the proposed approach

2.5.5 Comparative Analysis

The Table 2.5 shows the comparative analysis of the proposed

approach against Matsakis’s et al (2010) approach. The Table 2.5 and the

Figure 2.9 shows that the proposed method outperformed the existing method

in the Precision and F-measure values and it is clear that the precision of the

proposed approach is 35% more than the existing one and F-measure is 25%

more than the existing one.

62

Table 2.5 Comparative performance of the proposed approach in the

Restaurant dataset

Approach Precision Recall F-measure

N.E Matsakis’s

approach0.262 0.975 0.4130

Proposed

approach0.607 0.758 0.674

Figure 2.9 Comparative analyses on quality metrics

2.6 SUMMARY

The proposed approach has presented a domain independent

approach to detect duplicate records available in large databases. The

approach makes use of ANFIS and similarity functions in detecting the

duplicate records. K-means clustering method has been used along with the

approach to reduce the time taken on each comparison to improve duplicate

63

detection. Furthermore, to accurately identify the duplicate records, the

similarity has been computed with the help of Levenshtein distance, Cosine

similarity and MD5. Finally, the experimentation has been carried out using

real-life datasets and the Restaurant dataset. The performance of the proposed

approach is evaluated based on the evaluation metrics. The experimental

evaluation ensured that the proposed approach detects duplicates efficiently

and at the same time, the time incurred is also further reduced.