CHAPTER 2 DUPLICATE RECORD DETECTION USING ANFIS · document detection technique that could be...
Transcript of CHAPTER 2 DUPLICATE RECORD DETECTION USING ANFIS · document detection technique that could be...
38
CHAPTER 2
DUPLICATE RECORD DETECTION USING ANFIS
2.1 INTRODUCTION
The problem of duplicate detection is to find out whether the same
real-world object is represented by two or more distinct entries in the
database. Duplicate detection is otherwise known as Record linkage or record
matching. It is a greatly researched topic and is of vital importance in fields
such as master data management, data warehousing and ETL (Extraction,
Transformation and Loading), customer relationship management, and data
integration (Ahmed K. Elmagarmid et al 2007).
Theoretically, each candidate requires to be compared with all the
others. The first trouble is due to the quadratic characteristic of the problem.
Several solutions have been proposed to decrease the number of comparisons
to tackle this problem and thereby enhance the efficiency of duplicate
detection. The common objective is to focus on comparison of the objects that
have at least a few promptly identifiable similarities and prevent comparisons
of immensely dissimilar objects. A common technique to prevent costly
comparison of all pairs of records is careful grouping of the records into
smaller subsets. It is enough if all the pairs within each partition are compared
if duplicate records occur in the same partition (Uwe Draisbach and Felix
Naumann 2009).
39
Clustering is the categorization of objects into diverse groups, or
more exactly, the division of a data set into subsets (clusters), so that the data
in each subset (ideally) reveals a few common traits, frequently proximity
based on certain defined distance measure. Blocking increases the efficiency
of duplicate detection and substantially improves the speed of the process of
comparison (Jebamalar Tamilselvi and Saravanan 2010). The term ‘blocking’
refers to the process of splitting the database into a set of mutually exclusive
subsets (blocks) such that the matches do not occur across blocks.
The next problem of duplicate detection is the definition of a
suitable similarity measure to determine if a given pair is truly a duplicate.
Combinations of edit-distance, TF-IDF weighting and other text-based
similarity measures are employed by conventional methods. Further, by
means of a similarity-threshold pairs must be classified into duplicates that is,
similarity greater than or equal to the threshold and non-duplicates that is,
similarity less than threshold (Uwe Draisbach and Felix Naumann 2009).
Two common types of duplicate detection methods are the
fingerprint-based and full text-based methods (Hui Yang and Jamie Callan
2006). Textual similarity, typically quantified using a similarity function such
as, edit distance or cosine similarity is utilized by most of the current
approaches to determine whether the two representations are duplicates.
Hamid Haidarian Shahri and Ahmad Abdollahzadeh Barforush
(2004) have proposed an adjustable fuzzy expert system for the data cleaning
process which identifies and removes fuzzy duplicates by avoiding the
repetitive. Few of the important merits of their proposed approach for use in
diverse information systems include adaptability, simplicity of use,
extendibility, quick development time and run time efficiency. Eugenio
Cesario et al (2005) have presented an incremental algorithm to cluster
duplicate tuples in large databases, which has permitted the allocation of any
40
new tuple t to the cluster consisting of database tuples that has the greatest
similarity with t. Allocation of extremely identical objects to the same buckets
has been the objective of the hash-based indexing method which is the core of
their approach. The substantial improvement in efficiency over modern index
structure for proximity searches in metric spaces attained by the proposed
method has been proved by means of experimental validation.
Hamid H. Shahri and Saied H. Shahri (2006) have proposed a
duplicate-elimination framework that permits easy and flexible data cleaning
that does not necessitate the execution of any coding by users. The
uncertainty of the problem has been handled inherently by exploiting the
fuzzy inference, and the framework has been enabled to adjust to the specific
concept of similarity suitable for each domain by distinct machine learning
capabilities. The users have been permitted to operate with or without training
data by the framework, which is extensible and accommodative. Moreover,
the fast implementation of several preceding methods for duplicate
elimination has been permitted by their framework.
Xiao Mansheng et al (2009) have proposed a grouping based sub-
fuzzy clustering property optimization method. The method has first
processed the properties of group record to obtain the representation of the
group by effectively decreasing the dimension of property, and then
approximately duplicated records in groups have been detected using a
similarity comparison method. The method has been proved to solve the
problem of approximate duplicate record detection in large datasets and the
possess improved the precision of detection and efficiency by both theoretical
analysis and experimental results.
Ying Pei et al (2009) have proposed an enhanced K-medoids
clustering algorithm (IKMC) to address the problem of approximate duplicate
records identification. It has employed edit-distance method and the weights
41
of features by considering each record in database as a separate data object
and obtained the similarity measure among the records. It has then identified
the duplicate records by clustering these similarity values. The algorithm has
prevented a huge number of I/O operations used by the conventional
"sort/merge" algorithm for sequencing by comparing the similarity value with
the preset similarity threshold and automatically adjusting the number of
clusters. The algorithm has been proved to possess excellent detection
precision and high availability by means of experiment.
Hannaneh Hajishirzi et al (2010) have proposed a near-duplicate
document detection technique that could be easily adjusted for a specific
domain. Each document has been represented as a real-valued sparse k-gram
vector in their method. The weights have been trained to optimize for a
particular similarity function like cosine similarity or Jaccard coefficient. The
enhanced similarity measure has been capable of reliably detecting the
approximate-duplicate documents. In addition, efficient similarity evaluation
has been obtained by using locality sensitive hashing scheme to map these
vectors to a few numbers of hash-values as document signatures.
Automatic duplicate detection (Jie Song et al 2010) and (Jebamalar
Tamilselvi et al 2009) depending on the particular domain is complicated for
two reasons. First, duplicate representations are not similar as they differ
slightly in their values. Second, theoretical required comparison of all pairs of
records is impractical for huge quantities of data. In order to overcome these
problems, two major methods are available such as,
(i) Similarity measures can be used for comparing two records to
automatically detect duplicates so as that the effectiveness of
duplicate detection is improved by carefully selected similarity
measures.
42
(ii) Algorithms can be constructed to search for duplicates in extremely
huge quantities of data in such a way that the efficiency of
duplicate detection is improved by carefully designed algorithms
(Felix Naumann 2010).
Many approaches have been presented to address this problem.
Some of these approaches focus on the efficiency of the process of
comparison and others rely on the accuracy of the resulting records. In this
research, an efficient approach is employed using similarity functions and
ANFIS which improves the accuracy of the duplicate detection process with
respect to other well-known measures. Here, ANFIS is trained to discriminate
between pairs of records corresponding to duplicates and non-duplicates using
the training vectors generated by these attribute similarities.
This approach adopts ANFIS and similarity functions to improve
duplicate detection in two phases namely training phase and testing phase. In
the training phase, the dataset contains the labeled duplicates and non-
duplicates which are trained separately using ANFIS. First,a pair of duplicate
or non-duplicate records is considered and the similarity is computed using
three similarity measures to obtain the similarity distance between the two
individual records. The similarity values of all the records are then combined
to generate a feature vector. The steps are repeated for all the pairs of records
in the training dataset to obtain a set of feature vectors .The latter trains the
ANFIS according to the feature vectors.
In the duplicate detection phase, an efficient K-means clustering is
used to partition the dataset into small partitions based on some common
features. The efficiency of the duplicate detection process is increased by
reducing the time taken for comparison of records. Hence, the similarity
computations and other processes are performed on the particular cluster in
which the record falls. Thus, to improve the quality of the database the
43
duplicate and non-duplicate records are identified for the input dataset. The
experimentation is carried out on the real dataset containing duplicate records
and the results demonstrate that the proposed approach improves the accuracy
in duplicate detection.
2.2 ADAPTIVE NEURO-FUZZY INFERENCE SYSTEMS
ANFIS is a Neuro-fuzzy system that uses the learning techniques
of neural networks, with the efficiency of fuzzy inference systems
(Esposito et al 2000). ANFIS uses a hybrid learning algorithm to specify
parameters of Sugeno-type fuzzy inference systems. It uses the Least-squares
method with the Back propagation gradient descent method. Train FIS
membership function parameters simulate a given training data set. ANFIS
can be called using optional parameters to validate the model.
ANFIS structure is similar to the neural network structure based on
the Takagi Sugeno model. According to the Sugeno fuzzy model, rule sets are
as follows,
111111 ryqxpfTHENBisyandAisxIf
222222 ryqxpfTHENBisyandAisxIf
There is a forward pass and a backward pass for the training of the
network. The input vector is propagated, layer by layer through the network
by forward pass. In the backward pass, the error is sent back through the
network in a similar manner to back propagation. The Figure 2.1 shows the
architecture of ANFIS.
44
Figure 2.1 ANFIS architecture for a two rule Sugeno system
Layer 1 In this layer, every node i is an adaptive node with a node
function,
2,1)(,1 iforxOiAi
)1.2(
4,3)(2,1
iforyOiBi
)2.2(
where x (or y) is the input to node i and Ai (or Bi-2) is a linguistic label (such
as "small" or "large") associated with this node. In other words, O1,i is the
membership grade of a fuzzy set A ( =A1 , A2 , B1 or B2 ) and it specifies the
degree to which the given input x (or y) satisfies the quantifier A. The
generalized bell function is an example of an appropriate parameterized
membership function for A.
ib
i
i
A
a
cxx
2
1
1)(
)3.2(
45
where {ai, bi, ci} is the parameter set. The bell-shaped function varies
according to the values of these parameters. Thus, it presents various forms of
membership function for the fuzzy set A. Parameters used in this layer are
referred to as premise parameters.
Layer 2 In this layer, every node is a fixed node labeled , whose
output is the product of all the incoming signals,
2,1),()(,2 iyxwOii BAii
)4.2(
The output of each node represents the firing strength of a rule and
any other T-norm operators that performs fuzzy AND can be used as the node
function in this layer.
Layer 3 In this layer, every node is a fixed node labeled N. The ith
node calculates the ratio of the ith
rule's firing strength to the sum of all rules'
firing strengths.
21
,3ww
wwO i
ii
)5.2(
For ease, outputs of this layer are called normalized firing
strengths.
Layer 4 In layer 4, every node i is an adaptive node with a node
function,
)(,4 iiiiiii ryqxpwfwO)6.2(
46
whereiw is a normalized firing strength from layer 3 and {pi, qi, ri} is the
parameter set of this node. Parameters in this layer are referred to as
consequent parameters.
Layer 5 The single node in this layer is a fixed node labeled ,
which computes the overall output as the summation of all incoming signals
i i
i ii
ii
iiw
fwfwO ,5
)7.2(
2.3 SIMILARITY FUNCTIONS UTILIZED IN THE PROPOSED
APPROACH
Recognizing similarities in large collections of data is a major issue
in the context of duplicate record detection. Similarity between records and
fields are determined using ‘similarity functions’. Different similarity
computation functions are available for different data types. Therefore, the
user should consider the data type of the attribute such as numerical or string
while selecting the function (Israr Ahmed et al 2010). Thus, similarity
functions appropriate for a given domain is essential for obtaining high
accuracy. Similarity functions can be categorized into three groups namely
Character-based similarity functions, which allow contiguous sequences of
mismatched characters e.g., edit distance, Jaro distance, Text (token) based
similarity functions, which do not view strings as contiguous sequences but as
unordered bags of elements e.g., cosine similarity, Jaccard index and Finger
printing techniques.
47
Given a set of records },...,,,{ 321 nrrrrR where each record is a set
of tokens taken from a database },...,,,{ 321 mttttD . The similarity search
problem is aimed at finding all pairs of records Rrr yx , such that
),( yx rrsim ,where is a given similarity threshold value. The similarity
between two non-empty records xr and yr can be measured by using the
following functions.
1.5.1 Levenshtein Distance - Character- Based Similarity
Levenshtein Distance is described as the least number of edit
operations necessary to convert one string into another. Insertion, deletion and
substitution of characters are the permitted edit operations and unit cost is
assigned to each of these edit operations. The minimum number of such
operations can be computed using Dynamic Programming in time equal to the
product of the string lengths. For example, a character-based edit distance
between strings s =“company” and t = “corporation” is computed as follows.
There are several edit operations that transform s into t, but a Levenshtein
distance of ‘6’ between s and t implies that a minimum of ‘6’ operations are
required to transform s into t as unit cost is assigned to each operation. The
following illustrates the six edit operations applied to transform s into t.
1. Substitute “m” with “r”: “company” “corpany”
2. Insert “o”: “corpany” “corpoany”
3. Insert “r”: “corpoany” “corporany”
4. Insert “t”: “corporany” “corporatny”
5. Insert “i”: “corporatny” “corporatiny”
6. Insert “o”: “corporatiny” “corporationy”
7. Delete “y”: “corporationy” “corporation”.
48
Thus, Levenshtein Distance that takes two strings, s of length ‘n’,
and t of length ‘m’ as inputs and then, computes the Damerau-Levenshtein
distance between the strings.
2.3.2 Cosine Similarity - Text-Based Similarity
Cosine similarity is a similarity measure between two vectors that
measures the cosine angle between them. The cosine similarity, between
the two vectors of attributes, A and B can be expressed by means of a dot
product and magnitude as discussed in the section 1.2.3.
2.3.3 MD5 - Fingerprint-Based Similarity
For fingerprint computation, a standard hashing approach such as,
MD5 hashing that efficiently generates a fingerprint for the input message is
employed. MD5 (Message-Digest algorithm 5) has a 128-bit (16-byte) hash
value is an extensively used cryptographic hash function.
2.4 AN APPROACH TO DUPLICATE RECORD DETECTION
USING ANFIS
Let D be a database that contains records composed of k different
fields. Given database, },....,,{ 21 nRRRD where each record iR includes k
fields such as Name, Address and Pin Code. The proposed approach shown in
the Figure 2.2 has two phases such as training phase and duplicate detection
phase.
49
Figure 2.2 The workflow of the proposed approach
At first, the proposed approach takes the training dataset and
directly computes the similarity values for each field using m different
similarity metrics. The similarity values are then combined to generate a
feature vector which gives record level similarity. Thus, the feature vector is
generated for all the pair of records using these similarity values and it is then
given as input to the learning algorithm, ANFIS. Hence, ANFIS is trained by
both the duplicate and the non-duplicate records individually by generating
the feature vectors to discriminate the duplicate and non-duplicate records
during the testing phase.
Table 2.1 shows some sample records in the real dataset which is
used to test the proposed approach.
50
Table 2.1 Sample real dataset
S.No Name Address Pin code Contact No. E-mail
1 Mr.Rajendra
Sharma
403 Vandit,
Appartment,
Bhaikaka Nagar,
Thaltej,
Ahmedabad
380059 9829061356 [email protected]
2 Mr Bhagirath Patel ,
Dipali Patel
6th G/F, Dk House,
Mithakali, Ahmedabad
380006 jkssahmedabad@yahoo.
.com
3 Mr.Hariharan 101, Shivam Complex,
Nr.Silicon Tower, Law
Garden, Ahmedabad
38000945 9376124916 /
9820217300
hariharian@pacesetters. co.in
4 401, Neelkamal
Complex, Nr Shreeji
Baug Society,
Navrangpura,
Ahmedabad
9924067888
5 Mr Raghevendra S 36, Second Floor, 2nd
Main, Devi, Park
Extension Bangalore
560003 9892531242 [email protected]
6 Mr.Soloman Davis # 2, 14th Main, Oppo
Gautham College,Near
Shankarmutt,
Mahalakshmipuram,
Bangalore
5600 9810522125 [email protected]
7 Mr Raghevendra S 36, Second Floor, 2nd
Main, Devi, Park
Extension Bangalore
560003 9892531242 [email protected]
8 Mr Idris Khan 9886060482
9 Mr Pravin Sinha K No. 25, 16th Cross,
23rd Main ,J P Nagar,
5th Phase,
Bengalore
560078 9986222999 exceed_credit@rediffmail
10 Mr.Hariharan 102, Shivam Complex,
Nr.Silicon Tower, Law
Garden, Ahmedabad
380009 9376124916 /
9820217300
2.4.1 Similarity Computation for All Pairs of Records
Similarity computation is carried out by finding the similarity
functions on each record field. Each function compares the similarity of each
field with the other record fields and assigns a similarity value for each field.
51
For better duplicate detection, accurate similarity functions are very important
to calculate the distance between the records. Levenshtein distance, Cosine
similarity and MD5 are the three similarity measures used in the proposed
approach. The three measures are computed for all the attributes of record
pairs because different similarity operations have varying significance in
different domains. Here, the similarity computation process is explained for
two records 3R and 5R present in the Table 2.1. Three distance measures for
the “Name” attribute of these two records are computed as follows
Levenshtein distance
The chosen name fields of the records are “Hariharan” and
“Raghevendra”. The "Levenshtein distance" is computed by calculating the
minimum number of operations required to transform one string to the other.
Usually these operations are replace, insert or deletion of a character. The
Levenshtein distance between the “Name” fields of the records is 7 as seven
edit operations are required to change the word "Hariharan" to
"Raghevendra".
Cosine similarity
The Cosine similarity between the two records name field
“hariharan” and “raghevendra” are calculated as follows: First, the dimension
of both the strings are obtained by taking the union of the two
strings “hariharan” and “raghevendra” as (a, d, e, g, h, i, n, r, v) and the
frequency of occurrence vectors of the two strings are then calculated i.e.
“hariharan” = (3, 0, 0, 0, 2, 1, 1, 2, 0) and “raghevendra”= (2, 1, 2, 1, 1, 0, 1,
2, 1). The dot Product and magnitude of both strings are found out. In this
example, the dot product is 13 and the magnitude of both strings is 4.5588 and
4.1231 respectively. The product of magnitude of both the strings is then
52
calculated as 17.9722. This indicates that the similarity between both the
strings is 72%.
MD5
MD5 hashing technique is used for generating the message digest
of all the fields of all the records. Here, first it computes the 32-bit message
digest of the two fields of records using hashing. It then uses the edit distance
operation to calculate the distance between the generated message digest
values of two fields. For example, the message digest value for the name field
of the two records “hariharan” and “raghevendra” are as follows.
The 32-bit message digest of “hariharan”
faeaceaacceMDhariharanHash 638380276789201340230)'5','('
The 32-bit message digest of “raghevendra”
b2d6252fda917e2771f6ef765391d397
)'5MD','araghevendr('Hash
The distance between the two name fields
29)2b'17fda6252def6f2771e9397d765391
',a63f'0276ac38e80a7892013e('0e23acc4EditDist
Finally, the distance between the name fields of the two records
is 29. When comparing a pair of records, all three similarity measures are
computed for each record to obtain the similarity. Thus, the duplicate records
will be detected effectively by using these three measures of record.
2.4.2 Combining Similarity Across Multiple Fields
As the computation of similarity between the fields can vary
significantly depending on the domain and specific field under consideration,
the usual similarity functions may fail to find the similarity correctly. To
53
attain accurate similarity computations, it is therefore necessary to adapt
similarity measures for each field of the database with respect to the particular
data domain. Consequently, these similarity values obtained from different
similarity measures are combined to compute the distance between any two
records. When considering a database D that contains records composed of n
different fields and a set of m distance metrics, similarity between any pair of
records can be represented by an m -length vector as shown in the Figure 2.3.
Each component of the vector represents the computed similarity value
between two records that is calculated using one of the m distance metrics.
Figure 2.3 Computation of record similarity from individual field
similarities
54
2.4.3 ANFIS Training
The ANFIS structure is trained in the proposed approach by
providing a set of feature vectors generated from 2.4.2. The target output in a
feature vector is a binary matrix that differentiates the duplicate and non-
duplicate record pairs based on the threshold value. For efficient
classification, fuzzy rules are automatically generated from the neurons to
classify the input data.
2.4.4 Clustering of Input Dataset
Clustering is the initial step for detecting the duplicated records.
The process groups the data records with respect to its similarity measure.
The records presented in the input datasets are represented by a group of
clusters. This enables duplicate detection only on the data records given in the
most relevant clusters. This process automatically reduces the time
complexity. The clustering of data records is carried out using the k-means
clustering that is a widely accepted clustering method among the data mining
community. The basic steps employed in clustering are given as,
(1) Initialize k -centroids, one for each cluster
(2) Compute the similarity of each k -centroids with the data
records presented in the dataset
(3) Allocate data records to cluster iC whose similarity measure is
high.
(4) Update the k centroids
(5) Repeat Step 2 to step 4, until there is no movement of the data
records between the clusters.
55
2.4.5 ANFIS in Duplicate Detection
In the duplicate detection phase, the comparison is done for each of
the data records present only within the cluster to reduce the number of record
comparisons. Pairs of records that fall under each cluster are the candidates
for a full similarity computation. The record comparison is then performed by
utilizing the previous process in the same way. Here, the same similarity
metrics are used to calculate the distances of each pair of each field finding
the potential duplicate or non-duplicate records, thus creating the distance
feature vectors for ANFIS. The feature vectors are then fed to ANFIS as input
to obtain a binary matrix that distinguishes the duplicate and the non-
duplicate records based on the threshold value.
2.5 RESULTS AND DISCUSSION
The performance of the proposed approach is extensively analyzed
on two datasets, the real dataset and the Restaurant with the help of the
following evaluation metrics to ensure the efficiency of the approach.
2.5.1 Evaluation Metrics
A good quality duplicate detection process should have a high
precision and recall and also very low false positives and false negatives. The
performance is evaluated according to the following quality metrics
Precision (P): It is defined as the fraction of identified duplicate
pairs that are correct.
identifiedrecordsduplicateofnumber Total
identifiedrecordsduplicate trueofNumber(P)Precision
56
Recall (R): It is the fraction of actual duplicate pairs that are
identified correctly in the input dataset.
datasetin thepresentrecordsduplicateofnumber Total
identifiedrecordsduplicate trueofNumber (R)Recall
Recall would imply a measure of completeness, whereas precision
would imply a measure of exactness or fidelity.
F-measure (F): It gives equal weight to both precision and recall
and it is the harmonic mean of the two. The traditional F-measure or balanced
F-score is computed as,
)PR(
)PR2(F
)8.2(
False Positives (FP): The percentage of incorrect pairs of records
detected as duplicates relative to the actual number of duplicates is called the
false positive percentage. The false positive percentage can be greater than
100 if the approach produces many incorrect pairs. Lower false positive
percentage results in higher confidence in the approach.
datasetin thepresent recordsduplicateofnumber Total
duplicatesasidentified wrongly recordsofpairsincorrect ofPercentage(FP)PositivesFalse
False Negatives (FN): The false negative percentage is the
percentage of undetected duplicates in the input dataset relative to the number
of duplicates. Lower false negative percentage indicates good duplicate
detection.
datasetin thepresent recordsduplicateofnumber Total
recordsduplicateundetectedofPercentage(FN)Negatives-False
57
2.5.2 Experimentation on Real Dataset
The input for the experimentation is the real dataset that is taken
to test the proposed approach. The input dataset contains attributes such as
“name”, “address”, “contact number”, “e-mail id” of clients of a certain
organization. The dataset has a total of 1300 records in which 700 records for
training and the remaining 600 records are used for testing. At first, the
training dataset is converted into the feature vector that is directly given to the
ANFIS for training. The structure of the trained ANFIS is shown in Figure 2.4
Figure 2.4 The structure of the trained ANFIS of the proposed
approach
The ANFIS automatically generates the fuzzy rules that are shown
in the Figure 2.5.
58
Figure 2.5 Fuzzy rules generated from the ANFIS
2.5.3 Performance Evaluation on Real Dataset
Experiment 2.1: F-measure
The performance of the proposed approach is evaluated using
evaluation metrics discussed in the section 2.4.1.
Table 2.2 Results of quality metrics for the proposed approach
Approach R (%) P (%) F (%) FP (%) FN (%)
Proposed work without clustering 81 72 76 27 18
Proposed work after applying K-
means clustering 84 75 79 28 16
59
Figure 2.6 Evaluation metrics obtained for the proposed approach
From the Table 2.2 and Figure 2.6, it can be observed that the
performance of the proposed technique with clustering is better in terms of
high recall and precision and also low false positives and false negatives.
F- measure of the proposed approach with clustering is 3% more than the
work without clustering.
Experiment 2.2: Computation Time
Time taken for duplicate record detection differs for different input
records and also varies with the number of records. More number of records
in the input dataset takes more time for comparison. The main objective of
using clustering is to reduce the time taken for the comparison. From the
Table 2.3 and Figure 2.7, it is clear that the clustering reduces the
computation time.
60
Table 2.3 Performance in terms of computation time
Records
Time of proposed
approach with
Clustering (Sec)
Time of proposed
approach without
Clustering (Sec)
200 42 86
300 92 126
400 136 183
500 234 288
Figure 2.7 Performance in terms of computation time
2.5.4 Performance Evaluation on Restaurant Dataset
The Restaurant dataset used in the proposed approach is taken from
the riddle repositories for analysis. The Table 2.4 and the Figure 2.8 show the
performance of the proposed approach with clustering and with no clustering
evaluated based on the basis of accuracy on the Restaurant dataset. The work
with K-means clustering has achieved high F-measure of approximately 25%
more than the approach without clustering.
61
Table 2.4 F-measure of the proposed approach in the Restaurant dataset
Iterations
F-measure of
proposed approach
with Clustering
F-measure of
proposed approach
without Clustering
50 0.8178 0.5455
100 0.8571 0.5488
150 0.7926 0.5432
Figure 2.8 F-measure of the proposed approach
2.5.5 Comparative Analysis
The Table 2.5 shows the comparative analysis of the proposed
approach against Matsakis’s et al (2010) approach. The Table 2.5 and the
Figure 2.9 shows that the proposed method outperformed the existing method
in the Precision and F-measure values and it is clear that the precision of the
proposed approach is 35% more than the existing one and F-measure is 25%
more than the existing one.
62
Table 2.5 Comparative performance of the proposed approach in the
Restaurant dataset
Approach Precision Recall F-measure
N.E Matsakis’s
approach0.262 0.975 0.4130
Proposed
approach0.607 0.758 0.674
Figure 2.9 Comparative analyses on quality metrics
2.6 SUMMARY
The proposed approach has presented a domain independent
approach to detect duplicate records available in large databases. The
approach makes use of ANFIS and similarity functions in detecting the
duplicate records. K-means clustering method has been used along with the
approach to reduce the time taken on each comparison to improve duplicate
63
detection. Furthermore, to accurately identify the duplicate records, the
similarity has been computed with the help of Levenshtein distance, Cosine
similarity and MD5. Finally, the experimentation has been carried out using
real-life datasets and the Restaurant dataset. The performance of the proposed
approach is evaluated based on the evaluation metrics. The experimental
evaluation ensured that the proposed approach detects duplicates efficiently
and at the same time, the time incurred is also further reduced.