Building similarentityrecognizerv1
Transcript of Building similarentityrecognizerv1
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL1
Building Similar Entity Recognizers
By Arthi Venkataraman
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL2
Agenda
Similar Entity Detection Scenarios
Challenges, Techniques and Algorithms
Semantic applicability
Big Data Challenges and Solution
Sample Results
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL3
Scenario 1 - Fraud Detection in Insurance Claims
P1Tom Harold
makes claim on Policy 1
Are P1 , P2 and P3 same?
Is there FRAUD
P3T Harold makes a claim on
Policy 3
P2Tom H makes claim
on Policy 2
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL4
Scenario 2 – Cross Sell Potential Detection in Insurance
Does Tom Harold hold a policy in any other system. What are the policies he holds. Is
there a potential for cross sell.
Tom Harold holds Policy 1 in System 1He is high net-worth customer
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL5
Example features for different people
Person 1
• First Name – Tom
• Middle Name - • Last Name -
Harold• Date of Birth –
20/10/1987• Address - 1,
MG Road, Bangalore – 56
Person 2
• First Name – Tom
• Middle Name - Harry
• Last Name - Harold
• Date of Birth – 20/10/1987
• Address - 1, Mahatma Gandhi Rd, Bangalore - 560056
Person 3
• First Name – Tom
• Last Name - Harold
• Date of Birth – 20/10/1988
• Address - 1, Mahatma Gandhi Rd, Bangalore - 560056
Questions :• Is Person 1 same as Person2 ?• Is Person 2 same as Person3 ?• Is Person 1 same as Person3 ?
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL6
Similar Entity Detection Challenges
Quick manual inspection of Person 1 and Person 2 feature data to conclude that Person 1 is same as Person 2
Not so trivial for a machine
Weightages must be arrived at for different features
Code is needed for identifying if values of a feature for person 1 and person 2 are similar or different
A similar string comparison is not sufficient - Is MG Road same as Mahatma Gandhi Rd
Actual data will have some spelling mistakes, missing data and wrongly entered data. For e.g. 20/10/1987 could be entered as 20/10/1988 Or the field itself could be empty
Hence need other techniques like machine learning and semantic techniques
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL7
Similar Entity Detection Methodology
Step 1
• Identify relevant features
Step 2
• Extract values for the features
Step 3
• Create a model which can classify the two entities as same or different
Step 4
• Use the model to classify future customer pairs
Given two entities how can we say that two entities are same
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL8
Supervised Learning model
Labeled pre-identified customer pairs data as inputs
Values for different features for each of the customer pairs
Each customer pair is tagged as Same, Probably Same, Different
A supervised algorithm is chosen - ( Actual algorithm based on data characteristics )
The tagged data is fed as input
Output is the model
Model will classify a new customer pair into one of the identified categories
Model accuracy can be calculated using the Precision, Recall, Accuracy and F-Scores
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL9
Supervised Learning model example
• Live example of how to classify a given set of customer records using Supervised Learning
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL10
Un-Supervised Learning model
In many cases there is no pre-labeled data
In this case we would need to choose an Un-supervised learning model
The model will automatically detect patterns in the data and cluster the data points into different clusters
Any newly added customer pair would be placed in the right cluster
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL11
Un-Supervised Learning model example
• Live example of how to classify a given set of customer records using Un-Supervised Learning
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL12
Continuous Learning
In many cases there will be some small set of labeled data and very large set of un-labeled data
An initial model will be created using the small labeled data set
As more labeled data is available the model will evolve due to continuous learning and become more better at the classification
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL13
Semantic Techniques applicability
Semantic similarity scoring for features• Feature - List of Games played• Person 1 plays – Racquet Sports• Person 2 plays – Lawn Tennis• Using semantic comparison we can see that there is a high similarity between
person 1 and person 2 on the List of Games played feature
Extraction of features from different data sources• Similar features named differently
Associating customers in different data sources as same or different• Flexibly and easy addition of new relationships
Ease of adding additional data sources
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL14
Large Data handling challenges
Entity similarity is a pair wise operation
If there are n entities then there n*(n-1) number of comparisons to be done
Also within each comparison for every feature pair has to be compared
Highly time consuming operations
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL15
Large Data handling ideas
Use of Apache Mahout • Split the comparison into m
different machines • Each machine now handles
- n/m customer • Nearly an m time speed-up
Batch time comparisons and tagging• Reduce run time
response to find similar entities
Incremental addition of new customer pairs
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL16
Sample Metrics from our experiments
• Discussion on the sample metrics from our experiments• Learning from same
– Which algorithm and method was more apt under different circumstances
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL17
Thank You