Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL1

Building Similar Entity Recognizers

By Arthi Venkataraman


Agenda

Similar Entity Detection Scenarios

Challenges, Techniques and Algorithms

Semantic applicability

Big Data Challenges and Solution

Sample Results


Scenario 1 - Fraud Detection in Insurance Claims

P1Tom Harold

makes claim on Policy 1

Are P1 , P2 and P3 same?

Is there FRAUD

P3T Harold makes a claim on

Policy 3

P2Tom H makes claim

on Policy 2


Scenario 2 – Cross Sell Potential Detection in Insurance

Does Tom Harold hold a policy in any other system. What are the policies he holds. Is

there a potential for cross sell.

Tom Harold holds Policy 1 in System 1He is high net-worth customer


Example features for different people

Person 1

• First Name – Tom

• Middle Name - • Last Name -

Harold• Date of Birth –

20/10/1987• Address - 1,

MG Road, Bangalore – 56

Person 2


• Middle Name - Harry

• Last Name - Harold

• Date of Birth – 20/10/1987

• Address - 1, Mahatma Gandhi Rd, Bangalore - 560056

Person 3


• Last Name - Harold

• Date of Birth – 20/10/1988

• Address - 1, Mahatma Gandhi Rd, Bangalore - 560056

Questions :• Is Person 1 same as Person2 ?• Is Person 2 same as Person3 ?• Is Person 1 same as Person3 ?


Similar Entity Detection Challenges

Quick manual inspection of Person 1 and Person 2 feature data to conclude that Person 1 is same as Person 2

Not so trivial for a machine

Weightages must be arrived at for different features

Code is needed for identifying if values of a feature for person 1 and person 2 are similar or different

A similar string comparison is not sufficient - Is MG Road same as Mahatma Gandhi Rd

Actual data will have some spelling mistakes, missing data and wrongly entered data. For e.g. 20/10/1987 could be entered as 20/10/1988 Or the field itself could be empty

Hence need other techniques like machine learning and semantic techniques


Similar Entity Detection Methodology

Step 1

• Identify relevant features

Step 2

• Extract values for the features

Step 3

• Create a model which can classify the two entities as same or different

Step 4

• Use the model to classify future customer pairs

Given two entities how can we say that two entities are same


Supervised Learning model

Labeled pre-identified customer pairs data as inputs

Values for different features for each of the customer pairs

Each customer pair is tagged as Same, Probably Same, Different

A supervised algorithm is chosen - ( Actual algorithm based on data characteristics )

The tagged data is fed as input

Output is the model

Model will classify a new customer pair into one of the identified categories

Model accuracy can be calculated using the Precision, Recall, Accuracy and F-Scores


Supervised Learning model example

• Live example of how to classify a given set of customer records using Supervised Learning


Un-Supervised Learning model

In many cases there is no pre-labeled data

In this case we would need to choose an Un-supervised learning model

The model will automatically detect patterns in the data and cluster the data points into different clusters

Any newly added customer pair would be placed in the right cluster


Un-Supervised Learning model example

• Live example of how to classify a given set of customer records using Un-Supervised Learning


Continuous Learning

In many cases there will be some small set of labeled data and very large set of un-labeled data

An initial model will be created using the small labeled data set

As more labeled data is available the model will evolve due to continuous learning and become more better at the classification


Semantic Techniques applicability

Semantic similarity scoring for features• Feature - List of Games played• Person 1 plays – Racquet Sports• Person 2 plays – Lawn Tennis• Using semantic comparison we can see that there is a high similarity between

person 1 and person 2 on the List of Games played feature

Extraction of features from different data sources• Similar features named differently

Associating customers in different data sources as same or different• Flexibly and easy addition of new relationships

Ease of adding additional data sources


Large Data handling challenges

Entity similarity is a pair wise operation

If there are n entities then there n*(n-1) number of comparisons to be done

Also within each comparison for every feature pair has to be compared

Highly time consuming operations


Large Data handling ideas

Use of Apache Mahout • Split the comparison into m

different machines • Each machine now handles

- n/m customer • Nearly an m time speed-up

Batch time comparisons and tagging• Reduce run time

response to find similar entities

Incremental addition of new customer pairs


Sample Metrics from our experiments

• Discussion on the sample metrics from our experiments• Learning from same

– Which algorithm and method was more apt under different circumstances


Thank You

Building similarentityrecognizerv1

Documents

Transcript of Building similarentityrecognizerv1