Building similarentityrecognizerv1

17
© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL 1 Building Similar Entity Recognizers By Arthi Venkataraman

Transcript of Building similarentityrecognizerv1

Page 1: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL1

Building Similar Entity Recognizers

By Arthi Venkataraman

Page 2: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL2

Agenda

Similar Entity Detection Scenarios

Challenges, Techniques and Algorithms

Semantic applicability

Big Data Challenges and Solution

Sample Results

Page 3: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL3

Scenario 1 - Fraud Detection in Insurance Claims

P1Tom Harold

makes claim on Policy 1

Are P1 , P2 and P3 same?

Is there FRAUD

P3T Harold makes a claim on

Policy 3

P2Tom H makes claim

on Policy 2

Page 4: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL4

Scenario 2 – Cross Sell Potential Detection in Insurance

Does Tom Harold hold a policy in any other system. What are the policies he holds. Is

there a potential for cross sell.

Tom Harold holds Policy 1 in System 1He is high net-worth customer

Page 5: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL5

Example features for different people

Person 1

• First Name – Tom

• Middle Name - • Last Name -

Harold• Date of Birth –

20/10/1987• Address - 1,

MG Road, Bangalore – 56

Person 2

• First Name – Tom

• Middle Name - Harry

• Last Name - Harold

• Date of Birth – 20/10/1987

• Address - 1, Mahatma Gandhi Rd, Bangalore - 560056

Person 3

• First Name – Tom

• Last Name - Harold

• Date of Birth – 20/10/1988

• Address - 1, Mahatma Gandhi Rd, Bangalore - 560056

Questions :• Is Person 1 same as Person2 ?• Is Person 2 same as Person3 ?• Is Person 1 same as Person3 ?

Page 6: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL6

Similar Entity Detection Challenges

Quick manual inspection of Person 1 and Person 2 feature data to conclude that Person 1 is same as Person 2

Not so trivial for a machine

Weightages must be arrived at for different features

Code is needed for identifying if values of a feature for person 1 and person 2 are similar or different

A similar string comparison is not sufficient - Is MG Road same as Mahatma Gandhi Rd

Actual data will have some spelling mistakes, missing data and wrongly entered data. For e.g. 20/10/1987 could be entered as 20/10/1988 Or the field itself could be empty

Hence need other techniques like machine learning and semantic techniques

Page 7: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL7

Similar Entity Detection Methodology

Step 1

• Identify relevant features

Step 2

• Extract values for the features

Step 3

• Create a model which can classify the two entities as same or different

Step 4

• Use the model to classify future customer pairs

Given two entities how can we say that two entities are same

Page 8: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL8

Supervised Learning model

Labeled pre-identified customer pairs data as inputs

Values for different features for each of the customer pairs

Each customer pair is tagged as Same, Probably Same, Different

A supervised algorithm is chosen - ( Actual algorithm based on data characteristics )

The tagged data is fed as input

Output is the model

Model will classify a new customer pair into one of the identified categories

Model accuracy can be calculated using the Precision, Recall, Accuracy and F-Scores

Page 9: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL9

Supervised Learning model example

• Live example of how to classify a given set of customer records using Supervised Learning

Page 10: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL10

Un-Supervised Learning model

In many cases there is no pre-labeled data

In this case we would need to choose an Un-supervised learning model

The model will automatically detect patterns in the data and cluster the data points into different clusters

Any newly added customer pair would be placed in the right cluster

Page 11: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL11

Un-Supervised Learning model example

• Live example of how to classify a given set of customer records using Un-Supervised Learning

Page 12: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL12

Continuous Learning

In many cases there will be some small set of labeled data and very large set of un-labeled data

An initial model will be created using the small labeled data set

As more labeled data is available the model will evolve due to continuous learning and become more better at the classification

Page 13: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL13

Semantic Techniques applicability

Semantic similarity scoring for features• Feature - List of Games played• Person 1 plays – Racquet Sports• Person 2 plays – Lawn Tennis• Using semantic comparison we can see that there is a high similarity between

person 1 and person 2 on the List of Games played feature

Extraction of features from different data sources• Similar features named differently

Associating customers in different data sources as same or different• Flexibly and easy addition of new relationships

Ease of adding additional data sources

Page 14: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL14

Large Data handling challenges

Entity similarity is a pair wise operation

If there are n entities then there n*(n-1) number of comparisons to be done

Also within each comparison for every feature pair has to be compared

Highly time consuming operations

Page 15: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL15

Large Data handling ideas

Use of Apache Mahout • Split the comparison into m

different machines • Each machine now handles

- n/m customer • Nearly an m time speed-up

Batch time comparisons and tagging• Reduce run time

response to find similar entities

Incremental addition of new customer pairs

Page 16: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL16

Sample Metrics from our experiments

• Discussion on the sample metrics from our experiments• Learning from same

– Which algorithm and method was more apt under different circumstances

Page 17: Building similarentityrecognizerv1

© 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL17

Thank You