Developing and validating a document classifier: a real-life story - Marko Smiljanic

Marko Smiljanić, NIRI Inteligent computing Ltd,CEO

Developing and validating a document classifier:a real-life story

Developing and validating a document classifier:

a real-life storyMarko Smiljanić, CEO

www.niri-ic.com

About us.

NIRI: 10 years in Intelligent Computing Text Mining Knowledge Discovery and Management All about Data Science

About me.

My role

COMPANY

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Business context

Largest clients include Public Employment Services in EU, USA, and

Asia Staffing companies in EU, USA

Vacancies Job seekers

Job Taxonom

SkillTaxonom

ELISE Platform

Wrap up

The flow

Vacancies

Job Taxonom

Document Classification

Occupation Taxonomies ISCO (International Standard Classification of

Occupations) ESCO O*NET and many more ISCO level 1 (10)

ISCO level 2 (42)ISCO level 3 (124)ISCO level 4 (400)

ESCO level 5 (5000)

“Delivery service worker”

Challenges (for humans) Knowing the

taxonomy Ambiguous taxonomyHybrid positionsVague vacancy

Client’s situationin 2014

VacancyAggregato

rand Classifier

Correct Code? PublishRepair

Code!NO

ОК65%

no help

no code

2000-4000 per day (into >2000 taxonomy classes) %?

Wrap up

The flow

The Solution:NIRI will build you a better classifier

VacancyAggregato

rand Classifier

NIRI Classifier Publish2000-4000 per day

Really?How accurate will it be?How will it fit our process?

Reduce manual effort Increase volume Improve final accuracy

Really. We will (try to):

But you need to give us training data > 1M vacancies

No class12%

Not verified14%

Verified74%%?

Long tail effect

Architecture of our solution

FeatureExtractor Negotiator

Classifier 1

Classifier 2

Classifier N

…Vacancy [Class,

Confidence]+

Vacancy Classifier

External Services

What to do with confidence?

Vacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, Confidence…

Bulk Accept

To check manualy

Batch Processing

High accuracy

Low accuracy

Using confidence

Wrap up

The flow

Measuring accuracy in the laboratory

No class12%

Not verified14%

Verified74%

No class

Incorrect

Correct

Test20%

Train80% Train

Vacancy Classifier

Corpus Classifier Classifier 100 Classifier 1000

74% 78% 80% 85%

14%13% 12%

10%12% 9% 8% 5%

One of many Laboratory MeasurementsCorrect Incorrect No class

Does this make any sense?

Yes, but…

No class12%

Not verified14%

Verified74%

Vacancy Classifier

No class 9%

Incorrect13%

Correct78%

OriginalClassifier

This is not relaityBiased train/test setAccuracy of test set unknown Inability to test against 26%

Wrap up

The flow

Remember the process?

VacancyAggregato

rand Classifier

Correct Code? PublishRepair

Code!NO

ОК65%

no help

no code

This is what it actually looks like.Check Repair

Reduce manual effort Increase volume Improve final accuracy

We will

And we proposed this one.Bulk Accept Check Repair

Best/worst case analysis, some manual validation, careful assumptions:

Bulk Accept

Check Repair

Impact estimation showed that: Step 1 effort reduction 60%

(due to bulk acceptance) Step 2 effort reduction 11%

(due to bulk acceptance and top 5 offers) Significant published volume increase

(almost to 100%) Accuracy slightly larger

(+1%, to around 92%)

Does this make any sense?

Yes, but…

Wrap up

The flow

No class12%

Not verified14%

Verified74%%?

How can we measure production accuracy?

We can not,unless…

Golden Test Set

How was it built?Check & Repair4 eye principle

Vacancy Classifier

Published

Original Code&

Top 5 VC codes

Original Code&

Top 5 VC codes

Original Code&

Top 5 VC codes

Every single classification was marked as either Correct, Acceptable, or Wrong

Results

Current NIRI VC Current(HQ source)

NIRI VC (HQ source)

63.05%73.91% 72.06% 74.38%

65.98%77.56% 76.25% 78.69%

Golden Test Set ResultsCorrect Acceptable

Highest Quality Source (Training)

Wrap up

The flow

Wrap up Clean semantic data, in real-life, can only be a myth. We are looking into

data cleansing approaches. Measuring usefulness can be hard and expensive, but … … it can/must to be monitored after the system is deployed.

It changes over time. Continuous learning, where possible is a great thing. 1) Implementing state-of-the-art machine learning algorithm is one thing.

2) Making it useful is another. 3) Explaining that to the end-user is the third.

NIRI is a very cool company to work with!

I hope you liked the story, and I thank you for your attention.

Developing and validating a document classifier:

a real-life storyMarko Smiljanić, CEO

www.niri-ic.com

Developing and validating a document classifier: a real-life story - Marko Smiljanic

Data & Analytics

Transcript of Developing and validating a document classifier: a real-life story - Marko Smiljanic

AdaBoost 1. Classifier Simplest classifier 2 3 Adaboost: Agenda (Adaptive Boosting, R. Scharpire, Y. Freund, ICML, 1996): Supervised classifier Assembling.

Odlomak Kraljevic Marko

Marko Bulic's 2003

Light Metals 2010 : proceedings of the technical sessions ... · PDF fileAmorphous Bauxites 87 ... Centers ofAluminiumHydroxide, 221 R, Smiljanic, D. Lazic, D. Smiljanic, andZ. Zivkovic

CURRICULUM VITAE Marko

16S classifier

Marko Andrijić - repozitorij.ffst.unist.hr

AIR CLASSIFIER TURBO CLASSIFIER - AAAmachine

Marko Mäkelä - InnoDB

Cm Classifier

Publications Marko Ciciliani

k/k 1212 Cascading Style Sheets Part three Marko Boon marko/cursusCSS.

Naïve Bayes Classifier - unibas.ch · The Naïve Bayes classifier is more than a spam or text classifier. It is a general classification model based on the Bayes classifier with

Dynamic Classifier Systems for Classifier Aggregation

Marko Djurdjevic Tutorial

The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes Classifier •Practical concerns 2. Today’s lecture •The naïve Bayes Classifier

Marko Grobelnik (marko.grobelnik@ijs.si)marko.grobelnik@ijs.si Jozef Stefan Institute ( Slovenia, Europe Slides: marko/Slides/ActiveSummerSchool2009/marko/Slides/Active

Tutorial Marko

Marko Lopusina Apexmail

Define: Classifier