Effective Named Entity Recognition for Idiosyncratic Web Collections

Roman Prokofyev, Gianluca Demartini, Philippe Cudre-MaurouxeXascale Infolab, University of Fribourg, Switzerland

WWW 2014April 10, 2014

Outline• Introduction

• Problem definition• Existing approaches and applicability

• Overview• Candidate Named Entities Selection• Dataset description• Features description• Experimental setup & Evaluation

Problem Definition

• search engine• web search

engine• navigational

query• user intent• information need• web content• …

Entity type: scientific concept

Traditional NERTypes:• Maximum Entropy (Mallet, NLTK)• Conditional Random Fields (Stanford NER, Mallet)

Properties:• Require extensive training• Usually domain-specific, different collections require training on their domain

• Very good at detecting such types as Location, Person, Organization

Proposed ApproachOur problem is defined as a classification task.

Two-step classification:• Extract candidate named entities using frequency filtration algorithm.

• Classify candidate named entities using supervised classifier.

Candidate selection should allow us to greatly reduce the number of n-grams to classify, possibly without significant loss in Recall.

Pipeline

Candidate Selection: Part IConsider all bigrams with frequency > k (k=2):

candidate named: 5entity are: 4entity candidate: 3entity in: 18entity recognition: 12named entity: 101of named: 10that named: 3the named: 4

candidate named: 5entity candidate: 3entity recognition: 12named entity: 101

NLTK stop word filter

Candidate Selection: Part IITrigram frequency is looked up from the n-gram index. candidate named entity: 5

named entity candidate: 3named entity recognition: 12named entity: 101candidate named: 5entity candidate: 3entity recognition: 12

candidate named: 5entity candidate: 3entity recognition: 12named entity: 101

candidate named entity: 5named entity candidate: 3named entity recognition: 12named entity: 81candidate named: 0entity candidate: 0entity recognition: 0

Candidate Selection: Discussion

Possible to extract n-grams (n>2) with frequency ≤k

After Candidate Selection

TwiNER: named entity recognition in targeted twitter stream‘SIGIR 2012

Classifier: OverviewMachine Learning algorithm:Decision Trees from scikit-learn package.

Feature types:• POS Tags and their derivatives• External Knowledge Bases (DBLP, DBPedia)• DBPedia relation graphs• Syntactic features

DatasetsTwo collections:• CS Collection (SIGIR 2012 Research Track): 100 papers

• Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category

CS Collection Physics Collection

N# Candidate N-grams 21 531 18 129

N# Judged N-grams 15 057 11 421

N# Valid Entities 8 145 5 747

N# Invalid N-grams 6 912 5 674

Available at: github.com/XI-lab/scientific_NER_dataset

Features: POS Tags, part I

100+ different tag patterns

Features: POS Tags, part II

Two feature schemes:• Raw POS tag patterns, each tag is a binary feature• Regex POS tag patterns:

• First tag match, for example:

• Last tag match:

JJ NNSJJ NN NNJJ NN...

NN VBNN NN VBJJ NN VB...

Features: External Knowledge BasesDomain-specific knowledge bases:

• DBLP (Computer Science): contains author-assigned keywords to the papers

• ScienceWISE: high-quality scientific concepts (mostly for Physics domain) http://sciencewise.info

We perform exact string matching with these KBs.

Features: DBPedia, part IDBPedia pages essentially represent valid entities

But there are a few problems when:• N-gram is not an entity• N-gram is not a scientific concept (“Tom Cruise”

in IR paper)

CS Collection Physics Collection

Precision Recall Precision Recall

Exact string matching 0.9045 0.2394 0.7063 0.0155

Matching with redirects 0.8457 0.4229 0.7768 0.5843

Features: DBPedia, part II

Without redirects With redirects

Features: Syntactic

Set of common syntactic features:• N-gram length in words• Whether n-gram is uppercased• The number of other n-gram given n-gram is part of

Experiments: Overview

1. Regex POS Patterns vs Normal POS tags2. Redirects vs Non-redirects3. Feature importance scores4. MaxEntropy comparison

All results are obtained using average with 10-fold cross-validation.

Experiments: Comparison I

CS Collection Precision Recall F1 score

Accuracy N# features

Normal POS + Components

0.8794 0.8058* 0.8409* 0.8429* 54

Regex POS + Components

0.8475* 0.8524* 0.8499* 0.8448* 9

Normal POS + Components-Redirects

0.8678* 0.8305* 0.8487* 0.8473 50

Regex POS + Components-Redirects

0.8406* 0.8769 0.8584 0.8509 7

The symbol * indicates a statistically significant difference as compared to the approach in bold.

Experiments: Comparison II

Physics Collection Precision Recall F1 score

Accuracy N# features

Normal POS + Components

0.8253* 0.6567* 0.7311* 0.7567 53

Regex POS + Components

0.7941* 0.6781 0.7315* 0.7492* 4

Normal POS + Components-Redirects

0.8339 0.6674* 0.7412 0.7653 50

Regex POS + Components-Redirects

0.8375 0.6479* 0.7305* 0.7592* 6

The symbol * indicates a statistically significant difference as compared to the approach in bold.

Experiments: Feature Importance

Importance

NN STARTS 0.3091

DBLP 0.1442

Components + DBLP 0.1125

Components 0.0789

VB ENDS 0.0386

NN ENDS 0.0380

JJ STARTS 0.0364

Importance

ScienceWISE 0.2870

Component + ScienceWISE

0.1948

Wikipedia redirect 0.1104

Components 0.1093

Wikilinks 0.0439

Participation count 0.0370

CS Collection, 7 features Physics Collection, 6 features

Experiments: MaxEntropy

Precision Recall F1 score

Maximum Entropy 0.6566 0.7196 0.6867

Decision Trees 0.8121 0.8742 0.8420

MaxEnt classifier receives full text as input.(we used a classifier from NLTK package)

Comparison experiment: 80% of CS Collection as a training data, 20% as a test dataset.

Lessons LearnedClassic NER approaches are not good enough for Idiosyncratic Web Collections

Leveraging the graph of scientific concepts is a key feature

Domain specific KBs and POS patterns work well

Experimental results show up to 85% accuracy over different scientific collectionshttp://iner.exascale.info/

eXascale Infolab, http://exascale.info

Effective Named Entity Recognition for Idiosyncratic Web Collections

Science

Transcript of Effective Named Entity Recognition for Idiosyncratic Web Collections

EDHEC Working Paper Idiosyncratic Risk F

Idiosyncratic volatility and earnings quality: evidence ...

Idiosyncratic Risk and Security Returns

"Income smoothing and idiosyncratic volatility"

Price Momentum and Idiosyncratic Volatility

Tax Aggressiveness and Idiosyncratic Volatility · 2017-06-02 · Tax Aggressiveness and Idiosyncratic Volatility Neeru Chaudhry1 September 15, 2016 Abstract Does idiosyncratic volatility

Bankers closing idiosyncratic deals: Implications for organisational ... · Bankers closing idiosyncratic deals: Implications for organisational cynicism Manuela Morf1 | Arnold B.

The Relationship between Idiosyncratic Risk and · PDF fileThe relationship between idiosyncratic risk and stock returns has ... The Relationship between Idiosyncratic Risk and Returns

Exploring Entity Recognition and Disambiguation for ...Exploring Entity Recognition and Disambiguation for Cultural Heritage Collections Seth van Hoolandƒ⁄, Max De Wildeƒ, Ruben

Income Smoothing and Idiosyncratic Volatility · (Meulbroek, 2001), they can be expected to eschew idiosyncratic volatility. Managers are anticipated to reduce idiosyncratic volatility

The Common Factor in Idiosyncratic Volatility

Arbitrage Pricing Theory for Idiosyncratic Variance …mykland/Renault_draftchicago_Feb2016.pdf · Arbitrage Pricing Theory for Idiosyncratic Variance Factors PRELIMINARY AND INCOMPLETE:

Idiosyncratic Deals from an OrganizationalLevel - PerspectiveFILE/… · Idiosyncratic Deals from an OrganizationalLevel - Perspective . Consequences, Mechanisms and Boundary Conditions.

Do Hedge Funds Reduce Idiosyncratic Risk?

Idiosyncratic Coskewness and Equity Return Anomalies · Idiosyncratic Coskewness and Equity Return Anomalies ... Idiosyncratic Coskewness and Equity Return Anomalies by ... en modélisant

The Idiosyncratic Deal: Flexibility versus Fairness?homepages.se.edu/cvonbergen/files/2012/12/The-Idiosyncratic-Deal_Flexibility-vs...Idiosyncratic Deals Aren't Politics At one level

Idiosyncratic DILI: It’s in the Genome

Idiosyncratic Volatility, Aggregate Volatility Risk, and ...

Linking Born-Digital News and Social Media Collections via Automated Entity Detection and Authority Matching

Idiosyncratic Culture Map