SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao,...
-
Upload
elmer-rodgers -
Category
Documents
-
view
216 -
download
0
Transcript of SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao,...
1
sCooL: A System for Academic Institution
Name NormalizationFerosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair
Classification R & DCareerBuilder
2
Presentation overview
About sCooL◦ What is entity normalization?◦ Why is academic entity normalization important?◦ What are the academic entity normalization challenges?
Inside sCooL◦ A high-level overview of the core components◦ Atlas- the mapping manager
Evaluating sCooL◦ Comparing sCooL with existing implementation◦ Independent evaluation of sCooL
Concluding remarks◦ Demo◦ Questions?
3
About sCooL:Academic entity normalization facts
Facts 7,021 post-secondary title IV institutions in 2010-111*
200 Million unique visitors @ CB U.S
12 Million unique academic institutions entries in CB resume database
*http://nces.ed.gov/fastfacts/display.asp?id=84
4
About sCooL:Academic entity normalization definition
No. Name (surface formss) Frequency
1 410
2 139
3 131
4 6
5 1
6 1
7 1
8 1
9 1
10 1}Entity:Su
rface
form
s
5
About sCooL:Why academic entity normalizations
Improved Searching
Labor market dynamics insights
6
About sCooL:Academic entity normalization challenges
No. Name (surface formss) Frequency
1 Salford College 410
2 Salford College of Technology 139
3 Salford City College 131
4 Salford Uni 6
5 Salford University - 1
6 The University of Salford. 1
7 Salford University **+ 1
8 University of Salford 1982 1
9 =- University OF SALFORD 1
10 University of Salford- 1}Entity:Salford City CollegeMerchants Quay, Salford QuaysUnited Kingdom
Entity:University of SalfordSalford, LancashireUnited Kingdom
Entity:Salford College68 Grenfell Street, AdelaideAustralia
How will you identify the most accurate normalization from a given surface form?
7
About sCooL:Academic entity normalization challenges..
String similarity algorithms◦ Edit distance
Salford university -> Salford Unevarsity (Edit distance 2)
(spelling error)
St. Loye’s College ->St. Luke’s College (Edit distance 2)
(Two different academic institutions)
How will you distinguish spelling or typing errors from two different institution mapping scenario?
8
About sCooL:Academic entity normalization challenges
How will you create and maintain the surface form-entity mappings?
Legacy names (Mergers)
◦ University of Central England in Birmingham is an old name of Birmingham City University
◦ In January 2009, Salford College merged with Eccles College and Pendleton College to form Salford City College
◦ In October 2004, Victoria University of Manchester with the University of Manchester Institute of Science and Technology to form The University of Manchester
Popular names and Acronyms
◦ Ole Miss is a popular name for The University of Mississippi◦ MIT is an acronym for Massachusetts Institute of Technology. However, GIT is not an
acronym for Georgia Institute of Technology but Georgia Tech or Ga Tech are popular names for the institution.
9
No. Top 10 frequent universities in UK dataset
Frequency
1 N/A 128976
2 City & Guilds 23992
3 Not Specified 18598
4 City and Guilds 17441
5 Open University 6886
6 MIDDLESEX UNIVERSITY 5490
7 University of East London 5266
8 University of Greenwich 5108
9 CITY UNIVERSITY 4863
10 Kingston University 4856
About sCooL:Academic entity normalization challenges
How can we remove K-12 schools and noise?
Institution type Distribution
College 23.32%
University 16.57%
K-12 school 34.22%
Not sure 25.89%
10
About sCooL:Challenges summary
How will you identify the most accurate normalization from a given surface form?
How will you distinguish spelling or typing errors from two different institution mapping scenario?
How will you create and maintain the surface form-entity mappings?
How can we remove K-12 schools and noise?
11
Raw input query (surface form)
Remove K-12 schools• Weka classifier
Search institutions using mappings DB• Lucene index
Refine results• String comparison algorithm
Normalized entity• Update mappings DB
Inside sCooL:A high-level overview of the system
Inside sCooL:Atlas- sCooL’s mapping manager
12
CB mappi
ngs
Wikimappings
MongoDB
Lucene
Atlas
sCooL
13
Inside sCooL:Refining Lucene results
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑟𝑢𝑒𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛
𝑁𝑜𝑛𝑁𝑢𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛𝑠(𝑇𝑟𝑢𝑒+𝐹𝑎𝑙𝑠𝑒)
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒=𝑇𝑟𝑢𝑒𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛
𝐴𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛(𝑇𝑟𝑢𝑒+𝐹𝑎𝑙𝑠𝑒+𝑁𝑢𝑙𝑙)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Coverage
Accuracy
Threshold similarity
14
Evaluation:Comparing sCooL with existing implementation
Targeted metrics: Accuracy & Coverage
Precision is more important than Recall
Stratified Sampling in estimate of ratios
Favor high-frequency queries in sampling
15
Evaluation:Comparing sCooLwith existing implementation
𝐏𝐫 (|�̂�𝑖−𝑃 𝑖
𝑃 𝑖|<h 𝑖)=𝐶
{𝑛0=𝑍𝛼2 𝑃 𝑖(1−𝑃𝑖)
h 𝑖2
𝑛𝑖=𝑛0
1+(𝑛0−1)/𝑁 𝑖
�̂�=∑𝑖=1
3 𝑁 𝑖
∑𝑖
𝑁 𝑖
�̂�𝑖
91%
7%
2%
Sampling design
[1, 6]
[7, 39]
[40, max]
16
Evaluation:Comparing sCooL with existing implementation
Groups Group Size Sample SizeSampling Rate
sCool Accuracy
Existing System Accuracy
[1, 6] 145,126 780 1% 92% 75% [7, 39] 11,938 736 6% 96% 79% [40, max] 3,896 653 17% 95% 85% Total 160,960 2,169 1% 95% 80%
Dataset Coverage Weighted Coverage
UK CareerBuilder data sCool
Existing System
sCoolExisting System
40% 1% 73% 46%
17
Evaluation:Independent evaluation of sCooL
Test1-4ICU university list
The 4ICU [22] website145 popular universities and colleges in U.K.
Test2-Guardian university list:
The Guardian [23]a list of 135 universities in U.K.
DatasetAccuracy Coverage
sCool Existing System
sCoolExisting System
Test 1 (145) 93% 91% 95% 79%
Test 2 (135) 93% 90% 88% 72%
18
sCooL:Demo
Atlas http://ec2-54-193-1-73.us-west-1.compute.amazonaws.com/Atlas/
19
sCooL:Questions
20
sCooL: Appendix
Lucene search results for “University of Milan”
Rank Searchable field Display name
1 polytechnic university of milan Polytechnic University of Milan
2 university of milan University of Milan
3 catholic university of milan Universit`a Cattolica del Sacro Cuore
4 iulm university of milan IULM University of Milan
5 university of milan bicocca University of Milan Bicocca
6 milan university University of Milan
7 politecnico of milan Polytechnic University of Milan
8 milan polytechnic Polytechnic University of Milan
21
sCooL: AppendixString similarity algorithms
Rank String similarity algorithms
1 Levenshtein
2 Lucene Levenshtein
3 N-gram
4 Jaccard Similarity
5 Jaro Winkler
6 Hamming
7 Equals
8 Ignore case Equals
22
Evaluation:Comparing sCool with existing implementation
Balancing between Accuracy and Coverage
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
1000
2000
3000
4000
5000
6000
7000
CorrectWrongNull
Threshold similarity
Total input queries
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Coverage
Accuracy
Threshold similarity
23
About sCooL:Related work
Cucerzan, S from Microsoft Research did great work on large-scale disambiguation by Wikipedia data in 2007
Jijkoun, V et. al. from Univ. of Amsterdam proposed NEN in user generated content in 2008
Liu, X et. al. from Microsoft Research, China conducted a joint inference on NER and NEN for tweets in 2012
Magdy, W et. al. from IBM, Egypt invented NEN for Arabic names in 2007
Jonnalagadda, S et. al. from Lnx Research, CA developed NEMO, a NER and NEN system for PubMed author affiliations 2011
Cohen, A from OHSU studied gene/protein NEN by automatically generated libraries in 2005