Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering University...
-
Upload
rhoda-marion-hines -
Category
Documents
-
view
216 -
download
2
Transcript of Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering University...
Arash Joorabchi & Abdulhussain E. Mahdi
Department of Electronic and Computer Engineering
University of Limerick, Ireland
A New Unsupervised Approach to Automatic Topical
Indexing of Scientific Documents According to
Library Controlled Vocabularies
ALISE 2013
Work Supported by: Work Supported by:
Subject (Topical) Metadata in Libraries
• Un-controlled
Unrestricted author and/or reader-assigned keywords and keyphrases,
such as:
– Index Term-Uncontrolled (MARC-653)
• Controlled
Restricted cataloguer-assigned classes and subject headings, such as:
– DDC (MARC-082)
– LCC (MARC-050)
– LCSH/FAST (MARC-650)
The Case of Scientific Digital Libraries & Repositories
Archived Material Include: Journal articles, conference papers, technical
reports, theses & dissertations, books chapters, etc.
• Un-controlled Subject Metadata:
– Commonly available when enforced by editors, e.g., in case of published
journal articles & conf. proceedings, but rare in unedited publications.
– Inconsistent
• Controlled Subject Metadata:
– Rare due to the sheer volume of new materials published and high cost of
cataloguing.
– High level of incompleteness and inaccuracy due to oversimplified classification
rules, e.g., IF published by the Dept. of Computer Science THEN DDC: 004,
LCSH: Computer science
Automatic Subject Metadata Generation in Scientific Digital Libraries
& Repositories
Aims to provide a fully/semi automated alternative to manual
classification.
1. Supervised (ML-based) Approach:
– utilizing generic machine learning algorithms for text classification (e.g., NB, SVM, DT).
– challenged by the large-scale & complexities of library classification schemes, e.g., deep
hierarchy, skewed data distribution, data sparseness, and concept drift [Jun Wang ’09].
2. Unsupervised (String Matching-based) Approach:
– String-to-string matching between words in a term list extracted from library thesauri &
classification schemes, and words in the text to be classified.
– Inferior performance compared to supervised methods [Golub et al. ‘06].
A New Unsupervised Concept-to-Concept Matching Approach - An
Overview
WorldCatDatabase
MARC records sharing a key concept(s) with the
paper/article
Paper/Article (Full Text)
Inference
RankingWikipedia Concepts
Key ConceptsPaper/Article (MARC Rec.)
653: {…}
082: {…}
650: {…}DDC
FAST
Paper/Article (MARC Rec.)
653: {Wikipedia: HP 9000}
650: {FAST: HP 9000 (Computer)}
Wikipedia as a Crowd-Sourced Controlled Vocabulary
Extensive topic/concept coverage (4m < English articles)
Up-to-date (lags Twitter by ~3h on major events [Osborne et al.’12])
Rich knowledge source for NLP (semantic relatedness, word sense
disambiguation)
Detailed description of concepts
Alternative Label
Related Term
Wikification using WikipediaMiner – an open source toolkit for mining
Wikipedia [Milne, Witten ‘09]
Block Edit Models for Approximate String Matching
Abstract
In this paper we examine the concept of string block edit distance, where two strings A and B are compared by
extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena
encountered in important real-world applications, including pen computing and molecular biology. The basic problem
admits a family of variations depending on whether the strings must be matched in their entireties, and whether overlap
is permitted. We show that several variants are NP-complete, and give polynomial-time algorithms for solving….
.
.
Wikipedia Concepts – Detection In Text
Descriptor: String (computer science)
Non-descriptors:– character string – text string– binary string
String (theory) String (rope) String (music) …
Wikipedia Concepts – Ranking Features
1. Occurrence Frequency
2. First Occurrence3. Last Occurrence
4. Occurrence Spread
5. Length
6. Lexical Diversity7. Lexical Unity
8. Avg Link Probability 9. Max Link Probability
10. Generality 11. Speciality
12. Distinct Links Count 13. Links Out Ratio 14. Links In Ratio
15. Avg Disambiguation Confidence16. Max Disambiguation Confidence
17. Link-Based Relatedness to Other Topics 18. Link-Based Relatedness to Context
19. Cat-Based Relatedness to Other Topics
20. Translations Count
Un-supervised
Pros: easy to implement & fast
plug & play, i.e., no training needed
Cons (naïve assumptions): Assumes all features carry the same weight
Assumes all features contribute to the importance probability of candidates linearly
Key Wikipedia Concepts – Rank & Filtering
||
1
)Score(F
iijj ftopic
Genetic algorithm (ECJ) settings
Species Population Size
Genome Size
Chunk Size
Min Gene
Max Gene
Elites Crossover Type
Selection Method
Mutation Type
Mutation Probability
Threads
Float 40 40 2 0.0 2.0 1 two points Tournament Reset 0.05 2
Supervised1. Initial population - a set of ranking functions with random weight and degree parameter values within a preset range
2. Evaluate fitness of each ranking function.
3. (selection, crossover, mutation) -> new generation
4. Repeat steps 2 & 3 until threshold is passed
||
1
)Score(F
i
dijij
ifwtopic
Key Wikipedia Concepts – Evaluation Dataset & Measure
Wiki-20 dataset [Medelyan, Witten ‘08]:
20 Computer Science related papers/articles.
Each annotated by 15 Human Annotator (HA) teams independently.
HAs assigned an average of 5.7 topics per Doc.
an Avg. of 35.5 unique topics assigned per Doc.
Rolling’s inter-indexer consistency (=F1) :
ba
c(A,B)
2yconsistencindexer -Inter
HA1
MA
HA3HA2
VK
Key Wikipedia Concepts – Evaluation Results
Performance comparison with human annotators and rival machine annotators
Min. Avg. Max.
TFIDF (baseline) n/a - unsupervised 5 5.7 8.3 14.7KEA++ (KEA-5.0) Naïve Bayes 5 15.5 22.6 27.3
Grineva et al. n/a - unsupervised 5 18.2 27.3 33.0Maui Naïve Bayes (all 14 features) 5 22.6 29.1 33.8Maui Bagging decision trees (all 14 features) 5 25.4 30.1 38.0
Human annotators (gold standard)
n/a - senior CS studentsVaried, with an average of
5.7 per document21.4 30.5 37.1
CKE n/a - unsupervised 5 22.7 30.6 38.3Current work n/a - unsupervised 5 19.1 30.7 37.9
Maui Bagging decision trees (13 best features) 5 23.6 31.6 37.9Current work (LOOCV) GA, threshold=800, unique bests method 5 12.3 32.8 58.1Current work (LOOCV) GA, threshold=200, unique bests method 5 13.9 32.9 56.7
Current work (LOOCV) GA, threshold=400, unique bests method 5 14.0 33.5 58.1
MethodAvg. inter consistency with
human annotators (% )Number of Keyprases
Assgined per document, nk
Learning Approach
– Joorabchi, A. and Mahdi, A. Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic Algorithms. In Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012)
– Joorabchi, A. and Mahdi, A. Automatic Keyphrase Annotation of Scientific Documents Using Wikipedia and Genetic Algorithms. To appear in the Journal of Information Science
Querying WorldCat Database
Top
30
Key
Concepts
in the
document
WorldCatDatabase
http://worldcat.org/webservices/catalog/search/sru?querysru?query=
srw.kw = Doc_Key_Concept_Descriptor
AND srw.ln exact eng //Language
AND srw.la all eng //Language Code (Primary)
AND srw.mt all bks //Material Type
AND srw.dt exact bks //Document Type (Primary)
&servicelevel = full
&maximumRecords = 100
&sortKeys = relevance,,0 //Descending order
&wskey = [wskey]
≤100 potentially
related MARC records
Refining Key Concepts Based on WorldCat Search Results
marc_recsi , j ≤100
nceptsDoc_Key_CoRefinednceptsDoc_Key_Co
conceptkeydocnceptsDoc_Key_CoRefinednceptsDoc_Key_CoRefined
conceptkeydoc
nceptsDoc_Key_CoRefined
nceptsDoc_Key_CoRefined
conceptskeydocconceptskeydoc
conceptskeydocmatchestotal
matchestotal
ConceptsKeyDocconceptskeydoc
nceptsDoc_Key_CoRefined
i
i
i
iie
i
i
_:
___:_ ELSE
__ Discard THEN
20_ OR
10_
AND
8.0__eInDoc_Scor__eInDoc_Scor
OR
__eInDoc_Scor1_log OR
0_
IF
:____
_
1
Marc_Recsi=Doc_Key_Concepts =
doc_key_conceptsi ≤30
e.g., “Logical conjunction”
e.g., “Logic”(72,353): 13.7>10.3
vs. “Linear logic”(17): 2.83 < 8.6
total_matchesi
MARC Records Parsing, Classification, Concept Detection
001 Control Number
245($a) Title Statement (Title)
505($a, $t) Formatted Contents Note
520($a, $b) Summary, Etc.
650($a) Subject Added Entry-Topical Term
653($a) Index Term-Uncontrolled
OCLC Classify
Wikipedia-Miner
marc_recsi , j ≤100Marc_Recsi=
Doc_Key_Concepts=
doc_key_conceptsi ≤20
DDCi,j Marc_Conceptsi,jFASTi,j
*OCLC Classify finds the most popular DDC & FASTs for the work using the OCLC FRBR Work-Set algorithm.
total_matchesi
Measuring Relatedness Between MARC Records and the Article/Paper
ConceptsMarc
conceptsshared
conceptsharedconceptsshared
DKCMC
ConceptsMarcconceptshared_Mark_RecsAll_Uniquerecsmarc
_Mark_RecsAll_Uniqueconceptshared
Marc_Recs_Mark_RecsAll_Unique
ConceptsMarc
conceptssharedconceptsshared
conceptskeydocxConceptsKeyDocConceptsMarcxceptsShared_Con
k
k
ConceptsShared
kk
ji
jikji
k
ConceptsKeyDoc
ii
kk
iji
_
_eInDoc_Scor
_rc_FreqInverse_Malog1__FreqNormalizedlog
,sRelatednes
__:__rc_FreqInverse_Ma
_
_qInMarc_Fre__FreqNormalized
__:___
2
2
_
12
,
,,
__
1
,
Relatedness?Relatedness?
marc_recsi , j ≤100Marc_Recsi=
Doc_Key_Concepts=
doc_key_concepts i ≤20
DDCi,jMarc_Conceptsi,j FASTi,j
total_matchesi
Relatednessi,j
Weighting DDC Candidates
kk
kkkk
ConceptsKeyDoc
iijiki
kk
k
ConceptsKeyDoc
i
csMarc
jjikji
k
ijiki
k
ConceptsKeyDoc
icsMarc
jji
csMarc
jjik
k
csMarc
jjii
ConceptsKeyDoc
i
csMarc
jjikk
k
ijiji
ConceptsKeyDoc
ii
ddcsuniqueddcsuniqueMatchesTotalAverageInverse
ddcsuniqueddcsuniqueddcsuniqueddcsunique
Marc_RecsjDDCddcsuniquematchestotal
ddcsuniqueddcsunique
ddcsunique
DDCddcsuniquesRelatednes
ddcsunique
Marc_RecsjDDCddcsuniqueConceptsKeyDocconceptskeydoc
ConceptsKeyDocddcsunique
PerConceptCountValidDDCsHighest
DDC
DDCddcsunique
ddcsunique
DDCxConceptsKeyDocconceptskeydocxPerConceptCountValidDDCsHighest
DDCddcsuniqueddcsunique
DDCsUniqueddcsunique
Marc_RecsjConceptsKeyDoci_Marc_RecsAll_UniquerecsmarcDDCxDDCsUnique
Marc_Recsc_RecsUnique_MarAll
i
i
i
i
i
_l_Matcheserage_TotaInverse_Av1____log
_ncept_FreqInverse_Colog__FreqNormalizedlog_Freqlog_Weight
1__
_Freq_l_Matcheserage_TotaInverse_Av
_Freq
_
_latednessAverage_Re
1_:____
|__|_ncept_FreqInverse_Co
___
0
_
__FreqNormalized
0____: Nmax___
__Freq
:__
1,__1_:_
_
2
222
__
1,
__
1
Re_
1,,
,
__
1Re_
1,
Re_
1,
Re_
1,
__
1
Re_
1,
,,
__
1
Weighting FAST Candidates
kk
kkkk
ConceptsKeyDoc
iijiki
kk
k
ConceptsKeyDoc
i
csMarc
jjikji
k
ijiki
k
ConceptsKeyDoc
icsMarc
jji
csMarc
jjik
k
csMarc
jjii
ConceptsKeyDoc
i
csMarc
jjikk
k
ijiji
ConceptsKeyDoc
ii
fastsuniquefastsuniqueMatchesTotalAverageInverse
fastsuniquefastsuniquefastsuniquefastsunique
Marc_RecsjFASTfastsuniquematchestotal
fastsuniquefastsunique
fastsunique
FASTfastsuniquesRelatednes
fastsunique
Marc_RecsjFASTfastsuniqueConceptsKeyDocconceptskeydoc
ConceptsKeyDocfastsunique
PerConceptCountValidFASTsHighest
FAST
FASTfastsunique
fastsunique
FASTxConceptsKeyDocconceptskeydocxPerConceptCountValidFASTsHighest
FASTfastsuniquefastsunique
FASTsUniquefastsunique
Marc_RecsjConceptsKeyDoci_Marc_RecsAll_UniquerecsmarcFASTxFASTsUnique
Marc_Recsc_RecsUnique_MarAll
i
i
i
i
i
_l_Matcheserage_TotaInverse_Av1____log
_ncept_FreqInverse_Colog__FreqNormalizedlog_Freqlog_Weight
1__
_Freq_l_Matcheserage_TotaInverse_Av
_Freq
_
_latednessAverage_Re
1_:____
|__|_ncept_FreqInverse_Co
___
0
_
__FreqNormalized
0____:Nmax___
__Freq
:__
1,__1_:_
_
2
222
__
1,
__
1
Re_
1,,
,
__
1Re_
1,
Re_
1,
Re_
1,
__
1
Re_
1,
,,
__
1
006.312 : 10.991574176537037 + 006.31 : 19.614959248944054 = 30.60653342548109+ 006.3 : 12.77908859025236 = 43.385622015733446
DDCs Weight Aggregation & Outlier Detection
Sort Unique_DDCs set based on DDCs depth in descending order
For each DDCi ∈ Unique_DDCs Do :
For each DDCj ∈ Unique_DDCs Do :
IF subclass(DDCi, DDCj) THEN
IF weight(DDCi) > highest_DDC_weight/10 THEN
weight(DDCi) = weight(DDCi) + weight(DDCj)
Discard DDCj
ELSE Discard DDCi
DDCi DDCi+1
Upper + 1
Outlier
s.t. weight(DDCi) > (upper inner fence = Q3 + 1.5*IQ)
Example:
*BoxPlot Outliers - DDCs whose weights lie an abnormal distance from the others’, i.e., mild and extreme outliers
FASTs Weight Aggregation & Outlier Detection
Unique_FASTs := {x ∈ Unique_FASTs : weight(x) > highest_FAST_weight/10}
For each FASTi ∈ Unique_FASTs Do :
For each FASTj ∈ Unique_FASTs Do :
IF related(FASTi , FASTj) AND WC_SubjectUsage(FASTi) < WC_SubjectUsage(FASTj)
THEN weight(FASTi) = weight(FASTi) + weight(FASTj)
FASTi FASTi+1 FASTi+2
Outlier1 + Outlier2 + 1
Expert systems (Computer science) 4.224295291384108 -> seeAlsoHeading: Artificial intelligence-> seeAlsoHeading: Computer systems-> seeAlsoHeading: Soft computing-> subjectUsage: 14685.0
+ Artificial intelligence(subjectUsage:36145.0) weight : 5.214271611745798 = 9.438566903129907
Example:
DDCs Binary Evaluation
Wiki-20 dataset [Medelyan, Witten ‘08] containing 20 Computer Science related papers/articles.
FPTP
TP
assigned Total
assignedcorrectly ofNumber Precision
FNTP
TP
correct possible Total
assigned correctly ofNumber Recall
RePre
Re2Pr
1F
*Automatic Classification Toolbox for Digital Libraries (ACT-DL) by Bielefeld University Library and deployed at Bielefeld Academic Search Engine (BASE)
Doc ID Predicted DDC (by current method) True DDCPredicted
DDC (by ACT-DL*)
519.542 Decision theory ✓006.35 Natural language processing ✓
7183 006.333 Deduction, problem solving, reasoning ✓ 0047502 005.131 Symbolic logic 006.333 Deduction, problem solving, reasoning 0049307 005.757--0218 Object-oriented databases--Standards 005.757 Object-oriented databases 00410894 621.3815--0287 Components and circuits--Testing and measurement 005.14 Verification, testing, measurement, debugging 00412049 005.43 Systems programs 005.453 Compilers 00413259 001.6443 (invalid in DDC22 & DDC23) 001.4226 Presentation of statistical data 00016393 004.53 Internal storage (Main memory) 005.435 Memory management programs 00418209 005.115 Logic programming ✓ 004
511.322 Set theory ✓005.275 Programming for multiprocessor computers ✓004.35 Multiprocessing ✓004.33 Real-time processing ✓
23267 005.117 Object-oriented programming ✓ 00423507 495.6--5 Japanese--Grammar 006.35 Natural language processing 40023596 658.4036--028546 Group decision making--Computer communications ✓ 150
515.2433 Fourier and harmonic analysis ✓below threshold 006.37 Computer vision
37632 005.14 Verification, testing, measurement, debugging ✓ 00439172 006.4--015116 Computer pattern recognition--Combinatorics ✓ 51039955 005.117 Object-oriented programming ✓ 15040879 004 Computer science 006.31 Machine learning 00443032 005.262 Programming in specific programming languages 005.26 Programming for personal computers 004
TP= 14, FP=9, FN= 10, Pr= 0.61, Re= 0.58, F1= 0.60
287
19970
20287
25473
Overall F1=[0.05, 0.75]
004
004
004
004
004: 78k005: 100006: 403
ImbalancedTraining Set
DDCs Hierarchical Evaluation
L1 L2 L3 L4 L5 L6 L7 Facet Avg.TP 21 21 18 17 15 10 2 2
FP 2 2 5 5 5 4 2 3
FN 3 3 6 7 8 4 1 0
Pr 0.91 0.91 0.78 0.77 0.75 0.71 0.50 0.40 0.72
Re 0.88 0.88 0.75 0.71 0.65 0.71 0.67 1.00 0.78
F1 0.89 0.89 0.77 0.74 0.70 0.71 0.57 0.57 0.73
L1 L2 L3 L4 L5 L6 L7 Facet Avg.TP 16 16 1
FP 4 4 19
FN 4 4 19
Pr 0.80 0.80 0.05 0.55
Re 0.80 0.80 0.05 0.55
F1 0.80 0.80 0.05 0.55
L1 L2 L3 L4 L5 L6 L7 Facet Avg.Pr 0.90 0.78 0.77 0.82
Re 0.75 0.56 0.55 0.62
F1 0.81 0.63 0.62 0.69
Cu
rre
nt
Wo
rk
(Wik
i-2
0 d
ata
se
t)A
CT
-DL
(Wik
i-2
0 d
ata
se
t)
AC
T-D
L(B
AS
E
da
tas
et)
FASTs Binary Evaluation
Bayesian statistical decision theory ✓Bayesian statistical decision theory--Industrial applications Natural language processing (Computer science)Maximum entropy method Information retrievalEconometric models Machine learningModel-based reasoning ✓Knowledge acquisition (Expert systems) ✓Expert systems (Computer science) ✓Semantics Conceptual structures (Information theory)Case-based reasoning ✓Object-oriented databases ✓UML (Computer science) Computer software—DevelopmentBooch method Computer-aided software engineeringSoftware patterns ✓Object-oriented methods (Computer science) ✓Object-oriented databases--Standards Object-oriented programming (Computer science)Regression analysis ✓Struts framework Computer software--Quality controlApplication software--Testing ✓Yacc (Computer file) ✓Assembling (Electronic computers) Compiling (Electronic computers)Three-dimensional display systems ✓Interactive computer systems ✓Interactive multimedia Information visualizationDistributed shared memory ✓Intel i860 (Microprocessor) Memory management (Computer science)Cache memory ✓Virtual storage (Computer science) ✓Predicate (Logic) ✓Modality (Logic) ✓Set theory ✓Sorting (Electronic computers) ✓Parallel algorithms ✓Data transmission systems Real-time data processingVirtual computer systems ✓Parallel computers ✓Modula-3 (Computer program language) Object-oriented methods (Computer science)ML (Computer program language) Object-oriented programming (Computer science)Object-oriented databases Computer software--ReusabilityAbstract data types (Computer science) ✓English language--Noun phrase ✓Grammar, Comparative and general--Noun phrase ✓Automatic speech recognition Computational linguistics
23596 Teams in the workplace--Data processing ✓Data compression (Telecommunication) ✓Image compression ✓Signal processing--Mathematics ✓Wavelets (Mathematics) ✓Video compression ✓Digital video ✓Data compression (Computer science) ✓Software visualization ✓Debugging in computer science ✓Matching theory ✓Text processing (Computer science) ✓Graphical user interfaces (Computer systems) Combinatorial analysisSmalltalk (Computer program language) Object-oriented programming languagesObjective-C (Computer program language) Object-oriented programming (Computer science) Automatic speech recognition Machine learningSpeech processing systems ClassificationSupervised learning (Machine learning) ✓HP-UX Software localizationHewlett-Packard computers--Programming User interfaces (Computer systems)HP 9000 (Computer) Computer interfacesC (Computer program language) ✓
TP= 40, FP= 24, FN= 24, Pre= Re= F1= 0.625
Doc ID Predicted FAST True FAST
287
7183
7502
9307
23507
13259
20287
39955
25473
37632
39172
Overall
43032
10894
19970
18209
16393
12049
23267
40879
Bayesian statistical decision theory ✓Bayesian statistical decision theory--Industrial applications Natural language processing (Computer science)Maximum entropy method Information retrievalEconometric models Machine learningModel-based reasoning ✓Knowledge acquisition (Expert systems) ✓Expert systems (Computer science) ✓
Doc ID Predicted FAST True FAST
287
7183
TP= 40, FP= 24, FN= 24 F1= 0.625
Semi-Supervised Classification
1. Bayesian statistical decision theory >252.41740965808467 2. Bayesian statistical decision theory--Industrial applications >223.09281028013865 3. Maximum entropy method >223.09281028013865 4. Econometric models >189.47706031373122 5. Economics, Mathematical >188.4336672427764 6. Natural language processing (Computer science) >176.13905753628868 7. Econometrics >156.6469274464959 8. Distribution (Probability theory) >120.64195152106359 9. Parsing (Computer grammar) >102.72834662505807 10. Lexicology--Data processing >101.39771816337012 11. Machine translating >99.39171867148306 12. Text processing (Computer science) >96.65689215290195 13. Information retrieval >79.01359045012737 14. Semantic Web >73.12618493349078 15. Probabilities >70.99695859769267 16. Computational linguistics >65.00474591701948 17. Machine learning >60.14168210721469 18. Decision making >50.302190572189424 19. Inference >49.142891911243986 20. Interactive computer systems >49.04810095707191 ...41. Mathematical physics >25.256694185393123
287: Clustering Full Text Documents
12049: Occam's Razor: The Cutting Edge for Parser Technology1. 005.43 >449.17978755450434 (Systems programs) 2. 005.453 >429.04491205387495 (Compilers)3. 005.12 >144.3981891584036 4. 510.7808 >138.0169127750601 5. 005.26 >105.58801291194308 6. 415 >79.72358747591275 7. 001.6425 >39.024619737391866 8. 004 >36.433436906359425
Future Work
Detecting Wikipedia topics in documents is computationally expensive.
Eliminate the need for sending queries to WorldCat and repeating the process
of topic detection on matching MARC records by performing topic detection on
a locally held FRBRized version of WorldCat DB.
Complementing topics extracted from MARC records of a work
catalogued in WorldCat with Common terms and phrases from its
content (as extracted by Google Books)
Probabilistic Mapping of Wikipedia concepts/articles to their
corresponding DDCs and FASTS (already initiated by OCLC research
via developing VIAFbot for mapping Wikipedia biography articles to VIAF.org)
This work is supported by:
OCLC/ALISE Library & Information Science Research Grant Program
Irish Research Council 'New Foundations' Scheme
Thank You!Thank You!
Questions…Questions…
For more information, please contact: