A Comparison of Different Strategies for Automated Semantic Document Annotation
-
Upload
ansgar-scherp -
Category
Internet
-
view
432 -
download
0
Transcript of A Comparison of Different Strategies for Automated Semantic Document Annotation
1Chifumi Nishioka [email protected], K-CAP 2015
Gregor Große-BöltingChifumi NishiokaAnsgar Scherp
A Comparison of Different Strategies for Automated Semantic Document Annotation
2Chifumi Nishioka [email protected], K-CAP 2015
Motivation [1/2]• Document annotation
– Facilitates users and search engines to find documents– Requires a huge amount of human effort– e.g., subject indexers in ZBW labeled 1.6 million scientific
documents in economics
• Semantic document annotation– Documents annotated with semantic entities– e.g., PubMed and MeSH, ACM DL and ACM CCS
Focus on semantic document annotation
Necessity of automated document annotation
3Chifumi Nishioka [email protected], K-CAP 2015
Motivation [2/2]• Small scale experiments so far
– Comparing a small number of strategies– Datasets containing a few hundred documents
• Comparing of 43 strategies for document annotation within the developed experiment framework– The largest number of strategies
• Experiments with three datasets from different domains– Contain full-texts of 100,000 documents annotated by subject
indexers– The largest dataset of scientific publications
We conducted the largest scale experiment
4Chifumi Nishioka [email protected], K-CAP 2015
Experiment Framework
Strategies are composed of methods from concept extraction, concept activation, and annotation selection
1. Concept Extractiondetect concepts (candidate annotations) from each document
2. Concept Activationcompute a score for each concept of a document
3. Annotation Selectionselect annotations from concepts for each document
4. Evaluationmeasure performance of strategies with ground truth
5Chifumi Nishioka [email protected], K-CAP 2015
Research Question• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs best?
6Chifumi Nishioka [email protected], K-CAP 2015
Concept Extraction [1/2]• Entity
– Extract entities from documents using a domain-specific knowledge base
– Domain-specific knowledge base• Entities (subjects) in a specific domain (e.g., medicine)• One or more labels for each entity• Relationships between entities
– Detect entities by string matching with entity labels• Tri-gram
– Extract contigurous sequences of one, two, and three words in a document
7Chifumi Nishioka [email protected], K-CAP 2015
Concept Extraction [2/2]• RAKE (Rapid Automatic Keyword
Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords– Incorporate cooccurrence and frequency of words
• LDA (Latent Dirichlet Allocation) [Blei et al. 03]– Unsupervised topic modeling method for inferring latent
topics in a document corpus– Topic model
• Topic: A probability distribution over words• Document: A probability distribution over topics
– Treat a topic as a concept
8Chifumi Nishioka [email protected], K-CAP 2015
Concept Activation [1/6]• Three types of concept activation
methods– Statistical Methods
• Baseline• Use only directly mentioned concepts
– Hierarchy-based Methods• Reveal concepts that are not mentioned explicitly using a
hierarchical knowledge base– Graph-based Methods
• Use only directly mentioned concepts• Represent concept
cooccurrences as a graph
Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
9Chifumi Nishioka [email protected], K-CAP 2015
Concept Activation [2/6]• Statistical Methods
– Frequency
• depends on Concept Extraction methods– The number of appearances (Entity and Tri-gram)– The score output by RAKE (RAKE) – The probability of a topic for a document (LDA)
– CF-IDF [Goossen et al. 11]• An extension of TF-IDF replacing words with concepts• Lower scores for concepts that appear in many documents
𝑠𝑐𝑜𝑟𝑒𝑐𝑓𝑖𝑑𝑓 (𝑐 ,𝑑)=𝑐𝑓 (𝑐 ,𝑑) ∙𝑙𝑜𝑔¿𝐷∨ ¿¿ {𝑑∈𝐷 } :{𝑐∈𝑑}∨¿¿
¿
𝑠𝑐𝑜𝑟𝑒 𝑓𝑟𝑒𝑞(𝑐 ,𝑑)= 𝑓𝑟𝑒𝑞(𝑐 ,𝑑 )
10Chifumi Nishioka [email protected], K-CAP 2015
Concept Activation [3/6]• Hierarchy-based Methods [1/2]
– Base Activation
• : a set of child concepts of a concept • : decay parameter, set • e.g.,
𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 )+𝜆 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)
𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒(𝑐 𝑖 ,𝑑)
SocialRecommendation
SocialTagging
Web Searching Web Mining
SiteWrapping
Web LogAnalysis
World Wide Web
𝑐1
𝑐2
𝑐3
11Chifumi Nishioka [email protected], K-CAP 2015
Concept Activation [4/6]• Hierarchy-based Methods [2/2]
– Branch Activation
• : reciprocal of the number of concepts that are located one level above a concept
– OneHop Activation
• : set of concepts in a document • Activates concepts in a maximum distance of one hop
𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆∙𝐵𝑁 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)
𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐𝑖 ,𝑑)
𝑠𝑐𝑜𝑟𝑒 h𝑜𝑛𝑒 𝑜𝑝 (𝑐 ,𝑑 )={ 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 ) if∨𝐶𝑙(𝑐 )∩𝐶𝑑∨≥2𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆 ∙ ∑
𝑐 𝑖∈𝐶𝑙 (𝑐)
𝑓𝑟𝑒𝑞 (𝑐 𝑖 ,𝑑 )otherwise
12Chifumi Nishioka [email protected], K-CAP 2015
Concept Activation [5/6]• Graph-based Methods [1/2]
– Degree [Zouaq et al. 12]
• : the number of edges linked with a concept • e.g.,
– HITS [Kleinberg 99; Zouaq et al. 12]• Link analysis algorithm for search engines [Kleinberg 99]
𝑠𝑐𝑜𝑟𝑒𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑)=𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑 )
𝑠𝑐𝑜𝑟𝑒h𝑖𝑡𝑠 (𝑐 ,𝑑 )=h𝑢𝑏 (𝑐 ,𝑑)+ h𝑎𝑢𝑡 (𝑐 ,𝑑 )
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
13Chifumi Nishioka [email protected], K-CAP 2015
Concept Activation [6/6]• Graph-based Methods [2/2]
– PageRank [Page et al. 99; Mihalcea & Paul 04]• Link analysis algorithm for search engines• Based on the intuition that a node that is linked from many
important nodes is more important
• : set of concepts connected with incoming edges from • : set of concepts connected with outgoing edges from • : dumping factor,
𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒 (𝑐 ,𝑑 )=(1−𝜇 )+𝜇 ∙ ∑𝑐 𝑖∈𝐶𝑖𝑛 (𝑐)
𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒(𝑐 𝑖 ,𝑑 )¿𝐶𝑜𝑢𝑡(𝑐 𝑖)∨¿
¿
14Chifumi Nishioka [email protected], K-CAP 2015
Annotation Selection• Top-5 and Top-10
– Select concepts whose scores are ranked in top-k• k Nearest Neighbor (kNN) [Huang et al. 11]
– Based on the assumption that documents with similar concepts share similar annotations
1. Compute similarity scores between a target document and all documents with annotations
2. Select union of annotations of k nearest documents
Central bankLawFinancial crisis
FinanceChina
Human resourceLeadership
MarketingCompetition law
??
0.49
0.45
0.42
0.60
Example
- Selected annotationsFinance; China; Marketing; Competition Law
15Chifumi Nishioka [email protected], K-CAP 2015
Configurations [1/5]
Entity Tri-gram LDARAKE
StatisticalMethods(2 methods)
Hierarchy-basedMethods(3 methods)
Graph-basedMethods(3 methods)
Top-k(2 methods)
kNN(1 method)
ConceptExtraction
AnnotationSelection
ConceptActivation
16Chifumi Nishioka [email protected], K-CAP 2015
Configurations [2/5]
24 strategies
Entity Tri-gram LDARAKE
StatisticalMethods(2 methods)
Hierarchy-basedMethods(3 methods)
Graph-basedMethods(3 methods)
Top-k(2 methods)
kNN(1 method)
ConceptExtraction
AnnotationSelection
ConceptActivation
17Chifumi Nishioka [email protected], K-CAP 2015
15 strategies
Entity Tri-gram LDARAKE
StatisticalMethod
(2 methods)
Hierarchy-basedMethods(3 methods)
Graph-basedMethods(3 methods)
Top-k(2 methods)
kNN(1 method)
ConceptExtraction
AnnotationSelection
ConceptActivation
Configurations [3/5]
18Chifumi Nishioka [email protected], K-CAP 2015
3 strategies
Entity Tri-gram LDARAKE
StatisticalMethod
(Frequency)
Hierarchy-basedMethods(3 methods)
Graph-basedMethods(3 methods)
Top-k(2 methods)
kNN(1 method)
ConceptExtraction
AnnotationSelection
ConceptActivation
Configurations [4/5]
19Chifumi Nishioka [email protected], K-CAP 2015
Entity Tri-gram LDARAKE
StatisticalMethods(Frequency)
Hierarchy-basedMethods(3 methods)
Graph-basedMethods(3 methods)
Top-k(2 methods)
kNN(1 method)
ConceptExtraction
AnnotationSelection
ConceptActivation
Configurations [5/5]
43 strategies in total
20Chifumi Nishioka [email protected], K-CAP 2015
Datasets and Metrics of ExperimentsEconomics Political Science Computer Science
publication ZBW FIV SemEval 2010# of publications 62,924 28,324 244# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)knowledge base STW European Thesaurus ACM CCS# of entities 6,335 7,912 2,299# of labels 11,679 8,421 9,086
• Computer Science: SemEval 2010 dataset [Kim et al. 10]– Publications are annotated with keywords originally– We converted keywords to entities by string matching
• All publications and labels of entities are in English• We use full-texts of publications• All annotations are used as ground truth• Evaluation metrics: Precision, Recall, F-measure
21Chifumi Nishioka [email protected], K-CAP 2015
(I) Best Performing Strategies• Economics and Political Science datasets
– The best strategy: Entity × HITS × kNN– F-measure: 0.39 (economics), 0.28 (political science)
• Computer Science dataset– The best strategy: Entity × Degree × kNN– F-measure: 0.33 (computer science)
• Graph-based methods do not differ a lot
In general, a document annotation strategyEntity × Graph-based method × kNN performs best
22Chifumi Nishioka [email protected], K-CAP 2015
(II) Influence of Concept Extraction
• Concept Extraction method: Entity– Use domain-specific knowledge bases– Knowledge bases: freely available and of high quality– 32 thesauri listed in W3C SKOS Datasets
For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA
23Chifumi Nishioka [email protected], K-CAP 2015
(III) Influence of Concept Activation
• Poor performance of hierarchy-based methods– We use full-texts in the experiments
• Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated
– However, OneHop can work as well as graph-based methods• It activates concept in one hop distance
For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods
24Chifumi Nishioka [email protected], K-CAP 2015
(IV) Influence of Annotation Selection
• kNN– No learning process– Confirms the assumption that documents with similar
concepts share similar annotations
For Annotation Selection methods, kNN can enhance the performance
25Chifumi Nishioka [email protected], K-CAP 2015
Conclusion• Large scale experiment for automated semantic
document annotation for scientific publications• Best strategy: Entity × Graph-based method × kNN
– Novel combination of methods• Best concept extraction method: Entity• Best concept activation method: Graph-based
methods– OneHop can achieve similar performance and requires
less computation cost
26Chifumi Nishioka [email protected], K-CAP 2015
Thank you!Questions?
27Chifumi Nishioka [email protected], K-CAP 2015
Appendix
28Chifumi Nishioka [email protected], K-CAP 2015
Research Question• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs best?
29Chifumi Nishioka [email protected], K-CAP 2015
LDA (Latent Dirichlet Allocation)
source: D. M. Blei. Probabilistic topic models, CACM, 2012.
30Chifumi Nishioka [email protected], K-CAP 2015
Entity Extraction and Conversion• Entity extraction
– String matching with entity labels– Starting with longer entity labels
• e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”).
• Converting to entities– Words and keywords are extracted in Tri-gram and RAKE– They are converted to entities by string matching with
entity labels before annotation selection– If no matched entity label is found, word or keyword is
discarded
31Chifumi Nishioka [email protected], K-CAP 2015
kNN [1/2]• Similarity measure
– Each document is represented as a vector where each element is a score of a concept
– Cosine similarity is used as a similarity measureGDPImmigrationPopulationBankInterest rateCanada
0.30.50.80.10.00.5
GDPImmigrationPopulationBankInterest rateCanada
0.60.00.40.80.40.2
cosine similarity between and
𝑑1 𝑑2
32Chifumi Nishioka [email protected], K-CAP 2015
kNN [2/2]• k = 1
• k = 2
Central bankLawFinancial crisis
FinanceChina
Human resourceLeadership
MarketingCompetition law
??
0.49
0.45
0.42
0.60
MarketingCompetitive law
Selected annotations
Central bankLawFinancial crisis
FinanceChina
Human resourceLeadership
MarketingCompetition law
??
0.49
0.45
0.42
0.60
MarketingCompetitive lawFinanceChina
Selected annotations
33Chifumi Nishioka [email protected], K-CAP 2015
Evaluation Metrics• Precision
• Recall
• F-measure
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿
¿
𝑟𝑒𝑐𝑎𝑙𝑙=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿
¿
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒=2∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
34Chifumi Nishioka [email protected], K-CAP 2015
Datasets• Economics dataset
– 11 GB• Political science dataset
– 3.8 GB
35Chifumi Nishioka [email protected], K-CAP 2015
Experiments• Preprocessing documents
– lemmatization– stop words removal
• 10-fold cross validation– split a dataset into 10 equal sized subsets– 8 subset for training data– 1 subset for testing data– 1 subset for optimizing parameter
36Chifumi Nishioka [email protected], K-CAP 2015
Result Table: Entity [1/2]Economics
top-5 top-10 kNNRecall Precisio
nF Recall Precisio
nF Recall Precisio
nF
Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
Political Sciencetop-5 top-10 kNN
Recall Precision
F Recall Precision
F Recall Precision
F
Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09)CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16)Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12)Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11)OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19)Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19)HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20)PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)
37Chifumi Nishioka [email protected], K-CAP 2015
Result Table: Entity [2/2]Computer Science
top-5 top-10 kNNRecall Precisio
nF Recall Precisio
nF Recall Precisio
nF
Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17)CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18)Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17)Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17)OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20)Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18)HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20)PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)
38Chifumi Nishioka [email protected], K-CAP 2015
Result Table: Tri-gramEconomics
top-5 top-10 kNNRecall Precisio
nF Recall Precisio
nF Recall Precisio
nF
Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)
Political Sciencetop-5 top-10 kNN
Recall Precision
F Recall Precision
F Recall Precision
F
Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09)CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10)Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07)HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08)PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06)
Computer Sciencetop-5 top-10 kNN
Recall Precision
F Recall Precision
F Recall Precision
F
Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19)CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15)Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18)HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19)PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)
39Chifumi Nishioka [email protected], K-CAP 2015
Result Table: RAKEEconomics
top-5 top-10 kNNRecall Precisio
nF Recall Precisio
nF Recall Precisio
nF
Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)
Political Sciencetop-5 top-10 kNN
Recall Precision
F Recall Precision
F Recall Precision
F
Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)
Computer Sciencetop-5 top-10 kNN
Recall Precision
F Recall Precision
F Recall Precision
F
Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
40Chifumi Nishioka [email protected], K-CAP 2015
Result Table: LDAEconomics
kNNRecall Precisio
nF
Frequency .19 (.30) .19 (.30) .19 (.30)
Political SciencekNN
Recall Precision
F
Frequency .15 (.19) .15 (.18) .14 (.17)
Computer SciencekNN
Recall Precision
F
Frequency .28 (.27) .24 (.23) .24 (.22)
41Chifumi Nishioka [email protected], K-CAP 2015
Materials• Codes
– https://github.com/ggb/ShortStories• Datasets
– economics and political science• not publicly available yet• contact us directly, if you are interested in
– computer science• publicly available
42Chifumi Nishioka [email protected], K-CAP 2015
Presentation• K-CAP 2015
– International Conference on Knowledge Capture– Scope
• Knowledge Acquisition / Capture• Knowledge Extraction from Text• Semantic Web• Knowledge Engineering and Modelling• …
• Time slot– Presentation: 25 minutes– Q & A: 5 minutes
43Chifumi Nishioka [email protected], K-CAP 2015
Reference• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,
JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.
Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011.
• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015.
• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011.
• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.
• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.
44Chifumi Nishioka [email protected], K-CAP 2015
Reference• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked
environment, Journal of the ACM, 1999.• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,
EMNLP, 2004.• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank
citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword
extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept
detection, ESWC, 2012.