Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task...
Transcript of Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task...
![Page 1: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/1.jpg)
www.tugraz.at n
W I S S E N n T E C H N I K n L E I D E N S C H A F T
u www.tugraz.at
Science 2.0 VU Processing Science 2.0 Data, Content Mining
WS 2015/16
Elisabeth Lex KTI, TU Graz
![Page 2: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/2.jpg)
www.tugraz.at n
Agenda
• Repetition from last time: Open Science • Processing academic resources • Mining in academic resources (content perspective) • Example:
• ContentMine: Extraction of scientific facts
2
![Page 3: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/3.jpg)
www.tugraz.at n
Repetition: Open Science
• Open Science • Ideas, Concepts, Benefits and Pitfalls
• E.g. Enhancing collaboration and community-building, increasing efficiency of research vs no reward system yet
• Open Data • Sharing your data influences how often you get
cited (Piwowar, et al., 2007 and Pinowar, et a., 2013)
• Different models for Open Access • Green vs. Gold vs. Hybrid
3
![Page 4: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/4.jpg)
www.tugraz.at n
Open Science – 5 schools of thought
4
![Page 5: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/5.jpg)
www.tugraz.at n
Example: Open Government Data: Eurostat
5
“I’d like to compare the unemployment rate in Austria with other European ones”
Via Google Public Data Explorer, https://www.google.com/publicdata/directory
![Page 6: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/6.jpg)
www.tugraz.at n
Open Access in Science: Open Access Journals ● Green („self-archiving): author can self-archive at the time of
submission of the publication whether the publication is grey literature (usually internal non-peer-reviewed), a peer-reviewed journal publication, a peer-reviewed conference proceedings paper or a monograph
● Gold („author pays“): the author or author institution can pay a fee to the publisher at publication time, the publisher then makes the publication available 'free' at the point of access .
● further little-used “road” hybrid forms: for example platinum open access (does not charge author fees)...
● Both green and gold are compatible and can co-exist
Source: Jeffery, K. Open Access: An Introduction, 2006. http://www.ercim.eu/publication/Ercim_News/enw64/jeffery.html
![Page 7: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/7.jpg)
www.tugraz.at n
Processing Academic Resources
7
![Page 8: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/8.jpg)
www.tugraz.at n
• Aggregate scientific results • Exploratory search in digital collections • Find experts in domains
• Make science discoverable • Improve access to scientific publications • Extract facts for research • Discover relationships
• Check for errors => improve science
Motivation
![Page 9: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/9.jpg)
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
9
![Page 10: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/10.jpg)
www.tugraz.at n
KDD Process
10
![Page 11: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/11.jpg)
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,....
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
11
![Page 12: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/12.jpg)
www.tugraz.at n
Datasets
• The European Library Open Dataset • Digital collection and 200 mio bibliographic records • http://www.theeuropeanlibrary.org/tel4/access/data/
opendata • Datahub.io
• E.g. DBLP Computer Science Bibliography http://datahub.io/dataset/dblp
• Metadata of over 1.8 mio publications by 1 mio authors
12
![Page 13: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/13.jpg)
www.tugraz.at n
Repositories and Aggregators
• ISI Web of Science • Scopus • Pubmed • The European Library • Library of Congress • ArXiv • Figshare • Data Citation Index • Mendeley • Google Scholar • CiteSeerX • ...
13
![Page 14: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/14.jpg)
www.tugraz.at n
APIs to Repositories ...
• APIs to access scientific publications and research data
• rOpenSci: arXiv, PlosOne, Figshare • Mendeley: Developer API, http://dev.mendeley.com
• Python package: pip install mendeley
14
![Page 15: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/15.jpg)
www.tugraz.at n
Example - rOpenSci
15
![Page 16: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/16.jpg)
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering / Classification • Linking publications
• Make available data and source code J
16
![Page 17: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/17.jpg)
www.tugraz.at n
Information Extraction
• IE Goal: Extract structured information out of unstructured content, e.g.
• Method names, quantities, temporal expressions • Authors from scientific publications • Organizations in acknowledgements section of
papers • References • ...
17
![Page 18: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/18.jpg)
www.tugraz.at n
IE Process
18
http://www.nltk.org/book/ch07.html
Input: raw text of a document Output: list of (entity, relation, entity)
ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity)
Applying word classes to words within a sentence
![Page 19: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/19.jpg)
www.tugraz.at n
IE Standard Approaches (1/2)
• Regular expressions / Rule-based approaches • E.g. dates, email addresses, @user, RT@user
http://localhost:8888/notebooks/twitterprocessing.ipynb
19
![Page 20: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/20.jpg)
www.tugraz.at n
IE as Machine Learning Task
• Supervised: train model with annotated training data, use trained model to classify unknown text
• Choose a class label for a given input • Identify features of language data to classify it • Construct language models out of them • Learn about text/language from these models
• Methods: • Classifiers: Naive Bayes, Maxent Models • Sequence models: Hidden Markov Models, CRFs
20
![Page 21: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/21.jpg)
www.tugraz.at n
Libraries
• NLTK (http://www.nltk.org) • http://localhost:8888/notebooks/science20-ie.ipynb
21
![Page 22: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/22.jpg)
www.tugraz.at n
Mining academic documents
• Extraction of structural elements • Tables, figures,..
• Extraction of facts from structural elements and doc • Named Entity Recognition (e.g. gene names,..) • Relation extraction (e.g. system A impacts system
B) • Mostly: PDF format
• Good for presentation but problems with metadata quality, hard to analyse
• While PDF analysis tools exist, there is still room for improvement!
22
![Page 23: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/23.jpg)
www.tugraz.at n
Approach
• Divide and conquer • Extracting blocks from the PDF based on structure
and layout information • Classify the extracted blocks
• E.g. into title, body, references, abstract,.. • Classify content of extracted blocks
• E.g. tables • Extract relevant info from the content (Named
Entities, nouns, dates, quantities,..)
23
![Page 24: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/24.jpg)
www.tugraz.at n
Approach
• Extracting blocks • Features: layout specific such as position, font, font
size,.. • Apply Machine Learning approches
• Unsupervised (clustering) • Supervised (classification)
24
![Page 25: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/25.jpg)
www.tugraz.at n
Unsupervised Approach
• Clustering: given a set of objects find the groupings of objects so that the similarity within a group is maximized and the similarity between groups is minimized
• Cluster = block • Successive merge and split mechanism
25
![Page 26: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/26.jpg)
www.tugraz.at n
Supervised Approach
• Classification: given a set of labeled examples, create a model and use it to predict the label of unknown examples
• Classify blocks: Maximum Entropy Models • Create training data by labeling blocks, i.e. assigning
blocks to classes • Learn a model based on the training data and apply
it to classify unknown blocks • Features: layout, formatting, word frequencies,..
26
![Page 27: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/27.jpg)
www.tugraz.at n
Fact Extraction from Publications
• Extract entities from within the identified blocks • E.g. author block – divide further to extract all
authors contained in the block • Extract relations between entities
• Open Information Extraction • Learns a models without needing training data • Can extract binary relations from sentences
27
![Page 28: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/28.jpg)
www.tugraz.at n
Example: Measuring quality of Wikipedia
28 Elisabeth Lex, Michael Voelske, Marcelo Errecalde, Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein, and Michael Granitzer. 2012. Measuring the quality of web content using factual information. In Proceedings of WebQuality '12 at WWW‘12
(a) Unbalanced (b) Balanced
Figure 1: Histograms of Wikipedia corpora for unbalanced dataset and balanced dataset.
is the word count of t, and t is a Wikipedia article. Thesame holds for “Factual-density/sentence-count”.
The word count measure outperforms the factual densitymeasure normalized to sentence count as well as the wordcount on the unbalanced corpus. Apparently, word count isa strong feature on the unbalanced corpus.
We then evaluated the factual density measure on the bal-anced corpus where both featured/good and non-featuredarticles are more similar in respect to document length.The results for this experiment are shown in Figure 2(b)as precision-recall curves. On the balanced corpus, factualdensity normalized to sentence count as well as word countperforms much better than on the unbalanced corpus, whileword count, as expected, performs worse. There is not muchdi↵erence between the normalization to word or sentencecount since here, the number of words per document has asmaller influence on the result.
We also analyzed the distributions of featured/good andnon-featured articles if factual density is used as measure,as depicted in Figure 3. We found that the distributionof the featured/good articles is clearly separated from thedistribution of the non-featured articles, with peaks at twodi↵erent factual density values (0.06 and 0.03 respectively).This finding is in contrast to the fact that the distributionsof featured/good articles and non-featured articles have ahigh degree of overlap if word count is used, as shown inFigure 1(b). Consequently, on the balanced corpus, factualdensity clearly outperforms our baseline word count.
In a related experiment, we investigated the relational in-formation contained in the binary relationships ReVerb ex-tracts from sentences. We used the relations, i.e. only thepredicates from the extracted triples as a vocabulary to rep-resent the documents. We then tested the discriminativepower of these features by training a classifier to solve the bi-nary classification problem of distinguishing featured/goodfrom non-featured articles. The results reported in Table 1were obtained using the WEKA6 implementation of a NaiveBayes Classifier in combination with feature selection basedon Information Gain (IG). From 40 000 relations, we selected
6http://www.cs.waikato.ac.nz/~ml/weka/
Figure 3: Distribution of articles by factual density.
the 10% best features in terms of IG. We achieved similarresults for both corpora.
Table 1: Classification results using relational fea-tures on both corpora.
Unbalanced Balanced
Measure Value [%] Value [%]
Accuracy 84.01 87.14F-Measure 84 86.7Precision 84 89.2Recall 84 87.1
Apparently, relational features are more robust when thedocument length varies. However, we need to investigatethis in more detail.
![Page 29: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/29.jpg)
www.tugraz.at n
Extract Topics from Publications
• Topic Models: algorithms that uncover thematic structure in document collections
• Facilitate searching, browsing, summarizing • Latent Dirichlet Allocation (LDA)
• Hierarchical probabilistic model
18/11/15 29
![Page 30: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/30.jpg)
www.tugraz.at n
LDA
• Probabilistic model that helps find latent topics for documents
• Probabilistic model: treat data as observations that stem from a generative proabilistic process which involves hidden variables • Documents: Thematic structure are the hidden
variables • Each topic is described by words in the documents
18/11/15 30
![Page 31: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/31.jpg)
www.tugraz.at n
LDA
• Infer hidden structure using posterior inference
• „What are the topics that describe the documents?“ • Classify unknown data using the topic model
• „How does unknown data fit into estimated topic structure?“
• Nr of topics Z has to be choosen in advance • Defines level of specification of topics
18/11/15 31
Probability of ith word for doc d Probability of ti
within topic zi
Probability of using a word from topic zi in the doc
![Page 32: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/32.jpg)
www.tugraz.at n
Example: Model evolution of topics over time in Science journal
18/11/15 32
• Dataset: pages Science from 1880-2002 from JSTOR archive
https://www.cs.princeton.edu/~blei/topicmodeling.html
![Page 33: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/33.jpg)
www.tugraz.at n
Validation of extracted information
33
• Crowdsourcing as a way to evaluate mining quality • Share the extracted information via e.g. a Web-
based platform • Enable users to give feedback
• Accept, reject, suggest new concepts/facts
![Page 34: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/34.jpg)
www.tugraz.at n
HowTo: Text Mining using rOpenSci
• Library that facilitates text mining on publications • Search for articles • Fetch articles • Get links for full text articles (xml, pdf) • Extract text from articles / convert formats • Collect bits of articles that you actually need • Download supplementary materials from papers
34 https://ropensci.org/tutorials/fulltext_tutorial.html
Chamberlain Scott (2015). fulltext: Full Text of Scholarly Articles Across Many Data Sources. R package version 0.1.0. https://github.com/ropensci/fulltext
![Page 35: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/35.jpg)
www.tugraz.at n
Example: Text Mining using rOpenSci
#include the library!library("fulltext“)! #ft_search() - get metadata on a search query.!> (res1 <- ft_search(query = 'open science', from = 'arxiv'))!> (out <- ft_get(res1))!> res1$arxiv!!# ft_get() - get full or partial text of articles.!> res <- ft_get('cs/9301113v1', from='arxiv')!!#extract the fulltext!> res2 <- ft_extract(res)!> res2$arxiv$data!!#extract interesting parts from the fulltext!> out %>% chunks("doi")!
35
![Page 36: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/36.jpg)
www.tugraz.at n
Example: Text Mining using rOpenSci
• fulltext can extract parts of a paper via chunks(): • “all”, “front”, “body”, “back”, “title”, “doi”,
“categories”, “authors”, “keywords”, “abstract”, “executive_summary”, “refs”, “refs_dois”, “publisher”, “journal_meta”, “article_meta”, “acknowledgments”, “permissions”, “history”!
• Can do PDF extraction • E.g. via GhostScript: (res_gs <- ft_extract(pdf, "gs"))!• ..
36
https://ropensci.org/tutorials/fulltext_tutorial.html
![Page 37: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/37.jpg)
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
37
![Page 38: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/38.jpg)
www.tugraz.at n
Clustering of Academic Resources
• Detect groupings of papers based on content similarity
• E.g. alongside of topics • Transform content (e.g. abstract of a paper) into
machine readable representation • Bag of Words approach: document treated as bag
of words/terms, represented as vector • Document-Term matrix: term frequencies across all
documents
38
![Page 39: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/39.jpg)
www.tugraz.at n
Vector Space Model
• Documents are vectors in Term-Document Space
• Elements of vector are weights wij corresponding to doc i and term j
• Weights: frequencies of terms in docs • TF-IDF
• Proximity of documents (similarity) calculated by cosine of angle between document vectors
39
![Page 40: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/40.jpg)
www.tugraz.at n
Example: Facilitate exploratory search
• By topic of interest (cluster = topic of interest) • Setting: Social bookmarking dataset, URLs
described by tags § Research Questions:
§ What clusters (aka groups of interests) exist? § Are they somehow related? § How do they evolve over time?
![Page 41: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/41.jpg)
www.tugraz.at n
Clustering Algorithms
• KDD lectures! • Here, briefly: K-means algorithm
1. Select k points as initial centroids 2. Repeat
3. Form k clusters by assigning all points to closest centroid
4. Recompute centroid of each cluster 5. Until centroids don‘t change
18/11/15 41
![Page 42: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/42.jpg)
www.tugraz.at n
Example
![Page 43: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/43.jpg)
www.tugraz.at n
![Page 44: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/44.jpg)
www.tugraz.at n
Classification of Scientific Publications
• Categorize into established subject-based taxonomy • E.g. Library of Congress • UNESCO thesaurus • DOAJ subject classification • Library of Congress Subject Headings
44
![Page 45: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/45.jpg)
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Topic Modeling • Clustering/Classification • Linking publications
• Make available data and source code J
45
![Page 46: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/46.jpg)
www.tugraz.at n
Linking Scientific Publications
• Citations (explicitely defined) • Similarity
• Statistical similarity: cosine • Semantic similarity: more complex, e.g. via topics
• Usage • Argument support • Contradiction • ...
46
![Page 47: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/47.jpg)
www.tugraz.at n
Linking via Citations
47
![Page 48: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/48.jpg)
www.tugraz.at n
How?
• Aggregate and manage data: repositories, aggregators, datasets,...
• Mining in Academic Resources • Information Extraction • Clustering / Classification • Linking publications • Search
• Make available data and source code J
48
![Page 49: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/49.jpg)
www.tugraz.at n
Sharing code
• Github • Bitbucket • iPython Notebooks • ...
49
![Page 50: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/50.jpg)
www.tugraz.at n
Example: ContentMine
50 http://contentmine.org
Idea: • facts cannot
be copyrighted • Billion of facts
in copyright-protected research articles
à Make them publicly accessible!
![Page 51: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/51.jpg)
www.tugraz.at n
Possible questions for ContentMine
• Find references to papers by a given author. This is metadata and therefore factual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.
• Find who sponsors research. Extract acknowledgements and perform Named Entity Recognition to detect companies. Link the companies to the papers where they are listed in the acknowledgement
51
![Page 52: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/52.jpg)
www.tugraz.at n
1. Crawl scientific literature 2. Scrape each scientific article 3. Extract facts 4. Index 5. Republish (WikiData)
Machine Extraction of scientific facts
https://github.com/ContentMine
![Page 53: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/53.jpg)
www.tugraz.at n
Example: retrieve metadata for specific article
18/11/15 53
![Page 54: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/54.jpg)
www.tugraz.at n
• Secondary publishers create walled gardens • E.g. ResearchGate portal
• Publishers’ contracts ban content-mining. • Publishers may cut off universities who mine • Publishers lobby governments to require “licences
for content mining” UK à “the right to read is the right to mine”
Content Mining Problems
http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-our-digital-future-peter-murray-rust-is-the-problem/
![Page 55: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/55.jpg)
www.tugraz.at n
Summary
• Aggregators/repos for scientific publications • Mining content/data in publications
• Information / fact extraction • Topic modeling • Clustering
• E.g. Exploratory analysis of large datasets • Find groups of interest expressed by user generated
tags and their relations
• ContentMine as example
55
![Page 56: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to](https://reader033.fdocuments.in/reader033/viewer/2022060401/5f0e3dd77e708231d43e4aa9/html5/thumbnails/56.jpg)
www.tugraz.at n
Questions?
See you next week!
56