Word Embeddings - Semantics: What is in my...
Transcript of Word Embeddings - Semantics: What is in my...
![Page 1: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/1.jpg)
Word Embeddings - Semantics: What is in my Documents?
Laura DietzUniversity of New [email protected]
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
![Page 2: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/2.jpg)
The Problem
![Page 3: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/3.jpg)
The Solution
![Page 4: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/4.jpg)
Collections of Text
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Mr. Smithlives in theU.S.A. and reads crime novels.
words entities
topics
Smith
U.S.A.
crime novels
![Page 5: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/5.jpg)
Outline
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
![Page 6: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/6.jpg)
Why am I qualified to give this Talk?
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Laura Dietz - Computer Scientist
2000: Software Developer2004: Semantic Web2006: Machine Learning / Topic Models2011: Natural Language Processing / Entity Linking2013: Information Retrieval with KGs2016: Assistant Professor
![Page 7: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/7.jpg)
Outline: Topic Models
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
![Page 8: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/8.jpg)
Topic Models
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
- Same words are likely about the same topic.- Words in the same document are likely aboutthe same topic.
novels novels
crime
crime
news
![Page 9: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/9.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Topic Models
high if word is important for topic.high if topic important for doc.
novels novels
crime
crime
news
novels
newscrime
crime
[Blei et al 09]
![Page 10: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/10.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Topic Model Exercise - Apply rules to find topics!
A A
B
B
H
B
H B
A
doc 2: B A D I
doc 3 : B A F H
doc 4: E A F H
doc 5: E A G C
doc 1: B A G C
Rules:0. Assign each word a random topic.1. Assign two same words to the same topic.2. Assign two words in the same document the same topic.
![Page 11: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/11.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Topic Model Exercise - Solution
A A
B
B
H
B
H B
A
A: readB: politiciansC: novelsD: legalE: people
F: theG: crimeH: newsI: texts
doc 2: B A D Ipoliticians read legal texts
doc 3 : B A F H politicians read the news
doc 4: E A F Hpeople read the news
doc 5: E A G Cpeople read crime novels
doc 1: B A G Cpoliticians read crime novels
![Page 12: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/12.jpg)
Topic Model Toolkits
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
- LDA-c- Mallet- Topic Model Toolbox- Stanford Topic Modeling Toolbox- Tomoto
Extensions for:Authors [Rosen-Zvi 04], Time [Wang 06],Citation networks [Dietz 07], Ideal point [Gerrish 10] Friend-networks [Dietz 12], Taxonomies [Bakalov 12],and so many more....
![Page 13: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/13.jpg)
Topic Model Caveats
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Some topics are great (aka "spot on")others are merged/split or don't make sense.
It is impossible to know which topics are correct.
There are many correct solutions:
![Page 14: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/14.jpg)
Please Evaluate Tools!
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
When your research relies on a toolmake sure it works in *your* domain and for *your* task!
...otherwise you may draw wrong conclusions!
![Page 15: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/15.jpg)
Issues of Topic Models
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Topic models are based on assumptions that intuitively hold for topical words....but also for many "misleading" words.
Many politicians in the U.S.A. like to read crime novels.
We don't know which are the topical wordswe are looking for.
Which are the topical words?
![Page 16: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/16.jpg)
Topic Model Issues
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
politicians read crime novels politicians read legal texts
politicians read the news
people read the news
people read crime novels
polit
icia
ns
crime novels
read
"read" bridges two topics
![Page 17: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/17.jpg)
Outline: Word Embeddings
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
![Page 18: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/18.jpg)
Word Embeddings
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
read crime novels
politicians read
All words that fit hereare similar!
[Levy & Goldberg et al 14]
![Page 19: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/19.jpg)
Word Embeddings
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
crime novel news legal
politicians read
30.62.6-7
30.52.5-7
20.52.6-6
10.21.5-1
![Page 20: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/20.jpg)
Multiple Meanings with Word Embeddings
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
"Read" can have the same distance to both words,without these words being similar to each other.
polit
icia
ns
crime novels
read
0.70.7
10 0
1
cosine similarity = 0.7 0.7
0
![Page 21: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/21.jpg)
Word Embedding Issues
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Learns similarity according to types.(e.g., crime novels, legal texts & news are similar).
But often does not learn topical similarity.(e.g., a novel, its author, and its subject are different).
![Page 22: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/22.jpg)
Word Embedding Toolkits
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Word2vecGloVEGensim
Download pre-trained embeddings(avoid using embeddings from different domains)or train embeddings yourself (all you need is text)
(you may also like SeqToSeq)
![Page 23: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/23.jpg)
An Apology...
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
We computer scientists don't have a fool proof method for extracting topics from text.
(but next, a few things that work...)
![Page 24: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/24.jpg)
Outline: Text Classification
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
![Page 25: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/25.jpg)
Text Classification
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Caveat: Needs labeled training datafor *your* domain and *your* task!
Crime novels Politics
politicians read crime novels
![Page 26: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/26.jpg)
Text Classification (Naive Bayes)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Crime novels Politics
politicians read crime novels
peopleread crimenovels
shelikescrime she
wrotenovels
politread legal polit
read reports
polit.read news
crime: 2likes: 1novels: 2people: 1read: 3
legal: 1news: 1read: 3 reports: 1politicians: 3
![Page 27: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/27.jpg)
Text Classification Toolkits
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Support Vector Machines (SVM)Random ForestsWekaScikit.learn
![Page 28: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/28.jpg)
Often only a portion of text is on topic. (see Multi-label classification.)
Text Classification Issues
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Requires a lot of manual training data.(labor-intensive, not feasible for fine-grained topics).
Politics
Crime Novels
![Page 29: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/29.jpg)
Outline: Entity Linking
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
![Page 30: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/30.jpg)
Entity Linking
Smith
USAUSA
Smith
United States
Peter Smith
Smith
Smith
Smith
similarnames +context
many links between entitiesLaura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
popularity
[Ji & Grishman et al 11]
![Page 31: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/31.jpg)
Entity Linking
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
A black box that takes text and...
..spots mentions of (Wikipedia) entities in textand disambiguates among similarly named entities.
Mr. Smith Smith
USA
USA Smith
lives in the
USA.
Smith
Smith
![Page 32: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/32.jpg)
Wikipedia Entities
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
1 Wikipedia page = 1 entity
not just people, organization, and placesalso: Brexit, Economy, Immigration, Chocolate
Category: Foodsweetbrowndark
Chocolate
Link to TheobromineChocolate
USA
![Page 33: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/33.jpg)
Entities from Ontologies / Knowledge Graphs
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Other resources define semi-structured entities
Food
Chocolate
Theobromine
Chocolate
USA
has-compound:
of-category:
has-name:
description: sweet, dark brown food
![Page 34: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/34.jpg)
Entity Linking to Inspect your Collection
Smith
USA
Smith
USA
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Smith
USA
Smith
IBM
IBM
![Page 35: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/35.jpg)
Entity Linking Toolkits
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
- TagMe!- Smaph- DBpedia Spotlight- AIDA- ....
Idea: You can set up your own Wiki serverdefine concepts important to your researchand generate some training links.
![Page 36: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/36.jpg)
Entity Linking Issues
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Entity links are not saying much about topics.
One could use Wikipedia's categories.But these are often very surprisingly incomplete, inconsistent, and too fine-grained.
![Page 37: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/37.jpg)
Outline: Entity Aspects
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
![Page 38: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/38.jpg)
Entity Aspect Linking
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Harvested from sections of the entity's Wiki article.
Peter Smith
Peter Smith
Politics
Family
Smith
USA
Aspect "politics"
Aspect "family"
Refine entity links with aspects that match context.
text text ...
text text ...
[Nanni et al 18]
![Page 39: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/39.jpg)
Entity Aspect Example Application
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Twitter classification into different aspects of Brexit.
Law
Brexit
Economy
Immigration
Brexit willkill UK'sprosperity!
I don't like foreigners- #Brexit!
After brexitstreets willbe safe.
Economy Immigration
![Page 40: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/40.jpg)
Search Index
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Create a search index with documents.
Brexit willkill UK'sprosperity!
After brexitstreets willbe safe.I don't like
foreigners- #Brexit!
Create different descriptions of your topic.Use description = query to retrieve top 10.
![Page 41: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/41.jpg)
Outline: Search Index and Retrieval
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
![Page 42: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/42.jpg)
Search Index with Entities
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Documents can have fields:
Brexit willkill UK'sprosperity!
After brexitstreets willbe safe.
UKI don't like foreigners- #Brexit!
brexit
brexitUK
Query = "text:safe entity:brexit"
After brexitstreets willbe safe.
brexit
![Page 43: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/43.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Relevant Entities
Query EU UK relations dark chocolatehealth benefits
Queryentities
Latententities
EU
Brexit
Theresa May
chocolate
health
Theobromine
circulatorysystem
heart
Named Entities Concepts
UK
![Page 44: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/44.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
chocolate
health
Theobromine
circulatorysystem
heart
dark chocolatehealth benefits
Matching Entities in Documents
Document relevant?
Q:
![Page 45: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/45.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
chocolate
health
Theobromine
circulatorysystem
heart
dark chocolatehealth benefits
... dark chocolate ...
... health ... ...health... ... Theobromine ... circulatory system
Matching Entities in Documents by Name
Document relevant?
Q:
![Page 46: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/46.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
... health ... ...health... ... Theobromine ... circulatory system
chocolate
health
Theobromine
circulatorysystem
heart
dark chocolatehealth benefits
Matching Entities in Documents by Name
Document relevant?
Q:
![Page 47: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/47.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
chocolate
health
Theobromine
circulatorysystem
heart
dark chocolatehealth benefits
Entity Link
... dark chocolate ...
Matching Entities in Documents by Entity Links
... health ... ...health... ... Theobromine ... circulatory system
Document relevant?
Q:
![Page 48: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/48.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
... health ... ...health... ... Theobromine ... circulatory system
Matching Entities in Documents by Entity Links
chocolate
health
Theobromine
circulatorysystem
heart
dark chocolatehealth benefits
Document relevant?
Q:
![Page 49: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/49.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
chocolate
health
Theobromine
circulatorysystem
heart
... dark chocolate ...
dark chocolatehealth benefits
Matching Entities in Documents by Article Terms
... health ... ...health... ... Theobromine ... circulatory system
Document relevant?
Q:
![Page 50: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/50.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
chocolate
health
Theobromine
circulatorysystem
heart
dark chocolatehealth benefits
Entity Link
... dark chocolate ...
Combine All Names, Links, Terms
... dark chocolate ...
... health ....health... .. Theobromine ... circulatory system
... dark chocolate ...
... health ... ...health... ... Theobromine ... circulatory system
Document relevant?
Q:
![Page 51: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/51.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Using Entities as a Vocabulary of Concepts
use your favoriteretrieval model here!
Entity Link ...name ........ query term .... article term ...name ........
![Page 52: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/52.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Using Entities as a Vocabulary of Concepts
use your favoriteretrieval model here!
Entity Link ...name ........ query term .... article term ...name ........
How to fi
nd
rele
vant e
ntitie
s?
![Page 53: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/53.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Query: dark chocolate health benefits
Category: Foodsweetbrowndark
Chocolate
Theobromine
Query Entities through Entity Linking
![Page 54: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/54.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
1. Retrieve documents with a query2. Entity link documents3. Derive distribution over (bag of entities)(see pseudo relevance feedback / RM3)
1st 3rd2nd
Latent Entities through Pseudo-Relev. Feedback
[Dalton et al 14, Liu & Fang 15]
![Page 55: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/55.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Index Wikipedia pages or attribute sets of entities
Q
1st 3rd2nd
Cocoa_bean
Theobromine
Latent Entities through Retrieval
Retrieve entities from knowledge baseto obtain ranking of entities (with score) [Pound et al 10, Niklaev et al 16, Balog 18]
![Page 56: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/56.jpg)
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Document Retrieval with Entities
Query DocumentsEntities
Entities known ->to be relevant
Docs we ->want to rank
[Dalton et al 14]
![Page 57: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/57.jpg)
Search Index with Entities
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Brexit willkill UK'sprosperity!
After brexitstreets willbe safe.
UKI don't like foreigners- #Brexit!
brexit
brexitUK
Query = "text:safe entity:brexit
After brexitstreets willbe safe.
UKI don't like foreigners- #Brexit!
brexit
brexit
entity: entity1 entity2 entity3 ...
![Page 58: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/58.jpg)
Search Index (and Information Retrieval) Toolkits
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
IR:- Lucene (Java) / PyLuceneCombining different retrievals: Learning 2 Rank- Ranking SVM, RankLibEntity Retrieval: NordLys
Utilizing Knowledge Graphs for Information Retrieval:- my tutorial: github.com/laura-dietz/tutorial-utilizing-kg
- "KG4IR" Workshop at SIGIR Conference- Upcoming Special Issue
![Page 59: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/59.jpg)
Search Index with Entities
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Brexit willkill UK'sprosperity!
After brexitstreets willbe safe.
UKI don't like foreigners- #Brexit!
brexit
brexitUK
Query = "text:safe entity:brexit entity: entity1 entity2 entity3 ...
After brexitstreets willbe safe.
UKI don't like foreigners- #Brexit!
brexit
brexit
text: name1 name2 name3 ... text: word1 word2 word3 ..."
![Page 60: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/60.jpg)
Search Index (and Information Retrieval) Issues
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Issue 1:You need to guess a topic to look for.Issue 2:You still need to refine the results.
![Page 61: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/61.jpg)
Citations
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Topic Models: Blei & Lafferty. "Topic models." Text Mining. Chapman and Hall, 2009.Word Embeddings: Levy & Goldberg. "Dependency-based word embeddings." ACL 2014.Entity Linking: Ji & Grishman, "Knowledge base population: Successful approaches and challenges." NAACL-HLT, 2011.Aspects: Nanni, Ponzetto, Dietz, "Entity-aspect linking: providing fine-grained semantics of entities in context", JCDL 2018.Information Retrieval: Learning to Rank: Liu et al, "Learning to rank for information retrieval", FnTIR, 2009.Entity Retrieval:Pound, Mika, Zaragoza. "Ad-hoc object retrieval in the web of data", WWW, 2010.Nikolaev, Kotov, Zhiltsov. "Parameterized fielded term dependence models for ad-hoc entity retrieval from knowledge graph", SIGIR 2016.Balog. "Entity-Oriented Search (The Information Retrieval Series)", 2018, Springer.IR with Entities: Dietz, Kotov, Meij, "Utilizing Knowledge Graphs for Text-Centric Information Retrieval.", SIGIR 2018.Dalton, Dietz, Allan, "Entity query feature expansion using knowledge base links", SIGIR 2014.Liu & Fang, "Latent entity space: a novel retrieval approach for entity-bearing queries.", IRJ, 2015.Xiong, Callan, Liu, "Word-Entity Duet Representations for Document Ranking", SIGIR, 2017.
![Page 62: Word Embeddings - Semantics: What is in my Documents?cs.unh.edu/~Dietz/Talks/Csss18-what-is-in-my-collection.pdfWord Embeddings - Semantics: What is in my Documents? Laura Dietz University](https://reader036.fdocuments.in/reader036/viewer/2022062916/5eb762ab67f1d5547e6fd5eb/html5/thumbnails/62.jpg)
Conclusions
Laura Dietz [email protected] - Summer School Series on Methods for Computational Social Science 2018
Different techniques to inspect your documents.- topic models- word embeddings- text classification- entity linking- entity aspects- search index and retrieval (with entities)
There is no fool-proof method.Make sure the tools are doing what you need!