Hacking Human Language (PyCon Sweden 2015)
-
Upload
hendrik -
Category
Presentations & Public Speaking
-
view
78 -
download
3
Transcript of Hacking Human Language (PyCon Sweden 2015)
Hacking!Human!Language!Hendrik HeuerPyCon !Stockholm!Sweden
Hacking?!
– Hacker Ethics
“Access to computers —
and anything which might !teach you something about !
the way the world works!—
should be unlimited and total.
Always yield to !the Hands-On Imperative!”
Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.). New York: Penguin Books. ISBN 0141000511. OCLC 47216793.
Agenda
• Computational Social Science
• Natural Language Processing
• Word Vector Representations
• Visualising and comparing my Google searches
D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:
10.1145/2184319.2184336
Computational Social Science Digital Humanities
• combines computer science & social sciences
• makes new research possible, e.g. the analysis of massive social networks and content of millions of books
immersion.media.mit.edu
Massive-scale automated !analysis of news-content• 2.5 million articles from 498 different
English-language news outlets (Reuters & New York Times Corpus)
• automatically annotated into 15 topic areas
• the topics were compared in regards to readability, linguistic subjectivity and gender imbalances
I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale
automated analysis of news-content: topics, style and gender’, Digital Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928
Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet
“Low level of political interest and engagement could be connected to the !
lack of subjectivity (adjectival excess)”
Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet
Male-to-Female Ratio!Named Entity Recognition
Male-to-Female Ratio!Named Entity Recognition
“Gender bias in sports coverage (...) females only account for between
only 7 and 25 per cent of coverage”
scikit-learn
gensimNatural Language ToolkitspaCyword2vec
Machine Learning
Text ProcessingTopic Modeling
Visualizationd3.js
Google Chart APIHighcharts
Introduction to Natural Language Processing
nltk.org/book
Word Tokenization!Splitting a sentence into single words
>>> from nltk.tokenize import word_tokenize !>>> word_tokenize("All your base are belong to us")['All', 'your', 'base', 'are', 'belong', 'to', 'us']
Sentence Tokenization!Splitting a text into sentences
>>> from nltk.tokenize import sent_tokenize !>>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']
Sentence Tokenization!Splitting a text into sentences
>>> import nltk >>> import functools !>>> sent_tokenize = nltk.data.load(“tokenizers/punkt/swedish.pickle”)
Stemming!Finding the word stem or root form
>>> import nltk >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> wnl = nltk.WordNetLemmatizer() !>>> [wnl.lemmatize(w) for w in ['investigation','women']] ['investigation', ‘woman'] !>>> [porter.stem(w) for w in ['investigation','women']] ['investig', 'women'] !>>> [lancaster.stem(w) for w in ['investigation','women']] ['investig', 'wom']
Part-of-Speech Tagging!Identifying nouns, verbs, adjectives…
>>> import nltk >>> text = "In the middle ages Sweden had the same king as Denmark and Norway." >>> words = nltk.word_tokenize( text ) !>>> nltk.pos_tag( words ) [('In', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'), ('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as', 'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway', 'NNP'), ('.', '.')]
NN* Noun VB* Verb JJ* Adjective RB* Adverb DT Determiner IN Preposition
Named Entity Recognition!Identifying people, organizations, locations…
>>> import nltk >>> text = "New York City is the largest city in the United States." >>> words = nltk.word_tokenize( text ) !>>> nltk.ne_chunk( nltk.pos_tag( words ) ) Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('United', 'NNP'), ('States', 'NNPS')]), ('.', '.')])
ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian (geo-political entity)
Sentiment AnalysisTell if a sentence is positive or negative
Stanford Core NLP Tools
Vector Representations
–J. R. Firth 1957
“You shall know a word by the company it keeps”
Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.
–J. R. Firth 1957
“You shall know a word by the company it keeps”
Quoted after Socher
Vectors are directions in space
Vectors are directions in space
Quoted after Socher
word2vecRepresenting a word with a vector
T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
Vectors can encode relationships
MAN
WOMANAUNT
UNCLEQUEEN
KING
word2vecRepresenting a word with a vector
T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
man is to woman as king is to ?
KINGS
KING
QUEEN
QUEENS
word2vecRepresenting a word with a vector
T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].
Available: http://arxiv.org/abs/1301.3781
word2vecRepresenting a word with a vector
SwedenMost similar words
SwedenMost similar words
SwedenMost similar words
SwedenMost similar words
HarvardMost similar words
Link: https://radimrehurek.com/gensim/models/word2vec.html
Link: https://radimrehurek.com/gensim/models/word2vec.html
Link: https://honnibal.github.io/spaCy/
Link: https://honnibal.github.io/spaCy/
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
2 words context window
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
5 words context window
2 words context window
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
spaCy!Dependency-Based
Word representations by Levy and Goldberg
Gensim!word2vec
by Mikolov et al
Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
Applications
Machine Translation
T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:
http://arxiv.org/abs/1309.4168
Compare my Google searches
Link: https://support.google.com/websearch/answer/6068625?hl=en
{ "event":[ {"query": {"id":[ {"timestamp_usec":"1317002730153183"} ], "query_text":"google hangout" } }, {"query": {"id":[ {"timestamp_usec":"1316577601549660"} ], "query_text":"eurokrise" } }, {"query": {"id":[ {"timestamp_usec":"1315592145720230"} ], "query_text":"hoverboard" } }
parsed_json[‘event’][42]['query']['query_text']
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
Link gensim: https://radimrehurek.com/gensim/!Link word2vec: https://code.google.com/p/word2vec/
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
linguistics
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
1. Find Word
Representations word2vec
2. Dimensionality
Reduction t-SNE
3. Output JSON
Link: https://github.com/mbostock/d3/wiki/Gallery
My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
Hacking!Human!Language!Hendrik HeuerPyCon !Stockholm!Sweden
[email protected]!http://hen-drik.de!@hen_drik
Thanks to Andrii, Jussi & Roelof
Slides: https://tinyurl.com/pycon-word2vec
predict the current word!input!
wi-2, wi-1, wi+1, wi+2 !output !
wi!
predict the current word!input!
wi-2, wi-1, wi+1, wi+2 !output !
wi!
predict the surrounding words!input
wi !output !
wi-2, wi-1, wi +1, wi +2.