Hacking Human Language (PyCon Sweden 2015)

75
Hacking Human Language Hendrik Heuer PyCon Stockholm Sweden

Transcript of Hacking Human Language (PyCon Sweden 2015)

Page 1: Hacking Human Language (PyCon Sweden 2015)

Hacking!Human!Language!Hendrik HeuerPyCon !Stockholm!Sweden

Page 2: Hacking Human Language (PyCon Sweden 2015)

Hacking?!

Page 3: Hacking Human Language (PyCon Sweden 2015)

– Hacker Ethics

“Access to computers —

and anything which might !teach you something about !

the way the world works!—

should be unlimited and total.

Always yield to !the Hands-On Imperative!”

Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.). New York: Penguin Books. ISBN 0141000511. OCLC 47216793.

Page 4: Hacking Human Language (PyCon Sweden 2015)

Agenda

• Computational Social Science

• Natural Language Processing

• Word Vector Representations

• Visualising and comparing my Google searches

Page 5: Hacking Human Language (PyCon Sweden 2015)

D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:

10.1145/2184319.2184336

Page 6: Hacking Human Language (PyCon Sweden 2015)

Computational Social Science Digital Humanities

• combines computer science & social sciences

• makes new research possible, e.g. the analysis of massive social networks and content of millions of books

immersion.media.mit.edu

Page 7: Hacking Human Language (PyCon Sweden 2015)

Massive-scale automated !analysis of news-content• 2.5 million articles from 498 different

English-language news outlets (Reuters & New York Times Corpus)

• automatically annotated into 15 topic areas

• the topics were compared in regards to readability, linguistic subjectivity and gender imbalances

I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale

automated analysis of news-content: topics, style and gender’, Digital Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928

Page 8: Hacking Human Language (PyCon Sweden 2015)

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

Page 9: Hacking Human Language (PyCon Sweden 2015)

“Low level of political interest and engagement could be connected to the !

lack of subjectivity (adjectival excess)”

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

Page 10: Hacking Human Language (PyCon Sweden 2015)

Male-to-Female Ratio!Named Entity Recognition

Page 11: Hacking Human Language (PyCon Sweden 2015)

Male-to-Female Ratio!Named Entity Recognition

“Gender bias in sports coverage (...) females only account for between

only 7 and 25 per cent of coverage”

Page 12: Hacking Human Language (PyCon Sweden 2015)

scikit-learn

gensimNatural Language ToolkitspaCyword2vec

Machine Learning

Text ProcessingTopic Modeling

Visualizationd3.js

Google Chart APIHighcharts

Page 13: Hacking Human Language (PyCon Sweden 2015)

Introduction to Natural Language Processing

Page 14: Hacking Human Language (PyCon Sweden 2015)

nltk.org/book

Page 15: Hacking Human Language (PyCon Sweden 2015)

Word Tokenization!Splitting a sentence into single words

>>> from nltk.tokenize import word_tokenize !>>> word_tokenize("All your base are belong to us")['All', 'your', 'base', 'are', 'belong', 'to', 'us']

Page 16: Hacking Human Language (PyCon Sweden 2015)

Sentence Tokenization!Splitting a text into sentences

>>> from nltk.tokenize import sent_tokenize !>>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']

Page 17: Hacking Human Language (PyCon Sweden 2015)

Sentence Tokenization!Splitting a text into sentences

>>> import nltk >>> import functools !>>> sent_tokenize = nltk.data.load(“tokenizers/punkt/swedish.pickle”)

Page 18: Hacking Human Language (PyCon Sweden 2015)

Stemming!Finding the word stem or root form

>>> import nltk >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> wnl = nltk.WordNetLemmatizer() !>>> [wnl.lemmatize(w) for w in ['investigation','women']] ['investigation', ‘woman'] !>>> [porter.stem(w) for w in ['investigation','women']] ['investig', 'women'] !>>> [lancaster.stem(w) for w in ['investigation','women']] ['investig', 'wom']

Page 19: Hacking Human Language (PyCon Sweden 2015)

Part-of-Speech Tagging!Identifying nouns, verbs, adjectives…

>>> import nltk >>> text = "In the middle ages Sweden had the same king as Denmark and Norway." >>> words = nltk.word_tokenize( text ) !>>> nltk.pos_tag( words ) [('In', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'), ('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as', 'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway', 'NNP'), ('.', '.')]

NN* Noun VB* Verb JJ* Adjective RB* Adverb DT Determiner IN Preposition

Page 20: Hacking Human Language (PyCon Sweden 2015)

Named Entity Recognition!Identifying people, organizations, locations…

>>> import nltk >>> text = "New York City is the largest city in the United States." >>> words = nltk.word_tokenize( text ) !>>> nltk.ne_chunk( nltk.pos_tag( words ) ) Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('United', 'NNP'), ('States', 'NNPS')]), ('.', '.')])

ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian (geo-political entity)

Page 21: Hacking Human Language (PyCon Sweden 2015)

Sentiment AnalysisTell if a sentence is positive or negative

Page 22: Hacking Human Language (PyCon Sweden 2015)

Stanford Core NLP Tools

Page 23: Hacking Human Language (PyCon Sweden 2015)

Vector Representations

Page 24: Hacking Human Language (PyCon Sweden 2015)

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.

Page 25: Hacking Human Language (PyCon Sweden 2015)

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Quoted after Socher

Page 26: Hacking Human Language (PyCon Sweden 2015)
Page 27: Hacking Human Language (PyCon Sweden 2015)

Vectors are directions in space

Page 28: Hacking Human Language (PyCon Sweden 2015)

Vectors are directions in space

Quoted after Socher

word2vecRepresenting a word with a vector

Page 29: Hacking Human Language (PyCon Sweden 2015)

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

Vectors can encode relationships

MAN

WOMANAUNT

UNCLEQUEEN

KING

word2vecRepresenting a word with a vector

Page 30: Hacking Human Language (PyCon Sweden 2015)

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

man is to woman as king is to ?

KINGS

KING

QUEEN

QUEENS

word2vecRepresenting a word with a vector

Page 31: Hacking Human Language (PyCon Sweden 2015)

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecRepresenting a word with a vector

Page 32: Hacking Human Language (PyCon Sweden 2015)

SwedenMost similar words

Page 33: Hacking Human Language (PyCon Sweden 2015)

SwedenMost similar words

Page 34: Hacking Human Language (PyCon Sweden 2015)

SwedenMost similar words

Page 35: Hacking Human Language (PyCon Sweden 2015)

SwedenMost similar words

Page 36: Hacking Human Language (PyCon Sweden 2015)

HarvardMost similar words

Page 37: Hacking Human Language (PyCon Sweden 2015)
Page 38: Hacking Human Language (PyCon Sweden 2015)

Link: https://radimrehurek.com/gensim/models/word2vec.html

Page 39: Hacking Human Language (PyCon Sweden 2015)

Link: https://radimrehurek.com/gensim/models/word2vec.html

Page 40: Hacking Human Language (PyCon Sweden 2015)

Link: https://honnibal.github.io/spaCy/

Page 41: Hacking Human Language (PyCon Sweden 2015)

Link: https://honnibal.github.io/spaCy/

Page 42: Hacking Human Language (PyCon Sweden 2015)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 43: Hacking Human Language (PyCon Sweden 2015)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

2 words context window

Page 44: Hacking Human Language (PyCon Sweden 2015)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

5 words context window

2 words context window

Page 45: Hacking Human Language (PyCon Sweden 2015)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 46: Hacking Human Language (PyCon Sweden 2015)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 47: Hacking Human Language (PyCon Sweden 2015)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 48: Hacking Human Language (PyCon Sweden 2015)

Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Page 49: Hacking Human Language (PyCon Sweden 2015)

Applications

Page 50: Hacking Human Language (PyCon Sweden 2015)

Machine Translation

T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:

http://arxiv.org/abs/1309.4168

Page 51: Hacking Human Language (PyCon Sweden 2015)

Compare my Google searches

Page 52: Hacking Human Language (PyCon Sweden 2015)

Link: https://support.google.com/websearch/answer/6068625?hl=en

Page 53: Hacking Human Language (PyCon Sweden 2015)

{ "event":[ {"query": {"id":[ {"timestamp_usec":"1317002730153183"} ], "query_text":"google hangout" } }, {"query": {"id":[ {"timestamp_usec":"1316577601549660"} ], "query_text":"eurokrise" } }, {"query": {"id":[ {"timestamp_usec":"1315592145720230"} ], "query_text":"hoverboard" } }

parsed_json[‘event’][42]['query']['query_text']

Page 54: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Page 55: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Page 56: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Link gensim: https://radimrehurek.com/gensim/!Link word2vec: https://code.google.com/p/word2vec/

Page 57: Hacking Human Language (PyCon Sweden 2015)
Page 58: Hacking Human Language (PyCon Sweden 2015)
Page 59: Hacking Human Language (PyCon Sweden 2015)
Page 60: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Page 61: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

linguistics

Page 62: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Page 63: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Page 64: Hacking Human Language (PyCon Sweden 2015)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Link: https://github.com/mbostock/d3/wiki/Gallery

Page 65: Hacking Human Language (PyCon Sweden 2015)

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

Page 66: Hacking Human Language (PyCon Sweden 2015)

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

Page 67: Hacking Human Language (PyCon Sweden 2015)

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

Page 68: Hacking Human Language (PyCon Sweden 2015)

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

Page 69: Hacking Human Language (PyCon Sweden 2015)

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

Page 70: Hacking Human Language (PyCon Sweden 2015)

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

Page 71: Hacking Human Language (PyCon Sweden 2015)

Hacking!Human!Language!Hendrik HeuerPyCon !Stockholm!Sweden

[email protected]!http://hen-drik.de!@hen_drik

Thanks to Andrii, Jussi & Roelof

Slides: https://tinyurl.com/pycon-word2vec

Page 72: Hacking Human Language (PyCon Sweden 2015)

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

Page 73: Hacking Human Language (PyCon Sweden 2015)

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

predict the surrounding words!input

wi !output !

wi-2, wi-1, wi +1, wi +2.

Page 74: Hacking Human Language (PyCon Sweden 2015)
Page 75: Hacking Human Language (PyCon Sweden 2015)