Hacking Human Language (PyCon Sweden 2015)

Post on 28-Jul-2015

78 views 3 download

Transcript of Hacking Human Language (PyCon Sweden 2015)

Hacking!Human!Language!Hendrik HeuerPyCon !Stockholm!Sweden

Hacking?!

– Hacker Ethics

“Access to computers —

and anything which might !teach you something about !

the way the world works!—

should be unlimited and total.

Always yield to !the Hands-On Imperative!”

Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.). New York: Penguin Books. ISBN 0141000511. OCLC 47216793.

Agenda

• Computational Social Science

• Natural Language Processing

• Word Vector Representations

• Visualising and comparing my Google searches

D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:

10.1145/2184319.2184336

Computational Social Science Digital Humanities

• combines computer science & social sciences

• makes new research possible, e.g. the analysis of massive social networks and content of millions of books

immersion.media.mit.edu

Massive-scale automated !analysis of news-content• 2.5 million articles from 498 different

English-language news outlets (Reuters & New York Times Corpus)

• automatically annotated into 15 topic areas

• the topics were compared in regards to readability, linguistic subjectivity and gender imbalances

I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale

automated analysis of news-content: topics, style and gender’, Digital Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

“Low level of political interest and engagement could be connected to the !

lack of subjectivity (adjectival excess)”

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

Male-to-Female Ratio!Named Entity Recognition

Male-to-Female Ratio!Named Entity Recognition

“Gender bias in sports coverage (...) females only account for between

only 7 and 25 per cent of coverage”

scikit-learn

gensimNatural Language ToolkitspaCyword2vec

Machine Learning

Text ProcessingTopic Modeling

Visualizationd3.js

Google Chart APIHighcharts

Introduction to Natural Language Processing

nltk.org/book

Word Tokenization!Splitting a sentence into single words

>>> from nltk.tokenize import word_tokenize !>>> word_tokenize("All your base are belong to us")['All', 'your', 'base', 'are', 'belong', 'to', 'us']

Sentence Tokenization!Splitting a text into sentences

>>> from nltk.tokenize import sent_tokenize !>>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']

Sentence Tokenization!Splitting a text into sentences

>>> import nltk >>> import functools !>>> sent_tokenize = nltk.data.load(“tokenizers/punkt/swedish.pickle”)

Stemming!Finding the word stem or root form

>>> import nltk >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> wnl = nltk.WordNetLemmatizer() !>>> [wnl.lemmatize(w) for w in ['investigation','women']] ['investigation', ‘woman'] !>>> [porter.stem(w) for w in ['investigation','women']] ['investig', 'women'] !>>> [lancaster.stem(w) for w in ['investigation','women']] ['investig', 'wom']

Part-of-Speech Tagging!Identifying nouns, verbs, adjectives…

>>> import nltk >>> text = "In the middle ages Sweden had the same king as Denmark and Norway." >>> words = nltk.word_tokenize( text ) !>>> nltk.pos_tag( words ) [('In', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'), ('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as', 'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway', 'NNP'), ('.', '.')]

NN* Noun VB* Verb JJ* Adjective RB* Adverb DT Determiner IN Preposition

Named Entity Recognition!Identifying people, organizations, locations…

>>> import nltk >>> text = "New York City is the largest city in the United States." >>> words = nltk.word_tokenize( text ) !>>> nltk.ne_chunk( nltk.pos_tag( words ) ) Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('United', 'NNP'), ('States', 'NNPS')]), ('.', '.')])

ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian (geo-political entity)

Sentiment AnalysisTell if a sentence is positive or negative

Stanford Core NLP Tools

Vector Representations

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Quoted after Socher

Vectors are directions in space

Vectors are directions in space

Quoted after Socher

word2vecRepresenting a word with a vector

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

Vectors can encode relationships

MAN

WOMANAUNT

UNCLEQUEEN

KING

word2vecRepresenting a word with a vector

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

man is to woman as king is to ?

KINGS

KING

QUEEN

QUEENS

word2vecRepresenting a word with a vector

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecRepresenting a word with a vector

SwedenMost similar words

SwedenMost similar words

SwedenMost similar words

SwedenMost similar words

HarvardMost similar words

Link: https://radimrehurek.com/gensim/models/word2vec.html

Link: https://radimrehurek.com/gensim/models/word2vec.html

Link: https://honnibal.github.io/spaCy/

Link: https://honnibal.github.io/spaCy/

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

2 words context window

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

5 words context window

2 words context window

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Applications

Machine Translation

T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:

http://arxiv.org/abs/1309.4168

Compare my Google searches

Link: https://support.google.com/websearch/answer/6068625?hl=en

{ "event":[ {"query": {"id":[ {"timestamp_usec":"1317002730153183"} ], "query_text":"google hangout" } }, {"query": {"id":[ {"timestamp_usec":"1316577601549660"} ], "query_text":"eurokrise" } }, {"query": {"id":[ {"timestamp_usec":"1315592145720230"} ], "query_text":"hoverboard" } }

parsed_json[‘event’][42]['query']['query_text']

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Link gensim: https://radimrehurek.com/gensim/!Link word2vec: https://code.google.com/p/word2vec/

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

linguistics

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Output JSON

Link: https://github.com/mbostock/d3/wiki/Gallery

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

My Google SearchesOct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014

Hacking!Human!Language!Hendrik HeuerPyCon !Stockholm!Sweden

hendrikheuer@gmail.com!http://hen-drik.de!@hen_drik

Thanks to Andrii, Jussi & Roelof

Slides: https://tinyurl.com/pycon-word2vec

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

predict the surrounding words!input

wi !output !

wi-2, wi-1, wi +1, wi +2.