DevTalks Cluj - Open-Source Technologies for Analyzing Text

12
An open source tech stack to Analyze all reviews on the Internet Steffen Wenz, CTO TrustYou [email protected]

Transcript of DevTalks Cluj - Open-Source Technologies for Analyzing Text

Page 1: DevTalks Cluj - Open-Source Technologies for Analyzing Text

An open source tech stack toAnalyze all reviews on the Internet

Steffen Wenz, CTO [email protected]

Page 2: DevTalks Cluj - Open-Source Technologies for Analyzing Text

✓ Very good hotel!*

✓ Near city centre“Close to the city center”✓ Clean rooms« Chambre impeccable »✓ Popular with solo travelers“Remote doesnt work”

*) Ramada Cluj (Full summary)

Page 3: DevTalks Cluj - Open-Source Technologies for Analyzing Text
Page 4: DevTalks Cluj - Open-Source Technologies for Analyzing Text

DBCrawling Semantic Analysis

TrustYou Analytics

API

Google, Hotels.com …

TrustYou Architecture

200 million reqs/month

❤ Python

Page 5: DevTalks Cluj - Open-Source Technologies for Analyzing Text

Scrapy

● Build your own web crawlers● Extract data via CSS selectors, XPath, regexes …● Handles “tag soup”, queuing, request parallelism,

cookies, throttling … ● Code sample on GitHub

Page 6: DevTalks Cluj - Open-Source Technologies for Analyzing Text

NLP in Python

● NLTK○ Word/sentence tokenization○ POS tagging, parsing

● Great support for scientific computation:NumPy, SciPy, Pandas

● Scikit-learn● TensorFlow!

Page 7: DevTalks Cluj - Open-Source Technologies for Analyzing Text

Gensim: Fun with Word2Vec>>> # trained from 100k meetup descriptions!

>>> m = gensim.models.Word2Vec.load("data/word2vec")

>>> m.most_similar(positive=["python"])[:3]

[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',

0.8189617991447449)]

>>> m.doesnt_match(["python", "c++", "javascript"])

'c++'

>>> m.most_similar(positive=["berlin"])[:3]

[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',

0.7970746755599976)]

>>> m.most_similar(positive=["ladies"])[:3]

[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]

Page 8: DevTalks Cluj - Open-Source Technologies for Analyzing Text

Big Data & Open Source

2004MapReduce, GFS

BigTable, Spanner, F1 …

Apache Beam …

Page 9: DevTalks Cluj - Open-Source Technologies for Analyzing Text

Spark

● User writes driver program which transparently schedules execution in a cluster

● Faster and more expressive than MapReduce

● Spark SQL: Interactive query of large datasets● Spark Streaming: Spark is “batch first”, but fast enough

to implement stream processing with “mini batches”● Spark MLlib: Machine learning

Page 10: DevTalks Cluj - Open-Source Technologies for Analyzing Text

● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs

● Some support for Hadoop● Pythonic replacement for Oozie

Luigi

Page 11: DevTalks Cluj - Open-Source Technologies for Analyzing Text

Try it out!

GitHub repo showcasing:● Luigi● Scrapy● Word2Vec model training with gensim@ https://github.com/trustyou/meetups