DevTalks Cluj - Open-Source Technologies for Analyzing Text

An open source tech stack toAnalyze all reviews on the Internet

Steffen Wenz, CTO [email protected]

mailto:[email protected]


✓ Very good hotel!*

✓ Near city centre“Close to the city center”✓ Clean rooms« Chambre impeccable »✓ Popular with solo travelers“Remote doesnt work”

*) Ramada Cluj (Full summary)

http://www.trustyou.com/meta-review-search/?ty_id=92e85b9f-6257-49aa-bf8c-cb62fa6958c4

DBCrawling Semantic Analysis

TrustYou Analytics

API

Google, Hotels.com …

TrustYou Architecture

200 million reqs/month

❤ Python

Scrapy

● Build your own web crawlers● Extract data via CSS selectors, XPath, regexes …● Handles “tag soup”, queuing, request parallelism,

cookies, throttling … ● Code sample on GitHub

http://scrapy.org/

http://scrapy.org/

https://github.com/trustyou/meetups/tree/master/pydata

https://github.com/trustyou/meetups/tree/master/pydata

NLP in Python

● NLTK○ Word/sentence tokenization○ POS tagging, parsing

● Great support for scientific computation:NumPy, SciPy, Pandas

● Scikit-learn● TensorFlow!

http://www.nltk.org/

http://www.nltk.org/

http://scikit-learn.org/stable/

http://scikit-learn.org/stable/

https://www.tensorflow.org/

https://www.tensorflow.org/

Gensim: Fun with Word2Vec>>> # trained from 100k meetup descriptions!

>>> m = gensim.models.Word2Vec.load("data/word2vec")

>>> m.most_similar(positive=["python"])[:3]

[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',

0.8189617991447449)]

>>> m.doesnt_match(["python", "c++", "javascript"])

'c++'

>>> m.most_similar(positive=["berlin"])[:3]

[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',

0.7970746755599976)]

>>> m.most_similar(positive=["ladies"])[:3]

[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]

Big Data & Open Source

2004MapReduce, GFS

BigTable, Spanner, F1 …

Apache Beam …

Spark

● User writes driver program which transparently schedules execution in a cluster

● Faster and more expressive than MapReduce

● Spark SQL: Interactive query of large datasets● Spark Streaming: Spark is “batch first”, but fast enough

to implement stream processing with “mini batches”● Spark MLlib: Machine learning

● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs

● Some support for Hadoop● Pythonic replacement for Oozie

Luigi

https://github.com/spotify/luigi

https://github.com/spotify/luigi

Try it out!

GitHub repo showcasing:● Luigi● Scrapy● Word2Vec model training with gensim@ https://github.com/trustyou/meetups

https://github.com/trustyou/meetups

[email protected]



DevTalks Cluj - Open-Source Technologies for Analyzing Text

Technology

Transcript of DevTalks Cluj - Open-Source Technologies for Analyzing Text