DevTalks Cluj - Open-Source Technologies for Analyzing Text
-
Upload
steffen-wenz -
Category
Technology
-
view
381 -
download
0
Transcript of DevTalks Cluj - Open-Source Technologies for Analyzing Text
An open source tech stack toAnalyze all reviews on the Internet
Steffen Wenz, CTO [email protected]
✓ Very good hotel!*
✓ Near city centre“Close to the city center”✓ Clean rooms« Chambre impeccable »✓ Popular with solo travelers“Remote doesnt work”
*) Ramada Cluj (Full summary)
DBCrawling Semantic Analysis
TrustYou Analytics
API
Google, Hotels.com …
TrustYou Architecture
200 million reqs/month
❤ Python
Scrapy
● Build your own web crawlers● Extract data via CSS selectors, XPath, regexes …● Handles “tag soup”, queuing, request parallelism,
cookies, throttling … ● Code sample on GitHub
NLP in Python
● NLTK○ Word/sentence tokenization○ POS tagging, parsing
● Great support for scientific computation:NumPy, SciPy, Pandas
● Scikit-learn● TensorFlow!
Gensim: Fun with Word2Vec>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
Big Data & Open Source
2004MapReduce, GFS
BigTable, Spanner, F1 …
Apache Beam …
Spark
● User writes driver program which transparently schedules execution in a cluster
● Faster and more expressive than MapReduce
● Spark SQL: Interactive query of large datasets● Spark Streaming: Spark is “batch first”, but fast enough
to implement stream processing with “mini batches”● Spark MLlib: Machine learning
● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs
● Some support for Hadoop● Pythonic replacement for Oozie
Luigi
Try it out!
GitHub repo showcasing:● Luigi● Scrapy● Word2Vec model training with gensim@ https://github.com/trustyou/meetups