Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics
-
Upload
semanticsconference -
Category
Technology
-
view
697 -
download
0
Transcript of Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics
![Page 1: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/1.jpg)
WWW.LEDS-PROJEKT.DE
STREAMING-BASED TEXT MINING USING DEEP LEARNING AND SEMANTICS
MARTIN VOIGT / ONTOS
![Page 2: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/2.jpg)
AGENDA
Use CasesLessons Learned
WILDHORNMINERMarket
Situation
What is required from project partners and customers?
What is good, what not and what are the next steps?
What are the others doing?
How looks Ontos’ approach of a flexible text mining by using Deep Learning?
How to analyze texts from various sources and interlink with existing knowledge graphs?
12. September 20162
![Page 3: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/3.jpg)
USE CASES
What is required from project partners and customers?
12. September 20163
![Page 4: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/4.jpg)
USE CASES FROM LEDS
• Content Augmentation in E-Commerce• Product descriptions imported from various sources as usually unstructured text
• Much manual work is resource consuming and expensive
http://www.walmart.com/ip/The-Revenant-Blu-ray-Digital-HD/5012927712. September 20164
![Page 5: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/5.jpg)
USE CASES FROM CUSTOMERS
• Brand or competitor monitoring• What are the people or journalists writing about my brand?
• What are my competitors doing? How is the market changing?
Monitoring of
web sites
news feeds
social media channels
…
Link with
reports
CRM
Open data
…
12. September 20166
![Page 6: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/6.jpg)
USE CASES FROM CUSTOMERS
• Assist criminal investigations in the (dark) web• More and more organized crime in the web
• Detect entities with their relation and additional facts
But Neumann and McKelvey eventually sold the business to their landlord Joshua Guttman.
Neumann McKelvey
?
Joshua Guttman
Business
sold_what
sold_to
a
12. September 20167
![Page 7: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/7.jpg)
MAIN REQUIREMENTS
• Detection and classification of entities, relations and facts with good F1
• Long, high quality texts vs. short texts with bad / missing grammar in multiple languages
• Training not by linguists but domain experts
• Flexible adaption to new domains and contexts
• Many different data sources
12. September 20168
![Page 8: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/8.jpg)
MARKET SITUATION
What are the others doing?
12. September 20169
![Page 9: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/9.jpg)
MARKET
• MarketsAndMarkets: NLP (all) for 2020• Market: $ 13,4 bn
• CAGR: ~ 18,4%
• ResearchMOZ: Text Analytics 2015-2019• CAGR: ~ 16,1%
• Transparency Market Research: Text Analytics 2016–2024• CAGR: ~ 17,6%
• Driving element: NLP-as-a-Service with CAGR of ~20,2%
http://www.marketsandmarkets.com/PressReleases/natural-language-processing-nlp.asp
http://www.researchmoz.us/global-natural-language-processing-market-2015-2019-report.html
http://www.transparencymarketresearch.com/pressrelease/global-text-analytics-market.htm
12. September 201610
![Page 10: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/10.jpg)
MARKET
• ResearchAndMarkets: 2016 Top 10 Information and Communication Technologies• 7. Natural Language Processing (NLP)
• Increasing Demand for Automation Drives Growth• Funding on the Rise - Start-ups Driving Key Innovations for NLP Applications• High Acceptance of Technology enables the US to Remain on the Top
• Gartner‘s Top 10 Strategic Technology Trends für 2016• 4 - Information of Everything:
• Access to heterogeneous data sources• (Semantic) Linking of data items
• 5 - Advanced Machine Learning:• Deep Learning and Neural Network for NLP
http://www.researchandmarkets.com/research/vd8fr9/2016_top_10
12. September 201611
![Page 11: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/11.jpg)
GETTING AN OVERVIEW
• Understood domain by reviewing 30+ tools and frameworks• Checked scientific prototypes as well as open source tools
• Investigated what other enterprise, companies, and startups are doing
• Defined a set of 15+ criteria to compare and understand
12. September 201613
![Page 12: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/12.jpg)
GETTING AN OVERVIEW
• Understood domain by reviewing 30+ tools and frameworks
• Findings• Startups and huge enterprises usually stick to Deep Learning (DL) high flexibility
• Some existing companies loss market share, e.g., Attensity
• Market saturation in U.S., only less companies in Europe (5)
• More and more NLP-as-a-Service
• Only a few use RDF data, e.g., for output or disambiguation (8)
• Only 8 tools could extract relations and facts
• Only 9 open source tools with available benchmarks
12. September 201614
![Page 13: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/13.jpg)
GETTING AN OVERVIEW
• Rough comparison of the main concepts
• Conclusions for Ontos• Use DL as foundation for text mining tasks
• Combine it with semantic technologies, e.g. for disambiguation
• External view: NLP-as-a-Service
• Internal view: NLP pipelines in order to combine different tools / technologies
based on http://www.deeplearningbook.org/contents/intro.html Fig 1.512. September 201616
![Page 14: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/14.jpg)
MINER
How looks Ontos’ approach of a flexible text mining by using Deep Learning?
12. September 201617
![Page 15: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/15.jpg)
MINER - OVERVIEW
Language model generated from Corpora
Model as input for supervised step
Supervised Model
Domain or task specific model for sequence labeling
Trained on specific texts by domain experts
Large collection of texts in a given language or “dialect”, e.g. Social Media / Twitter
Large text corpora Unsupervised Model
12. September 201618
![Page 16: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/16.jpg)
MINER – CORPUS CREATION
• 1 corpus per language / dialect required
• the larger and more heterogeneous the better get and model more contexts for words
• Sources: Common Crawl, Wikipedia, news feeds, domain specific texts, Twitter, …
12. September 201619
![Page 17: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/17.jpg)
MINER – UNSUPERVISED MODEL
• language model that could be reused in multiple domains / contexts
• reuse the concept of word2vec to create word embeddings
Language Dictionary
Word2Vec ModelWord2Vec Dictionary
Numeric Corpus
12. September 201620
![Page 18: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/18.jpg)
MINER – SUPERVISED MODEL
• Sequence labeling task• use a bi-directional LSTM
(BLSTM)
• Gated Recurrent Unit (GRU) as specialized LSTM
• Output layer• join and normalize
• classification
LEDS is a research project
Word Embeddings
GRU GRU GRU GRU GRU
GRU GRU GRU GRU GRU
Forward GRU
Backward GRU
OUT OUT OUT OUT OUTOutput Layers
0
0
Output
Textual Input
12. September 201621
![Page 19: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/19.jpg)
MINER – IMPLEMENTATION
• Torch framework• matured, efficient GPU support, good packages for neural networks, …
• Scripting with Lua language
• Service implementation with Go because of great integration with Lua
• Integrated in Ontos Eiger workbench
12. September 201622
![Page 20: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/20.jpg)
• Create new “projects” ad-hoc
• Reuse pre-labeled corpora
• Free definition of entity types as required for domain
MINER – IMPLEMENTATION
12. September 201623
![Page 21: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/21.jpg)
• List and manage documents of project corpora
MINER – IMPLEMENTATION
12. September 201624
![Page 22: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/22.jpg)
• Double-check and annotate new labels
MINER – IMPLEMENTATION
12. September 201625
![Page 23: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/23.jpg)
MINER – IMPLEMENTATION
• English News• ~ 400 English texts from CoNLL 2003 and news feeds
• 5 entity types defined and manually annotated by 2 experts
• 1st value: correct start of entity
• 2nd value: correct end of entity
Class: Person Organization Product Location EventPer class F1:....... 0.976 | 0.990 0.932 | 0.962 0.873 | 0.942 0.958 | 0.976 0.891 | 0.928Per class Recall:... 0.974 | 0.988 0.936 | 0.953 0.845 | 0.926 0.965 | 0.980 0.870 | 0.913Per class Precision: 0.978 | 0.992 0.928 | 0.971 0.902 | 0.959 0.951 | 0.973 0.914 | 0.944
http://www.cnts.ua.ac.be/conll2003/ner/
12. September 201626
![Page 24: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/24.jpg)
MINER – IMPLEMENTATION
• English Twitter• ~ 2400 tweets from Twitter NLP tools / A. Ritter 2011
• 5 entity types defined, some are combined from original source
Class Person Organisation Place Product ThingPer class F1:....... 0.797 | 0.883 0.548 | 0.141 0.725 | 0.669 0.195 | 0.207 0.409 | 0.551Per class Recall:... 0.769 | 0.852 0.487 | 0.085 0.696 | 0.647 0.120 | 0.129 0.337 | 0.529Per class Precision: 0.828 | 0.918 0.627 | 0.414 0.756 | 0.693 0.516 | 0.514 0.520 | 0.574
https://github.com/aritter/twitter_nlp
12. September 201627
![Page 25: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/25.jpg)
WILDHORN
How to analyze texts from various sources and interlink with existing knowledge graphs?
12. September 201628
![Page 26: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/26.jpg)
NLP PIPELINES
• Problem: How to efficiently connect various tools in the data stream?
• Sources• RSS feed reader, crawler, Twitter, FTP servers, …
• Analytics• 1 MINER instance per language
• Disambiguate and link to Knowledge graphsor taxonomies
• Sinks• RDF stores, Apache Solr, HDFS, Apache Cassandra, …
http://www.computernewsme.com/wp-content/uploads/2011/06/Cable-clutter.jpg12. September 201629
![Page 27: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/27.jpg)
NLP PIPELINES
• Best practices – by LinkedIn• http://www.confluent.io/blog/stream-data-platform-1/
• Apache Kafka!
12. September 201630
![Page 28: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/28.jpg)
WILDHORN NLP PIPELINE
Message Broker
Apache Spark Apache Kafka / Confluent
Schema Registry
REST APITechnical Foundation:
Twitter Extractor
RSS Feed CrawlerREST Proxy
Log File Extractor
Web / Darknet Crawler
Taxo Extractor
MINER
AGDISTIS
Named Entity Extraction &
Disambiguation
ENS
RDF Serializer QUAD
NoSQL Serializer Apache Cassandra
Solr Serializer Apache Solr
12. September 201632
Document Classification
Sentiment Analysis
![Page 29: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/29.jpg)
WILDHORN NLP PIPELINE
• Implementation• Apache Spark 1.6 & Confluent 3.0
• Use of Spark Jobserver to manageApache Spark applications
• Ontos Eiger backend ready
• First tests• Simple Mesos cluster with 3 servers
• ~ 1000 message / sec with 1 broker without MINER analytics: no data problem
• Problem: 1 MINER instance currently scales up to 100 message / sec
Kafka Pipeline
Par
t
Producers
Consumers
Topic
Schema
Par
t
Producers
Consumers
Topic
Schema
...https://github.com/spark-jobserver/spark-jobserver
12. September 201633
![Page 30: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/30.jpg)
WILDHORN NLP PIPELINE
https://github.com/spark-jobserver/spark-jobserver
12. September 201634
![Page 31: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/31.jpg)
LESSONS LEARNED
What is good, what not and what are the next steps?
12. September 201635
![Page 32: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/32.jpg)
CURRENT STATUS
• Detection and classification of entities, relations and facts with good F1
• Long, high quality texts vs. short texts with bad / missing grammar in multiple languages
• Training not by linguists but domain experts
• Flexible adaption to new domains and contexts
• Many different data sources
12. September 201636
![Page 33: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/33.jpg)
LESSONS LEARNED
• Neural networks / deep learning provide great concepts & frameworks for flexible, high quality NLP tasks
• Apache Kafka / Confluent Platform in combination with Apache Spark good foundation for data streaming and processing
• Disambiguation and linking of entities to taxonomies and Knowledge Graphs via Semantic Web technologies is a core contribution for data integration
• Hard to find employees with Deep Learning skills
12. September 201637
![Page 34: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/34.jpg)
NEXT STEPS
• Try letter-trigram word hashing to overcome out of dictionary problem of word2vec algorithm
• Relation and fact extraction in MINER
Adam Neumann is the CEO of super-hot office rental company WeWork, the most valuable startup in New York City.
PersonFirst Name
Sentiment
Entity Type
Adam super-hot
Entity
Adam Neumann
Legend Relation Relation Type
firstName1
firstName WhoHow
positive1
positiveSentiment
WeWork
Organization
ThingWhatHow
a has
A1 A2 A3 A4
Annotation
https://arxiv.org/abs/1608.06757
12. September 201638
![Page 35: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/35.jpg)
NEXT STEPS
• Define NLP pipelines in frontend
• Make MINER scalable
• Usable search interface
• Benchmark with GERBIL
12. September 201639
![Page 36: Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics](https://reader030.fdocuments.in/reader030/viewer/2022020119/586f89851a28ab54768b6021/html5/thumbnails/36.jpg)
Q & A
Dr. Martin VoigtManaging Director
Ontos GmbHD-04319 Leipzig
M: +49 178 40 222 58E: [email protected]: m_a_r_t_i_n
12. September 201640