NoLimit Research Stack
-
Upload
ananta-pandu-wicaksana -
Category
Documents
-
view
322 -
download
0
Transcript of NoLimit Research Stack
![Page 1: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/1.jpg)
NoLimit Research Stack
Tech Talk - March 11th 2016
![Page 2: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/2.jpg)
Contents
I. OverviewII. APIs
A. Entity ExtractorB. SummarizerC. Category ClassifierD. TopicE. Next Project(s)
III. Supporting ToolsA. Gerbang-APIB. Demo Master
![Page 3: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/3.jpg)
Overview - Introduction
NoLimit Research Team responsible for developing internal API for NLP for Bahasa Indonesia. (Ananta Pandu & Anggrahita Bayu).
Currently, the APIs are:
A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic
![Page 4: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/4.jpg)
Overview - The Architecture
web service web service web service web service
APIs
Gerbang-API Demo Master
Supporting Tools
Entity Extractor Summarizer CategoryClassifier Topic Classifier
![Page 5: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/5.jpg)
Overview - The Architecture (2)
NodeJS NodeJS NodeJS NodeJS
APIs
NodeJS Reactjs
Supporting Tools
SCALA NodeJS NodeJS Python
![Page 6: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/6.jpg)
The APIs
A. Entity ExtractorB. SummarizerC. Category ClassifierD. Topic Classifier
Link: http://demo.api.nlp.nolimitid.com/
![Page 7: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/7.jpg)
Entity Extractor
Get entities from an online news text
http://demo.api.nlp.nolimitid.com/page/3
Text: StringEntities: Object(City, Country, Company, Event, JobTitle, Organization, Person,
Product)
![Page 8: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/8.jpg)
Entity Extractor (2)
![Page 9: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/9.jpg)
Entity Extractor (3)
- Built using Scala.- Get entities from text using HMM (https://en.wikipedia.org/wiki/Hidden_Markov_model).- Dataset (news+entities) provided by NoLimit. Total: 3000 articles.- Previously built using java + weka. Had to change it from simple classification to HMM because
HMM is known for its application in recognizing pattern better than simple classification.
![Page 10: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/10.jpg)
Entity Extractor - V1.0
V1.0 built using java + weka
- Tag each token with simple classification using weka API.
- Good, because:- Weka make experiment easier
- Bad, because:
- Java 1.7 syntax for higher-order function is worse than NodeJS
- Simple classification < HMM- Confusing Weka API- Weka’s models has large filesize
![Page 11: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/11.jpg)
Entity Extractor - V2.0
V2.0 built using NodeJS
- Implement HMM in NodeJS.- Good, because:
- JSON as literal object & array- Libraries (lodash, bluebird, nalapa https://github.com/anpandu/nalapa)- EZ to deploy (npm install + pm2 start)
- Bad, because:- NodeJS apps run in single thread- Had to create multiple instance as micro-services, too complicated for a single endpoint
![Page 12: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/12.jpg)
Entity Extractor - V3.0 (current)
V3.0 built using Scala
- Implement HMM in Scala, actually just port code to scala- Good, because:
- Can provide simpler multithread implementation more than nodejs (parallel map, akka, etc)- Safer because native immutability (var vs val)- Static typing languages is generally faster than dynamic typing language
- Bad, because:- Fewer libraries- Longer test time
![Page 13: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/13.jpg)
Entity Extractor (4)
Next :
- Updateable dataset- Scheduled re-training (realtime?)
![Page 14: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/14.jpg)
Entity Extractor - Conclusion
- HMM is better than simple classification because predicting label of each token in a text is pattern-recognizing problem. HMM save sequence of tokens and their labels.
- Developing entity extractor in Scala is the best solution so far because it enables us implement parallel computation easier than nodejs. It has better performance too.
![Page 15: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/15.jpg)
Summarizer
Get shorter version of an online news text
http://demo.api.nlp.nolimitid.com/page/2
Text: String Summary: String
![Page 16: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/16.jpg)
Summarizer (2)
![Page 17: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/17.jpg)
Summarizer (3)
- Built using NodeJS (core: https://github.com/anpandu/nodejs-text-summarizer).- Get summary from a text:
- Split text to sentences.
- Score each sentence using Word Form Similarity, Word Order Similarity, and Word Semantic Similarity (P. y. Zhang 2009).
- Take n best sentence to form a summary.- No dataset. Scoring formula coded straight from paper (P. y. Zhang 2009).
![Page 18: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/18.jpg)
Summarizer - Preprocess
- Split text to sentences- Split sentence into tokens- Remove stopwords, replace non-ASCII, etc
![Page 19: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/19.jpg)
- Best sentence for summary is the one that has most similarities with other sentences- Score each token with:
- Word Form Similarity- Word Order Similarity- Word Semantic
- WFS, WOS, and WS formula is in (P. y. Zhang 2009)- Sum the scores (SUM = a*WFS + b*WOS + c*WS; a+b+c=1)- Take best n sentences- Sort by it’s original order and join
Summarizer - Scoring
![Page 20: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/20.jpg)
Category Classifier
Determine the category of an online news text
http://demo.api.nlp.nolimitid.com/page/1
Text: String Category: Object
![Page 21: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/21.jpg)
Category Classifier (2)
![Page 22: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/22.jpg)
Category Classifier (3)
- Built using NodeJS. (core: https://github.com/anpandu/indonesian-news-category-classifier)- Tag an article as a category (total 12 categories).
- Split article into tokens.- Remove stopwords, unique-ify, etc ...- Use tf-idf scores as features to train model. (12 scores, 12 features)
- Train using SVM (https://github.com/nicolaspanel/node-svm), cos tf-idf score is a float number, excellent for vector.
- Tested on training set, f-measure of 90%- Dataset provided by NoLimit, tuple of (article + category), total of 48000 articles.
![Page 23: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/23.jpg)
Topic (1)
Use cases:
1. Identify several determined number of topics from a bunch of documents (news articles)
2. Classify a single article into a certain topic
![Page 24: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/24.jpg)
Topic (2)
- Built using Python 3.X and its awesome supporting libraries:- Numpy (array processing)
- http://www.numpy.org/- TextBlob (NLTK-based text processor)
- https://textblob.readthedocs.org/en/dev/- Scikit Learn (Machine Learning [Classifier, Clustering, Quality Analysis])
- http://scikit-learn.org/ - JSON (JSON handler), Pickle (Python object-to-file)- LDA
![Page 25: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/25.jpg)
Topic (3): Latent Dirichlet Allocation (LDA)
Here be the Leviathan
Indeed, any hope of overcoming him is false; Shall one not be overwhelmed at the sight of him?
-- Job 41:9 NKJV
![Page 26: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/26.jpg)
Latent Dirichlet Allocation (2)
![Page 27: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/27.jpg)
Latent Dirichlet Allocation (3)
Wd,n = n-th word in document d N = number of wordsZd,n = topic id of n-th word in document d D = number of documentsθd = distribution of topic in document d K = number of topicsβk = probability of word occurs in topic k
βk ~ DirichletV(η)α = prior weight of a topic in document θd ~ DirichletK(α)η = prior weight of a word in a topic z ~ CategoricalK(θd)
w ~ CategoricalV(βk)
Wd,nZd,nθdα βk η
ND K
![Page 28: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/28.jpg)
Latent Dirichlet Allocation (4): in Python
# X = matrix of wordcount for the vocabularies in all document# num_topics = predetermined amount of topics
import ldamodel = lda.LDA(num_topics)model.fit(X)#the leviathan roars after this function’s invocation
topic_word = model.topic_word_#(array, shape = [n_topics, n_features]) Point estimate of the topic-word distributions
doc_topic = model.doc_topic_#(array, shape = [n_samples, n_features]) Point estimate of the document-topic distributions
![Page 29: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/29.jpg)
Next Project(s)
- Entity extractor + sentiment- Quotation extraction- Opinion Mining- Credit Card scoring by social media
![Page 30: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/30.jpg)
Challenges
- How to make updating dataset/model can be done automatically- Balancing accuracy vs speed
![Page 31: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/31.jpg)
Supporting Tools
A. Gerbang APIB. Demo Master
![Page 32: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/32.jpg)
Gerbang-API
- Built by NodeJS (sailsjs).- Unify all APIs into one.- Actually just a simple app to re-route endpoints.- Next:
- Move to a simpler framework (restify ?)- user authorization + token- rate limiting- logger
![Page 33: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/33.jpg)
Demo Master
- Small web app to save documentation + demo.- Link: http://demo.enrichment.nolimitid.com/- Next:
- upgrade to 3rd party framework (probably https://github.com/tripit/slate)
![Page 34: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/34.jpg)
Demo Master (2)
Old Demo Master
![Page 35: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/35.jpg)
Demo Master (3)
Next Demo Master (with slate)
![Page 36: NoLimit Research Stack](https://reader033.fdocuments.in/reader033/viewer/2022051710/58f0560c1a28ab5a618b463b/html5/thumbnails/36.jpg)
Question ??