Post on 29-Jun-2015
description
Tag ExtractionGeorge McBay, Naoki Nakatani
San Jose State UniversityCS185C Spring 2014
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
Problem
Given question with title and body, can we automatically generate tags for it?
Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.
Does anyone have a link?
Documentation
latex3
expl3
DatasetFile :● Train.csv● Test.csv
Fields :● id, title, body, tags (Train)● id, title, body (Test)
Characteristics :● Quoted csv● Body contains \n● Tags separated by space● Entry delimited by \0
\0
“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” , \0
\0
\0
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” ,
Working Environment
● Mac OS 10.9.1● Apache Hadoop 1.2.1● Apache Mahout 0.8● Apache Spark 0.9
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
ETL
Extract : Assume data is extracted from websiteTransform : Use OpenCSV
1. Remove whitespaces (‘ ’, ‘\n’, ‘\t’)2. Combine fields with ‘\t’3. Write to tsv file
Load : Upload to HDFS
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
Data Analysis
Tag Occurrence Count
TSV FileMap-Reduce• Input : <index, question>• Mapper output : <tag, 1> for each tag• Reducer output : <tag, count> for each tag
7785 c#6788 java6575 php6135 javascript5317 android4949 jquery3278 c++3082 python
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
Question Filtering for ML
TSV File
Map-Reduce• Input : <index, question>• Mapper output : <index, question> if question contains top5 tag• Reducer output : <index, question>
TSV Filewith questions that has one of top5 tags
Machine Learning
● Problem○ Can we classify questions into one of 5 categories
(tags) ?
Classification● Naive Bayes Classifier● Detail in Mahout Classification Presentation
Machine LearningCorrectly Classified Instances : 10209 81.8816%Incorrectly Classified Instances : 2259 18.1184%Total Classified Instances : 12468
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
Title vs BodyIntuitively…Title is a short summary describing the body of the question
⇒ Title must be more important than body!How to put more emphasis on title?● Build separate models for title & body + more weight for
title model?● Prepend title several times and feed into regular model?
Two models approach
Title model not accurate…● Too short for model to
distinguish labels● Longer text wins!
Repeated title approach
Slight improvement!● Testing against train-set
~ 93% ⇒ ~ 95%● Testing against test-set
~ 80% ⇒ ~ 82%
Multiple title● more stop words ⇒ No effect● more keywords (if title has)
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Diving into model● Top 10 words from each category
● Popular (redundant) words
showing up in all categories (I, it,
code, etc)
BUT● Some words specific to each
category (activity for android,
jquery for javascript, echo for php)
Which words to drop?
Word count against TrainSmall.tsv?● Total count : 19276034
Top 5:● p - 827029● the - 545950● i - 476056● to - 393027● a - 362328
Problem● Key words have high count too
○ 39th - http - 51412wc○ 63rd - java - 35076wc○ 91st - php - 25135wc
Can’t even throw away first 100 words...
Which words to drop?
Word count against ordinary english text?● 20 books from gutenberg.org● Total count : 1041565● A lot less technical! (only 4wc for java,
probably an island from Indonesia?)● Safe to throw away 1959 words (> 50wc)
BUT
Not much improvement...
● Due to tf-idf measurement○ Less weight for words appearing in many documents○ More weight for words appearing only in specific
documents
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Any room for improvement?
What is the source of error?● android ⇔ java ==> both java● javascript ⇔ php ===> both web-related● java classified as c# ===> many questions have both tags
Any room for improvement?
No problem if we can give multiple labels to one question!
Multi-label classification● Modification from previous classification task
○ Top5 tags ⇒ Top1000 tags○ 1 tag for 1 question ⇒ 5 tags for 1 question
(Pick 5 most probable tags)○ 1 question learned only once ⇒ 1 question with
multiple tags learned multiple times
tag1 body
tag2 bodymodel
Good outcome (Example 1)TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags● iphone● ios● upgrade
Predicted tags● iphone● ios● osx● objective-c● php
GREAT outcome (Example 2)TITLE: Is it possible to display an image in text field in html?BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p>
Actual tags● javascript● jquery● html● css● web
Predicted tags● javascript● jquery ===> Never appears in text!● html● c#● php
Stats
Row : # actual tags assigned to one questionCol : # predicted tags which are also in actual tag set
[Ex] Out of total 32798 questions which have 2 tags:● For 14541 questions, model suggested both 2 actual tags.● For 13922 questions, model suggested 1 of 2 actual tags.● For 4335 questions, model couldn’t suggest the correct tags.
How to evaluateGenerous evaluator
If model gets at least 1 correct, approve it!
Total accuracy = 83.55% (B)
How to evaluateStrict evaluator
Never approve unless model gets all correct!
Total accuracy = 43.04% (F)
Conclusion for performance
● Overall, good!○ Predicted tag set is relatively close to the actual tag
set (Apple-related, Web-related)● but, not there yet...
○ Almost impossible to distinguish versions (c#-3.0, c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api, facebook-like ⇒ facebook)
○ Still showing unrelated tags (php python everywhere!)
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Spark
Advantages:- Easy to get started with- Interactive shell- Less code to write
Spark
Disadvantages:- Not many reference for MLlib- Still new
Spark
● Used PySpark which is python interface to using Spark
● Implemented ML model from ground-up using python dictionaries and mapreduce procedure
How It Works
5 basic procedures used:● map● flatMap● reduce● reduceByKey● collectAsMap
How It Works
key_val = line.flatMap(~).map(~)
key_val = key_val.reduceByKey(~)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (b, 1) (c, 1) (d, 1)(a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2)
LINE
How It Works
dict = key_val.collectAsMap()
{a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)
How It Works
Model:- statistical model- matrix of weights- uses tf-idf
How It Works
Tags
How It Works
Tags
Words from document
How It Works
Tags Relevance
Words from document
How It Works
Implemented as → { tag : { word : wight } }
How It Works
● Most relevant tag chosen by sum of weights associated to words contained in the document
How It Works
Now, how are the weights calculated?● First calculate idf (inverse document
frequency) for each word● Next calculate tf (term frequency) associated
with each tag● Multiply idf to each entry then Normalize
How It Works
idf for a worddefined by:
idf(word) = log(D/F(word))where,
D = total # of doc in the training setF(word) = # of doc which contains word
How It Works
Two ways to calculate tf:1) number of times you see the term associated with a tag2) number of documents you see the term associated with a tag (in other words only count one time per doc)
ResultsTITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags● iphone● ios● upgrade
Predicted tags● ios4.3● iphone-3gs● cocoa-touch● ios4● upgrade
ResultsTITLE: Is it possible to display an image in text field in html?BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p>
Actual tags● javascript● jquery● html● css● web
Predicted tags● html● img● alignment● get● web
Results
Top: Predicted
Below: Actual
Results
● Not perfect● But very close● Relevant words for tags look right
Results
most relevant words for tag “python”: [u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ]
most relevant words for tag “math”:[u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example', u'matlab', ... ]
Adjusting
What can be adjusted?● Pretty much anything!● I tried playing with: tf, idf, tag_frequency,
normalization, cleaning text, etc.
Conclusion
● Adjusting the metrics to get the right model can be time consuming (many things can be adjusted)!
● But still, Naive Bayes algorithm is very suited for keyword extraction problem (and text classification in general), because of how tf-idf is defined.