Tag Extraction Final Presentation - CS185CSpring2014

58
Tag Extraction George McBay, Naoki Nakatani San Jose State University CS185C Spring 2014

description

These slides were presented in class on May 7th 2014. Task allocation • George : ETL, Data Analysis, Machine Learning, Multi-label classification with Apache Spark • Naoki : ETL, Data Analysis, Machine Learning, Feature Engineering, Multi-label classification with Apache Mahout

Transcript of Tag Extraction Final Presentation - CS185CSpring2014

Page 1: Tag Extraction Final Presentation - CS185CSpring2014

Tag ExtractionGeorge McBay, Naoki Nakatani

San Jose State UniversityCS185C Spring 2014

Page 2: Tag Extraction Final Presentation - CS185CSpring2014

AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Page 3: Tag Extraction Final Presentation - CS185CSpring2014

Problem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Agenda

Page 4: Tag Extraction Final Presentation - CS185CSpring2014

Problem

Given question with title and body, can we automatically generate tags for it?

Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.

Does anyone have a link?

Documentation

latex3

expl3

Page 5: Tag Extraction Final Presentation - CS185CSpring2014

DatasetFile :● Train.csv● Test.csv

Fields :● id, title, body, tags (Train)● id, title, body (Test)

Characteristics :● Quoted csv● Body contains \n● Tags separated by space● Entry delimited by \0

\0

“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” , \0

\0

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”

“----” , ”-----------” , “------------------------” , “--- --- --- ---”

“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” ,

Page 6: Tag Extraction Final Presentation - CS185CSpring2014

Working Environment

● Mac OS 10.9.1● Apache Hadoop 1.2.1● Apache Mahout 0.8● Apache Spark 0.9

Page 7: Tag Extraction Final Presentation - CS185CSpring2014

Problem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Agenda

Page 8: Tag Extraction Final Presentation - CS185CSpring2014

ETL

Extract : Assume data is extracted from websiteTransform : Use OpenCSV

1. Remove whitespaces (‘ ’, ‘\n’, ‘\t’)2. Combine fields with ‘\t’3. Write to tsv file

Load : Upload to HDFS

Page 9: Tag Extraction Final Presentation - CS185CSpring2014

Problem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Agenda

Page 10: Tag Extraction Final Presentation - CS185CSpring2014

Data Analysis

Tag Occurrence Count

TSV FileMap-Reduce• Input : <index, question>• Mapper output : <tag, 1> for each tag• Reducer output : <tag, count> for each tag

7785 c#6788 java6575 php6135 javascript5317 android4949 jquery3278 c++3082 python

Page 11: Tag Extraction Final Presentation - CS185CSpring2014

Problem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Agenda

Page 12: Tag Extraction Final Presentation - CS185CSpring2014

Question Filtering for ML

TSV File

Map-Reduce• Input : <index, question>• Mapper output : <index, question> if question contains top5 tag• Reducer output : <index, question>

TSV Filewith questions that has one of top5 tags

Page 13: Tag Extraction Final Presentation - CS185CSpring2014

Machine Learning

● Problem○ Can we classify questions into one of 5 categories

(tags) ?

Classification● Naive Bayes Classifier● Detail in Mahout Classification Presentation

Page 14: Tag Extraction Final Presentation - CS185CSpring2014

Machine LearningCorrectly Classified Instances : 10209 81.8816%Incorrectly Classified Instances : 2259 18.1184%Total Classified Instances : 12468

Page 15: Tag Extraction Final Presentation - CS185CSpring2014

Problem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Agenda

Page 16: Tag Extraction Final Presentation - CS185CSpring2014

Title vs BodyIntuitively…Title is a short summary describing the body of the question

⇒ Title must be more important than body!How to put more emphasis on title?● Build separate models for title & body + more weight for

title model?● Prepend title several times and feed into regular model?

Page 17: Tag Extraction Final Presentation - CS185CSpring2014

Two models approach

Title model not accurate…● Too short for model to

distinguish labels● Longer text wins!

Page 18: Tag Extraction Final Presentation - CS185CSpring2014

Repeated title approach

Slight improvement!● Testing against train-set

~ 93% ⇒ ~ 95%● Testing against test-set

~ 80% ⇒ ~ 82%

Multiple title● more stop words ⇒ No effect● more keywords (if title has)

Page 19: Tag Extraction Final Presentation - CS185CSpring2014

AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Page 20: Tag Extraction Final Presentation - CS185CSpring2014

Diving into model● Top 10 words from each category

● Popular (redundant) words

showing up in all categories (I, it,

code, etc)

BUT● Some words specific to each

category (activity for android,

jquery for javascript, echo for php)

Page 21: Tag Extraction Final Presentation - CS185CSpring2014

Which words to drop?

Word count against TrainSmall.tsv?● Total count : 19276034

Top 5:● p - 827029● the - 545950● i - 476056● to - 393027● a - 362328

Problem● Key words have high count too

○ 39th - http - 51412wc○ 63rd - java - 35076wc○ 91st - php - 25135wc

Can’t even throw away first 100 words...

Page 22: Tag Extraction Final Presentation - CS185CSpring2014

Which words to drop?

Word count against ordinary english text?● 20 books from gutenberg.org● Total count : 1041565● A lot less technical! (only 4wc for java,

probably an island from Indonesia?)● Safe to throw away 1959 words (> 50wc)

Page 23: Tag Extraction Final Presentation - CS185CSpring2014

BUT

Page 24: Tag Extraction Final Presentation - CS185CSpring2014

Not much improvement...

● Due to tf-idf measurement○ Less weight for words appearing in many documents○ More weight for words appearing only in specific

documents

Page 25: Tag Extraction Final Presentation - CS185CSpring2014

AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Page 26: Tag Extraction Final Presentation - CS185CSpring2014

Any room for improvement?

What is the source of error?● android ⇔ java ==> both java● javascript ⇔ php ===> both web-related● java classified as c# ===> many questions have both tags

Page 27: Tag Extraction Final Presentation - CS185CSpring2014

Any room for improvement?

No problem if we can give multiple labels to one question!

Page 28: Tag Extraction Final Presentation - CS185CSpring2014

Multi-label classification● Modification from previous classification task

○ Top5 tags ⇒ Top1000 tags○ 1 tag for 1 question ⇒ 5 tags for 1 question

(Pick 5 most probable tags)○ 1 question learned only once ⇒ 1 question with

multiple tags learned multiple times

tag1 body

tag2 bodymodel

Page 29: Tag Extraction Final Presentation - CS185CSpring2014

Good outcome (Example 1)TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>

Actual tags● iphone● ios● upgrade

Predicted tags● iphone● ios● osx● objective-c● php

Page 30: Tag Extraction Final Presentation - CS185CSpring2014

GREAT outcome (Example 2)TITLE: Is it possible to display an image in text field in html?BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p>

Actual tags● javascript● jquery● html● css● web

Predicted tags● javascript● jquery ===> Never appears in text!● html● c#● php

Page 31: Tag Extraction Final Presentation - CS185CSpring2014

Stats

Row : # actual tags assigned to one questionCol : # predicted tags which are also in actual tag set

[Ex] Out of total 32798 questions which have 2 tags:● For 14541 questions, model suggested both 2 actual tags.● For 13922 questions, model suggested 1 of 2 actual tags.● For 4335 questions, model couldn’t suggest the correct tags.

Page 32: Tag Extraction Final Presentation - CS185CSpring2014

How to evaluateGenerous evaluator

If model gets at least 1 correct, approve it!

Total accuracy = 83.55% (B)

Page 33: Tag Extraction Final Presentation - CS185CSpring2014

How to evaluateStrict evaluator

Never approve unless model gets all correct!

Total accuracy = 43.04% (F)

Page 34: Tag Extraction Final Presentation - CS185CSpring2014

Conclusion for performance

● Overall, good!○ Predicted tag set is relatively close to the actual tag

set (Apple-related, Web-related)● but, not there yet...

○ Almost impossible to distinguish versions (c#-3.0, c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api, facebook-like ⇒ facebook)

○ Still showing unrelated tags (php python everywhere!)

Page 35: Tag Extraction Final Presentation - CS185CSpring2014

AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Page 36: Tag Extraction Final Presentation - CS185CSpring2014

Spark

Advantages:- Easy to get started with- Interactive shell- Less code to write

Page 37: Tag Extraction Final Presentation - CS185CSpring2014

Spark

Disadvantages:- Not many reference for MLlib- Still new

Page 38: Tag Extraction Final Presentation - CS185CSpring2014

Spark

● Used PySpark which is python interface to using Spark

● Implemented ML model from ground-up using python dictionaries and mapreduce procedure

Page 39: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

5 basic procedures used:● map● flatMap● reduce● reduceByKey● collectAsMap

Page 40: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

key_val = line.flatMap(~).map(~)

key_val = key_val.reduceByKey(~)

(a, 1) (b, 1) (c, 1) (d, 1)

(a, 1) (b, 1) (c, 1) (d, 1)(a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2)

LINE

Page 41: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

dict = key_val.collectAsMap()

{a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)

Page 42: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

Model:- statistical model- matrix of weights- uses tf-idf

Page 43: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

Tags

Page 44: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

Tags

Words from document

Page 45: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

Tags Relevance

Words from document

Page 46: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

Implemented as → { tag : { word : wight } }

Page 47: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

● Most relevant tag chosen by sum of weights associated to words contained in the document

Page 48: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

Now, how are the weights calculated?● First calculate idf (inverse document

frequency) for each word● Next calculate tf (term frequency) associated

with each tag● Multiply idf to each entry then Normalize

Page 49: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

idf for a worddefined by:

idf(word) = log(D/F(word))where,

D = total # of doc in the training setF(word) = # of doc which contains word

Page 50: Tag Extraction Final Presentation - CS185CSpring2014

How It Works

Two ways to calculate tf:1) number of times you see the term associated with a tag2) number of documents you see the term associated with a tag (in other words only count one time per doc)

Page 51: Tag Extraction Final Presentation - CS185CSpring2014

ResultsTITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>

Actual tags● iphone● ios● upgrade

Predicted tags● ios4.3● iphone-3gs● cocoa-touch● ios4● upgrade

Page 52: Tag Extraction Final Presentation - CS185CSpring2014

ResultsTITLE: Is it possible to display an image in text field in html?BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p>

Actual tags● javascript● jquery● html● css● web

Predicted tags● html● img● alignment● get● web

Page 53: Tag Extraction Final Presentation - CS185CSpring2014

Results

Top: Predicted

Below: Actual

Page 54: Tag Extraction Final Presentation - CS185CSpring2014

Results

● Not perfect● But very close● Relevant words for tags look right

Page 55: Tag Extraction Final Presentation - CS185CSpring2014

Results

most relevant words for tag “python”: [u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ]

most relevant words for tag “math”:[u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example', u'matlab', ... ]

Page 56: Tag Extraction Final Presentation - CS185CSpring2014

Adjusting

What can be adjusted?● Pretty much anything!● I tried playing with: tf, idf, tag_frequency,

normalization, cleaning text, etc.

Page 57: Tag Extraction Final Presentation - CS185CSpring2014

Conclusion

● Adjusting the metrics to get the right model can be time consuming (many things can be adjusted)!

● But still, Naive Bayes algorithm is very suited for keyword extraction problem (and text classification in general), because of how tf-idf is defined.

Page 58: Tag Extraction Final Presentation - CS185CSpring2014