Tag Extraction Final Presentation - CS185CSpring2014

Tag ExtractionGeorge McBay, Naoki Nakatani

San Jose State UniversityCS185C Spring 2014

AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization

Feature EngineeringTitle vs BodyStop Words

Multi-label ClassificationApache Spark

Problem DescriptionETLData AnalysisMachine LearningOptimization



Agenda

Problem

Given question with title and body, can we automatically generate tags for it?

Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.

Does anyone have a link?

Documentation

latex3

expl3

DatasetFile :● Train.csv● Test.csv

Fields :● id, title, body, tags (Train)● id, title, body (Test)

Characteristics :● Quoted csv● Body contains \n● Tags separated by space● Entry delimited by \0

\0

“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” , \0

\0

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”

“----” , ”-----------” , “------------------------” , “--- --- --- ---”

“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” ,

Working Environment

● Mac OS 10.9.1● Apache Hadoop 1.2.1● Apache Mahout 0.8● Apache Spark 0.9




Agenda

ETL

Extract : Assume data is extracted from websiteTransform : Use OpenCSV

1. Remove whitespaces (‘ ’, ‘\n’, ‘\t’)2. Combine fields with ‘\t’3. Write to tsv file

Load : Upload to HDFS




Agenda

Data Analysis

Tag Occurrence Count

TSV FileMap-Reduce• Input : <index, question>• Mapper output : <tag, 1> for each tag• Reducer output : <tag, count> for each tag

7785 c#6788 java6575 php6135 javascript5317 android4949 jquery3278 c++3082 python




Agenda

Question Filtering for ML

TSV File

Map-Reduce• Input : <index, question>• Mapper output : <index, question> if question contains top5 tag• Reducer output : <index, question>

TSV Filewith questions that has one of top5 tags

Machine Learning

● Problem○ Can we classify questions into one of 5 categories

(tags) ?

Classification● Naive Bayes Classifier● Detail in Mahout Classification Presentation

Machine LearningCorrectly Classified Instances : 10209 81.8816%Incorrectly Classified Instances : 2259 18.1184%Total Classified Instances : 12468




Agenda

Title vs BodyIntuitively…Title is a short summary describing the body of the question

⇒ Title must be more important than body!How to put more emphasis on title?● Build separate models for title & body + more weight for

title model?● Prepend title several times and feed into regular model?

Two models approach

Title model not accurate…● Too short for model to

distinguish labels● Longer text wins!

Repeated title approach

Slight improvement!● Testing against train-set

~ 93% ⇒ ~ 95%● Testing against test-set

~ 80% ⇒ ~ 82%

Multiple title● more stop words ⇒ No effect● more keywords (if title has)

Diving into model● Top 10 words from each category

● Popular (redundant) words

showing up in all categories (I, it,

code, etc)

BUT● Some words specific to each

category (activity for android,

jquery for javascript, echo for php)

Which words to drop?

Word count against TrainSmall.tsv?● Total count : 19276034

Top 5:● p - 827029● the - 545950● i - 476056● to - 393027● a - 362328

Problem● Key words have high count too

○ 39th - http - 51412wc○ 63rd - java - 35076wc○ 91st - php - 25135wc

Can’t even throw away first 100 words...

Which words to drop?

Word count against ordinary english text?● 20 books from gutenberg.org● Total count : 1041565● A lot less technical! (only 4wc for java,

probably an island from Indonesia?)● Safe to throw away 1959 words (> 50wc)

Not much improvement...

● Due to tf-idf measurement○ Less weight for words appearing in many documents○ More weight for words appearing only in specific

documents

Any room for improvement?

What is the source of error?● android ⇔ java ==> both java● javascript ⇔ php ===> both web-related● java classified as c# ===> many questions have both tags

Any room for improvement?

No problem if we can give multiple labels to one question!

Multi-label classification● Modification from previous classification task

○ Top5 tags ⇒ Top1000 tags○ 1 tag for 1 question ⇒ 5 tags for 1 question

(Pick 5 most probable tags)○ 1 question learned only once ⇒ 1 question with

multiple tags learned multiple times

tag1 body

tag2 bodymodel

Good outcome (Example 1)TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. Does anyone have some experience in doing this or does apple provide tutorials for developers in this? A lot of thanks.

Actual tags● iphone● ios● upgrade

Predicted tags● iphone● ios● osx● objective-c● php

GREAT outcome (Example 2)TITLE: Is it possible to display an image in text field in html?BODY: Can we display image inside a text field in <code>html</code>? Edit What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)

Actual tags● javascript● jquery● html● css● web

Predicted tags● javascript● jquery ===> Never appears in text!● html● c#● php

Stats

Row : # actual tags assigned to one questionCol : # predicted tags which are also in actual tag set

[Ex] Out of total 32798 questions which have 2 tags:● For 14541 questions, model suggested both 2 actual tags.● For 13922 questions, model suggested 1 of 2 actual tags.● For 4335 questions, model couldn’t suggest the correct tags.

How to evaluateGenerous evaluator

If model gets at least 1 correct, approve it!

Total accuracy = 83.55% (B)

How to evaluateStrict evaluator

Never approve unless model gets all correct!

Total accuracy = 43.04% (F)

Conclusion for performance

● Overall, good!○ Predicted tag set is relatively close to the actual tag

set (Apple-related, Web-related)● but, not there yet...

○ Almost impossible to distinguish versions (c#-3.0, c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api, facebook-like ⇒ facebook)

○ Still showing unrelated tags (php python everywhere!)

Spark

Advantages:- Easy to get started with- Interactive shell- Less code to write

Spark

Disadvantages:- Not many reference for MLlib- Still new

Spark

● Used PySpark which is python interface to using Spark

● Implemented ML model from ground-up using python dictionaries and mapreduce procedure

How It Works

5 basic procedures used:● map● flatMap● reduce● reduceByKey● collectAsMap

How It Works

key_val = line.flatMap(~).map(~)

key_val = key_val.reduceByKey(~)

(a, 1) (b, 1) (c, 1) (d, 1)

(a, 1) (b, 1) (c, 1) (d, 1)(a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2)

LINE

How It Works

dict = key_val.collectAsMap()

{a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)

How It Works

Model:- statistical model- matrix of weights- uses tf-idf

How It Works

Tags

How It Works

Tags

Words from document

How It Works

Tags Relevance

Words from document

How It Works

Implemented as → { tag : { word : wight } }

How It Works

● Most relevant tag chosen by sum of weights associated to words contained in the document

How It Works

Now, how are the weights calculated?● First calculate idf (inverse document

frequency) for each word● Next calculate tf (term frequency) associated

with each tag● Multiply idf to each entry then Normalize

How It Works

idf for a worddefined by:

idf(word) = log(D/F(word))where,

D = total # of doc in the training setF(word) = # of doc which contains word

How It Works

Two ways to calculate tf:1) number of times you see the term associated with a tag2) number of documents you see the term associated with a tag (in other words only count one time per doc)

ResultsTITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. Does anyone have some experience in doing this or does apple provide tutorials for developers in this? A lot of thanks.

Actual tags● iphone● ios● upgrade

Predicted tags● ios4.3● iphone-3gs● cocoa-touch● ios4● upgrade

ResultsTITLE: Is it possible to display an image in text field in html?BODY: Can we display image inside a text field in <code>html</code>? Edit What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)

Actual tags● javascript● jquery● html● css● web

Predicted tags● html● img● alignment● get● web

Results

Top: Predicted

Below: Actual

Results

● Not perfect● But very close● Relevant words for tags look right

Results

most relevant words for tag “python”: [u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ]

most relevant words for tag “math”:[u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example', u'matlab', ... ]

Adjusting

What can be adjusted?● Pretty much anything!● I tried playing with: tf, idf, tag_frequency,

normalization, cleaning text, etc.

Conclusion

● Adjusting the metrics to get the right model can be time consuming (many things can be adjusted)!

● But still, Naive Bayes algorithm is very suited for keyword extraction problem (and text classification in general), because of how tf-idf is defined.

Tag Extraction Final Presentation - CS185CSpring2014

Technology

Transcript of Tag Extraction Final Presentation - CS185CSpring2014