Classifying Tech News with Sparkling Water
-
Upload
srisatish-ambati -
Category
Technology
-
view
706 -
download
0
Transcript of Classifying Tech News with Sparkling Water
H2O.aiMachine Intelligence
BUILDING MACHINE LEARNING APPLICATIONS WITH SPARKLING WATER
AV N I WA D H WA & V I N O D I Y E N G A R
H2O.aiMachine Intelligence
Sparkling Water
• Seamless integration of H2O with Spark ecosystem
• Transparent use of H2O data structures and algorithms with Spark API
• Excels in existing Spark workflows requiring advanced Machine Learning algorithms
Provides the following:
H2O.aiMachine Intelligence
Sparkling Water Requirements
• Spark Version 1.4 • Sparkling Water 1.4.3 (download at h2o.ai/download)
H2O.aiMachine Intelligence
Tech News Use Case
• The goal is to predict the tag based on the short summary of the article
H2O.aiMachine Intelligence
Tech News Use Case— Crawler
Used import.io to create a crawler which went through numerous pages of techcrunch.com news and and acquired data regarding the title of the article, the author, a 2-3 sentence opening from the beginning of the articles, and the tags associated with the article
H2O.aiMachine Intelligence
Tech News Use Case
First manipulation of words involves eliminating words that could occur frequently and do not add value to the classification process.
Sample Scala code:
H2O.aiMachine Intelligence
Tech News Use Case
We now eliminate words that do not add value to the classification process
• ie punctuation, stopwords, and words that do not occur frequently
Sample Scala code:
H2O.aiMachine Intelligence
Tech News Use Case — Word2Vec
A mathematical way to represent a word as a vector of numbers. These vector ‘representations’ encode information about the given word. In other words, the vector captures the meaning of the word.
Text blurb
Word2Vec Model
GBMModel
Word2Vec
Categorize the text
Train a model
“This article is related to gadgets”
“Apple has been tinkering with ways to make
the iPhone better at managing battery life…”
Article Blurb
Tech News Work Flow
H2O.aiMachine Intelligence
Category Information
The original data set yielded about 55 categories. In order to streamline the classification process, we chose the 14 most frequently appearing tags in our dataset and labeled the rest into a catch-all category titled “Other.” The figure to the right shows the distribution of data in each category.
Category Information
The variable importance chart to the right shows that the author holds an overwhelming majority when it comes to importance among variables. In other words, the classification took place using very little information from the text samples provided and came mostly from authors that frequently write under the same article tag. Let’s see how this changes when we try to classify the articles using only the text samples.
H2O.aiMachine Intelligence
Analysis
The validation confusion matrix below is for the model that used both the authors and text blurbs to categorize articles. We know that in this model, there was a heavy variable importance placed on authors. In the confusion matrix below, we see how this effects the error rate of various tags. For tags with smaller sets of data, it is common that a few authors write the majority of articles associated with those tags. For the “Enterprise” tag for example, the data set is relatively small, and the error rate is relatively low (40%).
H2O.aiMachine Intelligence
Analysis
The validation confusion matrix below is for the model that uses text blurbs exclusively to categorize articles. If we look at the error rate on the “Enterprise” tag, we see that the error rate is 75%, significantly higher than the error rate we saw when authors were incorporated into the data. This shows the strength in the variable importance of the authors.
H2O.aiMachine Intelligence
Example Classification
With the Scala code below, we identify and author of an article and a the snippet of the article provided, and try to classify what the article is about.
H2O.aiMachine Intelligence
Hit Ratios
With Authors Without Authors
Hit ratios illustrate the chances of your model correctly categorizing a text blurb on the 1st, 2nd, 3rd, etc. try. The above charts show that both the model that do and do not include authors have approx. 70% chance of correctly predicting a text blurb on the second try.
H2O.aiMachine Intelligence
Possible Use
A possible use for such classification capabilities would be for blog posting sites. The user would enter their text into the field, and the classification model would automatically choose tags for the post.
H2O.ai Machine Intelligence Customers • Community •
Evangelists
November 9, 10, 11Computer History Museum
H2OWORLD.H2O.AI
20% off registrationusing code:
h2ocommunity