Classifying Tech News with Sparkling Water

H2O.aiMachine Intelligence

BUILDING MACHINE LEARNING APPLICATIONS WITH SPARKLING WATER

AV N I WA D H WA & V I N O D I Y E N G A R


Sparkling Water

• Seamless integration of H2O with Spark ecosystem

• Transparent use of H2O data structures and algorithms with Spark API

• Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Provides the following:


Sparkling Water Requirements

• Spark Version 1.4 • Sparkling Water 1.4.3 (download at h2o.ai/download)


Tech News Use Case

• The goal is to predict the tag based on the short summary of the article


Tech News Use Case— Crawler

Used import.io to create a crawler which went through numerous pages of techcrunch.com news and and acquired data regarding the title of the article, the author, a 2-3 sentence opening from the beginning of the articles, and the tags associated with the article

http://techcrunch.com


Tech News Use Case

First manipulation of words involves eliminating words that could occur frequently and do not add value to the classification process.

Sample Scala code:


Tech News Use Case

We now eliminate words that do not add value to the classification process

• ie punctuation, stopwords, and words that do not occur frequently

Sample Scala code:


Tech News Use Case — Word2Vec

A mathematical way to represent a word as a vector of numbers. These vector ‘representations’ encode information about the given word. In other words, the vector captures the meaning of the word.

Text blurb

Word2Vec Model

GBMModel

Word2Vec

Categorize the text

Train a model

“This article is related to gadgets”

“Apple has been tinkering with ways to make

the iPhone better at managing battery life…”

Article Blurb

Tech News Work Flow


Category Information

The original data set yielded about 55 categories. In order to streamline the classification process, we chose the 14 most frequently appearing tags in our dataset and labeled the rest into a catch-all category titled “Other.” The figure to the right shows the distribution of data in each category.

Category Information

The variable importance chart to the right shows that the author holds an overwhelming majority when it comes to importance among variables. In other words, the classification took place using very little information from the text samples provided and came mostly from authors that frequently write under the same article tag. Let’s see how this changes when we try to classify the articles using only the text samples.


Analysis

The validation confusion matrix below is for the model that used both the authors and text blurbs to categorize articles. We know that in this model, there was a heavy variable importance placed on authors. In the confusion matrix below, we see how this effects the error rate of various tags. For tags with smaller sets of data, it is common that a few authors write the majority of articles associated with those tags. For the “Enterprise” tag for example, the data set is relatively small, and the error rate is relatively low (40%).


Analysis

The validation confusion matrix below is for the model that uses text blurbs exclusively to categorize articles. If we look at the error rate on the “Enterprise” tag, we see that the error rate is 75%, significantly higher than the error rate we saw when authors were incorporated into the data. This shows the strength in the variable importance of the authors.


Example Classification

With the Scala code below, we identify and author of an article and a the snippet of the article provided, and try to classify what the article is about.


Hit Ratios

With Authors Without Authors

Hit ratios illustrate the chances of your model correctly categorizing a text blurb on the 1st, 2nd, 3rd, etc. try. The above charts show that both the model that do and do not include authors have approx. 70% chance of correctly predicting a text blurb on the second try.


Possible Use

A possible use for such classification capabilities would be for blog posting sites. The user would enter their text into the field, and the classification model would automatically choose tags for the post.

H2O.ai Machine Intelligence Customers • Community •

Evangelists

November 9, 10, 11Computer History Museum

H2OWORLD.H2O.AI

20% off registrationusing code:

h2ocommunity

Classifying Tech News with Sparkling Water

Technology

Transcript of Classifying Tech News with Sparkling Water