Machine Learning Research

Applications of Machine Learning Algorithms

Brian Zhao

[email protected]

Mentor: Dr. Manna

Computer Science Department

California Polytechnic University, Pomona

Abstract

The purpose of machine learning is to model functions that take input data and create useful output, generally as a prediction about the input.[1] This paper presents two applications of different machine learning algorithms on both textual and nontextual data. We implemented a set of supervised learning algorithms to classify Reddit posts to corresponding subreddits with an average accuracy exceeding 80%. In addition, we also implemented the K Means Clustering algorithm to cluster smartphones based on their attributes in order to aid the efficient retrieval of similar products.

1 Introduction

Machine learning can be defined as a set of algorithms which attempt to learn from patterns in data. [1] These algorithms are can be applied to problems in which: a mathematical relation exists between certain entities, the relation is “hard to pin down mathematically”, and a large set of training data exists. [2] There are two main types of machine learning algorithms: supervised and unsupervised. The difference between the two is that the training data given in supervised learning has associated outputs or results given along with it, whereas resultant outputs are not known in unsupervised learning algorithms. Thus, supervised

learning is an attempt to generate a function that maps between known inputs and outputs, whereas unsupervised learning endeavors to illuminate underlying “structure in its input.”[1]

One practical application of supervised learning is text classification. Reddit is an online forum where people post content in specific subforums called “subreddits.” Each post has an associated title that hints at the content of the post. The post itself, however, may consist of text, a link to another website, a picture, or video. We wanted to see if, given the title of any post, we could accurately classify which subreddit it came from. Our solution was to implement Naïve Bayes and Rocchio text classifiers that map Reddit post titles to subreddits. Our training data consisted of 90% of the top 1000 posts from a fixed number of randomly selected subreddits. We would then classify the remaining 10% of posts as test data.

Another problem we tackled this quarter involved the design of a smartphone recommendation engine that would quickly provide the user similar smartphones relative to a smartphone that he or she is currently searching for. Our solution was to represent smartphones as vectors of their attributes (such as screen size, operating system and screen resolution) and cluster the smartphones so that phones with similar attributes would be associated together.

We can then look for the n most similar phones from within the cluster of the user’s target phone and offer these as suggestions to the user.

2 Related Works

2.1 Classification

In An Introduction To Information Retrieval, authors Manning et al. define the classification problem as: “given a set of classes, we seek to determine which classes a given object belongs to.” [3] Specifically, our project deals with text classification, which is the problem of assigning a text document to a topical category. The authors formally define the parameters: “we are given a training set D of labeled documents (d,c), where (d,c) ”, where X is the “document space”, and C is the “set of classes”. [3] The text classification algorithm must generate a function “ : → ", which chooses the best class each input document X should belong to. Although Manning et al. warns that certain classes may be subsets of other sets, creating an “any-of problem” [3], this situation does not apply to our implementation because subreddits are mutually exclusive. The specifics of the Naïve Bayes and Rocchio text classification algorithms are explained in the “methodology” section.

2.2 Clustering

The goal of clustering is to “create clusters that are coherent internally, but clearly different from each other”. [3]. Clustering is similar to text classification in that both algorithms attempt to assign documents to certain sets. However, these sets are predefined and known in a classification problem, whereas a clustering algorithm dynamically assigns documents to sets based on their similarity to the documents currently assigned to the set. Manning et al. defines the cluster hypothesis as the property that “documents

1 Obtained from https://github.com/umbrae/reddit‐top‐2.5‐million/tree/master/data

in the same cluster behave similarly with respect to relevance to information needs.” This is the fundamental assumption for our use of clustering to provide similar smartphones to a user. Mathematically, Manning et al. defines hard flat clustering as: “Given (i) a set of documents D = {d1,….dN}, (ii) a desired number of clusters K, and (iii) an objective function that evaluates the quality of a clustering, we want to compute an assignment : → 1,… that minimizes the objective function.” [3] Essentially, the goal is to assign documents to clusters in a manner that minimizes “dissimilarity” between the intra-cluster documents. This “dissimilarity” can be defined differently, yielding different clustering algorithms. For example, one possible measure of dissimilarity between documents could be the cosine of the angle between document vectors, whereas another is the Euclidean distance between document vectors. The specifics of the K-Means clustering algorithm is explained in the “methodology” section.

3 Methodology

3.1a Reddit Classification Problem

Our first task was to obtain the documents to classify. We used csv databases of the top 1000 posts (as of 2013) of 1000 subreddits1. Using a parser, we extracted the post titles of all posts (as a list of strings) and the subreddit that each post title came from (as another string). We could then pass this data pair of (string document, string class) as training data for the classifier.

For example, in the above picture, the title of the post is “We have an employee whose last name is Null.”, and the subreddit is “programming”.

Next, we had to create a function : → which takes as input a Reddit post title , and assigns to it which subreddit that it came from.

To solve this problem, we implemented two classification algorithms: Naïve Bayes and Rocchio.

3.2a Naïve Bayes Formalized

The goal of the Naïve Bayes Classification function is to output the class (from the set of possible classes ) which has the highest probability of occurring, given input document (from the set of input documents ). Mathematically, it chooses the class that maximizes | . [4] By Bayes’ rule, this is equivalent to choosing class that maximizes

| , and since the denominator does not

change with respect to the class , it is the same as maximizing the numerator, | .[4] Under the simple “bag of words” assumption, we can view each document as the set of words , , …

that contains. And, under the conditional independence assumption (which is actually not true), we assume that the probability of each word

(from document ) occurring, given class , is independent of any other word from document

.[4] Thus, we can calculate | as the product of the probabilities | ∗ | ∗ | ∗

… ∗ | , where , , … are the set of words that are within .

In other words, we are finding the class

∗ |

[3]. is the “prior probability” of a class, which can be calculated by dividing the number of documents within class by the total number of

documents. [3] | can be found by taking the total frequency of word across all documents in class , and dividing it by the total number of words in all documents from class . This is:

|∑ ∈

Where is the total frequency of word in all

documents of class , is the vocabulary of all words, and is the total frequency of word in

class .

However, this expression results in a probability of zero assigned to class if any word in document does not occur in any document within the class . To account for this, we use Laplace smoothing

to add one to each count, resulting in the following formula.[3]

|1

∑ 1∈

Finally, since most of the probabilities multiplied in the expression ∏ | are small, it is likely that a floating point underflow error could occur.[3] Thus, we will instead add the logarithms of each probability instead of multiplying the actual probabilities. The answer will still be the same because the logarithm function is monotonic [3](meaning the higher probabilities will still have higher logarithmic values).

3.2b Accuracy Optimizations

In addition, we chose to alter the “bag of words” representation of each document so that each document only contained its most representative form of words. First, we split each input Reddit title document on whitespace to obtain a list of all the words that it contained. We then converted each word from uppercase to lowercase so that the classifier would not distinguish differently between words like “hi” and “Hi”. In addition, we removed any punctuation, including apostrophes, commas, and quotes. We also removed any “stopwords”,

words that are commonly used in documents (like “a, the, and, an, but, is, has, was,” etc) that do not add to the content of the document.2 Finally, we applied the open source Porter Stemmer3 to reduce each word to its most basic root. For example, the stemmer would convert the word “runs” to “run”, and “learning” to “learn”. Thus, words that have similar meanings but present in different forms of tense or singularity would be stemmed to the same word. This is ideal as we would like similar ideas to be represented through the pattern of higher frequencies of the root word, and not scattered in the frequencies of forms of the same word. In this way, the “words” that we use to represent each document are as concise and representative of its content as possible. Here is an example of this process:

Title of post:

An email went out this morning declaring “free cookies in the lounge.” This is what was there when I arrived.

1. Split on whitespace: “An”, “email”, “went” “out” “this”, “morning”, “declaring” “”free” “cookies”” “in” “the” “lounge.” “This” “is” “what” “was” “there” “when” “I” “arrived.”

2. Convert to Lowercase: “an” “email” “went” “out” “this” “morning” “declaring” “”free” “cookies”” “in” “the” “lounge.” “this” “is” “what” “was” “there” “when” “i” “arrived.”

3. Remove punctuation “an” “email” “went” “out” “this” “morning” “declaring” “free” “cookies” “in” “the” “lounge” “this” “is” “what” “was” “there” “when” “i” “arrived”

4. Remove stopwords

2 Stopwords obtained from http://www.ranks.nl/stopwords

“email” “went” “morning” “declaring” “free” “cookies” “lounge” “arrived”

5. Apply stemming: “email” “went” “morn” “declar” “free” “cooki” “loung” “arriv”

3.2c Naïve Bayes Classifier Pseudocode

We have represented the Naïve Bayes Classifier as a programmable class in the diagram below, where each arrow points to an internal component of the class:

Thus, the flow of logic for adding a training document and its respective category for the Naïve Bayes Classifier is as follows:

1. Increment total document count 2. Increment total number of documents in

category 3. Split document string on whitespace to

obtain an array of ’s words 4. For each word

a. Convert to lowercase b. Remove punctuation c. Eliminate stopwords

3 Obtained from http://tartarus.org/martin/PorterStemmer/java.txt

Naïve Bayes

Classifier

Integer: Total

Document

Count

Mapping: from

Category name to a

“Category” object

Integer: Number

of Documents in

Category

Mapping:

Frequencies of all

words in category

d. Apply the Porter stemmer e. Increment the associated frequency of

the word in category

Similarly, the logic to classify a test document into an appropriate class would be to:

1. Split document string on whitespace to obtain an array of ’s words

2. For each word a. Convert to lowercase b. Remove punctuation c. Eliminate stopwords d. Apply the Porter stemmer

3. Create a list to hold probabilities of each class

4. For each class , a. Set an accumulator to 0 b. Calculate prior probability for

each class as #

#

c. Add the logarithm of the prior probability to the accumulator

d. For each word in document , i. Find the frequency , of

occurrences of in ii. Find the total number of

words of all documents in , ∑ ∈

iii. Find the total number of unique words present in the vocabulary of all documents

iv. Calculate ∑ ∈

v. Add the logarithm of the previous value to the accumulator

e. Store the accumulator for this class in list

5. Find the highest probability in list , and return the class that this probability is associated with.

3.2d Naïve Bayes Classifier Data and Analysis

Our program randomly selects a predefined number of subreddits n, uses 90% of the top 1000 posts from each of the n subreddits as training data, and compares the Naïve Bayes’ Classifiers output category of the remaining 10% of posts with the actual subreddit they originated from. The accuracy of the classifier is then calculated as the number of successful classifications divided by the total number of classifications. This process was repeated 100 times for each value of n between 1 and 10, and we recorded the average, minimum, and maximum classification accuracy in the following table.

Naïve Bayes Classifier

Number of Subreddits

Average Classification

Accuracy Over 100 Iterations

Minimum Accuracy

Maximum Accuracy

1 100% 100% 100% 2 93.8% 76.0% 100% 3 91.6% 70.8% 98.7% 4 88.0% 75.8% 97.0% 5 86.0% 74.7% 94.3% 6 85.3% 72.4% 95.6% 7 83.3% 69.9% 92.9% 8 81.3% 65.0% 91.7% 9 80.9% 70.8% 90.2%

10 79.6% 65.8% 88.7%

As expected, the classifier gives 100% accuracy when training and test data posts only come from one subreddit. However, the accuracy of the classifier decreases by 2-3% for each additional subreddit that the training and test data come from. The reason for this is that certain words may be used frequently in several subreddits. For example, the words “how”, “what”, or “why” are frequently used in the subreddits “AskEngineers” “AskHistorians”, “AskElectronics”, “AskReddit”, and all the “ask” subreddits. As we choose a greater number of subreddits to classify to, we are

more likely to pick subreddits that share such words in common, and are thus more prone to errors, as the classifier cannot easily distinguish which category to classify to. However, the range between minimum and maximum accuracy tends to stay between 20-30% regardless of subreddit count.

Overall, the accuracy of our Naïve Bayes Classifier was reasonably good; it correctly classified titles of posts to their appropriate subreddits over 80% of the time when given less than 10 subreddits.

3.3a Rocchio Formalized

The goal of the Rocchio Classification function is to output the class (from the set of possible classes ) which has the closest centroid vector to the input document (from the set of input documents ).

The underlying assumption of Rocchio Classification is that text documents can be represented as vectors | |, [3] where is the set of real numbers, and | | is the cardinality of the set of words in the vocabulary , i.e. the number of unique words used in all documents known in the training set of our classifier. In other words, they are of the form: ⟨ , , … | |⟩, with

being a real number that corresponds to a particular word in the vocabulary .

These real numbers are the “tf-idf” weights associated with each unique word in the vocabulary. The product of these weights ideally gives a numerical value of the representational power of each word in the document. This value should increase if a particular term occurs more often in the document, or if the term is rare, meaning it exclusively occurs in certain documents.

TF is the term frequency, or a measure of the number of times that the word has appeared in the

document vector. If we simply count the frequency as the number of the occurrences of the word, the weight for that component of the vector is directly proportional to the number of occurrences. For example, if a term appears 2 times in a document, its corresponding weight component doubles. If it appears 100 times in a document, the weight component will be 100 times more powerful, although the document may not actually be 100 times more relevant. To account for this we use “sublinear tf scaling” and take the logarithm of the term frequency instead [3]:

′1 log if 0

0 otherwise

This weighted term frequency ′ is then multiplied by the “idf” or inverse document frequency, which is a measure of the uniqueness of the term. This can be calculated as the logarithm of the total number of documents divided by the number of documents the term appears in, :

log

[3] If the term appears in fewer documents the ratio will increase, and if the term appears in more documents the ratio decreases.

By representing documents as vectors of weights of the words they contain relative to a vocabulary of words, we can measure similarity between documents by calculating the Euclidean distance between vectors, or taking the cosine of the angles between them. The difference between the two measures is that cosine similarity inherently length-normalizes the vectors that it is being calculated on; this means that all document vectors’ weights are divided by a constant factor so that the magnitude of each document vector becomes 1, effectively putting longer and shorter documents on even footing.

For our classification task, we chose not to length-normalize because all documents were of similarly short length. Reddit post titles are

typically 1-2 lines of text since the core contents are described within the post itself.

Ultimately, the Rocchio classifier will, after taking all input training data documents, convert them to vectors, and find the centroid vector for each class. The centroid vector of class , , is calculated component-wise as the average of each component of each of the class’ constituent document vectors, :

1

| |∗

[3]. The classifier then converts the test data document into a vector form and returns the class with the most “similar” centroid vector.

3.3b Rocchio Classifier Pseudocode

We have represented the Rocchio Classifier as a programmable class in the diagram below, where each arrow points to an internal component of the class:

The flow of logic for adding a training document and its respective category to the Rocchio

Classifier is as follows:

1. Increment total document count 2. If category does not exist, initialize it 3. Split document string on whitespace to

obtain an array of ’s words 4. Declare a mapping of words to term

frequencies for the particular document 5. For each word

a. Convert to lowercase b. Remove punctuation c. Eliminate stopwords d. Apply the Porter stemmer e. Increment document frequency of

word if appropriate f. Increment term frequency of word in

the mapping

Once all the training data have been entered, the classifier can then create the vector representation of each training document as well as the centroid vector for each category.

1. For each category , a. For each document in

i. Create a vector of size |V| (number of terms of vocabulary), also the cardinality of the keys of the Document Frequency mapping

ii. Calculate the tf-idf weight for each component of the vector

2. For each category c, a. Create a new centroid vector of size

|V|, called b. Set each component of to the

average of the respective components of all document vectors of

Finally, when given the test document , we try to find the most similar centroid vector for , and return the corresponding class .

Rocchio

Classifier

Integer: Total

Document

Count

Mapping: each

word in

vocabulary to

document

frequency

Mapping: each

category string to

a category object,

containing…

Vector:

Category’s

centroid

Set of Vectors: Vector

representation of each

document within

category

1. Split test document string on whitespace to obtain an array of ’s words

2. Declare a mapping of words to term frequencies for the particular document

3. For each word a. Convert to lowercase b. Remove punctuation c. Eliminate stopwords d. Apply the Porter stemmer e. Increment term frequency of word

in mapping M 4. Generate the corresponding vector of

document from mapping M 5. Set MinDistance = infinity 6. Set answerCategory = null 7. For each category

a. Calculate distance between and centroid of

b. If distance < MinDistance i. answerCatgory =

8. Return answerCategory

3.3c Rocchio Classifier Data and Analysis

Following the same procedure as the Naïve Bayes Classification, we randomly selected subreddits, used 90% of their top 1000 posts as training data, and classified on the remaining 10%. We obtained the following average classification accuracies over 100 iterations for each value of from 1 to 10.

Overall, the classification accuracies follow the same general pattern as that of the Naïve Bayes classifier. Accuracy decreases with increasing number of subreddits to classify from. However, the Rocchio classifier was less accurate than the Naïve Bayes Classifier by a margin of up to 10%, and had higher variability.

A possible reason for this is the fact that Rocchio classification only accounts for closeness to the centroid vector, and thus assigns test data 4 http://www.gsmarena.com/

in a spherical manner. Test data closest to the “sphere of influence” of a particular centroid will be classified to that centroid. However, the data itself may not necessarily be distributed spherically, causing non-ideal assignments.

Rocchio Classifier

Number of Subreddits

Average Classification

Accuracy Over 100 Iterations

Minimum Accuracy

Maximum Accuracy

1 100% 100% 100% 2 92.2% 79.0% 100% 3 86.4% 67.3% 97.3% 4 83.3% 68.1% 94.5% 5 81.4% 66.4% 94.8% 6 78.7% 64.9% 91.9% 7 76.6% 63.8% 88.1% 8 75.7% 61.0% 88.2% 9 73.5% 58.6% 84.3%

10 72.0% 58.5% 82.3%

3.4a Smartphone Similarity

In the Smartphone Similarity project, our goal was to cluster smartphones based on their attributes. There were two main components to our project: extracting the attributes of thousands of smartphones from the internet into a database file, and clustering the actual attributes.

3.4b Smartphone Data Retrieval

For the first component, we decided to crawl our data from GSMArena4, a website which displays

Retrieve

Smartphone

data from

Internet

Cluster

Smartphone

Data

various attributes for thousands of phones. Our plan of attack was as follows:

1) Manually form of list of the top 10-15 smartphone manufacturers’ URLs on the GSMArena website, called “Manufacturers.txt”

Eg: http://www.gsmarena.com/samsung-phones-9.php

http://www.gsmarena.com/apple-phones-48.php

2) Visit each of these URLs, and append all GSMArena smartphone URL links associated with the particular manufacturer to a new file, called “Smartphones.txt”

E.g.: Each picture is a hyperlink to a smartphone: http://www.gsmarena.com/samsung_galaxy_s6-6849.php

http://www.gsmarena.com/samsung_galaxy_j1_4g-7034.php

…etc

3) Visit all the URLs in “Smartphones.txt”, and extract all relevant attributes from the html body of the particular smartphone

E.g.: for the Samsung Galaxy S6, we would form a vector of attributes as follows:

Samsung Galaxy S6,2015.0,68745.96,138.0,5.1,3686400.0,2.0,4.0,1.5,3.0,16.0,2550.0,

4) Output attributes as lines of comma separated values in a new file, “Smartphones.csv”

The format of the vector is as follows: Name, release year, volume(mm^3), weight(g), screen size(in), total resolution, OSType, number of cores, core clock rate(gHz), ram(GB), primary

camera(MP), battery(mAh). The .csv is available online5.

Thus, the first component of the program will have visited, retrieved, and stored all relevant data for any smartphones out of the 3000 plus phone URLs crawled from the GSM phone manufacturers. However, to use this data, we must perform a clustering analysis on it.

3.4c K Means Clustering Formalized

Similar to the Rocchio Classifier, the K-Means Clustering algorithm acts upon its input documents as vectors of real numbered values. The objective function of the algorithm is to minimize the “residual sum of squares” of all vectors from their cluster centroids. [3] This RSS value is the Euclidean distance between each vector and its centroid, squared and summed together.

The algorithm is as follows [3]:

1. Select K random data vectors as the initial cluster centroids

2. Assign all other data vectors to the cluster with the closest centroid

3. Recalculate the centroid for each cluster as the component-wise average values of each data vector in the cluster

4. Re-assign all data vectors to the cluster with the closest centroid

5. Repeats steps 3 and 4 until RSS converges to a satisfactory minimum

In practice, the stopping condition 5 can be set to a fixed number of iterations of step 3 and 4, until the RSS reaches a specific value, or until centroid assignments no longer change[3]. For our project, we used a fixed number of iterations.

Thus, when the algorithm terminates, it has assigned each document vector to one of the K

5 https://github.com/bmzhao/SmartphoneSimilarity/blob/master/Smartphones.csv

vectors. The number K can be predefined, or it can selected after repeated trials of various K values.

3.4d K Means Clustering Results

When we actually perform the clustering algorithm, the name string for each phone was not considered. All other fields of the vector are real valued numbers. In addition, the OSType field of the vector uses the following arbitrary mapping: Windows -1, Android - 2, iOS - 3. The clustering results are available online6.

4 Conclusion

Ultimately, in this quarter, we have successfully applied machine learning algorithms to accurately classify short text documents that appear in real life social media. Classifying Reddit post titles to their corresponding subreddits is a natural application of the classification problem, since the categories a post belongs to is well defined and singular. Other possible text classification problems, such as classifying a news article to a news type, are less transparent, since an article may belong to multiple categories. In the future, we can further adapt the project by classifying newly submitted posts on Reddit itself, instead of downloading a csv database of previous posts. In addition, we could implement other classification algorithms, such as k-Nearest-Neighbor or Support Vector Machines, and compare their relative effectiveness.

In addition, we have successfully crawled and clustered data regarding over a thousand smartphones. However, we would like to expand

6 https://github.com/bmzhao/SmartphoneSimilarity/blob/master/Clustered%20Results%20K25.txt

upon this further by crawling for several smartphone attributes were not immediately available from GSMArena, such as price and user ratings. In addition, we plan to compare the effectiveness of the flat K-Means clustering with other types of hierarchical clustering, as well as analyze their performance through purity calculations.

References [1] "Machine Learning." Wikipedia. Wikimedia Foundation. Web. 15 Mar. 2015. <https://en.wikipedia.org/wiki/Machine_learning>.

[2] "Lecture 01 - The Learning Problem." YouTube. YouTube, n.d. Web. 14 Mar. 2015.

[3] Manning, Christopher D., and Prabhakar Raghavan. Introduction to Information Retrieval. New York: Cambridge UP, 2008. Print.

[4] "6 - 3 - Formalizing the Naive Bayes Classifier - Stanford NLP-Dan Jurafsky & Chris Manning." YouTube. YouTube, n.d. Web. 14 Mar. 2015.

Machine Learning Research

Documents

Transcript of Machine Learning Research