Vectorization - Georgia Tech - CSE6242 - March 2015
-
Upload
josh-patterson -
Category
Data & Analytics
-
view
186 -
download
3
Transcript of Vectorization - Georgia Tech - CSE6242 - March 2015
![Page 1: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/1.jpg)
VectorizationCore Concepts in Data Mining
Georgia Tech – CSE6242 – March 2015
Josh Patterson
![Page 2: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/2.jpg)
Presenter: Josh Patterson
• Email:– [email protected]
• Twitter: – @jpatanooga
• Github: – https://github.com/
jpatanooga
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing Algorithm”
Grad work in Meta-heuristics, Ant-algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today: Patterson Consulting
![Page 3: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/3.jpg)
Topic Index
• Why Vectorization?
• Vector Space Model
• Text Vectorization
• General Vectorization
![Page 4: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/4.jpg)
WHY VECTORIZATION?
“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a
world far larger and more complicated than itself?”
--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
![Page 5: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/5.jpg)
Classic Scenario:
“Classify some tweets for positive vs
negative sentiment”
![Page 6: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/6.jpg)
What Needs to Happen?
• Need each tweet as some structure that can be fed to a learning algorithm– To represent the knowledge of “negative” vs
“positive” tweet
• How does that happen?– We need to take the raw text and convert it into what
is called a “vector”
• Vector relates to the fundamentals of linear algebra– “Solving sets of linear equations”
![Page 7: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/7.jpg)
Wait. What’s a Vector Again?
• An array of floating point numbers
• Represents data
– Text
– Audio
– Image
• Example:
–[ 1.0, 0.0, 1.0, 0.5 ]
![Page 8: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/8.jpg)
VECTOR SPACE MODEL
“I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.”
--- Hal, 2001
![Page 9: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/9.jpg)
Vector Space Model
• Common way of vectorizing text– every possible word is mapped to a specific integer
• If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the
word occurs
• Most often our array size is less than our corpus vocabulary – so we have to have a “vectorization strategy” to
account for this
![Page 10: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/10.jpg)
Text Can Include Several Stages
• Sentence Segmentation– can skip straight to tokenization depending on use case
• Tokenization– find individual words
• Lemmatization– finding the base or stem of words
• Removing Stop words– “the”, “and”, etc
• Vectorization– we take the output of the process and make an array of
floating point values
![Page 11: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/11.jpg)
TEXT VECTORIZATION STRATEGIES
“A man who carries a cat by the tail learns something he can learn in no other way.”
--- Mark Twain
![Page 12: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/12.jpg)
Bag of Words
• A group of words or a document is represented as a bag – or “multi-set” of its words
• Bag of words is a list of words and their word counts– simplest vector model – but can end up using a lot of columns due to number of words
involved.
• Grammar and word ordering is ignored – but we still track how many times the word occurs in the
document
• has been used most frequently in the document classification – and information retrieval domains.
![Page 13: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/13.jpg)
Term frequency inverse document frequency (TF-IDF)
• Fixes some issues with “bag of words”
• allows us to leverage the information about how often a word occurs in a document (TF)– while considering the frequency of the word in the
corpus to control for the facet that some words will be more common than others (IDF)
• more accurate than the basic bag of words model – but computationally more expensive
![Page 14: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/14.jpg)
Kernel Hashing
• When we want to vectorize the data in a single pass– making it a “just in time” vectorizer.
• Can be used when we want to vectorize text right before we feed it to our learning algorithm.
• We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize– Then we use a hash function to create an index into
the vector.
![Page 15: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/15.jpg)
GENERAL VECTORIZATION STRATEGIES
“Everybody good? Plenty of slaves for my robot colony?”
--- TARS, Interstellar
![Page 16: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/16.jpg)
Four Major Attribute Types
• Nominal– Ex: “sunny”, “overcast”, and “rainy”
• Ordinal– Like nominal but with order
• Interval– “year” but expressed in fixed and equal lengths
• Ratio– scheme defines a zero point and then a distance
from this fixed zero point
![Page 17: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/17.jpg)
Techniques of Feature Engineering
• Taking the values directly from the attribute unchanged– If the value is something we can use out of the box
• Feature scaling– standardization– or Normalizing an attribute
• Binarization of features– 0 or 1
• Dimensionality reduction– Use only the most interesting features
![Page 18: Vectorization - Georgia Tech - CSE6242 - March 2015](https://reader030.fdocuments.in/reader030/viewer/2022032421/55a6792a1a28ab800f8b46b4/html5/thumbnails/18.jpg)
Canova
• Command Line Based– We don’t want to write custom code for every dataset
• Examples of Usage– Convert the MNIST dataset from raw binary files to
the svmLight text format.– Convert raw text into TF-IDF based vectors in a text
vector format {svmLight, arff}
• Scales out on multiple runtimes– Local, hadoop
• Open Source, ASF 2.0 Licensed– https://github.com/deeplearning4j/Canova