E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector...
Transcript of E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector...
![Page 1: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/1.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing 1
E6893 Big Data Analytics Lecture 3:
Big Data Analytics Algorithms — I
Ching-Yung Lin, Ph.D.
Adjunct Professor, Dept. of Electrical Engineering and Computer Science
September 20, 2019
![Page 2: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/2.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing 2
Analytics 1 — Recommendation
![Page 3: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/3.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 3
Key Components of Mahout in Hadoop
![Page 4: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/4.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 4
Key Components of Spark MLlib
![Page 5: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/5.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 5
Spark ML Basic Statistics
p Correlation: Calculating the correlation between two series of data is a common
operation in Statistics
■ Pearson’s Correlation
■ Spearman’s Correlation
![Page 6: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/6.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 6
Example of Popular Similarity Measurements
Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity
![Page 7: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/7.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 7
Pearson Correlation Similarity
Data:
missing data
![Page 8: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/8.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 8
On Pearson Similarity
Three problems with the Pearson Similarity:
1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)
2. If two users overlap on only one item, no correlation can be computed.
3. The correlation is undefined if either series of preference values are identical.
Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used.
![Page 9: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/9.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 9
Spearman Correlation Similarity Example for ties
Pearson value on the relative ranks
![Page 10: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/10.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 10
Basic Spark Data format
// number of parameters, location of non-zero indices, and non-zero values
Data: 1.0, 0.0, 3.0
// straightforward
// number of parameters, Sequence of non-value values (index, value)
![Page 11: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/11.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 11
Correlation Example in Spark
1.0, 0.0, 0.0, -2.0 4.0, 5.0, 0.0, 3.0 6.0, 7.0, 0.0, 8.0 9.0, 0.0, 0.0, 1.0
![Page 12: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/12.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 12
Euclidean Distance Similarity
Similarity = 1 / ( 1 + d )
![Page 13: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/13.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 13
Cosine Similarity
Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0).
![Page 14: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/14.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 14
Caching User Similarity
Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed.
![Page 15: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/15.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 15
Tanimoto (Jaccard) Coefficient Similarity
Discard preference values
Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.
![Page 16: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/16.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 16
Log-Likelihood Similarity
Asses how unlikely it is that the overlap between the two users is just due to chance.
![Page 17: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/17.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 17
Performance measurements
Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson: 0.89
![Page 18: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/18.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 18
Spark ML Basic Statistics
p Hypothesis testing: Hypothesis testing is a powerful tool in statistics to determine whether
a result is statistically significant. Spark ML
currently supports Pearson’s Chi-squared (χ2)
tests for independence.
p ChiSquareTest conducts Pearson’s independence
test for every feature against the label.
![Page 19: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/19.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 19
Chi-Square Tests
http://math.hws.edu/javamath/ryan/ChiSquare.html
![Page 20: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/20.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 20
Chi-Square Tests
http://math.hws.edu/javamath/ryan/ChiSquare.html
![Page 21: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/21.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 21
Chi-Square Tests
http://math.hws.edu/javamath/ryan/ChiSquare.html
We would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location.
![Page 22: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/22.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 22
Chi-Square Tests in Spark
![Page 23: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/23.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 23
Spark ML Clustering
![Page 24: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/24.jpg)
© 2015 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 24
Example: clustering
Feature Space
![Page 25: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/25.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 25
Clustering
![Page 26: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/26.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 26
Clustering — on feature plane
![Page 27: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/27.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 27
Clustering example
![Page 28: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/28.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 28
Steps on clustering
![Page 29: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/29.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 29
Making initial cluster centers
![Page 30: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/30.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 30
K-mean clustering
![Page 31: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/31.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 31
HelloWorld clustering scenario result
![Page 32: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/32.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 32
Testing difference distance measures
![Page 33: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/33.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 33
Manhattan and Cosine distances
![Page 34: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/34.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 34
Tanimoto distance and weighted distance
![Page 35: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/35.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 35
Results comparison
![Page 36: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/36.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 36
Sample code of K-mean clustering in Spark
![Page 37: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/37.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 37
vectorization example
0: weight 1: color 2: size
![Page 38: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/38.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 38
Canopy clustering to estimate the number of clustersTell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.
![Page 39: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/39.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 39
Other clustering algorithms
Hierarchical clustering
![Page 40: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/40.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 40
Different clustering approaches
![Page 41: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/41.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 41
Spark ML Classification and Regression
![Page 42: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/42.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 42
Spark ML Classification and Regression
![Page 43: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/43.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 43
Classification — definition
![Page 44: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/44.jpg)
© 2015 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 44
Classification example: using SVM to recognize a Toyota Camry
ML — Support Vector MachineNon-MLFeature Space Positive SVs
Negative SVs
Rule 1.Symbol has something like bull’s headRule 2.Big black portion in front of car.Rule 3. …..????
![Page 45: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/45.jpg)
© 2015 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 45
Classification example: using SVM to recognize a Toyota Camry
ML — Support Vector Machine
Feature Space
Positive SVs
Negative SVs
PCamry > 0.95
![Page 46: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/46.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 46
When to use Big Data System for classification?
![Page 47: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/47.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 47
The advantage of using Big Data System for classification
![Page 48: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/48.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 48
How does a classification system work?
![Page 49: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/49.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 49
Key terminology for classification
![Page 50: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/50.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 50
Input and Output of a classification model
![Page 51: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/51.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 51
Four types of values for predictor variables
![Page 52: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/52.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 52
Sample data that illustrates all four value types
![Page 53: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/53.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 53
Supervised vs. Unsupervised Learning
![Page 54: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/54.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 54
Work flow in a typical classification project
![Page 55: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/55.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 55
Classification Example 1 — Color-Fill
Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label.
![Page 56: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/56.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 56
Classification Example 2 — Color-Fill (another feature)
![Page 57: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/57.jpg)
© 2016 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 57
Fundamental classification algorithms
Example of fundamental classification algorithms:
• Naive Bayesian • Complementary Naive Bayesian • Stochastic Gradient Descent (SDG) • Random Forest • Support Vector Machines
![Page 58: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/58.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 58
Choose algorithm
![Page 59: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/59.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 59
Stochastic Gradient Descent (SGD)
![Page 60: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/60.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 60
Characteristic of SGD
![Page 61: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/61.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 61
Support Vector Machine (SVM)
nonlinear kernels
maximize boundary distances; remembering “support vectors”
![Page 62: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/62.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 62
Example SVM code in Spark
![Page 63: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/63.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 63
Naive BayesTraining set:
Classifier using Gaussian distribution assumptions:
Test Set:
==> female
![Page 64: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/64.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 64
Random Forest
Random forest uses a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features.
![Page 65: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/65.jpg)
!65
Adaboost Example
- Start with a uniform distribution (“weights”) over training examples (The weights tell the weak learning algorithm which examples are important)
- Obtain a weak classifier from the weak learning algorithm, hjt:X→{-1,1}
- Increase the weights on the training examples that were misclassified
- (Repeat)
• Adaboost [Freund and Schapire 1996] ■ Constructing a “strong” learner as
a linear combination of weak learners
![Page 66: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/66.jpg)
!66
Example — User Modeling using Time-Sensitive Adaboost• Obtain simple classifier on each feature, e.g., setting threshold
on parameters, or binary inference on input parameters.
• The system classify whether a new document is interested by a person via Adaptive Boosting (Adaboost):
■ The final classifier is a linear weighted combination of single-feature classifiers.
■ Given the single-feature simple classifiers, assigning weights on the training samples based on whether a sample is correctly or mistakenly classified. <== Boosting.
■ Classifiers are considered sequentially. The selected weights in previous considered classifiers will affect the weights to be selected in the remaining classifiers. !"" Adaptive.
■ According to the summed errors of each simple classifier, assign a weight to it. The final classifier is then the weighted linear combination of these simple classifiers.
• Our new Time-Sensitive Adaboost algorithm: ■ In the AdaBoost algorithm, all samples are regarded
equally important at the beginning of the learning process
■ We propose a time-adaptive AdaBoost algorithm that assigns larger weights to the latest training samples
People select apples according to their shapes, sizes, other people’s interest, etc.
Each attribute is a simple classifier used in Adaboost.
![Page 67: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/67.jpg)
!67
Time-Sensitive Adaboost [Song, Lin, et al. 2005]
![Page 68: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/68.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 68
Evaluate the model
AUC (0 ~ 1): 1 — perfect 0 — perfectly wrong 0.5 — random
confusion matrix
![Page 69: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/69.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms 69
Homework #1 (Due 10/4/2019, 5pm)
See TA’s instruction.
![Page 70: E6893 Big Data Analytics Lecture 3: Big Data …cylin/course/bigdata/EECS...ML — Support Vector Machine Non-ML Feature Space Positive SVs Negative SVs Rule 1.Symbol has something](https://reader030.fdocuments.in/reader030/viewer/2022040122/5f0c260e7e708231d433fa6d/html5/thumbnails/70.jpg)
© 2018 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing 70
Questions?