final_ppt

16
T.Karthikeyan

Transcript of final_ppt

Page 1: final_ppt

T.Karthikeyan

Page 2: final_ppt

Wh

at o

n W

orl

d is

Apache MAHOUT

MathVectors/Matrices/SVD

RecommendersClusteringClassificationFreq. PatternMining

Genetic

UtilitiesLucene/Vectorizer

Collections (primitives)

Apache Hadoop

Applications

Examples

Page 3: final_ppt

Mahout ClusteringAlgorithms : K-Means Fuzzy K-MeansMean shift Canopy Dirichlet Spectral Clustering based on Eigen valuesMinhash clustering LDA based clustering

Notion Of similarity : Distance Measure : Euclidean Cosine Tanimoto Manhattan

Page 4: final_ppt

Dataset

Hadoop Sequence File format

./mahout seqdirectory <options>

Sparse vector Format

./mahout seq2sparse <options>

Clustering Driver class

./mahout <kmeans/…> <options>

Dump cluster output

./mahout clusterdump <options>Clu

ste

rin

g o

ur

ow

n d

ata

Page 5: final_ppt

Clustering Examples

Using Reuters Dataset (SGML File) :

$ bin/mahout seqdirectory -i reuters-ip -o reuters-seqdir \-c UTF-8 -chunk 1

$ bin/mahout seq2sparse -i reuters-seqdir -o reuters-sparse

$ bin/mahout kmeans -i reuters-sparse/tfidf-vectors / -c reuters-clusters \-o reuters-kmeans \-dm org.apache.mahout.distance.CosineDistanceMeasure\-cd 0.1 -x 10 -k 20 –ow

$ bin/mahout clusterdump -d reuters-sparse \dictionary.file-0 -s reuters-kmeans-clusters/clusters-19 -b 10 –n 10

Page 6: final_ppt

Mahout Classification

Algorithms Implemented: Naïve Bayes Complementary Naïve Bayes Random Forest Logistic Regression (Sequential Algorithm) Hidden markov models

Upcoming Algorithms:Support vector machinesClassification based on perception and winnow

Page 7: final_ppt

Bayes , Cbayes Classifier

Preprocessing Raw data into classifiable data

Page 8: final_ppt

Bayes ,Cbayes Classifier Example

Using Newsgroup Dataset:

$./mahout prepare20newsgroups -p 20news-bydate-train -o 20news-train \-a org.apache.lucene.analysis.standard.StandardAnalyzer \-c UTF-8

$./mahout trainclassifier –i 20news-train -o 20news-model \-type <cbayes ,bayes> \-ng 1 -source hdfs

$./mahout testclassifier -d 20news-test -m 20news-model \-type <cbayes,bayes> \-ng 1 -source hdfs

Output :Confusion matrix

Page 9: final_ppt

Logistic Regression

x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias""

0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1

./mahout trainlogistic --input input.csv --output ./model \

--target color --categories 2

./mahout runlogistic --input test.csv --model ./model \

--auc --confusion

CONFUSION MATRIX ( 0/P)

A B

AUC = 0.97 ; A {[24.0, 2.0],

B [3.0, 11.0]]

Page 10: final_ppt

Random Forest Input : arff or csv

Generate a file descriptor for the dataset:$ericsson>$HADOOP_HOME/bin/hadoop jar \$MAHOUT_HOME/core/target/mahout-core-0.6-SNAPSHOT-job.jar \org.apache.mahout.df.tools.Describe -p KDDTrain.arff -f Train.info \-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L

Run the example:$ericsson>$HADOOP_HOME/hadoop jar \$MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar\org.apache.mahout.df.mapreduce.BuildForest <options>

Using the Decision Forest to Classify new data$HADOOP_HOME/hadoop jar \$MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.df.mapreduce.TestForest -i Test.arff -ds Train.info <options>

Output : confusion matrix

Page 11: final_ppt

Dimension reductionAlgorithms Implemented:Singular value DecompositionStochastic singular value Decomposition

Upcoming Algorithms : Principal Components Analysis Independent Component Analysis Gaussian Discriminative Analysis

0.12 0.8 0.123

0.89 2.33 1.445

4.12 2.123 3.12

./mahout <svd/ssvd> <options>

Eigen Vectors

Input : Real value Matrix

Page 12: final_ppt

Frequent Pattern mining

Algorithm: Parallel FP growth Algorithm

Input : dat or csv

Running Parallel FPGrowth:$./mahout fpg retail.dat -o patterns -k 50 -method mapreduce -regex '[\ ]' -s 2

Viewing the results :$./mahout seqdumper -s patterns/part-?-00000 -n 4

Page 13: final_ppt

Recommenders / Collaborative FilteringAlgorithms:Non-distributed recommenders ("Taste") Distributed Item-Based Collaborative Filtering Collaborative Filtering using a parallel matrix factorization Input is text file: user ,item ,preference

T A

S T

E

Page 14: final_ppt

Collaborative Filtering using a parallel matrix factorization

• To Run distributed ALS-WR to factorize the rating matrix defined by the training set

$MAHOUT parallelALS –input TrainingSet --output out \

--tempDir tmp -- numFeatures 20 -- numIterations 10 --lambda 0.065

• Compute predictions against the probe set, measure the error

$MAHOUT evaluateFactorization –input TrainingSet --output op \

--tempDir tmp1

• Compute recommendations

$MAHOUT recommendfactorized –input userRatings --output recommendations \numRecommendations 6 --maxRating 5

Input : Rating Matrix or csv

Page 15: final_ppt

ALGORITHMS INPUT

All Clustering Algorithms,Bayes, Cbayes classifier

Sparse Vector

Logistic regression, Random forest, FP Growth

CSV

Taste ,Collaborative Filtering

User ,Item ,Preference

SVD, SSVD Matrix

SUMMARY

Page 16: final_ppt