final_ppt
-
Upload
me2kathick -
Category
Documents
-
view
181 -
download
8
Transcript of final_ppt
T.Karthikeyan
Wh
at o
n W
orl
d is
Apache MAHOUT
MathVectors/Matrices/SVD
RecommendersClusteringClassificationFreq. PatternMining
Genetic
UtilitiesLucene/Vectorizer
Collections (primitives)
Apache Hadoop
Applications
Examples
Mahout ClusteringAlgorithms : K-Means Fuzzy K-MeansMean shift Canopy Dirichlet Spectral Clustering based on Eigen valuesMinhash clustering LDA based clustering
Notion Of similarity : Distance Measure : Euclidean Cosine Tanimoto Manhattan
Dataset
Hadoop Sequence File format
./mahout seqdirectory <options>
Sparse vector Format
./mahout seq2sparse <options>
Clustering Driver class
./mahout <kmeans/…> <options>
Dump cluster output
./mahout clusterdump <options>Clu
ste
rin
g o
ur
ow
n d
ata
Clustering Examples
Using Reuters Dataset (SGML File) :
$ bin/mahout seqdirectory -i reuters-ip -o reuters-seqdir \-c UTF-8 -chunk 1
$ bin/mahout seq2sparse -i reuters-seqdir -o reuters-sparse
$ bin/mahout kmeans -i reuters-sparse/tfidf-vectors / -c reuters-clusters \-o reuters-kmeans \-dm org.apache.mahout.distance.CosineDistanceMeasure\-cd 0.1 -x 10 -k 20 –ow
$ bin/mahout clusterdump -d reuters-sparse \dictionary.file-0 -s reuters-kmeans-clusters/clusters-19 -b 10 –n 10
Mahout Classification
Algorithms Implemented: Naïve Bayes Complementary Naïve Bayes Random Forest Logistic Regression (Sequential Algorithm) Hidden markov models
Upcoming Algorithms:Support vector machinesClassification based on perception and winnow
Bayes , Cbayes Classifier
Preprocessing Raw data into classifiable data
Bayes ,Cbayes Classifier Example
Using Newsgroup Dataset:
$./mahout prepare20newsgroups -p 20news-bydate-train -o 20news-train \-a org.apache.lucene.analysis.standard.StandardAnalyzer \-c UTF-8
$./mahout trainclassifier –i 20news-train -o 20news-model \-type <cbayes ,bayes> \-ng 1 -source hdfs
$./mahout testclassifier -d 20news-test -m 20news-model \-type <cbayes,bayes> \-ng 1 -source hdfs
Output :Confusion matrix
Logistic Regression
x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias""
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1
./mahout trainlogistic --input input.csv --output ./model \
--target color --categories 2
./mahout runlogistic --input test.csv --model ./model \
--auc --confusion
CONFUSION MATRIX ( 0/P)
A B
AUC = 0.97 ; A {[24.0, 2.0],
B [3.0, 11.0]]
Random Forest Input : arff or csv
Generate a file descriptor for the dataset:$ericsson>$HADOOP_HOME/bin/hadoop jar \$MAHOUT_HOME/core/target/mahout-core-0.6-SNAPSHOT-job.jar \org.apache.mahout.df.tools.Describe -p KDDTrain.arff -f Train.info \-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
Run the example:$ericsson>$HADOOP_HOME/hadoop jar \$MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar\org.apache.mahout.df.mapreduce.BuildForest <options>
Using the Decision Forest to Classify new data$HADOOP_HOME/hadoop jar \$MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.df.mapreduce.TestForest -i Test.arff -ds Train.info <options>
Output : confusion matrix
Dimension reductionAlgorithms Implemented:Singular value DecompositionStochastic singular value Decomposition
Upcoming Algorithms : Principal Components Analysis Independent Component Analysis Gaussian Discriminative Analysis
0.12 0.8 0.123
0.89 2.33 1.445
4.12 2.123 3.12
./mahout <svd/ssvd> <options>
Eigen Vectors
Input : Real value Matrix
Frequent Pattern mining
Algorithm: Parallel FP growth Algorithm
Input : dat or csv
Running Parallel FPGrowth:$./mahout fpg retail.dat -o patterns -k 50 -method mapreduce -regex '[\ ]' -s 2
Viewing the results :$./mahout seqdumper -s patterns/part-?-00000 -n 4
Recommenders / Collaborative FilteringAlgorithms:Non-distributed recommenders ("Taste") Distributed Item-Based Collaborative Filtering Collaborative Filtering using a parallel matrix factorization Input is text file: user ,item ,preference
T A
S T
E
Collaborative Filtering using a parallel matrix factorization
• To Run distributed ALS-WR to factorize the rating matrix defined by the training set
$MAHOUT parallelALS –input TrainingSet --output out \
--tempDir tmp -- numFeatures 20 -- numIterations 10 --lambda 0.065
• Compute predictions against the probe set, measure the error
$MAHOUT evaluateFactorization –input TrainingSet --output op \
--tempDir tmp1
• Compute recommendations
$MAHOUT recommendfactorized –input userRatings --output recommendations \numRecommendations 6 --maxRating 5
Input : Rating Matrix or csv
ALGORITHMS INPUT
All Clustering Algorithms,Bayes, Cbayes classifier
Sparse Vector
Logistic regression, Random forest, FP Growth
CSV
Taste ,Collaborative Filtering
User ,Item ,Preference
SVD, SSVD Matrix
SUMMARY