Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

11
Redpoll A machine learning library based on hadoop Jeremy Chow([email protected]) CS Dept. Jinan University, Guangzhou

description

Basic Principles... Decomposition Mappers Reducer Assume that we have a set of m data points each of length n

Transcript of Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Page 1: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Redpoll A machine learning library based on

hadoopJeremy Chow([email protected])

CS Dept. Jinan University, Guangzhou

Page 2: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Introduction What is redpoll? Who will use redpoll? Motivation

Challenge from large-scale datasets More pratical when mining textual

corpus Close to we chinese people

Apache licensed

Page 3: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Basic Principles

[x11 x12 ... x1nx21 x22 ... x2n... ... ... ...xm1 xm2 ... xmn

]...

x1= x11, x12, ... , x1n

xm= xm1 , xm2 , ... , xmn

x2= x 21, x 22, ... , x 2n

Decomposition Mappers

Reducer

Assume that we have a set of m data points each of length n

Page 4: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Performance Bottlenecks

Network bandwidth I/O speed Algorithm implementations Hadoop

Page 5: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Current Works Vector Writable utils Distance Measure utils Naive Bayes Canopy K-means An Infrastructure for textual DM An example for mining Sogou news

Page 6: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

An example: Canopy Large, high dimensionalLarge, high dimensional datasets

clustering Two different distanceTwo different distance Two stagesTwo stages Computation saving Applying many domainsApplying many domains

EM, GAC, K-meansEM, GAC, K-means

Page 7: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

An example: Canopy cont'd CanopyDriver CanopyMapper

Input <label, vector> output <“canopy“, center>

CanopyReducer output <“canopy“, center>

ClusterDriver & ClusterMapper assign each point to canopies

Page 8: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

What's the Next?SVM(Support Vector Machine) Fast in training and prediction Optimal hyperplane Kernels Duality Decomposition Parallelize approach

Page 9: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Algorithms under plan EM(Expectation Maximization) LSI(Latant Semantic Indexing) SVD (Singular Values Decomposition) PCA(Principal Components Analysis) PageRank KNN(k Nearest Neighbors) Linear Regression and so on ...

Page 10: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Welcome to join us! Development Documentation Source code management Suggestion Any other things can help us

Page 11: Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

http://code.google.com/p/redpoll

Check it out!