Big dataanalyticsbeyondhadoop public_20_june_2013

23
We Implement Big Data 1 Introduction to Machine Learning on Big-data: Big-data Analytics Beyond Hadoop Dr. Vijay Srinivas Agneeswaran Director, Technology and Head, Big-Data R&D, I-Labs

description

Big-data analytics beyond Hadoop - Big-data is not equal to Hadoop, especially for iterative algorithms! Lot of alternatives have emerged. Spark and GraphLab are most interesting next generation platforms for analytics.

Transcript of Big dataanalyticsbeyondhadoop public_20_june_2013

Page 1: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

1

Introduction to Machine Learning on Big-data: Big-data Analytics Beyond Hadoop

Dr. Vijay Srinivas Agneeswaran

Director, Technology and

Head, Big-Data R&D, I-Labs

Page 2: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

Agenda

2

• Introduction

• Introduction to Machine Learning (ML)

• Three generations of realizations – machine learning (ML) algorithms

• Hadoop Suitability – iterative and real-time applications

• Third generation ML realizations

• Spark

• comparison of graph processing paradigms

• Real-time Analytics – our R&D efforts.

• ML algorithm in Storm-Kafka – manufacturing use case

• ML algorithm in Spark – traffic analysis use case

Page 3: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

3

• What is it?• learn patterns in data

• improve accuracy by learning

• Examples

• Speech recognition systems

• Recommender systems

• Medical decision aids

• Robot navigation systems

Introduction to Machine Learning

Page 4: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

4

• Attributes and their values:• Outlook: Sunny, Overcast, Rain

• Humidity: High, Normal

• Wind: Strong, Weak

• Temperature: Hot, Mild, Cool

• Target prediction - Play Tennis: Yes, No

Introduction to Machine Learning

This example is from the following source:http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch3.pdf

Page 5: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

5

Introduction to Machine Learning

NoStrongHighMildRainD14

YesWeakNormalHotOvercastD13

YesStrongHighMildOvercastD12

YesStrongNormalMildSunnyD11

YesStrongNormalMildRainD10

YesWeakNormalCoolSunnyD9

NoWeakHighMildSunnyD8

YesWeakNormalCoolOvercastD7

NoStrongNormalCoolRainD6

YesWeakNormalCoolRainD5

YesWeakHighMildRain D4

YesWeakHighHotOvercastD3

NoStrongHighHotSunnyD2

NoWeakHighHotSunnyD1

Play TennisWindHumidityTemp.OutlookDay

Page 6: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

6

Introduction to Machine Learning: Decision Trees

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Page 7: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

7

• Decision trees• Pros

• Handling of mixed data, Robustness to outliers, Computational scalability

• cons

• Low prediction accuracy, High variance, Size VS Goodness of fit

• Can we have an ensemble of trees? – random forests• Final prediction is the mean (regression) or class

with max votes (categorization)

• Does not need tree pruning for generalization

• Greater accuracy across domains.

Decision Trees to Random Forests

Page 8: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

8

• Machine learning tasks• Learning associations – market basket analysis

• Supervised learning (Classification/regression) –random forests, support vector machines (SVMs), logistic regression (LR), Naïve Bayes

• Prediction – random forests, SVMs, LR

• Unsupervised learning (clustering) - k-means, sentiment analysis

• Data Mining• Application of machine learning to large data

• Knowledge Discovery in Databases (KDD)

• Credit scoring, fraud detection, market basket analysis, medical diagnosis, manufacturing optimization

Introduction to Machine Learning

Page 9: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

9

• First Generation• SAS, SPSS, Informatica etc.

• Deep analytics – large collection of serial ML algorithms

• Logistic regression, kernel Support Vector Machines (SVMs), Principal Component Analysis (PCA), Conjugate Gradient (CG) Descent etc.

• Vertical scaling – increase computation/memory power of node.

Three Generation of Realization – ML Algos.

Page 10: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

10

• Second Generation• Mahout, RapidMiner, Pentaho

• Works over Hadoop Map-Reduce (MR) – can potentially scale to large data sets

• Shallow analytics may only be possible on Big-data

• Small number of algorithms only available – including linear regression, linear SVMs, Stochastic gradient descent, collaborative filtering, k-means clustering etc.

Three Generation of Realization – ML Algos.

Page 11: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

11

• Third Generation• Spark, HaLoop, Twister MR, Apache Hama,

Apache Giraph, GraphLab, Pregel, Piccolo

• Deep analytics on Big-data?

• Possible – many more algorithms can be realized in parallel

• Kernel SVMs, Logistic Regression, CG etc.

• Motivated by iterative processing + social networks (power law graphs)

• Realize ML algorithms in real-time – use Kafka-Storm integrated environment or Spark Streaming

Three Generation of Realization – ML Algos.

Page 12: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

12

• Hindrances/Hurdles towards Hadoop adoption

• Single cluster – Hadoop YARN is the way forward.

• Lack of ODBC connectivity

• Hadoop suitability

• Mainly for embarrassingly parallel problems

• Not suitable if data splits need to communicate or data splits are inter-related.

• Map-Reduce for iterative computations

• Hadoop not currently well suited – no long lived MR job nor in-memory data structures for persisting/caching data across MR iterations.

Introduction: Hadoop Adoption Future

Page 13: Big dataanalyticsbeyondhadoop public_20_june_2013

Suitability of Map-Reduce for Machine Learning Origin in functional programming languages (Lisp and ML)

Built for embarrassingly parallel computations

Map: (k1, v1) -> list(k2, v2)

Reduce: list(k2, list(v2)) - > list(v2).

Algorithms which can be expressed in Statistical Query Model in summation form – highly suitable for MR [CC06] – suitable for Hadoop MR.

Different categorization of ML algorithms

Algorithms which need a single execution of an MR model

Algorithms which need sequential execution of fixed no. of MR models.

Algorithms where 1 iteration is a single execution of MR model.

Algorithms where 1 iteration itself needs multiple MR model executions.

[CC06] Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. Map-reduce for machine learning on multicore. In NIPS (2006), pp. 281--288.

Page 14: Big dataanalyticsbeyondhadoop public_20_june_2013

What about Iterative Algorithms? What are iterative algorithms?

Those that need communication among the computing entities

Examples – neural networks, PageRank algorithms, network traffic analysis

Conjugate gradient descent

Commonly used to solve systems of linear equations

[CB09] tried implementing CG on dense matrices

DAXPY – Multiplies vector x by constant a and adds y.

DDOT – Dot product of 2 vectors

MatVec – Multiply matrix by vector, produce a vector.

1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices.

Other iterative algorithms – fast fourier transform, block tridiagonal[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific

computing, Technical Report, University of California, Computer Science Department, 2009.

Page 15: Big dataanalyticsbeyondhadoop public_20_june_2013

Further exploration: Iterative Algorithms

[SN12] explores CG kind of iterative algorithms on MR

Compare Hadoop MR with Twister MR (http://iterativemapreduce.org)

It took 220 seconds on a 16 node cluster to solve system with 24 unknowns, while for 8000 unknowns – took almost 2 hours.

MR tasks for each iteration – computation is too little, overhead of setup of MR tasks and communication is too high.

Data is reloaded from HDFS for each MR iteration.

Surprising that Hadoop does not have support for long running MR tasks

Other alternative MR frameworks?

HaLoop [YB10] – extends MR with loop aware task scheduling and loop invariant caching.

Spark [MZ10] – introduces resilient distributed datasets (RDD) – RDD can be cached in memory and reused across iterations.

Beyond MR – Apache Hama (http://hama.apache.org) – BSP paradigm

[SN12] Satish Narayana Srirama, Pelle Jakovits, and Eero Vainikko. 2012. Adapting scientific computing problems to clouds using MapReduce. Future Generation Computer Systems 28, 1 (January 2012), 184-192, Elsevier Publications[YB10]  Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters InVLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010[MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10

Page 16: Big dataanalyticsbeyondhadoop public_20_june_2013

Data processing: Alternatives to Map-Reduce R language

Good for statistical algorithms

Does not scale well – single threaded, single node execution.

Inherently good for iterative computations – shared array architecture.

Way forward

R-Hadoop integration – or R-Hive integration

R extensions to support distributed execution.

[SV12] is an effort to provide R runtime for scalable execution on cluster.

Revolution Analytics is an interesting startup in this area.

Apache HAMA (http://hama.apache.org) is another alternative

Based on Bulk Synchronous Parallel (BSP) model – inherently good for iterative algorithms – can do Conjugate gradient, non-linear SVMs – hard in Hadoop MR.[SV12] Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber.

2012. Using R for iterative and incremental processing. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing (HotCloud'12). USENIX

Association, Berkeley, CA, USA, 11-11.

Page 17: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big DataComparison of Graph Processing Paradigms

Graph Paradigm

Computation (Synchronous or Asynchronous)

Determinism and Effects

Efficiency VS Asynchrony

Efficiency of Processing Power Law Graphs.

GraphLab1 Asynchronous Deterministic – serializable computation.

Uses inefficient locking protocols

Inefficient – Locking protocol is unfair to high degree vertices

Pregel Synchronous – BSP based

Deterministic – serial computation.

NA Inefficient – curse of slow jobs.

Piccolo Asynchronous Non-deterministic – non-serializable computation.

Efficient, but may lead to incorrectness.

May be efficient.

GraphLab2 (PowerGraph)

Asynchronous Deterministic – serializable computation.

Uses parallel locking for efficient, serializable and asynchronous computations.

Efficient – parallel locking and other optimizations for processing natural graphs.

Page 18: Big dataanalyticsbeyondhadoop public_20_june_2013

Spark: Third Generation ML Tool Two parallel programming abstractions [MZ10]

Resilient distributed data sets (RDDs)

Read-only collection of objects partitioned across a cluster

Can be rebuilt if partition is lost.

Parallel operation on RDDs

User can pass a function – first class entities in Scala.

Foreach, reduce, collect

Programmer can build RDDs from

1.a file in HDFS

2.Parallelizing Scala collection - divide into slices.

3.Transform existing RDD - Specify flatmap operations such as Map, Filter

4.Change persistence of RDD Cache or a save action – saves to HDFS.

Shared variables

Broadcast variables, accumulators

[MZ10] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10

Page 19: Big dataanalyticsbeyondhadoop public_20_june_2013

Some Spark(ling) examplesScala code (serial)

var count = 0

for (i <- 1 to 100000)

{ val x = Math.random * 2 - 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).

Page 20: Big dataanalyticsbeyondhadoop public_20_june_2013

Some Spark(ling) examplesSpark code (parallel)

val spark = new SparkContext(<Mesos master>)

var count = spark.accumulator(0)

for (i <- spark.parallelize(1 to 100000, 12))

{ val x = Math.random * 2 - 1 val

y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Notable points:

1. Spark context created – talks to Mesos1 master.

2. Count becomes shared variable – accumulator.

3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.

4. Parallelize method invokes foreach method of RDD.

1 Mesos is an Apache incubated clustering system – http://mesosproject.org

Page 21: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

21

Real-time Analytics: my R&D

Page 22: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

Architecture of Fast Classification System [Patent Pending: 1143/CHE/2013 with Indian PTO]

Page 23: Big dataanalyticsbeyondhadoop public_20_june_2013

We Implement Big Data

Thank You!

[email protected]

• LinkedIn http://

in.linkedin.com/in/vijaysrinivasagneeswaran• Blogs

blogs.impetus.com

• Twitter @a_vijaysrinivas.