Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL -...

Machine Learning (ML) and TACC Supercomputers

A little about me

• Data Scientist at Texas Advanced Computing Center (TACC)

• My Contact: [email protected]• TACC - Independent research center at UT Austin• TACC - One of the largest HIPAA compliant supercomputer center

• ~250 faculty, researchers, students and staff• We work on providing support to large scale computing problems

2

http://www.tacc.utexas.edu/

mailto:[email protected]

Some Basic Observations

There are fundamental differences in data access patterns between Data Intensive Computing and High Performance Computing (HPC)

Today, most of the ML Researchers want/need to work with Big Data, Vectorization, Code Optimization etc.

3

Data Intensive Computing

Specialized in dealing effectively with vast quantities of data in distributed environments

Generates high demand for computational resources, e.g. storing capacity, processing power etc.

4

Big data plays the key role in the popularity and growth of Data intensive computing

Increased the volume of data Improves accuracy of existing algorithms Helps create better predictive models

Increased the complexity

Data Intensive Computing & Big Data

5

What’s the challenge with the big data analysis?

6

Big Data Analysis requires even more computational resources

Storage is triple the standard data size

Algorithms use large data points and is memory intensive

The Big Data Analysis takes much longer time

Typical hard drive read-speed is about 150MB/sec But for reading 1TB ~ 2 hours

Analysis could require processing-time proportional to the size of the data Data Analysis at the rate of 1GB /second would require 11 days to

finish for 1TB data

7

High Performance Computing (HPC)

Hardware with more computational power per compute node

Computation can be done with multiple nodes

Provides highly efficient numeric processing in distributed environments

HPC has seen a recent growth in shared memory architectures

8

Sample TACC Computing Cluster

9

Combine HPC & Data intensive computing

The intersection of these two domains is mainly driven by the use of machine learning (ML)

ML methodologies help extract knowledge from big data

These hybrid environments – take advantage of data locality keep the data exchanges over the network at a manageable level offer high performance through distributed libraries

10

Stampede – Traditional cluster HPC system

Stockyard and Corral – 25 Petabytes of combined disk storage for all data needs

Ranch – 160 Petabytes of tape archive storage

Maverick/Rustler/Rodeo – “Niche” systems with GPU clusters, great for data anatytics and visualization

Wrangler - A New Generation of Data-intensive Supercomputer

TACC Ecosystem

11

TACC Ecosystem Goals

Goal to address the data problem in multiple dimensions Supports data in large and small scales Supports data reliability Supports data security Supports multiple data types: structured and unstructured Supports sequential access Fast for large files

Goal to support a wide range of applications and interfaces Hadoop (and Mahout) & Spark (and MLlib) Traditional R, GIS, DBs, and other HPC style performing

workflows

Goal to support the full data lifecycle Metadata and collection management support

12

Need to analyze large datasets quickly

Need a more on-demand interactive analysis environment

Need to work with databases at high transaction rates

Have a Hadoop or Spark workflow with need for large HDFS datastore

Have a dataset that many users will compute with or analyze

In need of a system with data management capabilities

Have a job that is currently IO bound

Why use TACC Supercomputers?

13

TACC Success Stories

14

Available ML tools/libraries in TACC Supercomputers

Scikit-learn

Caffe

Theano

CUDA/cuDNN

Hadoop

PyHadoop

RHadoop

Mahout

Spark

PySpark

SparkR

MLlib

17

Two Sample ML workflows in TACC Supercomputers

GPU Powered Deep Learning on MRI images with NVIDIA DIGITS in Maverick Supercomputer

Pubmed Recommender System in Wrangler Supercomputer

18

Deep Learning on Images

Deep Neural Networks are computationally quite demanding

The input data is much larger if we use even a small image resolution 256 x 256 RGB-pixel implies 196,608 input neurons (256 x 256 x 3)

Many of the involved floating point matrix operations can be addressed by GPUs

19

Deep Learning on MRI using TACC Supercomputers

Maverick has large GPU Clusters There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe

We use NVIDIA DIGITS (based on caffe), which is a web server providing a convenient web interface for training and testing Deep Neural Networks

For classification of MRI/images we use a convolutional DNN to figure out the features

We use CUDA 7,cuDNN, caffe and DIGITS on Maverick to classify our MRI/images

In the course of 30 epochs, our classification accuracy ranges from 74.21% to 82.09%

20

http://deeplearning.net/software/theano/

http://torch.ch/

https://developer.nvidia.com/digits

http://en.wikipedia.org/wiki/Convolutional_neural_network

http://en.wikipedia.org/wiki/CUDA

https://developer.nvidia.com/cuDNN

http://caffe.berkeleyvision.org/

https://github.com/NVIDIA/DIGITS

Pubmed Recommender System in Wrangler

21

What is a Recommendation System?

Recommender System helps match users with item

Implicit or explicit user feedback or item suggestion

Our Recommendation system: We try to build a model which recommends Pubmed

documents to users, based on the user search profile

22

Types of Recommender System

Types Pros ConsKnowledge‐based(i.e, search)

Deterministic recommendations,

assured quality,

no cold‐ start

Knowledge engineering effort to bootstrap,

basically static

Content‐based No community required,

comparison between items possible

Content descriptions necessary,

cold start for new users

Collaborative No knowledge‐ engineering effort,

serendipity of results

Requires some form of rating feedback,

cold start for new users and new items

23

Using Vector Space Model (VSM) for Pubmed

Given: A set of Pubmed documents N features (unique terms) describing the documents in the set

VSM builds an N-dimensional Vector Space

Each item/document is represented as a point in the Vector Space

Information Retrieval based on search Query: A point in the Vector Space We apply TFIDF to the tokenized documents to weight the documents

and convert the documents to vectors We compute cosine similarity between the tokenized documents and

the query term We select top 3 documents matching our query

We weight the query term in the sparse matrix and rank documents2424

MPI or Hadoop or Spark?

Which is really more suitable for this ML problem in a HPC system ?

25

Message Passing in HPCMessage Passing Interface (MPI) was one of the key factors which supported the initial growth of cluster computing

MPI helped shape what the HPC world has become today

MPI supported a substantial majority of all supercomputing work

Scientists and engineers have relied upon MPI for the past decades

MPI works great for data intensive computing in a GPU cluster

26

Why MPI is not the best tool for ML

A researcher/developer working with MPI needs to manually decompose the common data structures across processors

Every update of the data structure needs to be recast into a flurry of messages, syncs, and data exchange

Programming at the transport layer is an awkward fit for numerical application developers

This led to the advent of other techniques

27

Hadoop is an open source implementation of MapReduce programming model in JAVA

It has interface to other programming languages such as R, python etc.

Hadoop includes - HDFS: A distributed file system based on google file

system (GFS)

YARN: A resource manager to assign resources to the computational tasks

MapReduce: A library to enable e cffi ient distributed data processing easily

Mahout: Scalable machine learning and data mining library

Hadoop streaming: It is a generic API which allows writing Mappers and Reducers in any language.

Hadoop is a good fit for large single-pass data processing, but has its own limitations

Choosing Hadoop over MPI

28

Limitations of Hadoop in HPCHadoop comes with mandatory Map Reduce logging of output to the disk after every Map/Reduce stage

In HPC, logging output to disk could be sped up with caching or SSDs

In general, this fact rendered Hadoop unusable for many ML approaches which required iteration, or interactive use

The real issue with Hadoop was its HDFS file system. The HDFS file system was intimately tied to Hadoop cluster scheduling

The large-scale ML community sought in-memory approaches to avoid this problem

29

Spark

For large-scale technical computing, one very promising in-memory approach is Spark

Spark lacks Map/Reduce-style requirements

Spark can run standalone, without a scheduler like YARN

It has interfaces to other programming languages such as R, python etc.

Spark supports HDFS through YARN

MLlib: Scalable machine learning and data mining library

Spark streaming: Enables stream processing of live data streams

30

Our Recommendation Model

We apply collaborative filtering on the weighted/ranked documents

We use Alternating Least Square (pyspark.mllib.recommendation.ALS) for recommending Pubmed documents MatrixFactorizationModel.recommendProducts(int user_id, int num_of_iterations)

We use collaborative filtering in Scikit-learn & Hadoop as baselines

We use the python-recsys library along with Python Scikit-learn svd.recommend(int product_id)

We use the mahout’s Alternating Least Square for Hadoop

Comparative study of our model shows improved performance in Spark

3131

https://spark.apache.org/docs/1.1.1/api/python/pyspark.mllib.recommendation.ALS-class.html

http://ocelma.net/software/python-recsys/build/html/quickstart.html

Performance Evaluation of Pubmed Recommendation Model

We evaluate our recommendation model using Python Scikit-learn, Apache Mahout and PySpark MLlib in Wrangler

Recommendation model use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) for evaluation

Lower the errors, more accurate the modelLower the time taken to train/test the model, better the performance

Algo: Type Public Dataset Python ML library Eval Test Model Training Time Model Test Time

RecommendationWeighted Pubmed

Documents Python ScikitRMSE=17.96%MAE=16.53% 42 secs 19 secs


Documents Hadoop MahoutRMSE=16.02%MAE=14.98% 38 secs 14 secs


Documents PySpark MLlibRMSE=15.88%MAE=14.23% 34 secs 11 secs

32

THANK YOU !

Questions?

33

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL -...

Technology

Transcript of Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL -...