Big Data Analysis in Hydrogen Station using Spark and Azure ML

35
High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Data Analysis and Prediction Using Spark Manvi Chandra, [email protected] Jongwook Woo, PhD, [email protected] High-Performance Information Computing Center (HiPIC) California State University Los Angeles

Transcript of Big Data Analysis in Hydrogen Station using Spark and Azure ML

Page 1: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Data Analysis and Prediction Using

Spark

Manvi Chandra, [email protected] Woo, PhD, [email protected]

High-Performance Information Computing Center (HiPIC)California State University Los Angeles

Page 2: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction

Model

Page 3: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Myself

Name: Manvi chandraExperience:

2012 -2014– Programmer Analyst at Cognizant Technology Solutions

2015-2016 - Present : Master’s in information system Exposed to Big Data Analytics Pursuing research in Big data analytics and machine learning 2007-2011-Bachelor of Technology in Electronics and

Communication Engineering.

Page 4: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction

Model

Page 5: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Introduction To Big Data

Page 6: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Data Issues

Large-Scale dataTera-Byte (1012), Peta-byte (1015)

– Because of web– Sensor Data (IoT), Bioinformatics, Social

Computing, Streaming data, smart phone, online game…

Cannot handle with the legacy approachToo bigUn-/Semi-structured dataToo expensive

Need new systemsNon-expensive

Page 7: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogle

How to store Big Data– GFS– On non-expensive commodity computers

How to compute Big Data– MapReduce– Parallel Computing with multiple non-expensive

computers• Own super computers

Page 8: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

What is Hadoop?

8

Hadoop Founder: Doug CuttingChief Architect at Cloudera

Page 9: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Definition: Big Data

Inexpensive frameworks that can store a large scale data and process it faster in parallelHadoop

–Non-expensive Super Computer–You can build and run your applications

Page 10: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Alternate of Hadoop MapReduce

Limitation in MapReduceHard to program in JavaBatch Processing

– Not interactiveDisk storage for intermediate data

– Performance issue

Spark by UC Berkley AMP LabIn-memory storage for intermediate data10 ~ 100x faster than N/W and Disk

Page 11: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction

Model

Page 12: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Machine Learning

Subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.

Explores pattern recognition during data analysis through computer science and statistics.

Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.

Page 13: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Machine Learning Studio

Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you can use to build, test, and deploy predictive analytics solutions on your data.

Page 14: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction

Model

Page 15: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Spark

In-Memory Data ComputingFaster than Hadoop MapReduce

Can integrate with Hadoop and its ecosystemsHDFSHBase, Hive, Sequence files

New Programming with faster data sharingGood in complex multi-stage applications

– Iterative graph algorithms, Machine LearningInteractive query

Page 16: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SparkSQL

MLLibmachin

e learnin

g

DStream’s: Streams of

RDD’s

SchemaRDD’s

DataFramesRDD-Based Matrices

Spark Cores

GraphX

(graph)

RDD-Based Matrices

SparkR

RDD-Based Matrices

Page 17: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Spark Drivers and Workers

DriversClient

–with SparkContext• Create RDDs

WorkersSpark ExecutorRun on cluster nodes

–ProductionRun in local threads

–development

Page 18: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Myself Introduction To Big Data Hive Examples Spark Cores RDD Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction

Model

Page 19: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

RDD

Resilient Distributed Dataset (RDD)Distributed collections of objects

–that can be cached in memoryRDD, DStream, SchemaRDD, PairRDDImmutableLineage

–History of the objects–Automatically and efficiently recompute lost

data

Page 20: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

RDD Operations

TransformationDefine new RDDs from the current

–Lazy: not computed immediatelymap(), filter(), join()

ActionsReturn valuescount(), collect(), take(), save()

Page 21: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Programming in Spark

ScalaFunctional Programming

–Fundamental of programming is function• Input/Output is function

No side effects–No states

PythonLegacy, large Libraries

Java

Page 22: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Contents Myself Introduction To Big Data Hive Examples Spark Cores RDD Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction

Model

Page 23: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

SparkSparkSQL

Turning an RDD into a RelationQuerying using SQL

Spark StreamingDStream

– RDD in streaming– Windows

• To select DStream from streaming data

MLibSparse vector support, Decision trees, Linear/Logistic

Regression, PCASVD and PCA

Page 24: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Spark

Hydrogen gas power plant spark model

o Separating the labeled column.o Creation of RDD.o Splitting the data into training and test sets.o Training the dataset using Decision forest

regression algorithm.o Evaluation of the result.

Page 25: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Spark

Hydrogen gas power plant spark model

Page 26: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Myself Introduction To Big Data Hive Examples Spark Cores RDD Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction

Model

Page 27: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Prediction Model

The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) was formally opened on May 7, 2014.

Page 28: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Prediction Model

The station is capable of producing hydrogen onsite from renewable energy sources, using the process known as electrolysis.

Cal State L.A. Hydrogen Research and Fueling Facility became the first station in the nation to sell hydrogen fuel by the kilogram to the public. 

Page 29: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Prediction Model

Workflow

Page 30: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Prediction Model

Model

Page 31: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Prediction Model

Results and observations

Page 32: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Prediction Model

Results and observationsAccording to our research we are able to predict

Vehicle Pressure (Pressure of hydrogen gas within the vehicle Hydrogen Storage System)using our model.

The algorithm used is decision forest regression.Decision forest are an ensemble learning method for

classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Page 33: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Hydrogen Gas Power Plant Prediction Model

Results and observationsSTATE OF CHARGE (SOC):-

–Ratio of hydrogen density within the vehicle storage system to the full-fill density. SOC is expressed as a percentage and is computed based on the gas density as per formula below:

Our model predict vehicle pressure which in turn could be used to determine the state of charge.

Page 34: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

Question?

Page 35: Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing CenterJongwook Woo

CSULA

References

Hadoop, http://hadoop.apache.orgApache Spark op Word Count Example

(http://spark.apach.org )Databricks (http://www.databricks.com )