Big Data Analysis in Hydrogen Station using Spark and Azure ML
-
Upload
jongwook-woo -
Category
Data & Analytics
-
view
129 -
download
1
Transcript of Big Data Analysis in Hydrogen Station using Spark and Azure ML
![Page 1: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/1.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Data Analysis and Prediction Using
Spark
Manvi Chandra, [email protected] Woo, PhD, [email protected]
High-Performance Information Computing Center (HiPIC)California State University Los Angeles
![Page 2: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/2.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction
Model
![Page 3: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/3.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Myself
Name: Manvi chandraExperience:
2012 -2014– Programmer Analyst at Cognizant Technology Solutions
2015-2016 - Present : Master’s in information system Exposed to Big Data Analytics Pursuing research in Big data analytics and machine learning 2007-2011-Bachelor of Technology in Electronics and
Communication Engineering.
![Page 4: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/4.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction
Model
![Page 5: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/5.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Introduction To Big Data
![Page 6: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/6.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Data Issues
Large-Scale dataTera-Byte (1012), Peta-byte (1015)
– Because of web– Sensor Data (IoT), Bioinformatics, Social
Computing, Streaming data, smart phone, online game…
Cannot handle with the legacy approachToo bigUn-/Semi-structured dataToo expensive
Need new systemsNon-expensive
![Page 7: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/7.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Two Cores in Big Data
How to store Big DataHow to compute Big DataGoogle
How to store Big Data– GFS– On non-expensive commodity computers
How to compute Big Data– MapReduce– Parallel Computing with multiple non-expensive
computers• Own super computers
![Page 8: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/8.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
What is Hadoop?
8
Hadoop Founder: Doug CuttingChief Architect at Cloudera
![Page 9: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/9.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can store a large scale data and process it faster in parallelHadoop
–Non-expensive Super Computer–You can build and run your applications
![Page 10: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/10.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduceHard to program in JavaBatch Processing
– Not interactiveDisk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP LabIn-memory storage for intermediate data10 ~ 100x faster than N/W and Disk
![Page 11: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/11.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction
Model
![Page 12: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/12.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Machine Learning
Subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.
Explores pattern recognition during data analysis through computer science and statistics.
Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.
![Page 13: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/13.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Machine Learning Studio
Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you can use to build, test, and deploy predictive analytics solutions on your data.
![Page 14: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/14.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Myself Introduction To Big DataMachine Learning Spark Cores RDD Spark SQL, Streaming, MLHydrogen Gas Power Plant Prediction
Model
![Page 15: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/15.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Spark
In-Memory Data ComputingFaster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystemsHDFSHBase, Hive, Sequence files
New Programming with faster data sharingGood in complex multi-stage applications
– Iterative graph algorithms, Machine LearningInteractive query
![Page 16: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/16.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
SparkRDDs, Transformations, and Actions
Spark Streamin
greal-time
SparkSQL
MLLibmachin
e learnin
g
DStream’s: Streams of
RDD’s
SchemaRDD’s
DataFramesRDD-Based Matrices
Spark Cores
GraphX
(graph)
RDD-Based Matrices
SparkR
RDD-Based Matrices
![Page 17: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/17.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Spark Drivers and Workers
DriversClient
–with SparkContext• Create RDDs
WorkersSpark ExecutorRun on cluster nodes
–ProductionRun in local threads
–development
![Page 18: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/18.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Myself Introduction To Big Data Hive Examples Spark Cores RDD Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction
Model
![Page 19: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/19.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)Distributed collections of objects
–that can be cached in memoryRDD, DStream, SchemaRDD, PairRDDImmutableLineage
–History of the objects–Automatically and efficiently recompute lost
data
![Page 20: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/20.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
RDD Operations
TransformationDefine new RDDs from the current
–Lazy: not computed immediatelymap(), filter(), join()
ActionsReturn valuescount(), collect(), take(), save()
![Page 21: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/21.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Programming in Spark
ScalaFunctional Programming
–Fundamental of programming is function• Input/Output is function
No side effects–No states
PythonLegacy, large Libraries
Java
![Page 22: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/22.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Contents Myself Introduction To Big Data Hive Examples Spark Cores RDD Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction
Model
![Page 23: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/23.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
SparkSparkSQL
Turning an RDD into a RelationQuerying using SQL
Spark StreamingDStream
– RDD in streaming– Windows
• To select DStream from streaming data
MLibSparse vector support, Decision trees, Linear/Logistic
Regression, PCASVD and PCA
![Page 24: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/24.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Spark
Hydrogen gas power plant spark model
o Separating the labeled column.o Creation of RDD.o Splitting the data into training and test sets.o Training the dataset using Decision forest
regression algorithm.o Evaluation of the result.
![Page 25: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/25.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Spark
Hydrogen gas power plant spark model
![Page 26: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/26.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Myself Introduction To Big Data Hive Examples Spark Cores RDD Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction
Model
![Page 27: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/27.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Prediction Model
The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) was formally opened on May 7, 2014.
![Page 28: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/28.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Prediction Model
The station is capable of producing hydrogen onsite from renewable energy sources, using the process known as electrolysis.
Cal State L.A. Hydrogen Research and Fueling Facility became the first station in the nation to sell hydrogen fuel by the kilogram to the public.
![Page 29: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/29.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Prediction Model
Workflow
![Page 30: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/30.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Prediction Model
Model
![Page 31: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/31.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Prediction Model
Results and observations
![Page 32: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/32.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Prediction Model
Results and observationsAccording to our research we are able to predict
Vehicle Pressure (Pressure of hydrogen gas within the vehicle Hydrogen Storage System)using our model.
The algorithm used is decision forest regression.Decision forest are an ensemble learning method for
classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
![Page 33: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/33.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Hydrogen Gas Power Plant Prediction Model
Results and observationsSTATE OF CHARGE (SOC):-
–Ratio of hydrogen density within the vehicle storage system to the full-fill density. SOC is expressed as a percentage and is computed based on the gas density as per formula below:
Our model predict vehicle pressure which in turn could be used to determine the state of charge.
![Page 34: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/34.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
Question?
![Page 35: Big Data Analysis in Hydrogen Station using Spark and Azure ML](https://reader036.fdocuments.in/reader036/viewer/2022070512/589bc58b1a28ab082b8b6109/html5/thumbnails/35.jpg)
High Performance Information Computing CenterJongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.orgApache Spark op Word Count Example
(http://spark.apach.org )Databricks (http://www.databricks.com )