Data munging and analysis
-
Upload
raminder-singh -
Category
Data & Analytics
-
view
110 -
download
0
description
Transcript of Data munging and analysis
![Page 1: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/1.jpg)
Data Munging and Analysis for Scientific Applications
Raminder SinghScience Gateways Group
Indiana University, [email protected]
![Page 2: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/2.jpg)
• Evaluate the Apache Big data tools• Understand the execution patterns of Analysis
applications• Solutions using Airavata• Build a gateways solution with HPC and Big Data
requirements
Overview
![Page 3: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/3.jpg)
http://hortonworks.com/hadoop/yarn/
Hadoop 2 Ecosystem
![Page 4: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/4.jpg)
Motivation to explore• Heterogeneous data• Data Munging (parsing, scraping, formatting data)• Visualization or Analyze• Preservation of data
![Page 5: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/5.jpg)
Analysis Applications
• Behavior Tracking - medical
• Situational Awareness - weather
• Time Series Data -Patient monitoring, weather data to help farmers
• Resource consumption Monitoring - Smart grid
• Process optimization
![Page 6: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/6.jpg)
What is Science Gateway?
• Community portal or desktop tools
• Common science theme
• Collaborative environment
![Page 7: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/7.jpg)
The Ultrascan science gateway supports high performance computing analysis of biophysics experiments using XSEDE, Juelich, and campus clusters.
Desktop analysis tools
Launch analysis and monitor through a browser
We help build gateways for labs or facilities.
![Page 8: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/8.jpg)
Airavata
![Page 9: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/9.jpg)
Value of using Airavata
• Enable collection of resources• Application centric not compute centric• Meta workflow to enable set of applications
![Page 10: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/10.jpg)
Use-case for Data Analysis
• TextRWeb: Large Scale Text Analytics with R on the web
Collaborator: Hui Zhang, Data Scientist at Indiana University
![Page 11: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/11.jpg)
![Page 12: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/12.jpg)
Goals for R on the web project
• Run large scale text analysis using parallel R.
• Hide computational complexity with user interfaces
• Support interactive text analysis
• Support iterative text mining
![Page 13: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/13.jpg)
TextR Solution Diagram
![Page 14: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/14.jpg)
Future Work
• Integrate TextRWeb with Apache Spark
• Explore SparkR [1]
• Develop Apache Thrift interfaces for TextRWeb server
• Integrate with Apache Airavata for HPC job.
• Explore workflow DAGs for Text Analysis
• Keep updated with product offering like Stratosphere
1. https://github.com/amplab-extras/SparkR-pkg
![Page 15: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/15.jpg)
![Page 16: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/16.jpg)
Conclusion
• Value added for the scientific communities• Value for Apache Big Data Suite
![Page 17: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/17.jpg)
airavata.apache.org
Subscribe: [email protected]: [email protected]
Subscribe: [email protected]
Thanks You!
Q & A
![Page 18: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/18.jpg)
Apache Spark
• In Memory computations• Machine learning library (MLLib)• graph engine (GraphX) • Streaming analytics engine (Spark Streaming) • Fast interactive query tool (Shark).• Use Lineage data for fault tolerance
• Tracking the data path
![Page 19: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/19.jpg)
Current Hadoop Integration
![Page 20: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/20.jpg)
Scientific applications Data TypesObservational Data – uncontrolled events happen and we record data about them.
Examples include astronomy, earth observation, geophysics, medicine, commerce, social data, the internet of things.
Experimental Data – we design controlled events for the purpose of recording data about them.
Examples include particle physics, photon sources, neutron sources, bioinformatics, product development.
Simulation Data – we create a model, simulate something, and record the resulting data.
Examples include weather & climate, nuclear & fusion energy, high-energy physics, materials, chemistry, biology, fluid dynamics.
![Page 21: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/21.jpg)
![Page 22: Data munging and analysis](https://reader036.fdocuments.in/reader036/viewer/2022081414/54c69b384a79593b258b4653/html5/thumbnails/22.jpg)
BioVLAB