Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data •...

11
Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems? Dr. William Kramer National Center for Supercomputing Applications, University of Illinois

Transcript of Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data •...

Page 1: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Co-existence: Can Big Data and Big Computation Co-exist on the Same

Systems? Dr. William Kramer

National Center for Supercomputing Applications, University of Illinois

Page 2: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Where these views come from •  Large scale simulation

•  CFD, MD, Fusion, Materials, Chemistry, Climate, Weather, Seismic, Structures, …

•  Large scale experimental systems •  High Energy and Nuclear Physics – LHC (CMS, STAR), SNO, … •  Astronomy – SNF/SNAP, CMB, DES, LSST •  Genomics – Genomics automation, analysis and workflow •  Self Organizing Networks

•  Networks and CyberInfrastructures •  NASnet, Esnet, Open Science Grid, XSEDE

•  Cyber Protection and Security •  Intrusion Detection, other

•  Resiliency •  Large System State and Response

•  Design and Implementation of HSMs •  NAStore, HPSS, ….

•  National Aerospace System •  Free Flight, AATT

2 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 3: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Types of Big Data

•  Computer Processed Semi-Structured Data •  Been doing this a long time and reasonable well

•  Structured Observational Data •  Been doing this a long time and reasonable well

•  Unstructured Observational Data •  Capabilities now allow us to consider doing this at

unprecedented scale

3 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 4: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Computer Processed Semi-Structured Data •  Examples

•  Simulation •  Coordinated data assimilation with analysis •  Traditional Business processing

•  Characteristics •  Structured file based I/O

•  Many to many, many to few, few to many •  Claim to be about a few big files – in reality there are many small files •  Format is application specific, investigator specific and sometimes domain

specific •  Parallel file systems used

•  Performance, Reliability, Management •  Uses levels of storage devices

•  On-line disk, tape,… •  Significant amounts of the data is published via copy and post methods

•  PCMDI, QCD Lattices, Protein data bases,

4 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 5: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Structured Observational Data •  Examples

•  HEP/NP – LHC, CMS, SNO, … •  Astronomy – DES, LSST, SKA, Supernova, CMB •  EOS

•  Characteristics •  Structured file based I/O

•  Domain specific meta-data structure •  Custom, Data base, …

•  Parallel and non-parallel file systems used – sometimes not •  Uses levels of storage devices

•  On-line disk, tape,… •  Globally accessible

•  Much of the data is shared in a distributed hierarchy •  Tier 0, 1, 2, 3… •  Mechanisms for automatic discovery and retrieval

5 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 6: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

UnStructured Observational Data •  Examples

•  textual - Tweets, email, documents, genomic sequence segments, log files •  Images – youtube, surveillance videos, images, … •  Combined - Medical records •  Other – manufacturing and vehicle control systems

•  Characteristics •  Often minimal metadata initially

•  Significant background processing to improve organization and retrieval •  Hadoop or other custom filesystems •  Asynchronous creation •  Small atomic units – mostly randomly accessed

•  Storage System •  Typically only on-line but that may change •  Mostly local storage on nodes – have to schedule work on nodes with data or move the data •  Coordination via reading and writing files

•  Much of the data is served after simple searches via portals and browsers •  Mechanisms for automatic discovery and retrieval

6 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 7: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Example Cost Comparison •  From White Paper by Xyratex - “Map/Reduce on Lustre - Hadoop

Performance in HPC Environments” by Nathan Rutman •  Can HPC file systems do better then HDFS? •  Clusters sized for similar performance of Mapreduce tests – TestDFSIO, Hadoop

Sort, read/write, etc. •  Based partially on storing 3 copies is more expensive than RAID storage

+ controller/OSTs

7 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 8: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Resource Management •  Resource Scheduling Functions on large scale systems have the

features needed to schedule work as people expect for Big Data •  Scheduling decisions are based on “culture” that is made up of users,

providers, stack holders, etc. •  Queuing theory determines the tradeoffs of utilization vs response time •  In “Big Data” “batch” background work is done for what we perceive as

interactive query response •  E.g. crawling and indexing web pages, image preparation, weather

products, … •  Changes required

•  Need to schedule bandwidth – not processors •  How many units need to be schedule as a unit •  For Big Data – need to move computation to the data or move the data •  Can do at least space sharing within a system

8 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 9: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Architecture •  Architecture:

•  What architectural changes are needed for extreme computing storage systems to make them better suited for BD?

•  Better small scale atomic I/O – Solid State Storage? •  A new storage repository – non POSIX? •  Seamless storage hierarchies

•  What operational changes are needed to support new storage architectures?

•  Yes – critical resource is bandwidth not CPU •  Looking at future technologies, what future architectures are

possible? •  Interconnect is the most essential. Processor technology can be

whatever it is. •  Energy efficient memory

9 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 10: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

What do we need to investigate •  Software layers that interface the map reduce programming framework to

HPC file systems AND •  Software layers that run Parallel POSIX I/O on HDFS implementations •  What lessons from HPC parallelism can be applied to Big Data Applications •  Replace workflow communication by files with communication by memory •  Create robust time to solution performance evaluation suites that can be used

to explore claims of price performance of architectures and implementations •  Representing all three types of data use, and know which use models require them

•  Need to manage non-traditional resources •  e.g. bandwidth rather than processors •  Need to manage time to solution rather than time to start

•  Change mind share •  Traditional HPC = maximizing CPU use •  Big Data = CPU is not an important resource

•  Understand the role(s) of virtualization and Virtual Machines

10 Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC

Page 11: Co-existence: Can Big Data and Big Computation Co-exist on ... · Structured Observational Data • Examples • HEP/NP – LHC, CMS, SNO, … • Astronomy – DES, LSST, SKA, Supernova,

Acknowledgements

This work is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI

07-25070) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign, its National Center for

Supercomputing Applications, Cray, and the Great Lakes Consortium for Petascale Computation.

The work described is achievable through the efforts of the many other on different teams.

Big Data and Extreme Scale Computation Workshop - April 30 2013 - Charleston SC 11