Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang,...

download Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

If you can't read please download the document

description

In-Situ Analysis – What and Why Process of transforming data at run time – Analysis – Classification – Reduction – Visualization In-Situ has the promise of – Saving more information dense data – Saving I/O or network transfer time – Saving disk space – Saving time in analysis

Transcript of Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang,...

Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others) In-Situ Scientific Analytics What is In Situ? Co-locating simulation and analytics programs Moving computation instead of data Constraints of In Situ Minimize the impact on simulation Memory constraint Time constraint 2 SimulationAnalytics Persistent Storage Simulation Analytics In-Situ Analysis What and Why Process of transforming data at run time Analysis Classification Reduction Visualization In-Situ has the promise of Saving more information dense data Saving I/O or network transfer time Saving disk space Saving time in analysis Key Questions How do we decide what data to save? This analysis cannot take too much time/memory Simulations already consume most available memory Scientists cannot accept much slowdown for analytics How insights can be obtained in-situ? Must be memory and time efficient What representation to use for data stored in disks? Effective analysis/visualization Disk/Network Efficient A Vertical View In-Situ Algorithms No disk I/O Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems Enhance resource utilization Simplify the management of analytics code GoldRush, Glean, DataSpaces, FlexIO, etc. 5 Algorithm/Application Level Platform/System Level Seamlessly Connected? Rethink These Two Levels In-Situ Algorithms Implemented with low-level APIs like OpenMP/MPI Manually handle all the parallelization details In-Situ Resource Scheduling Systems Play the role of coordinator Focus on scheduling issues like cycle stealing and asynchronous I/O No high-level parallel programming API Motivation Can the applications be mapped more easily to the platforms for in-situ analytics? Can the offline and in-situ analytics code be (almost) identical? 6 Outline Background Bitmap based summarization and processing Key Ideas Algorithms Evaluation Smart Middleware System Motivation Design Evaluation Conclusions 7 Key Questions How do we decide what data to save? This analysis cannot take too much time/memory Simulations already consume most available memory Scientists cannot accept much slowdown for analytics How insights can be obtained in-situ? Must be memory and time efficient What representation to use for data stored in disks? Effective analysis/visualization Disk/Network Efficient Quick Answers How do we decide what data to save? Use Bitmaps! How insights can be obtained in-situ? Use Bitmaps!! What representation to use for data stored on disks? Bitmaps!!! Specific Issues Bitmaps as data summarization Utilize extra computer power for data reduction Save memory usage, disk I/O and network transfer time In-Situ Data Reduction In-Situ generate bitmaps Bitmaps generation is time-consuming Bitmaps before compression has big memory cost In-Situ Data Analysis Time steps selection Can bitmaps support time step selection? Efficiency of time step selection using bitmaps Offline Analysis: Only keep bitmaps instead of data Types of analysis supported by bitmaps Background: Bitmaps Widely used in scientific data management Suitable for floating value by binning small ranges Run Length Compression (WAH, BBC) Bitmaps can be treated as a small profile of the data In-Situ Bitmaps Generation Parallel index generation Save the data loading cost Multi-Core based index generation Core allocation strategies Shared Cores Allocate all cores to simulation and bitmaps generation Executed in sequence Separate Cores Allocate different core sets to simulation and bitmaps generation A data queue is shared between simulation and bitmaps generation Executed in parallel In-place bitvector compression Scan data by segments Merge segment into compressed bitvectors Time-Steps Selection Correlation Metrics Earth Movers Distance: Indicate distance between two probability distributions over a region Cost of changing value distributions of data Shannons Entropy: A metric to show the variability of the dataset High entropy => more random distributed data Mutual Information: A metric for computing the dependence between two variables Low M => two variables are relatively independent Conditional Entropy: Self-contained information Information with respect to others Calculate Earth Movers Distance Using Bitmaps Divide T i and T j into bins over value subsets Generate a CFP based on value differences between bins of T i and T j Accumulate results Correlation Mining Using Bitmaps Correlation mining Automatically suggest data subsets with high correlations Correlation Analysis: keep submitting queries Traditional Method Exhaustive calculation over data subsets (spatial and value) Huge time and memory cost Correlation mining using bitmap Mutual Information Calculated by probability distribution (value subsets) A top-down method for value subsets Multi-level bitmap indexing Go to low-level index only if high-level has high mutual info A bottom-up method for spatial subsets Divide bitvectors (with high correlations) into basic strides Perform 1-bits count operation over strides Correlation Mining Experiment Results Goals: Efficiency and storage improvement using bitmaps Scalability in parallel in-situ environment Efficiency improvement for correlation mining Efficiency and accuracy comparison with sampling Simulations: Heat3D, Lulesh Datasets: Parallel Ocean Program Environment: 32 Intel Xeon x5650 CPUs and 1TB memory MIC: 60 Intel Xeon Phi coprocessors and 8GB memory OSC Oakley Cluster: 32 nodes with 12 Intel Xeon x5650 CPUs and 48 GB memory Efficiency Comparison for In-Situ Analysis - CPU Full Data (original): Simulation: bad scalability Time Step Selection: big Data Writing: big and bad scalability Bitmaps: Simulation: utilize extra computing power for bitmaps generation Extra bitmaps generation time but good scalability Time Step Selection Using Bitmaps: 1.38x to 1.5x Bitmaps Writing: 6.78x Overall: 0.79x to 2.38x More number of cores, better speedup we can achieve Simulation: Heat3D; Processor: CPU Time steps: select 25 over 100 time steps 6.4 GB per time step (800*1000*1000) Metrics: Conditional Entropy Efficiency Comparison for In-Situ Analysis - MIC MIC: More cores Lower bandwidth Full Data (original): Huge data writing time Bitmaps: Good scalability of both bitmaps generation and time step selection using bitmaps Much smaller data writing time Overall: 0.81x to 3.28x Simulation: Heat3D; Processor: MIC Time steps: select 25 over 100 time steps 1.6 GB per time step (200*1000*1000) Metrics: Conditional Entropy Memory Cost of In-Situ Analysis Simulation: Heat3D, Lulesh Processor: CPU, MIC Keep 10 time steps in memory Heat3D - No Indexing: 12 time steps (pre, temp, cur) Heat3D - Bitmap Indexing: 2 time steps (pre, temp) 1 previous selected indices 10 current indices Lulesh No Indexing: 11 time steps (pre, cur) Huge extra memory for edges Lulesh Bitmap Indexing: 1 time step (pre) 1 previous selected indices 10 current indices Huge extra memory for edges 2.0x to 3.59x smaller memory Better as bigger data simulated and more time steps to hold Scalability in Parallel Environment Select 25 time steps out of 100 TEMP Variable: 6.4 GB per time step Number of nodes: 1 to 32 Number of cores: 8 Simulation: Heat3D Full Data Local: Each node write its data subblock into its own disk Bitmaps Local: Each node writes its bitmaps subblock into its own disk Fast time step selection and local writing 1.24x 1.29x speedup Full Data Remote: Different nodes send data sub- blocks to a master node Bitmaps Remote: Greatly alleviate data transfer burden of master node 1.24x 3.79x speedup Speedup for Correlation Mining Variables: TEMP, SALT Data size per variable: 1.4 GB to 11.2 GB Number of cores: 1 Simulation: POP Full Data: Big data loading cost Exhaustive calculations over data subsets Each calculation is time consuming Bitmaps: Smaller data loading Multi-level bitmaps to improve the mining process Bitwise AND and 1-bits count operations to improve the calculation efficiency 3.81x 4.92x speedup In-Situ Sampling vs. Bitmaps Heat3D,100 time steps (6.4 GB), 32 cores Bitmaps generation (binning, compression) has more time cost then down-sampling Sampling can effectively improve the time step selection cost Bitmaps generation can still achieve better efficiency if the index size is smaller than sample size Bitmaps: using the same binning scale, does not have any information loss Sampling: information loss is unavoidable no matter what sample% 30% % loss 15% % loss 5% % loss Outline Background Bitmap based summarization and processing Key Ideas Algorithms Evaluation Smart Middleware System Motivation Design Evaluation Conclusions 26 The Big Picture In-Situ Algorithms No disk I/O Indexing, compression, visualization, statistical analysis, etc. In-Situ Resource Scheduling Systems Enhance resource utilization Simplify the management of analytics code GoldRush, Glean, DataSpaces, FlexIO, etc. 27 Algorithm/Application Level Platform/System Level Seamlessly Connected? Opportunity Explore the Programming Model Level in In- Situ Environment Between application level and system level Hides all the parallelization complexities by simplified API A prominent example: MapReduce 28 + In Situ Challenges Hard to Adapt MR to In-Situ Environment MR is not designed for in-situ analytics 4 Mismatches Data Loading Mismatch Programming View Mismatch Memory Constraint Mismatch Programming Language Mismatch 29 Data Loading Mismatch In Situ Requires Taking Input From Memory Ways to Load Data into MRs From distributed file systems Hadoop and many variants (on HDFS), Google MR (on GFS), and Disco (on DDFS) From shared/local file systems MARIANE and CGL-MapReduce MPI-Based: MapReduce-MPI and MRO-MPI From memory Phoenix (shared-memory) From data streams HOP, M3, and iMR 30 Data Loading Mismatch (Contd) Few MR Option Most MRs load data from file systems Loading data from memory is mostly restricted to shared-memory environment Wrap simulation output as a data stream? Periodical stream spiking Only one-time scan is allowed An Exception -- Spark Can support loading data from file systems, memory, or data stream 31 Programming View Mismatch Scientific Simulation Parallel programming view Explicit parallelism: partitioning, message passing, and synchronization MapReduce Sequential programming view Partitions are transparent Need a Hybrid Programming View Exposes partitions during data loading Hides parallelism after data loading 32 Memory Constraint Mismatch MR is Often Memory/Disk Intensive Map phase creates intermediate data Sorting, shuffling, and grouping do not reduce intermediate data at all Local combiner cannot reduce the peak memory consumption (in map phase) Need Alternate MR API Avoids key-value pair emission in the map phase Eliminates intermediate data in the shuffling phase 33 Programming Language Mismatch Simulation Code in Fortran or C/C++ Impractical to rewrite in other languages Mainstream MRs in Java/Scala Hadoop in Java Spark in Scala/Java/Python Other MRs in C/C++ are not widely adopted 34 Bridging the Gap Addresses All the Mismatches Loads data from (distributed) memory, even without extra memcpy in time sharing mode Presents a hybrid programming view High memory efficiency with alternate API Implemented in C++11, with OpenMP + MPI 35 System Overview 36 Shared-Memory System Distributed System In-Situ System In-Situ System = Shared-Memory System + Combination = Distributed System Partitioning Two In-Situ Modes 37 Time Sharing Mode: Minimizes memory consumption Space Sharing Mode: Enhances resource utilization when simulation reaches its scalability bottleneck Launching Smart in Time Sharing Mode 38 Launching Smart in Space Sharing Mode 39 Ease of Use Launching Smart No extra libraries or configuration Minimal changes to the simulation code Analytics code remains the same in different modes Application Development Define a reduction object Derive a Smart scheduler class gen_key(s): generates key(s) for a data chunk accumulate: accumulates data on a reduction object merge: merges two reduction objects 40 Optimization: Early Emission of Reduction Object Motivation: Mainly considers window-based analytics, e.g., moving average A large # of reduction objects to maintain -> high memory consumption Key Insight: Most reduction objects can be finalized in the reduction phase Set a customizable trigger: outputs these reduction objects (locally) as early as possible 41 Smart vs. Spark To Make a Fair Comparison Bypass programming view mismatch Run on an 8-core node: multi-threaded but not distributed Bypass memory constraint mismatch Use a simulation emulator that consumes little memory Bypass programming language mismatch Rewrite the simulation in Java and only compare computation time 40 GB input and 0.5 GB per time-step 42 62X 92X K-MeansHistogram Smart vs. Spark (Contd) Faster Execution Spark 1) emits intermediate data, 2) makes immutable RDDs, and 3) serializes RDDs and sends them through network even in the local mode Smart 1) avoids intermediate data, 2) performs data reduction in place, and 3) takes advantage of shared-memory environment (of each node) Better (Thread) Scalability Spark launches extra threads for other tasks, e.g., communication and drivers UI Smart launches no extra thread Higher Memory Efficiency Spark: over 90% of 12 GB memory Smart: around 16 MB besides 0.5 GB time-step 43 Smart vs. Low-Level Implementations Setup Smart: time sharing mode; Low-Level: OpenMP + MPI Apps: K-means and logistic regression 1 TB input on 864 nodes Programmability 55% and 69% parallel codes are either eliminated or converted into sequential code Performance Up to 9% extra overheads for k-means Nearly unnoticeable overheads for logistic regression 44 K-Means Logistic Regression Node Scalability Setup 1 TB data output by Heat3D; time sharing; 8 cores per node 4-32 nodes 45 Thread Scalability Setup 1 TB data output by Lulesh; time sharing; 64 nodes 1-8 threads per node 46 Memory Efficiency of Time Sharing Setup Logistic regression on Heat3D using 4 nodes (left) Mutual information on Lulesh using 64 nodes (right) 47 Efficiency of Space Sharing Mode Setup 1 TB data output by Lulesh 8 Xeon Phi nodes and 60 threads per node Apps: K-Means (Left) and Moving Median (Right) 48 outperform time sharing by 48%outperform time sharing by 10% Conclusions In-Situ Analytics needs to be carefully architecture Memory Constraints Programmability Issues Many-cores are changing the game Bitmaps can be generated sufficiently fast Effective summarization structure Memory efficiency No loss of accuracy in most cases Smart Middleware Beats Conventional Wisdom Commercial `Big Data Ideas can be applied Requires careful design of middleware 49