Project Report ECS 199 Summer Session II
-
Upload
sugeerth-murugesan -
Category
Documents
-
view
216 -
download
0
Transcript of Project Report ECS 199 Summer Session II
-
8/12/2019 Project Report ECS 199 Summer Session II
1/20
Name: Sugeerth Murugesan
Major: Computer Science
Course: ECS 199 summer session-II 2013, 5 units
Instructor: Bernd Hamann
VISUALIZING WORK ASSIGNMENT DATA IN AN ADAPTIVE MESH REFINEMENT LIBRARY-
FURTHER EXTENSIONS OF SYSTEM
Objective:
Enormous amounts of data are generated daily. The lack of effective tools to analyze the
collected data results in a reduced ability of the data scientists to gain insights [7]. The only
reasonable approach to analyze such huge amounts of data is through effective and efficient
ways of visualization. The ability to support real-time visualization is an essential aspect of any
visualization tool. The goal of this project is to support visualizations of high-resolution datasets by designing, implementing and testing methods that allow the data scientists using the
tool to perceive visualizations in real-time. This project is an extension of my work done in UC
Davis during the summer session- I.
We add further extensions to the existing system that we developed earlier in spring
as well as summer session-I. The module that we developed was the Patch Module in the
performance visualization tool Boxfish. The desired operators that we implement in this project
include context visualization [5], parallelizing and optimizing data processing parts of the
prototype, projecting regions of interest through different levels of the AMR hierarchy. Basically,
AMR (Adaptive Mesh Refinement) [4] is a process by which, the cells in a physical space are
refined only in the areas where there is a complex activity. Over a particular instant the space
gets subdivided into smaller grid cells. In a High Performance Computing (HPC) scenario, the
simulated domain (application domain) is the physical space and the hardware domain is the
physical hardware of the supercomputers. The parts of the application domain are mapped to
hardware domain. We visualize and optimize the visualizations of physical domain.
We basically optimize our prototype by
1) Clearly stating the parts of non-parallel implementation that consume the relativelylargest amounts of resources (computing, storage and the number of functional calls)
2) Understanding the behavior of the application by applying non-parallelimplementation with 3 different resolutions of the same type
a. Initial resolution = 1024 core run dataset.(Figure 2(a))b. Twice the initial resolution = 2 x 1024 core run dataset.(Figure 2(b))c. Four times the initial resolution =4 x 1024 core run dataset.(Figure 2(c))
-
8/12/2019 Project Report ECS 199 Summer Session II
2/20
3) Applying our optimization as well as parallelization techniques to those parts thatseem more expensive in terms of computational storage and number of function
calls used.
The methods used to realize the overall software re-design by optimization and parallelization
are:
1) Code profiling: Profiling the current execution pattern in the prototype, finding the costdata intensive operations.
2) Code optimizing: Optimizing data intensive operations that we were found in codeprofiling.
3) Code parallelizing: Parallelizing expensive operations identified in the code optimizingphase.
1. Code Profiling:
The python libraries used for the analysis are:
cProfiler:It states the amount of time in CPU seconds and the number of function call each
part(function) of the program took.
Pymbler: It is a library in python that informs the current size of the data structures in the code.
For example, the size of the "Patch dict" in "4 x 1024 core run data set is 1.44 MB.
Pymetrics: It determines the number of basic independent paths of the code region. This gives
a basic idea to profile which parts of the code need the most amount of attention (in terms of
paralyzing and exploiting more resources from the CPU).
PyCuda: It lets the programmer access the Nvidias CUDA parallel computation API from
Python.[10].It utilizes the power of CUDA programming through the driver API.
1.2 Finding the McCabes Cyclomatic complexity:
The McCabes Cyclomatic complexity *7+ analysis lets us know the important parts of the
code which needs to be given attention. They traverse the most number of independent
execution paths. This forms a basis to find which parts of the code consume the largest amount
of resources (CPU units taken to execute, number of function calls made to the CPU).The parts
of the code (applicable to our prototype) which will be analyzed in the following sections are
defined below:
SetPatchSize():(Function responsible for implosion operators)
It sets the size of the patches and announces the change to the patch module. As soon as the patch sizes
are updated the display lists are updated to reflect the change in rendering of patches.
ChangeValue_for_transparency():(Function to change the degree of opacity of patches)The function
sets a value chosen when the opacity slider is updated
-
8/12/2019 Project Report ECS 199 Summer Session II
3/20
ChangeValue_for_Level_slider():(Function to change the level of interest)
Visualizes the desired level of interest. The function is connected to a Level slider where the user can
restrict the visualization to certain levels of refinement.
UpdateHighlights(): (Function that propagates any change in the resolution)
Given a list of the patch ids to be highlighted, it displays a neat dialogue box which very well quantifies
the data-attributes of the patches of interest.
Neighbour_Change():(Function that finds the neighbors of the patch of interest)
It returns the neighbors of the patch of interest in a list. Does not return the patches which are already
sliced-off using the Slice and Dice operation.
Process_ Planes (): (Function that initiates values for slice planes):
Initiates the planes data structure (Hash map) to render during the slice and dice mode.
Magnify ():(Function that returns patches in the vicinity of the patch of interest.):
The function finds the distance between the patch of interest and its neighbors. The function plays a
crucial role in creating context visualization.
Range ():Calculates the distance between the patch of interest and the entire simulated domain. It also
calculates where exactly the slice planes are to be placed in the simulated domain.
DoPick():A function responsible for the selection of patches and slice planes.
Plane_highlight():Function responsible for the selection of a particular plane in the simulated domain.
Highlight_Drawing():The function that is responsible for visualization on multiple patches of interest,
level of interest, magnification visualization and highlighting the patch neighbors.
The graph in figure 1 is a representation of different parts of the prototype plotted against the
number of independent paths (McCabes Complexity) in the prototype designed. The figure 2 is a graph
that represents the relation between the numbers of function calls made as well the amount of CPU
seconds taken to respond for each part of the prototype.
2. Inference Drawn and Optimization:
With the brief analysis of the code, we applied many techniques for optimizing and parallelizing
our non-parallel implementation. The goal is to minimize the cost of expensive operations in the
prototype to create a real-time Visualization. We implemented the following operations to improve the
overall performance of the software:
1. Replacement of Euclidean distance formulae with the Manhattan Distance or city blockdistance. Average improvement percentage is
a. Neighbour_Change (): 37.34%.
2. The operational numpy data structures are converted from float32 to int32.Average
improvement in performance in
-
8/12/2019 Project Report ECS 199 Summer Session II
4/20
a. Neighbor Change (): 25.73%.
b. DrawCubes (): 2.74%.
Please note that the improvement in performance is calculated by
Percent improvement =
3. Reducing on the fly operations to improve the overall process of rendering. This is
accomplished by storing the calculated color values in a numpy data structure. Average improvement
in performance is:
a. Draw_Cubes(): ~25x Faster
b. Highlight_Drawing(): ~2x Faster
c. Range(): ~5x Slower
2.1 Manhattan Distance:
For the purpose of reducing the computational time, the Manhattan distance is introduced to
replace the Euclidean distance formulae. Manhattan distance is the number of city blocks between the
two points of interest. According to the graph, we see that at an average, there is a 37.34% decrease in
the amount of time taken by the CPU to respond. The purpose of finding distance in the Neighbor
Change () function is to highlight the patches which are in the region of interest. Restructuring the code
to accommodate Manhattan distance also improved the overall computational time.
2.2 Integer Conversion:
The conversion of floating point data structures to integer type data structures resulted in a:
a. Neighbor Change (): 25.73%.
b. DrawCubes (): 2.74%.
decrease in the computational time taken by the CPU. The conversion resulted in a down-
sampling process where the data was rounded off to the nearest integer. This limited the precision of
the float value. The data structures which were rounded off to integers are:
self. Distance=Distance between the regions of interest.
self.maxval, self.minval=Maximum and minimum values of color map
self.Slice=Slice plane values in x, y and z axes.
The conversion of a float value to an integer value in the draw Cubes () function resulted in a
2.74% decrease in the computational time whereas in the Neighbor_change () resulted in a 25.73 %
decrease in computational time.
2.3 Reducing on the Fly computations:
The purpose of reducing the on the fly computations is to decrease the initial time
taken to render the patches. Earlier, patches were retrieved from the python dict and using the
-
8/12/2019 Project Report ECS 199 Summer Session II
5/20
-
8/12/2019 Project Report ECS 199 Summer Session II
6/20
exploit the benefits of GPU computing. The purpose of using such an operation is to reduce the overall
computation time and utilize the GPU for the computation of expensive parts of the code.
Computing and storing the individual RGBA color scheme for every patch.
Implementation details:
The data structure self.values is a n-dimensional numpy array in the Patch module.The
data that resides in self.values .It contains the values of the data-attributes that are dropped onto the
rendering scene eg, max-hops-extents, owner-extents, Mpi-ranks. The data structure is divided into a
number of threads and blocks to support the computation by PyCuda. A number of simultaneous
threads computing the color value of the patches is launched. The number of threads are analogous to
the number of patches. Thus, each thread computes the value color of the patch and stores them in a
memory location.To compute of number of blocks required to caluclate the color attributes of tha patch
the following equation is used:
With a study of existing hardware and the configuration of threads and blocks in the GPU
hardware, the configuration of 1024 threads per block and Number of blocks1 as the number of
blocks was the one of the optimal configuration of threads and blocks.
The configuration of number of threads per block that were taken into consideration were:
1)1024x1x1
2)64x4x4
3)32x32x1
4)512x2x1
Inferring from the analyses, the configuration of 32x32x1 threads per block yielded an
approximately 36.6% improvement in the computation time than the average computational time of
other configurations.The time measured here is the computation time and does not include the
communication time. The communication time in this context here refers to the amount of time
involved in transferring the data to and from the GPU,the amount of time involved in converting lists to
numpy arrays.
The graph in figure 10 represents a comparison of average computational time in CPU (Python looping
time) vs. GPU (Time required to compute in GPU).The scalar values are:
1) CPU=0.119815s
2) GPU=0.00078925
-
8/12/2019 Project Report ECS 199 Summer Session II
7/20
Figure 10.
Amount of time taken by the GPU and CPU to compute color.
If we take into account the amount of time required to transfer data to and from the GPU, the
overhead involved in transferring data, computing values, storing and retrieving the computed values
to and from a data structure,then it amounts to a surprising increase in the initial data processing time
by 33.41%.
Calculating the Manhattan Distance from a particular point of interest:
The Manhattan Distance or city block distance is found out when the highlight or Magnify
operations are initiated .It calculates the distance between the centers of interest and the centers of all
the patches in the vicinity.The procedure followed in the case of Color value to compute the optimal
configuration is to find the number of threads and blocks for maximum performance. As the Cuda
Kernel in Python implementation only computes on numpy arrays all the centers of the patches are
stored in a numpy array and sent to the kernel for computation. The result of the computed values are
stored in another numpy array.
The distance from the patch of interest to the other patches are stored in their respective
indexes.The computation of such an operation is done in parallel. The operation performs the same
funcition on multiple threads or kernels.
Parallel implementation vs Serial implementation:
The graph represents the computational time (Communication_ time+ Computational_ time)
time plotted with the 3 datasets. It is interesting to see that as the datasets go more and more
complicated the computational time increases in the case of serial implementation, but in the case of
parallel implementation, the computational time is more or less a constant. The serial implementation
leads to more number of Python loops and iterations for a given dataset, but on the other hand, the
parallel implementation sends simultaneous threads to the GPU to compute.
-
8/12/2019 Project Report ECS 199 Summer Session II
8/20
Figure 11.
Parallel vs Serial behaviour with initial render time.
In the Figure 11. we find that there is kink in the measurement of the serial implemntation ofthe prototype .It is probably because the datasets that were taken into consideration were not linear i.e.
1024 core-run, 2 times 1024 core-run datasets, 4 times 1024 core -run dataset. Therefore, as a result of
non-linear increase in resolution, there is a "kink" in the graph.
Other Features implemented:
Context-Visualization:
We develop a context [5] in the existing visualization to emphasize more on the region of
interest in the visualization. Certain data subsets are viewed in more detail than certain regions which
are just shown for context. [5] We develop an intuitive feature by which we use different graphic
resources such as opacity, color, space, etc. to succinctly create visualizations of a focused region ofinterest .The reason to introduce such visualization is because for very large datasets every detail cannot
be explicitly shown to the user. The method by which we apply this into our software is by rendering
patches which are of least interest into a wire framework. The Figure 8 represents visualization where
there are multiple regions of interest. The patch data which is assigned blue in color is the focus patch.
The visualization also highlights the neighbors of the patch of interest in red. The patch colored in yellow
was once a region of interest chosen by the user. To indicate that it is a previous region of interest the
-
8/12/2019 Project Report ECS 199 Summer Session II
9/20
original color of the patch is preserved. In Figure 8, we also see that the simulated domain is
characterized by horizontal sliceplanes. These slice planes are responsible for cutting, narrowing
unwanted parts of the simulated domain. The figure 9 represents visualization, where the region of
interest is exhibited across the refinement levels of 3-D AMR dataset. We define the context of the
visualization using the wire framework to focus on the regions of interest across all levels.
Ability to take screenshots:
Working:With the press of the key "z" on the keyboard, the user could take the cropped screenshot of
the current visualization viewed by the user.Using the QFileDialog option in Qt, the software could pop
a dialogue box to enable the saving of the image currently seen by the user.The crop function enables
the cropping of the desired pixels from the image and returns a result. It finds out minimum_x,
minimum_y,maximum_x,maximum_y values to initiate the cropping function.
-
8/12/2019 Project Report ECS 199 Summer Session II
10/20
CONCLUSION AND FUTURE WORK
In this project, we have applied various concepts of parallel computing and information
visualization to create simulations that scale to real-time. The optimization and parallelization of the
current prototype has resulted in an average of ~25x improvement in the initial rendering time. We have
also developed intuitive context visualization techniques that have resulted in increased clarity of the
visualization of the region of interest. This project has also provided me the opportunity to exploit the
computational power of the GPU. The result of evaluating the performance is very interesting and
promising.
Many other aspects of the tool could be improved in the near future .It would be important if
the user could view multiple localized visualizations of the region of interest by keeping in mind the
global reference of the simulated domain. The current toolset although supports real-time rendering to
patches upto 8 times the initial resolution of 1024 core run dataset, but the rendering is platform
dependent and is relative to every machine. Implementing platform independent software would be a
challenging direction.
-
8/12/2019 Project Report ECS 199 Summer Session II
11/20
References
[1] Bhatele, A., Gamblin, G.T., Isaacs, K.E., Gunney, B.T.N., Schulz, M.W.J., Bremer, P.-T. and Hamann, B.
(2012), Novel views of performance data to analyze large-scale adaptive applications. (pdf), in:
Hollingsworth, J.K., ed., Proceedings of Supercomputing 2012.SC12), ACM/IEEE, ACM Press, New York,New York, 11 pages.
[2] Isaacs, K.E., Landge, A.G., Gamblin, G.T., Bremer, P.-T., Pascucci, V. and Hamann, B. (2012), Exploring
performance data with Boxfish (pptx), electronic poster presentation, in: Hollingsworth, J.K., ed.,
Proceedings of Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New York, 13 pages.
[3] Bhatele, A., Gamblin, G.T., Langer, S.H., Bremer, P.-T., Draeger, E.W., Hamann, B., Isaacs, K.E., Landge,
A.G., Levine, J.A., Pascucci, V., Schulz, M.W.J. and Still, C.H. (2012), Mapping applications with collectives
over sub-communicators on torus networks (pdf), in: Hollingsworth, J.K., ed., Proceedings of
Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New Yor, 11 pages.
[4] Marsha Berger and Phillip Colella. Local adaptive mesh refinement for shock hydrodynamics. Journal
of Computational Physics, 82:6484, May 1989. Lawrence Livermore National Laboratory, Technical
Report No. UCRL-97196.
*5+ Generalizing Focus Context Visualization, Helwig Hauser, VRVis Research Center in Vienna,
Austria.
[6] E-mails and conversations involving Prof. Bernd Hamann, and Katherine. E. Isaacs, Department of
Computer Science, University of California, Davis.
[7] Thomas McCabe A complexity Measure, IEEE transactions on software engineering, Vol.SE-2 NO.4,
December 1976.
[8] Andreas Klckner, Computer Science,University of Illinois at Urbana-Champaign,
http://mathema.tician.de/software/pycuda.
[9] Chapter 3 Viewing, Chapter 4 Color, Chapter 5 Lighting http://www.glprogramming.com/red/
[10] Boxfish Documentation, User Guide and Developer Guide, https://scalability.llnl.gov/performance-
analysis-through-visualization/software/boxfish/docs/index.html
[11] http://docs.python.org/2/library/profile.html.
[12] R. D. Hornung and S. R. Kohn, Managing application complexity inthe samrai object-oriented
framework, Concurrency and Computation:Practice and Experience, vol. 14, no. 5, pp. 347368, 2002
[13] B. T. Gunney, A. M. Wissink, and D. A. Hysom, Parallel clustering algorithms for structured amr,
Journal of Parallel and Distributed Computing, vol. 66, no. 11, pp. 14191430, 2006
[14]PyCuda examples, http://wiki.tiker.net/PyCuda/Examples.
[15] David Luebke, John Owens, Mike Roberts, Cheng-Han Lee,Introduction to Parallel Programming,
https://www.udacity.com/course/cs344.
http://docs.python.org/2/library/profile.htmlhttp://docs.python.org/2/library/profile.htmlhttp://docs.python.org/2/library/profile.htmlhttps://www.udacity.com/course/cs344https://www.udacity.com/course/cs344https://www.udacity.com/course/cs344http://docs.python.org/2/library/profile.html -
8/12/2019 Project Report ECS 199 Summer Session II
12/20
[16] Andreas Klockner,GPU Metaprogramming using PyCUDA: Methods & Applications, Division of
Applied Mathematics Brown University. Nvidia GTC October 2, 2009.
[17] Helmut Doleisch, Martin Gasser, Helwig Hauser, Interactive Feature Speci cation for Focus+Context
Visualization of Complex Simulation Data, VRVis Research Center, Vienna, Austria.
-
8/12/2019 Project Report ECS 199 Summer Session II
13/20
Appendix A: Figures
Figure.1
The different parts of the prototype are plotted against the individual percent that each part contributes
to the overall McCabescomplexity.
Figure.2 (a)
Dataset 1: 1024 core run dataset:
-
8/12/2019 Project Report ECS 199 Summer Session II
14/20
Figure.2 (b)
Dataset 2: 2 x 1024 core run dataset:
Figure.2 (c)
Dataset 3: 4 x 1024 core run dataset:
-
8/12/2019 Project Report ECS 199 Summer Session II
15/20
Figure.3 (a)
Represents the total contribution of a particular part of the prototype to the overall percentage.The two
parameters that are taken into consideration are the number of function calls and CPU seconds taken to
respnd for dataset 1.
Figure.3 (b)
-
8/12/2019 Project Report ECS 199 Summer Session II
16/20
Percent contribution in dataset 2.
Figure.3 (c)
Percent contribution in dataset 3.
Figure.4
The graph plots the improvement in rendering time when the optimization technique of converting
euclidean distance to manhattan distance is applied.
-
8/12/2019 Project Report ECS 199 Summer Session II
17/20
Figure.5 (a)
The graph plots the improvement in response time when the optimization technique of converting
floating point values to integer values for the DrawCubes().
Figure.5 (b)
The graph plots the improvement in response time when the optimization technique of converting
floating point values to integer values for the Neighbor_change().
-
8/12/2019 Project Report ECS 199 Summer Session II
18/20
-
8/12/2019 Project Report ECS 199 Summer Session II
19/20
-
8/12/2019 Project Report ECS 199 Summer Session II
20/20
Figure 9.
Visualizations that demonstrate the patch linking between every refinement level of the AMR dataset. The wire
framework represents context visualization.