Project Report ECS 199 Summer Session II

8/12/2019 Project Report ECS 199 Summer Session II

1/20

Name: Sugeerth Murugesan

Major: Computer Science

Course: ECS 199 summer session-II 2013, 5 units

Instructor: Bernd Hamann

VISUALIZING WORK ASSIGNMENT DATA IN AN ADAPTIVE MESH REFINEMENT LIBRARY-

FURTHER EXTENSIONS OF SYSTEM

Objective:

Enormous amounts of data are generated daily. The lack of effective tools to analyze the

collected data results in a reduced ability of the data scientists to gain insights [7]. The only

reasonable approach to analyze such huge amounts of data is through effective and efficient

ways of visualization. The ability to support real-time visualization is an essential aspect of any

visualization tool. The goal of this project is to support visualizations of high-resolution datasets by designing, implementing and testing methods that allow the data scientists using the

tool to perceive visualizations in real-time. This project is an extension of my work done in UC

Davis during the summer session- I.

We add further extensions to the existing system that we developed earlier in spring

as well as summer session-I. The module that we developed was the Patch Module in the

performance visualization tool Boxfish. The desired operators that we implement in this project

include context visualization [5], parallelizing and optimizing data processing parts of the

prototype, projecting regions of interest through different levels of the AMR hierarchy. Basically,

AMR (Adaptive Mesh Refinement) [4] is a process by which, the cells in a physical space are

refined only in the areas where there is a complex activity. Over a particular instant the space

gets subdivided into smaller grid cells. In a High Performance Computing (HPC) scenario, the

simulated domain (application domain) is the physical space and the hardware domain is the

physical hardware of the supercomputers. The parts of the application domain are mapped to

hardware domain. We visualize and optimize the visualizations of physical domain.

We basically optimize our prototype by

1) Clearly stating the parts of non-parallel implementation that consume the relativelylargest amounts of resources (computing, storage and the number of functional calls)

2) Understanding the behavior of the application by applying non-parallelimplementation with 3 different resolutions of the same type

a. Initial resolution = 1024 core run dataset.(Figure 2(a))b. Twice the initial resolution = 2 x 1024 core run dataset.(Figure 2(b))c. Four times the initial resolution =4 x 1024 core run dataset.(Figure 2(c))


2/20

3) Applying our optimization as well as parallelization techniques to those parts thatseem more expensive in terms of computational storage and number of function

calls used.

The methods used to realize the overall software re-design by optimization and parallelization

are:

1) Code profiling: Profiling the current execution pattern in the prototype, finding the costdata intensive operations.

2) Code optimizing: Optimizing data intensive operations that we were found in codeprofiling.

3) Code parallelizing: Parallelizing expensive operations identified in the code optimizingphase.

1. Code Profiling:

The python libraries used for the analysis are:

cProfiler:It states the amount of time in CPU seconds and the number of function call each

part(function) of the program took.

Pymbler: It is a library in python that informs the current size of the data structures in the code.

For example, the size of the "Patch dict" in "4 x 1024 core run data set is 1.44 MB.

Pymetrics: It determines the number of basic independent paths of the code region. This gives

a basic idea to profile which parts of the code need the most amount of attention (in terms of

paralyzing and exploiting more resources from the CPU).

PyCuda: It lets the programmer access the Nvidias CUDA parallel computation API from

Python.[10].It utilizes the power of CUDA programming through the driver API.

1.2 Finding the McCabes Cyclomatic complexity:

The McCabes Cyclomatic complexity *7+ analysis lets us know the important parts of the

code which needs to be given attention. They traverse the most number of independent

execution paths. This forms a basis to find which parts of the code consume the largest amount

of resources (CPU units taken to execute, number of function calls made to the CPU).The parts

of the code (applicable to our prototype) which will be analyzed in the following sections are

defined below:

SetPatchSize():(Function responsible for implosion operators)

It sets the size of the patches and announces the change to the patch module. As soon as the patch sizes

are updated the display lists are updated to reflect the change in rendering of patches.

ChangeValue_for_transparency():(Function to change the degree of opacity of patches)The function

sets a value chosen when the opacity slider is updated


3/20

ChangeValue_for_Level_slider():(Function to change the level of interest)

Visualizes the desired level of interest. The function is connected to a Level slider where the user can

restrict the visualization to certain levels of refinement.

UpdateHighlights(): (Function that propagates any change in the resolution)

Given a list of the patch ids to be highlighted, it displays a neat dialogue box which very well quantifies

the data-attributes of the patches of interest.

Neighbour_Change():(Function that finds the neighbors of the patch of interest)

It returns the neighbors of the patch of interest in a list. Does not return the patches which are already

sliced-off using the Slice and Dice operation.

Process_ Planes (): (Function that initiates values for slice planes):

Initiates the planes data structure (Hash map) to render during the slice and dice mode.

Magnify ():(Function that returns patches in the vicinity of the patch of interest.):

The function finds the distance between the patch of interest and its neighbors. The function plays a

crucial role in creating context visualization.

Range ():Calculates the distance between the patch of interest and the entire simulated domain. It also

calculates where exactly the slice planes are to be placed in the simulated domain.

DoPick():A function responsible for the selection of patches and slice planes.

Plane_highlight():Function responsible for the selection of a particular plane in the simulated domain.

Highlight_Drawing():The function that is responsible for visualization on multiple patches of interest,

level of interest, magnification visualization and highlighting the patch neighbors.

The graph in figure 1 is a representation of different parts of the prototype plotted against the

number of independent paths (McCabes Complexity) in the prototype designed. The figure 2 is a graph

that represents the relation between the numbers of function calls made as well the amount of CPU

seconds taken to respond for each part of the prototype.

2. Inference Drawn and Optimization:

With the brief analysis of the code, we applied many techniques for optimizing and parallelizing

our non-parallel implementation. The goal is to minimize the cost of expensive operations in the

prototype to create a real-time Visualization. We implemented the following operations to improve the

overall performance of the software:

1. Replacement of Euclidean distance formulae with the Manhattan Distance or city blockdistance. Average improvement percentage is

a. Neighbour_Change (): 37.34%.

2. The operational numpy data structures are converted from float32 to int32.Average

improvement in performance in


4/20

a. Neighbor Change (): 25.73%.

b. DrawCubes (): 2.74%.

Please note that the improvement in performance is calculated by

Percent improvement =

3. Reducing on the fly operations to improve the overall process of rendering. This is

accomplished by storing the calculated color values in a numpy data structure. Average improvement

in performance is:

a. Draw_Cubes(): ~25x Faster

b. Highlight_Drawing(): ~2x Faster

c. Range(): ~5x Slower

2.1 Manhattan Distance:

For the purpose of reducing the computational time, the Manhattan distance is introduced to

replace the Euclidean distance formulae. Manhattan distance is the number of city blocks between the

two points of interest. According to the graph, we see that at an average, there is a 37.34% decrease in

the amount of time taken by the CPU to respond. The purpose of finding distance in the Neighbor

Change () function is to highlight the patches which are in the region of interest. Restructuring the code

to accommodate Manhattan distance also improved the overall computational time.

2.2 Integer Conversion:

The conversion of floating point data structures to integer type data structures resulted in a:

a. Neighbor Change (): 25.73%.

b. DrawCubes (): 2.74%.

decrease in the computational time taken by the CPU. The conversion resulted in a down-

sampling process where the data was rounded off to the nearest integer. This limited the precision of

the float value. The data structures which were rounded off to integers are:

self. Distance=Distance between the regions of interest.

self.maxval, self.minval=Maximum and minimum values of color map

self.Slice=Slice plane values in x, y and z axes.

The conversion of a float value to an integer value in the draw Cubes () function resulted in a

2.74% decrease in the computational time whereas in the Neighbor_change () resulted in a 25.73 %

decrease in computational time.

2.3 Reducing on the Fly computations:

The purpose of reducing the on the fly computations is to decrease the initial time

taken to render the patches. Earlier, patches were retrieved from the python dict and using the


5/20


6/20

exploit the benefits of GPU computing. The purpose of using such an operation is to reduce the overall

computation time and utilize the GPU for the computation of expensive parts of the code.

Computing and storing the individual RGBA color scheme for every patch.

Implementation details:

The data structure self.values is a n-dimensional numpy array in the Patch module.The

data that resides in self.values .It contains the values of the data-attributes that are dropped onto the

rendering scene eg, max-hops-extents, owner-extents, Mpi-ranks. The data structure is divided into a

number of threads and blocks to support the computation by PyCuda. A number of simultaneous

threads computing the color value of the patches is launched. The number of threads are analogous to

the number of patches. Thus, each thread computes the value color of the patch and stores them in a

memory location.To compute of number of blocks required to caluclate the color attributes of tha patch

the following equation is used:

With a study of existing hardware and the configuration of threads and blocks in the GPU

hardware, the configuration of 1024 threads per block and Number of blocks1 as the number of

blocks was the one of the optimal configuration of threads and blocks.

The configuration of number of threads per block that were taken into consideration were:

1)1024x1x1

2)64x4x4

3)32x32x1

4)512x2x1

Inferring from the analyses, the configuration of 32x32x1 threads per block yielded an

approximately 36.6% improvement in the computation time than the average computational time of

other configurations.The time measured here is the computation time and does not include the

communication time. The communication time in this context here refers to the amount of time

involved in transferring the data to and from the GPU,the amount of time involved in converting lists to

numpy arrays.

The graph in figure 10 represents a comparison of average computational time in CPU (Python looping

time) vs. GPU (Time required to compute in GPU).The scalar values are:

1) CPU=0.119815s

2) GPU=0.00078925


7/20

Figure 10.

Amount of time taken by the GPU and CPU to compute color.

If we take into account the amount of time required to transfer data to and from the GPU, the

overhead involved in transferring data, computing values, storing and retrieving the computed values

to and from a data structure,then it amounts to a surprising increase in the initial data processing time

by 33.41%.

Calculating the Manhattan Distance from a particular point of interest:

The Manhattan Distance or city block distance is found out when the highlight or Magnify

operations are initiated .It calculates the distance between the centers of interest and the centers of all

the patches in the vicinity.The procedure followed in the case of Color value to compute the optimal

configuration is to find the number of threads and blocks for maximum performance. As the Cuda

Kernel in Python implementation only computes on numpy arrays all the centers of the patches are

stored in a numpy array and sent to the kernel for computation. The result of the computed values are

stored in another numpy array.

The distance from the patch of interest to the other patches are stored in their respective

indexes.The computation of such an operation is done in parallel. The operation performs the same

funcition on multiple threads or kernels.

Parallel implementation vs Serial implementation:

The graph represents the computational time (Communication_ time+ Computational_ time)

time plotted with the 3 datasets. It is interesting to see that as the datasets go more and more

complicated the computational time increases in the case of serial implementation, but in the case of

parallel implementation, the computational time is more or less a constant. The serial implementation

leads to more number of Python loops and iterations for a given dataset, but on the other hand, the

parallel implementation sends simultaneous threads to the GPU to compute.


8/20

Figure 11.

Parallel vs Serial behaviour with initial render time.

In the Figure 11. we find that there is kink in the measurement of the serial implemntation ofthe prototype .It is probably because the datasets that were taken into consideration were not linear i.e.

1024 core-run, 2 times 1024 core-run datasets, 4 times 1024 core -run dataset. Therefore, as a result of

non-linear increase in resolution, there is a "kink" in the graph.

Other Features implemented:

Context-Visualization:

We develop a context [5] in the existing visualization to emphasize more on the region of

interest in the visualization. Certain data subsets are viewed in more detail than certain regions which

are just shown for context. [5] We develop an intuitive feature by which we use different graphic

resources such as opacity, color, space, etc. to succinctly create visualizations of a focused region ofinterest .The reason to introduce such visualization is because for very large datasets every detail cannot

be explicitly shown to the user. The method by which we apply this into our software is by rendering

patches which are of least interest into a wire framework. The Figure 8 represents visualization where

there are multiple regions of interest. The patch data which is assigned blue in color is the focus patch.

The visualization also highlights the neighbors of the patch of interest in red. The patch colored in yellow

was once a region of interest chosen by the user. To indicate that it is a previous region of interest the


9/20

original color of the patch is preserved. In Figure 8, we also see that the simulated domain is

characterized by horizontal sliceplanes. These slice planes are responsible for cutting, narrowing

unwanted parts of the simulated domain. The figure 9 represents visualization, where the region of

interest is exhibited across the refinement levels of 3-D AMR dataset. We define the context of the

visualization using the wire framework to focus on the regions of interest across all levels.

Ability to take screenshots:

Working:With the press of the key "z" on the keyboard, the user could take the cropped screenshot of

the current visualization viewed by the user.Using the QFileDialog option in Qt, the software could pop

a dialogue box to enable the saving of the image currently seen by the user.The crop function enables

the cropping of the desired pixels from the image and returns a result. It finds out minimum_x,

minimum_y,maximum_x,maximum_y values to initiate the cropping function.


10/20

CONCLUSION AND FUTURE WORK

In this project, we have applied various concepts of parallel computing and information

visualization to create simulations that scale to real-time. The optimization and parallelization of the

current prototype has resulted in an average of ~25x improvement in the initial rendering time. We have

also developed intuitive context visualization techniques that have resulted in increased clarity of the

visualization of the region of interest. This project has also provided me the opportunity to exploit the

computational power of the GPU. The result of evaluating the performance is very interesting and

promising.

Many other aspects of the tool could be improved in the near future .It would be important if

the user could view multiple localized visualizations of the region of interest by keeping in mind the

global reference of the simulated domain. The current toolset although supports real-time rendering to

patches upto 8 times the initial resolution of 1024 core run dataset, but the rendering is platform

dependent and is relative to every machine. Implementing platform independent software would be a

challenging direction.


11/20

References

[1] Bhatele, A., Gamblin, G.T., Isaacs, K.E., Gunney, B.T.N., Schulz, M.W.J., Bremer, P.-T. and Hamann, B.

(2012), Novel views of performance data to analyze large-scale adaptive applications. (pdf), in:

Hollingsworth, J.K., ed., Proceedings of Supercomputing 2012.SC12), ACM/IEEE, ACM Press, New York,New York, 11 pages.

[2] Isaacs, K.E., Landge, A.G., Gamblin, G.T., Bremer, P.-T., Pascucci, V. and Hamann, B. (2012), Exploring

performance data with Boxfish (pptx), electronic poster presentation, in: Hollingsworth, J.K., ed.,

Proceedings of Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New York, 13 pages.

[3] Bhatele, A., Gamblin, G.T., Langer, S.H., Bremer, P.-T., Draeger, E.W., Hamann, B., Isaacs, K.E., Landge,

A.G., Levine, J.A., Pascucci, V., Schulz, M.W.J. and Still, C.H. (2012), Mapping applications with collectives

over sub-communicators on torus networks (pdf), in: Hollingsworth, J.K., ed., Proceedings of

Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New Yor, 11 pages.

[4] Marsha Berger and Phillip Colella. Local adaptive mesh refinement for shock hydrodynamics. Journal

of Computational Physics, 82:6484, May 1989. Lawrence Livermore National Laboratory, Technical

Report No. UCRL-97196.

*5+ Generalizing Focus Context Visualization, Helwig Hauser, VRVis Research Center in Vienna,

Austria.

[6] E-mails and conversations involving Prof. Bernd Hamann, and Katherine. E. Isaacs, Department of

Computer Science, University of California, Davis.

[7] Thomas McCabe A complexity Measure, IEEE transactions on software engineering, Vol.SE-2 NO.4,

December 1976.

[8] Andreas Klckner, Computer Science,University of Illinois at Urbana-Champaign,

http://mathema.tician.de/software/pycuda.

[9] Chapter 3 Viewing, Chapter 4 Color, Chapter 5 Lighting http://www.glprogramming.com/red/

[10] Boxfish Documentation, User Guide and Developer Guide, https://scalability.llnl.gov/performance-

analysis-through-visualization/software/boxfish/docs/index.html

[11] http://docs.python.org/2/library/profile.html.

[12] R. D. Hornung and S. R. Kohn, Managing application complexity inthe samrai object-oriented

framework, Concurrency and Computation:Practice and Experience, vol. 14, no. 5, pp. 347368, 2002

[13] B. T. Gunney, A. M. Wissink, and D. A. Hysom, Parallel clustering algorithms for structured amr,

Journal of Parallel and Distributed Computing, vol. 66, no. 11, pp. 14191430, 2006

[14]PyCuda examples, http://wiki.tiker.net/PyCuda/Examples.

[15] David Luebke, John Owens, Mike Roberts, Cheng-Han Lee,Introduction to Parallel Programming,

https://www.udacity.com/course/cs344.
http://docs.python.org/2/library/profile.htmlhttp://docs.python.org/2/library/profile.htmlhttp://docs.python.org/2/library/profile.htmlhttps://www.udacity.com/course/cs344https://www.udacity.com/course/cs344https://www.udacity.com/course/cs344http://docs.python.org/2/library/profile.html


12/20

[16] Andreas Klockner,GPU Metaprogramming using PyCUDA: Methods & Applications, Division of

Applied Mathematics Brown University. Nvidia GTC October 2, 2009.

[17] Helmut Doleisch, Martin Gasser, Helwig Hauser, Interactive Feature Speci cation for Focus+Context

Visualization of Complex Simulation Data, VRVis Research Center, Vienna, Austria.


13/20

Appendix A: Figures

Figure.1

The different parts of the prototype are plotted against the individual percent that each part contributes

to the overall McCabescomplexity.

Figure.2 (a)

Dataset 1: 1024 core run dataset:


14/20

Figure.2 (b)

Dataset 2: 2 x 1024 core run dataset:

Figure.2 (c)

Dataset 3: 4 x 1024 core run dataset:


15/20

Figure.3 (a)

Represents the total contribution of a particular part of the prototype to the overall percentage.The two

parameters that are taken into consideration are the number of function calls and CPU seconds taken to

respnd for dataset 1.

Figure.3 (b)


16/20

Percent contribution in dataset 2.

Figure.3 (c)

Percent contribution in dataset 3.

Figure.4

The graph plots the improvement in rendering time when the optimization technique of converting

euclidean distance to manhattan distance is applied.


17/20

Figure.5 (a)

The graph plots the improvement in response time when the optimization technique of converting

floating point values to integer values for the DrawCubes().

Figure.5 (b)

The graph plots the improvement in response time when the optimization technique of converting

floating point values to integer values for the Neighbor_change().


18/20


19/20


20/20

Figure 9.

Visualizations that demonstrate the patch linking between every refinement level of the AMR dataset. The wire

framework represents context visualization.

Project Report ECS 199 Summer Session II

Documents

Transcript of Project Report ECS 199 Summer Session II