Project Report ECS 199 Summer Session II

download Project Report ECS 199 Summer Session II

of 20

Transcript of Project Report ECS 199 Summer Session II

  • 8/12/2019 Project Report ECS 199 Summer Session II

    1/20

    Name: Sugeerth Murugesan

    Major: Computer Science

    Course: ECS 199 summer session-II 2013, 5 units

    Instructor: Bernd Hamann

    VISUALIZING WORK ASSIGNMENT DATA IN AN ADAPTIVE MESH REFINEMENT LIBRARY-

    FURTHER EXTENSIONS OF SYSTEM

    Objective:

    Enormous amounts of data are generated daily. The lack of effective tools to analyze the

    collected data results in a reduced ability of the data scientists to gain insights [7]. The only

    reasonable approach to analyze such huge amounts of data is through effective and efficient

    ways of visualization. The ability to support real-time visualization is an essential aspect of any

    visualization tool. The goal of this project is to support visualizations of high-resolution datasets by designing, implementing and testing methods that allow the data scientists using the

    tool to perceive visualizations in real-time. This project is an extension of my work done in UC

    Davis during the summer session- I.

    We add further extensions to the existing system that we developed earlier in spring

    as well as summer session-I. The module that we developed was the Patch Module in the

    performance visualization tool Boxfish. The desired operators that we implement in this project

    include context visualization [5], parallelizing and optimizing data processing parts of the

    prototype, projecting regions of interest through different levels of the AMR hierarchy. Basically,

    AMR (Adaptive Mesh Refinement) [4] is a process by which, the cells in a physical space are

    refined only in the areas where there is a complex activity. Over a particular instant the space

    gets subdivided into smaller grid cells. In a High Performance Computing (HPC) scenario, the

    simulated domain (application domain) is the physical space and the hardware domain is the

    physical hardware of the supercomputers. The parts of the application domain are mapped to

    hardware domain. We visualize and optimize the visualizations of physical domain.

    We basically optimize our prototype by

    1) Clearly stating the parts of non-parallel implementation that consume the relativelylargest amounts of resources (computing, storage and the number of functional calls)

    2) Understanding the behavior of the application by applying non-parallelimplementation with 3 different resolutions of the same type

    a. Initial resolution = 1024 core run dataset.(Figure 2(a))b. Twice the initial resolution = 2 x 1024 core run dataset.(Figure 2(b))c. Four times the initial resolution =4 x 1024 core run dataset.(Figure 2(c))

  • 8/12/2019 Project Report ECS 199 Summer Session II

    2/20

    3) Applying our optimization as well as parallelization techniques to those parts thatseem more expensive in terms of computational storage and number of function

    calls used.

    The methods used to realize the overall software re-design by optimization and parallelization

    are:

    1) Code profiling: Profiling the current execution pattern in the prototype, finding the costdata intensive operations.

    2) Code optimizing: Optimizing data intensive operations that we were found in codeprofiling.

    3) Code parallelizing: Parallelizing expensive operations identified in the code optimizingphase.

    1. Code Profiling:

    The python libraries used for the analysis are:

    cProfiler:It states the amount of time in CPU seconds and the number of function call each

    part(function) of the program took.

    Pymbler: It is a library in python that informs the current size of the data structures in the code.

    For example, the size of the "Patch dict" in "4 x 1024 core run data set is 1.44 MB.

    Pymetrics: It determines the number of basic independent paths of the code region. This gives

    a basic idea to profile which parts of the code need the most amount of attention (in terms of

    paralyzing and exploiting more resources from the CPU).

    PyCuda: It lets the programmer access the Nvidias CUDA parallel computation API from

    Python.[10].It utilizes the power of CUDA programming through the driver API.

    1.2 Finding the McCabes Cyclomatic complexity:

    The McCabes Cyclomatic complexity *7+ analysis lets us know the important parts of the

    code which needs to be given attention. They traverse the most number of independent

    execution paths. This forms a basis to find which parts of the code consume the largest amount

    of resources (CPU units taken to execute, number of function calls made to the CPU).The parts

    of the code (applicable to our prototype) which will be analyzed in the following sections are

    defined below:

    SetPatchSize():(Function responsible for implosion operators)

    It sets the size of the patches and announces the change to the patch module. As soon as the patch sizes

    are updated the display lists are updated to reflect the change in rendering of patches.

    ChangeValue_for_transparency():(Function to change the degree of opacity of patches)The function

    sets a value chosen when the opacity slider is updated

  • 8/12/2019 Project Report ECS 199 Summer Session II

    3/20

    ChangeValue_for_Level_slider():(Function to change the level of interest)

    Visualizes the desired level of interest. The function is connected to a Level slider where the user can

    restrict the visualization to certain levels of refinement.

    UpdateHighlights(): (Function that propagates any change in the resolution)

    Given a list of the patch ids to be highlighted, it displays a neat dialogue box which very well quantifies

    the data-attributes of the patches of interest.

    Neighbour_Change():(Function that finds the neighbors of the patch of interest)

    It returns the neighbors of the patch of interest in a list. Does not return the patches which are already

    sliced-off using the Slice and Dice operation.

    Process_ Planes (): (Function that initiates values for slice planes):

    Initiates the planes data structure (Hash map) to render during the slice and dice mode.

    Magnify ():(Function that returns patches in the vicinity of the patch of interest.):

    The function finds the distance between the patch of interest and its neighbors. The function plays a

    crucial role in creating context visualization.

    Range ():Calculates the distance between the patch of interest and the entire simulated domain. It also

    calculates where exactly the slice planes are to be placed in the simulated domain.

    DoPick():A function responsible for the selection of patches and slice planes.

    Plane_highlight():Function responsible for the selection of a particular plane in the simulated domain.

    Highlight_Drawing():The function that is responsible for visualization on multiple patches of interest,

    level of interest, magnification visualization and highlighting the patch neighbors.

    The graph in figure 1 is a representation of different parts of the prototype plotted against the

    number of independent paths (McCabes Complexity) in the prototype designed. The figure 2 is a graph

    that represents the relation between the numbers of function calls made as well the amount of CPU

    seconds taken to respond for each part of the prototype.

    2. Inference Drawn and Optimization:

    With the brief analysis of the code, we applied many techniques for optimizing and parallelizing

    our non-parallel implementation. The goal is to minimize the cost of expensive operations in the

    prototype to create a real-time Visualization. We implemented the following operations to improve the

    overall performance of the software:

    1. Replacement of Euclidean distance formulae with the Manhattan Distance or city blockdistance. Average improvement percentage is

    a. Neighbour_Change (): 37.34%.

    2. The operational numpy data structures are converted from float32 to int32.Average

    improvement in performance in

  • 8/12/2019 Project Report ECS 199 Summer Session II

    4/20

    a. Neighbor Change (): 25.73%.

    b. DrawCubes (): 2.74%.

    Please note that the improvement in performance is calculated by

    Percent improvement =

    3. Reducing on the fly operations to improve the overall process of rendering. This is

    accomplished by storing the calculated color values in a numpy data structure. Average improvement

    in performance is:

    a. Draw_Cubes(): ~25x Faster

    b. Highlight_Drawing(): ~2x Faster

    c. Range(): ~5x Slower

    2.1 Manhattan Distance:

    For the purpose of reducing the computational time, the Manhattan distance is introduced to

    replace the Euclidean distance formulae. Manhattan distance is the number of city blocks between the

    two points of interest. According to the graph, we see that at an average, there is a 37.34% decrease in

    the amount of time taken by the CPU to respond. The purpose of finding distance in the Neighbor

    Change () function is to highlight the patches which are in the region of interest. Restructuring the code

    to accommodate Manhattan distance also improved the overall computational time.

    2.2 Integer Conversion:

    The conversion of floating point data structures to integer type data structures resulted in a:

    a. Neighbor Change (): 25.73%.

    b. DrawCubes (): 2.74%.

    decrease in the computational time taken by the CPU. The conversion resulted in a down-

    sampling process where the data was rounded off to the nearest integer. This limited the precision of

    the float value. The data structures which were rounded off to integers are:

    self. Distance=Distance between the regions of interest.

    self.maxval, self.minval=Maximum and minimum values of color map

    self.Slice=Slice plane values in x, y and z axes.

    The conversion of a float value to an integer value in the draw Cubes () function resulted in a

    2.74% decrease in the computational time whereas in the Neighbor_change () resulted in a 25.73 %

    decrease in computational time.

    2.3 Reducing on the Fly computations:

    The purpose of reducing the on the fly computations is to decrease the initial time

    taken to render the patches. Earlier, patches were retrieved from the python dict and using the

  • 8/12/2019 Project Report ECS 199 Summer Session II

    5/20

  • 8/12/2019 Project Report ECS 199 Summer Session II

    6/20

    exploit the benefits of GPU computing. The purpose of using such an operation is to reduce the overall

    computation time and utilize the GPU for the computation of expensive parts of the code.

    Computing and storing the individual RGBA color scheme for every patch.

    Implementation details:

    The data structure self.values is a n-dimensional numpy array in the Patch module.The

    data that resides in self.values .It contains the values of the data-attributes that are dropped onto the

    rendering scene eg, max-hops-extents, owner-extents, Mpi-ranks. The data structure is divided into a

    number of threads and blocks to support the computation by PyCuda. A number of simultaneous

    threads computing the color value of the patches is launched. The number of threads are analogous to

    the number of patches. Thus, each thread computes the value color of the patch and stores them in a

    memory location.To compute of number of blocks required to caluclate the color attributes of tha patch

    the following equation is used:

    With a study of existing hardware and the configuration of threads and blocks in the GPU

    hardware, the configuration of 1024 threads per block and Number of blocks1 as the number of

    blocks was the one of the optimal configuration of threads and blocks.

    The configuration of number of threads per block that were taken into consideration were:

    1)1024x1x1

    2)64x4x4

    3)32x32x1

    4)512x2x1

    Inferring from the analyses, the configuration of 32x32x1 threads per block yielded an

    approximately 36.6% improvement in the computation time than the average computational time of

    other configurations.The time measured here is the computation time and does not include the

    communication time. The communication time in this context here refers to the amount of time

    involved in transferring the data to and from the GPU,the amount of time involved in converting lists to

    numpy arrays.

    The graph in figure 10 represents a comparison of average computational time in CPU (Python looping

    time) vs. GPU (Time required to compute in GPU).The scalar values are:

    1) CPU=0.119815s

    2) GPU=0.00078925

  • 8/12/2019 Project Report ECS 199 Summer Session II

    7/20

    Figure 10.

    Amount of time taken by the GPU and CPU to compute color.

    If we take into account the amount of time required to transfer data to and from the GPU, the

    overhead involved in transferring data, computing values, storing and retrieving the computed values

    to and from a data structure,then it amounts to a surprising increase in the initial data processing time

    by 33.41%.

    Calculating the Manhattan Distance from a particular point of interest:

    The Manhattan Distance or city block distance is found out when the highlight or Magnify

    operations are initiated .It calculates the distance between the centers of interest and the centers of all

    the patches in the vicinity.The procedure followed in the case of Color value to compute the optimal

    configuration is to find the number of threads and blocks for maximum performance. As the Cuda

    Kernel in Python implementation only computes on numpy arrays all the centers of the patches are

    stored in a numpy array and sent to the kernel for computation. The result of the computed values are

    stored in another numpy array.

    The distance from the patch of interest to the other patches are stored in their respective

    indexes.The computation of such an operation is done in parallel. The operation performs the same

    funcition on multiple threads or kernels.

    Parallel implementation vs Serial implementation:

    The graph represents the computational time (Communication_ time+ Computational_ time)

    time plotted with the 3 datasets. It is interesting to see that as the datasets go more and more

    complicated the computational time increases in the case of serial implementation, but in the case of

    parallel implementation, the computational time is more or less a constant. The serial implementation

    leads to more number of Python loops and iterations for a given dataset, but on the other hand, the

    parallel implementation sends simultaneous threads to the GPU to compute.

  • 8/12/2019 Project Report ECS 199 Summer Session II

    8/20

    Figure 11.

    Parallel vs Serial behaviour with initial render time.

    In the Figure 11. we find that there is kink in the measurement of the serial implemntation ofthe prototype .It is probably because the datasets that were taken into consideration were not linear i.e.

    1024 core-run, 2 times 1024 core-run datasets, 4 times 1024 core -run dataset. Therefore, as a result of

    non-linear increase in resolution, there is a "kink" in the graph.

    Other Features implemented:

    Context-Visualization:

    We develop a context [5] in the existing visualization to emphasize more on the region of

    interest in the visualization. Certain data subsets are viewed in more detail than certain regions which

    are just shown for context. [5] We develop an intuitive feature by which we use different graphic

    resources such as opacity, color, space, etc. to succinctly create visualizations of a focused region ofinterest .The reason to introduce such visualization is because for very large datasets every detail cannot

    be explicitly shown to the user. The method by which we apply this into our software is by rendering

    patches which are of least interest into a wire framework. The Figure 8 represents visualization where

    there are multiple regions of interest. The patch data which is assigned blue in color is the focus patch.

    The visualization also highlights the neighbors of the patch of interest in red. The patch colored in yellow

    was once a region of interest chosen by the user. To indicate that it is a previous region of interest the

  • 8/12/2019 Project Report ECS 199 Summer Session II

    9/20

    original color of the patch is preserved. In Figure 8, we also see that the simulated domain is

    characterized by horizontal sliceplanes. These slice planes are responsible for cutting, narrowing

    unwanted parts of the simulated domain. The figure 9 represents visualization, where the region of

    interest is exhibited across the refinement levels of 3-D AMR dataset. We define the context of the

    visualization using the wire framework to focus on the regions of interest across all levels.

    Ability to take screenshots:

    Working:With the press of the key "z" on the keyboard, the user could take the cropped screenshot of

    the current visualization viewed by the user.Using the QFileDialog option in Qt, the software could pop

    a dialogue box to enable the saving of the image currently seen by the user.The crop function enables

    the cropping of the desired pixels from the image and returns a result. It finds out minimum_x,

    minimum_y,maximum_x,maximum_y values to initiate the cropping function.

  • 8/12/2019 Project Report ECS 199 Summer Session II

    10/20

    CONCLUSION AND FUTURE WORK

    In this project, we have applied various concepts of parallel computing and information

    visualization to create simulations that scale to real-time. The optimization and parallelization of the

    current prototype has resulted in an average of ~25x improvement in the initial rendering time. We have

    also developed intuitive context visualization techniques that have resulted in increased clarity of the

    visualization of the region of interest. This project has also provided me the opportunity to exploit the

    computational power of the GPU. The result of evaluating the performance is very interesting and

    promising.

    Many other aspects of the tool could be improved in the near future .It would be important if

    the user could view multiple localized visualizations of the region of interest by keeping in mind the

    global reference of the simulated domain. The current toolset although supports real-time rendering to

    patches upto 8 times the initial resolution of 1024 core run dataset, but the rendering is platform

    dependent and is relative to every machine. Implementing platform independent software would be a

    challenging direction.

  • 8/12/2019 Project Report ECS 199 Summer Session II

    11/20

    References

    [1] Bhatele, A., Gamblin, G.T., Isaacs, K.E., Gunney, B.T.N., Schulz, M.W.J., Bremer, P.-T. and Hamann, B.

    (2012), Novel views of performance data to analyze large-scale adaptive applications. (pdf), in:

    Hollingsworth, J.K., ed., Proceedings of Supercomputing 2012.SC12), ACM/IEEE, ACM Press, New York,New York, 11 pages.

    [2] Isaacs, K.E., Landge, A.G., Gamblin, G.T., Bremer, P.-T., Pascucci, V. and Hamann, B. (2012), Exploring

    performance data with Boxfish (pptx), electronic poster presentation, in: Hollingsworth, J.K., ed.,

    Proceedings of Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New York, 13 pages.

    [3] Bhatele, A., Gamblin, G.T., Langer, S.H., Bremer, P.-T., Draeger, E.W., Hamann, B., Isaacs, K.E., Landge,

    A.G., Levine, J.A., Pascucci, V., Schulz, M.W.J. and Still, C.H. (2012), Mapping applications with collectives

    over sub-communicators on torus networks (pdf), in: Hollingsworth, J.K., ed., Proceedings of

    Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New Yor, 11 pages.

    [4] Marsha Berger and Phillip Colella. Local adaptive mesh refinement for shock hydrodynamics. Journal

    of Computational Physics, 82:6484, May 1989. Lawrence Livermore National Laboratory, Technical

    Report No. UCRL-97196.

    *5+ Generalizing Focus Context Visualization, Helwig Hauser, VRVis Research Center in Vienna,

    Austria.

    [6] E-mails and conversations involving Prof. Bernd Hamann, and Katherine. E. Isaacs, Department of

    Computer Science, University of California, Davis.

    [7] Thomas McCabe A complexity Measure, IEEE transactions on software engineering, Vol.SE-2 NO.4,

    December 1976.

    [8] Andreas Klckner, Computer Science,University of Illinois at Urbana-Champaign,

    http://mathema.tician.de/software/pycuda.

    [9] Chapter 3 Viewing, Chapter 4 Color, Chapter 5 Lighting http://www.glprogramming.com/red/

    [10] Boxfish Documentation, User Guide and Developer Guide, https://scalability.llnl.gov/performance-

    analysis-through-visualization/software/boxfish/docs/index.html

    [11] http://docs.python.org/2/library/profile.html.

    [12] R. D. Hornung and S. R. Kohn, Managing application complexity inthe samrai object-oriented

    framework, Concurrency and Computation:Practice and Experience, vol. 14, no. 5, pp. 347368, 2002

    [13] B. T. Gunney, A. M. Wissink, and D. A. Hysom, Parallel clustering algorithms for structured amr,

    Journal of Parallel and Distributed Computing, vol. 66, no. 11, pp. 14191430, 2006

    [14]PyCuda examples, http://wiki.tiker.net/PyCuda/Examples.

    [15] David Luebke, John Owens, Mike Roberts, Cheng-Han Lee,Introduction to Parallel Programming,

    https://www.udacity.com/course/cs344.

    http://docs.python.org/2/library/profile.htmlhttp://docs.python.org/2/library/profile.htmlhttp://docs.python.org/2/library/profile.htmlhttps://www.udacity.com/course/cs344https://www.udacity.com/course/cs344https://www.udacity.com/course/cs344http://docs.python.org/2/library/profile.html
  • 8/12/2019 Project Report ECS 199 Summer Session II

    12/20

    [16] Andreas Klockner,GPU Metaprogramming using PyCUDA: Methods & Applications, Division of

    Applied Mathematics Brown University. Nvidia GTC October 2, 2009.

    [17] Helmut Doleisch, Martin Gasser, Helwig Hauser, Interactive Feature Speci cation for Focus+Context

    Visualization of Complex Simulation Data, VRVis Research Center, Vienna, Austria.

  • 8/12/2019 Project Report ECS 199 Summer Session II

    13/20

    Appendix A: Figures

    Figure.1

    The different parts of the prototype are plotted against the individual percent that each part contributes

    to the overall McCabescomplexity.

    Figure.2 (a)

    Dataset 1: 1024 core run dataset:

  • 8/12/2019 Project Report ECS 199 Summer Session II

    14/20

    Figure.2 (b)

    Dataset 2: 2 x 1024 core run dataset:

    Figure.2 (c)

    Dataset 3: 4 x 1024 core run dataset:

  • 8/12/2019 Project Report ECS 199 Summer Session II

    15/20

    Figure.3 (a)

    Represents the total contribution of a particular part of the prototype to the overall percentage.The two

    parameters that are taken into consideration are the number of function calls and CPU seconds taken to

    respnd for dataset 1.

    Figure.3 (b)

  • 8/12/2019 Project Report ECS 199 Summer Session II

    16/20

    Percent contribution in dataset 2.

    Figure.3 (c)

    Percent contribution in dataset 3.

    Figure.4

    The graph plots the improvement in rendering time when the optimization technique of converting

    euclidean distance to manhattan distance is applied.

  • 8/12/2019 Project Report ECS 199 Summer Session II

    17/20

    Figure.5 (a)

    The graph plots the improvement in response time when the optimization technique of converting

    floating point values to integer values for the DrawCubes().

    Figure.5 (b)

    The graph plots the improvement in response time when the optimization technique of converting

    floating point values to integer values for the Neighbor_change().

  • 8/12/2019 Project Report ECS 199 Summer Session II

    18/20

  • 8/12/2019 Project Report ECS 199 Summer Session II

    19/20

  • 8/12/2019 Project Report ECS 199 Summer Session II

    20/20

    Figure 9.

    Visualizations that demonstrate the patch linking between every refinement level of the AMR dataset. The wire

    framework represents context visualization.