The ARCS Data Analysis Software Michael Aivazis California Institute of Technology.

23
The ARCS Data Analysis Software Michael Aivazis California Institute of Technology
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of The ARCS Data Analysis Software Michael Aivazis California Institute of Technology.

The ARCS Data Analysis Software

Michael Aivazis

California Institute of Technology

Fractals in software

• “Drip programming”– may generate aesthetically interesting

flow charts– but it is not a desirable practice

• Advanced technology may actually complicate matters

– complex data structures– objects– user interfaces– multiple platforms– distributed computing– high performance computing– security– …– the Grid

Pollock’s “Autumn Rhythm”

… or Michael’s framework?

Software Roadmap

• Account for incident flux• Remove background• Convert from time to energy• Correct for detector efficiency• Bin into rings of constant scattering angle• Convert from angle to momentum • Subtract multi-phonon and multiple scattering• Correct for absorption

Data reductions

C++

Python

RebinWriteHDFfile

Sq. rt errserrors2 errors

energies

counts in energy

Subtractbackground

ReadHDFfile

Rebin

filename ReadHDFfile

raw counts

Spect. Info

times

Subtractbackground Rebin

data

errors2

times

Spect. Info

num_e

e_mine_maxe_i

t_mint_max

From TOF to energy

Data flow for TOF to Energy conversion

Design directions

• Integrate analysis modules using scripting– Python

• Data flow paradigm– Well understood– Easy to implement and document

• Meta-data in XML – fully reproducible description of the data analysis pipeline– tag and archive data– record the version number of each module used in the analysis

• Enable distributed computing– XMLRPC, SOAP, …

• File formats: NeXus + XML meta-data – Reuse, reuse, reuse – Augment, contribute– HDF5!

Flexibility through the use of scripting

• Scripting enables us to– Organize the large number of parameters– Allow the analysis environment to discover new capabilities without the need for

recompilation or relinking

• The python interpreter– The interpreter

• modern object oriented language• robust, portable, mature, well supported, well documented• easily extensible• rapid application development

– Support for parallel programming• trivial embedding of the interpreter in an MPI compliant manner• a python interpreter on each compute node• MPI is fully integrated: bindings + OO layer

– No measurable impact on either performance or scalability

Writing python bindings

• Given a “low level” routine, such as

• and a wrapper

double arcs::add(double a, double b);

PyObject * arcs_add(PyObject *, PyObject * args){ double a, b; int ok = PyArg_ParseTuple(args, “dd”, &a, &b);

if (!ok) { return 0; }

double result = arcs::add(a,b );

return Py_BuildValue(“d”, result);}

c = arcs.add(2, 2)

• one can place the result of the routine in a python variable

• The general case is not much more complicated than this

Pyre Architecture

component

bindings

engine

engine

component

bindings

library

infrastructure

service

framework

service

serviceservice

component

bindings

engine

engine

abstract classabstract class

specializationspecialization

packagepackage

• The integration framework is a set of co-operating abstract services

FORTRAN/C/C++FORTRAN/C/C++

pythonpython

Pyre services

• journal– flexible control over the generation and delivery of simulation diagnostics from the

compute nodes to the workstation

• monitor– a distributed service for low bandwidth, on the fly visualizations– currently used mostly for status monitoring and debugging

• timer• weaver

– a general source code generation facility– support for many languages

• FORTRAN, C, C++, python, HTML, XML• from makefiles to optimized C++ sources

– automatic web page creation for cgi scripts– supports user authentication

• passwords, soon user SSL certificates

• blade– a toolkit independent UI generator

Distributed services

Workstation Services Compute nodes

analysis

journal

monitorcomponent1

component2

IRIS Explorer

• Data flow paradigm appears natural– usability problems are focused on knowledge of what is possible– used by many commercial and open source tools

• Improvements– decouple UI from diagram logic– interface

• use OpenGL!• collaborative• interesting and relevant research

– diagram logic• thin, reusable component• scripting• multi-layered control

– development can use existing solutions as a guide of what not to do– many modules already available in pyre– enable distributed programming

• Target for prototype: early 2004

Visual Programming Environment

ClientRemote Server

Database Server

Beowulf Cluster

• An open standard for remote procedure calls

• Allows us to perform the computation

– where the data lives– independently of the local computing

capacity

• Security is an issue

XMLRPC: Enabling distributed computing

• Application capabilities– depend on the remote server

– exported to the client

• Boxes represent– data sources

– computational modules

• Wires represent– data flows

– control

• Boxes have input and output ports where wires can be attached

Prototype User Interface

Data Analysis Execution

• User hits “Run”

• Applet interprets wiring diagram as XMLRPC commands

• Server receives commands,arranges Python script, and data processing commences.

User interface prototypes - I

User interface prototypes - II

User interface prototypes - III

MATLAB

• If you must…• Fully accessible from Python• Support involves converting result of data analysis into MATLAB native

arrays

Software engineering practices

• Version control– Provides a record of the evolution of the software– CVS: well supported, open source

• Configuration management– Uniform, portable build procedure– Automatic, regular builds of the entire software base– config: a system based on make– merlin: a python-based replacement under development

• Regression testing– Test cases that

• Exercise expected behavior• Exercise fixes for known bugs

• Bug tracking– Organize the “to do” list, the feature requests … and the known defects– Gnats: well supported, open source

Design directions

• Integrate analysis modules using scripting– Python

• Data flow paradigm– Well understood– Easy to implement and document

• Meta-data in XML – fully reproducible description of the data analysis pipeline– tag and archive data– record the version number of each module used in the analysis

• Enable distributed computing– XMLRPC, SOAP, …

• File formats: NeXus + XML meta-data – Reuse, reuse, reuse – Augment, contribute– HDF5!