The ARCS Data Analysis Software Michael Aivazis California Institute of Technology.
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of The ARCS Data Analysis Software Michael Aivazis California Institute of Technology.
Fractals in software
• “Drip programming”– may generate aesthetically interesting
flow charts– but it is not a desirable practice
• Advanced technology may actually complicate matters
– complex data structures– objects– user interfaces– multiple platforms– distributed computing– high performance computing– security– …– the Grid
Pollock’s “Autumn Rhythm”
… or Michael’s framework?
• Account for incident flux• Remove background• Convert from time to energy• Correct for detector efficiency• Bin into rings of constant scattering angle• Convert from angle to momentum • Subtract multi-phonon and multiple scattering• Correct for absorption
Data reductions
C++
Python
RebinWriteHDFfile
Sq. rt errserrors2 errors
energies
counts in energy
Subtractbackground
ReadHDFfile
Rebin
filename ReadHDFfile
raw counts
Spect. Info
times
Subtractbackground Rebin
data
errors2
times
Spect. Info
num_e
e_mine_maxe_i
t_mint_max
From TOF to energy
Design directions
• Integrate analysis modules using scripting– Python
• Data flow paradigm– Well understood– Easy to implement and document
• Meta-data in XML – fully reproducible description of the data analysis pipeline– tag and archive data– record the version number of each module used in the analysis
• Enable distributed computing– XMLRPC, SOAP, …
• File formats: NeXus + XML meta-data – Reuse, reuse, reuse – Augment, contribute– HDF5!
Flexibility through the use of scripting
• Scripting enables us to– Organize the large number of parameters– Allow the analysis environment to discover new capabilities without the need for
recompilation or relinking
• The python interpreter– The interpreter
• modern object oriented language• robust, portable, mature, well supported, well documented• easily extensible• rapid application development
– Support for parallel programming• trivial embedding of the interpreter in an MPI compliant manner• a python interpreter on each compute node• MPI is fully integrated: bindings + OO layer
– No measurable impact on either performance or scalability
Writing python bindings
• Given a “low level” routine, such as
• and a wrapper
double arcs::add(double a, double b);
PyObject * arcs_add(PyObject *, PyObject * args){ double a, b; int ok = PyArg_ParseTuple(args, “dd”, &a, &b);
if (!ok) { return 0; }
double result = arcs::add(a,b );
return Py_BuildValue(“d”, result);}
c = arcs.add(2, 2)
• one can place the result of the routine in a python variable
• The general case is not much more complicated than this
Pyre Architecture
component
bindings
engine
engine
component
bindings
library
infrastructure
service
framework
service
serviceservice
component
bindings
engine
engine
abstract classabstract class
specializationspecialization
packagepackage
• The integration framework is a set of co-operating abstract services
FORTRAN/C/C++FORTRAN/C/C++
pythonpython
Pyre services
• journal– flexible control over the generation and delivery of simulation diagnostics from the
compute nodes to the workstation
• monitor– a distributed service for low bandwidth, on the fly visualizations– currently used mostly for status monitoring and debugging
• timer• weaver
– a general source code generation facility– support for many languages
• FORTRAN, C, C++, python, HTML, XML• from makefiles to optimized C++ sources
– automatic web page creation for cgi scripts– supports user authentication
• passwords, soon user SSL certificates
• blade– a toolkit independent UI generator
Distributed services
Workstation Services Compute nodes
analysis
journal
monitorcomponent1
component2
• Data flow paradigm appears natural– usability problems are focused on knowledge of what is possible– used by many commercial and open source tools
• Improvements– decouple UI from diagram logic– interface
• use OpenGL!• collaborative• interesting and relevant research
– diagram logic• thin, reusable component• scripting• multi-layered control
– development can use existing solutions as a guide of what not to do– many modules already available in pyre– enable distributed programming
• Target for prototype: early 2004
Visual Programming Environment
ClientRemote Server
Database Server
Beowulf Cluster
• An open standard for remote procedure calls
• Allows us to perform the computation
– where the data lives– independently of the local computing
capacity
• Security is an issue
XMLRPC: Enabling distributed computing
• Application capabilities– depend on the remote server
– exported to the client
• Boxes represent– data sources
– computational modules
• Wires represent– data flows
– control
• Boxes have input and output ports where wires can be attached
Prototype User Interface
Data Analysis Execution
• User hits “Run”
• Applet interprets wiring diagram as XMLRPC commands
• Server receives commands,arranges Python script, and data processing commences.
MATLAB
• If you must…• Fully accessible from Python• Support involves converting result of data analysis into MATLAB native
arrays
Software engineering practices
• Version control– Provides a record of the evolution of the software– CVS: well supported, open source
• Configuration management– Uniform, portable build procedure– Automatic, regular builds of the entire software base– config: a system based on make– merlin: a python-based replacement under development
• Regression testing– Test cases that
• Exercise expected behavior• Exercise fixes for known bugs
• Bug tracking– Organize the “to do” list, the feature requests … and the known defects– Gnats: well supported, open source
Design directions
• Integrate analysis modules using scripting– Python
• Data flow paradigm– Well understood– Easy to implement and document
• Meta-data in XML – fully reproducible description of the data analysis pipeline– tag and archive data– record the version number of each module used in the analysis
• Enable distributed computing– XMLRPC, SOAP, …
• File formats: NeXus + XML meta-data – Reuse, reuse, reuse – Augment, contribute– HDF5!