DataFinder: A Python Application for Scientific Data Management
-
Upload
andreas-schreiber -
Category
Technology
-
view
5.490 -
download
5
description
Transcript of DataFinder: A Python Application for Scientific Data Management
Folie 1EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
DataFinder: A Python Application for Scientific Data Management
EuroPython 2008 (July 9th 2008, Vilnius)
Andreas Schreiber <[email protected]>
German Aerospace Center (DLR), Cologne
http://www.dlr.de/sc
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 2
The DLRGerman Aerospace Research Center Space Agency of the Federal Republic of Germany
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 3
5,600 employees working in 28 research institutes and facilities
at 13 sites.
Offices in Brussels, Paris and Washington. Köln
Lampoldshausen
Stuttgart
Oberpfaffenhofen
Braunschweig
Göttingen
Berlin-
Bonn
Trauen
Hamburg
Neustrelitz
Weilheim
Bremen-
Sites and employees
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 4
Short Overview
DataFinder is a software for efficient management of scientific and technical data
Focus on huge data sets
Development of the DataFinder by DLR
Primary functionality
Structuring of data through assignment of meta information and self-defined data models
Flexible usage of heterogeneous storage resources
Integration in the working environment
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 5
Introduction
DataFinder founded by DLR
National Grid project AeroGrid
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 6
IntroductionBackground
Large-scale simulations
aerodynamics
material science
climate
…
Tons of measured data
wind-tunnel experiments
earth observations
traffic data
…
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 7
IntroductionData Management Problem
Typical organizational situations
No central data management policy
Every employee organizes his/her data individually
Researchers spend about 30% of their time searching for data
Problem with data left behind by temporary staff
Increase of data size and regulations
Rapidly growing volume of simulation and experimental data
Legal requirements for long-term availability of data (up to 50 years!)
Situation similar at many organizations
All ~30 DLR institutes
Other research labs and agencies
Industry
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 8
DataFinder HistorySearch for solution for scientific data management
Definition of “standard problem” (helicopter simulation)
Test case for evaluation of software
Evaluation of commercial product data management (PDM) systems
PDM systems could manage data but with huge amount of costs
PDM systems have many unneeded functionalities
PDM systems have self-defined or unreadable scripting languages for extension and customization (Tcl etc.)
Development of DataFinder
Lightweight data management client and existing server solution
Just enough functionality for our problems (no paid but unused features!)
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 9
DataFinder DevelopmentFrom Java Prototype to Python Product…
Development of prototype in Java
Data could be manages with prototype successfully
Drawbacks: Java problems on important platforms (e.g., SGI IRIX)
Embedded Jython interpreter great feature for users
User: “The Java GUI is like shit, but the Python scripting is great. We want a pure Python solution!”
Development of DataFinder product from scratch in Python
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 10
Python for Scientists and EngineersReasons for Python in Research and Industry
Observations:
Scientists and engineers don’t want to write software but just solve their problems
If they have to write code, it must be as easy as possible
Why Python is perfect?
Very easy to learn and easy to use ( = steep learning curve)
Allows rapid development ( = short development time)
Inherent great maintainability
“Python has the cleanest, most-scientist- or engineer friendly syntax and semantics.”
(Paul F. Dubois. Ten good practices in scientific programming. Comp. In Sci. Eng., Jan/Feb 1999, pp.7-11)
“Python has the cleanest, most-scientist- or engineer friendly syntax and semantics.”
(Paul F. Dubois. Ten good practices in scientific programming. Comp. In Sci. Eng., Jan/Feb 1999, pp.7-11)
“I want to design planes,
not software!”
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 11
DataFinder OverviewBasic Concept
Client-Server solution
Based on open and stable standards, such as XML and WebDAV
Extensive use of standard software components (open source / commercial), limited own development at client side
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 12
WebDAVWeb-based Distributed Authoring & Versioning
Extension of HTTP
Allows to manage files on remote servers collaboratively
WebDAV supports
Resources (“files”)
Collections (“directories”)
Properties (“meta data”, in XML format)
Locking
WebDAV extensions
Versioning (DeltaV)
Access control (ACP)
Search (DASL)
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 13
DataFinder OverviewClient and Server
Client
User client
Administrator client
Implementation: Python with Qt
Server
WebDAV server for meta data and data structure
Data Store concept
Abstracts access to managed data
Flexible usage of heterogeneous storage resources
Implementation: Various existing server solutions (third-party)
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 14
DataFinder ClientGraphical User Interfaces
User Client Administrator Client
Implementation in Python with Qt/PyQt
Implementation in Python with Qt/PyQt
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 15
DataFinder ServerSupported WebDAV servers
Commercial Server Solution
Tamino XML database (Software AG)
Open Source Server Solutions
Apache HTTP Web server and module mod_dav
Default storage: file system (mod_dav_fs)
Module Catacomb (mod_dav_repos) + Relational database
(http://catacomb.tigris.org)
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 19External Medias
(CD, DVD,…)
Mass Data StorageData Stores
Meta Data Server
Department
Employee
Simulation
Geometry
Grid Generation
Flow Solution
Visualisation
Data Access
WebDAV Server
FTP/GridFTP Server
Tivoli StorageManager
Storage Resource Broker
File System
Amazon S3
Logical View User ClientStorage
Locations
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 20
DataFinder Technical Aspects
Access privilege management
Authentication using WebDAV and LDAP
Authorization for users and groups based on WebDAV (ACP)
Client available on many platforms
Linux, Windows, …
Restricted by availability of Python 2.5 and Qt 3 + PyQt
Extensible through Python scripts
Python application programming interface (API)
Accessing data and meta data
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 21
Python API User Client Extension with GUI
import threadingfrom datafinder.application import search_supportfrom datafinder.gui.user import facade
def searchAndDisplayResult(): """Searches and displays the result in the search result logging window. """ query = "displayname contains ‘test’ OR displayname == ‘ab’" result = search_support.performSearch(query) resultLogger = facade.getSearchResultLogger() for path in result.keys(): resultLogger.info("Found item %s." % path)
thread = threading.Thread(target=searchAndDisplayResult)thread.start()
import threadingfrom datafinder.application import search_supportfrom datafinder.gui.user import facade
def searchAndDisplayResult(): """Searches and displays the result in the search result logging window. """ query = "displayname contains ‘test’ OR displayname == ‘ab’" result = search_support.performSearch(query) resultLogger = facade.getSearchResultLogger() for path in result.keys(): resultLogger.info("Found item %s." % path)
thread = threading.Thread(target=searchAndDisplayResult)thread.start()
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 22
Python API Command Line Example (without GUI)
# Get APIfrom datafinder.application import ExternalFacade
externalFacade = ExternalFacade.getInstance()
# Connect to a repositoryexternalFacade.performBasicDatafinderSetup(username, password, startUrl)
# Download the whole contentrootItem = externalFacade.getRootWebdavServerItem()items = externalFacade.getCollectionContents(rootItem)for item in items: externalFacade.downloadFile(item, baseDirectory)
# Get APIfrom datafinder.application import ExternalFacade
externalFacade = ExternalFacade.getInstance()
# Connect to a repositoryexternalFacade.performBasicDatafinderSetup(username, password, startUrl)
# Download the whole contentrootItem = externalFacade.getRootWebdavServerItem()items = externalFacade.getCollectionContents(rootItem)for item in items: externalFacade.downloadFile(item, baseDirectory)
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 23
Additional “Batteries”…Used Libraries beyond the Python Standard Library (1)
PyQt (http://www.riverbankcomputing.co.uk/software/pyqt)Interface to the Qt GUI framework (currently Qt 3)Used for DataFinder UI layer
Pyparsing (http://pyparsing.wikispaces.com/)Creating and executing simple grammarsUsed for highlighting search expressions
python-ldap (http://python-ldap.sourceforge.net/)Object-oriented API to access LDAP serversAuthentication against LDAP / ActiveDirectory server
paramiko (http://www.lag.net/paramiko)SSH2 protocol implementation
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 24
Additional “Batteries”…Used Libraries beyond the Python Standard Library (2)
PyGlobus (http://www-itg.lbl.gov/gtg/projects/pyGlobus)
Interface to The Globus Toolkit
Used for GridFTP Data Store
Boto (http://code.google.com/p/boto)
Interfaces to Amazon Web Services
Used for S3 (Simple Storage Service) Data Store
davlib (http://www.webdav.org/mod_dav/davlib.py)
WebDAV client library
Used for core WebDAV functions
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 25
WebDAV Client LibrarySupport for DAV Extensions
Provides an object-oriented interface for accessing WebDAV server
Extracted from DataFinder source
WebDAV client-side library supports
Core WebDAV specification
Access Control Protocol
Basic Versioning (experimental)
DAV Searching and Locating
Secure HTTP connections
Implementation based on davlib and standard httplib
Apache License Version 2
Project Site: http://sourceforge.net/projects/pythonwebdavlib
Folie 26EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Working with DataFinder…
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 27
Configuration and CustomizationPreparing DataFinder for certain “use cases”
Requirements Analysis
Analyze data, working environment, and users workflows
Configuration
Define and configure data model
Configure distributed storage resources (Data Stores)
Customization
Write functional extensions with Python scripts
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 28
DataFinder ConfigurationData Model and Data Stores
Logical view to data
Definition of data structuring and meta data(“data model”)
Separated storage of data structure / meta data and actual data files
Flexible use of (distributed) storage resources
File system, WebDAV, FTP, GridFTP
Amazon S3 (Simple Storage Service)
Tivoli Storage Manager (TSM)
Storage Resource Broker (SRB)
Complex search mechanism to find data
Department
Employee
Simulation
Geometry
Grid Generation
Flow Solution
Visualisation
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 29
Data StructureMapping of Organizational Data Structures
User
Project A
Project B
Project C
File 1
File 2
Simulation I
Experiment
Simulation II
Project MegaCode UltraUser EddieKey Value
Object(collection)
Object(file)
Relation Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Attributes(meta data)
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 30
Meta Data
Describe and annotate data (“files”) and collections (“directories”)
Different levels of meta data
Required attributes defined by administrator
User is free to choose additional ones
Different types of meta data
String
Numbers (float, double, …)
Lists
Pictures
Links
Stored in XML format
User can search in meta data
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 31
Impact for Users
“Damn! I’m a great scientist!I want freedom to have
my own directory layout…”
DataFinder restricts the rights of users!
Enforcement of “good behavior”
User must comply to organizational standards
Data is stored in defined (directory) hierarchy on data server
Required meta data must be set prior upload
User have certain access rights within hierarchy
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 32
Customization Python-Scripting for Extension and Automation
Integration of DataFinder with environment
User, infrastructure, software, …
Extension of DataFinder by Python scripts
Actions for resources (i.e., files, directories)
User interface extensions
Typical automations and customizations
Data migration and data import
Start of external application (with downloaded data files)
Extraction of meta data from result files
Automation of recurring tasks (“workflows”)
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 33
DataFinder Scripting Downloading File and Starting Application# Download the selected file and try to execute it.from datafinder.application import ExternalFacadefrom guitools.easygui import *import osfrom tempfile import *from win32api import ShellExecute # Get instance of ExternalFacade to access DataFinder APIfacade = ExternalFacade.getInstance() # Get currently selected collection in DataFinder Server-View resource = facade.getSelectedResource()
if resource != None: tmpFile = mktemp(ressource.name) facade.downloadFile(resource, tmpFile)
if os.path.exists(tmpFile): ShellExecute(0, None, tmpFile, "", "", 1)else: msgbox("No file selected to execute.")
# Download the selected file and try to execute it.from datafinder.application import ExternalFacadefrom guitools.easygui import *import osfrom tempfile import *from win32api import ShellExecute # Get instance of ExternalFacade to access DataFinder APIfacade = ExternalFacade.getInstance() # Get currently selected collection in DataFinder Server-View resource = facade.getSelectedResource()
if resource != None: tmpFile = mktemp(ressource.name) facade.downloadFile(resource, tmpFile)
if os.path.exists(tmpFile): ShellExecute(0, None, tmpFile, "", "", 1)else: msgbox("No file selected to execute.")
Folie 34EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Examples…
Folie 35EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Example 1:Example 1:Turbine SimulationTurbine Simulation
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 36
Example 1: Fluid Dynamics SimulationTurbine Simulation
Design of new turbine engines
High-resolution simulation of flow
Computational Fluid Dynamics (CFD)
Use of high-performance computing resources (Cluster / Grid)
Huge amounts of data (>100 GByte)
DataFinder used for
Management of results
Automation of simulation runs
Starting pre-/post processing
Used for CFD-code TRACE (DLR)
See http://www.aero-grid.de
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 37
Simulation steps (example):
1. splitCGNSPreparing data for TRACE
2. TRACE (CFD solver)Main computation
3. fillCGNSConflating results
4. Post ProcessingData reduction and visualization
Automation with customized DataFinder
Turbine SimulationData Model
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 38
Turbine Simulation: Graphical User Interface
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 39
Turbine Simulation: Customized GUI Extensions11
22
3
4
55
1.1. Create new simulationCreate new simulation
2.2. Start a simulation Start a simulation
3.3. Query statusQuery status
4.4. Cancel simulationCancel simulation
5.5. Project overviewProject overview
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 40
Turbine Simulation Starting External Applications
1. CGNS Infos / ADFview / CGNS Plot
2. TRACE GUI
3. Gnuplot
1
2
3
Folie 41EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Example 2:Example 2:Automobile SupplierAutomobile Supplier
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 42
Example 2: Automobile SupplierDataFinder for Simulation and Data Management
Tasks
• Automation and management of simulation of customers
• Mapping of specific work sequence
• High flexibility regarding customers requirements
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 43
Automobile SupplierData Model
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 44
Automobile SupplierConfiguration of Customers Parameters
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 45
Automobile SupplierManagement of Simulations
Status overview
Create, change, and deletedata sets
Manage versions of datafiles
Parameter overview
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 46
Automobile SupplierUpload, Download, and Versioning of Files
Upload/download of results
Versioning of results
Script store results in DataFinder data structures
Folie 47EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Example 3:Example 3:Air Traffic Air Traffic ManagementManagement
Example 3:Example 3:Air Traffic Air Traffic ManagementManagement
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 48
Example 3: Air Traffic Monitoring Database for Air Traffic Monitoring
Air traffic monitoring is important for researchPredictions of air trafficNew traffic management approaches
Usage of DataFinderDatabase for traffic data and reportsProject oriented view
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 49
Database for Air Traffic MonitoringData Model and Data Migration
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 50
Database for Air Traffic MonitoringData Import Wizard
Import of all data sources (PDF/Word/text files, Excel, Access, …)
Classification into multiple categories
Prevention of duplicated data and consistent naming
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 51
Database for Air Traffic MonitoringSearch Results
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 52
Current Work and Future Plans
Current work
Migration to Qt 4
Improved usage (e.g., search dialogs)
Integration with Shibboleth
Future
Web interfaces
Jython
Embedding in Java/Eclipseapplications
Reuse of custom GUI dialogs
Migration to Py3k
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 54
Links
DataFinder Web site
http://www.dlr.de/datafinder
Python WebDAV library
http://sourceforge.net/projects/pythonwebdavlib
Catacomb
http://catacomb.tigris.org
AeroGrid Project
http://www.aero-grid.de
EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008
Folie 56
Questions?Questions?