DataFinder: A Python Application for Scientific Data Management

51
Folie 1 EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008 DataFinder: A Python Application for Scientific Data Management EuroPython 2008 (July 9th 2008, Vilnius) Andreas Schreiber <[email protected]> German Aerospace Center (DLR), Cologne http://www.dlr.de/sc

description

EuroPython 2008 (09.07.2008, Vilnius)

Transcript of DataFinder: A Python Application for Scientific Data Management

Page 1: DataFinder: A Python Application for Scientific Data Management

Folie 1EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 (July 9th 2008, Vilnius)

Andreas Schreiber <[email protected]>

German Aerospace Center (DLR), Cologne

http://www.dlr.de/sc

Page 2: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 2

The DLRGerman Aerospace Research Center Space Agency of the Federal Republic of Germany

Page 3: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 3

5,600 employees working in 28 research institutes and facilities

at 13 sites.

Offices in Brussels, Paris and Washington. Köln

Lampoldshausen

Stuttgart

Oberpfaffenhofen

Braunschweig

Göttingen

Berlin-

Bonn

Trauen

Hamburg

Neustrelitz

Weilheim

Bremen-

Sites and employees

Page 4: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 4

Short Overview

DataFinder is a software for efficient management of scientific and technical data

Focus on huge data sets

Development of the DataFinder by DLR

Primary functionality

Structuring of data through assignment of meta information and self-defined data models

Flexible usage of heterogeneous storage resources

Integration in the working environment

Page 5: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 5

Introduction

DataFinder founded by DLR

National Grid project AeroGrid

Page 6: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 6

IntroductionBackground

Large-scale simulations

aerodynamics

material science

climate

Tons of measured data

wind-tunnel experiments

earth observations

traffic data

Page 7: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 7

IntroductionData Management Problem

Typical organizational situations

No central data management policy

Every employee organizes his/her data individually

Researchers spend about 30% of their time searching for data

Problem with data left behind by temporary staff

Increase of data size and regulations

Rapidly growing volume of simulation and experimental data

Legal requirements for long-term availability of data (up to 50 years!)

Situation similar at many organizations

All ~30 DLR institutes

Other research labs and agencies

Industry

Page 8: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 8

DataFinder HistorySearch for solution for scientific data management

Definition of “standard problem” (helicopter simulation)

Test case for evaluation of software

Evaluation of commercial product data management (PDM) systems

PDM systems could manage data but with huge amount of costs

PDM systems have many unneeded functionalities

PDM systems have self-defined or unreadable scripting languages for extension and customization (Tcl etc.)

Development of DataFinder

Lightweight data management client and existing server solution

Just enough functionality for our problems (no paid but unused features!)

Page 9: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 9

DataFinder DevelopmentFrom Java Prototype to Python Product…

Development of prototype in Java

Data could be manages with prototype successfully

Drawbacks: Java problems on important platforms (e.g., SGI IRIX)

Embedded Jython interpreter great feature for users

User: “The Java GUI is like shit, but the Python scripting is great. We want a pure Python solution!”

Development of DataFinder product from scratch in Python

Page 10: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 10

Python for Scientists and EngineersReasons for Python in Research and Industry

Observations:

Scientists and engineers don’t want to write software but just solve their problems

If they have to write code, it must be as easy as possible

Why Python is perfect?

Very easy to learn and easy to use ( = steep learning curve)

Allows rapid development ( = short development time)

Inherent great maintainability

“Python has the cleanest, most-scientist- or engineer friendly syntax and semantics.”

(Paul F. Dubois. Ten good practices in scientific programming. Comp. In Sci. Eng., Jan/Feb 1999, pp.7-11)

“Python has the cleanest, most-scientist- or engineer friendly syntax and semantics.”

(Paul F. Dubois. Ten good practices in scientific programming. Comp. In Sci. Eng., Jan/Feb 1999, pp.7-11)

“I want to design planes,

not software!”

Page 11: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 11

DataFinder OverviewBasic Concept

Client-Server solution

Based on open and stable standards, such as XML and WebDAV

Extensive use of standard software components (open source / commercial), limited own development at client side

Page 12: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 12

WebDAVWeb-based Distributed Authoring & Versioning

Extension of HTTP

Allows to manage files on remote servers collaboratively

WebDAV supports

Resources (“files”)

Collections (“directories”)

Properties (“meta data”, in XML format)

Locking

WebDAV extensions

Versioning (DeltaV)

Access control (ACP)

Search (DASL)

Page 13: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 13

DataFinder OverviewClient and Server

Client

User client

Administrator client

Implementation: Python with Qt

Server

WebDAV server for meta data and data structure

Data Store concept

Abstracts access to managed data

Flexible usage of heterogeneous storage resources

Implementation: Various existing server solutions (third-party)

Page 14: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 14

DataFinder ClientGraphical User Interfaces

User Client Administrator Client

Implementation in Python with Qt/PyQt

Implementation in Python with Qt/PyQt

Page 15: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 15

DataFinder ServerSupported WebDAV servers

Commercial Server Solution

Tamino XML database (Software AG)

Open Source Server Solutions

Apache HTTP Web server and module mod_dav

Default storage: file system (mod_dav_fs)

Module Catacomb (mod_dav_repos) + Relational database

(http://catacomb.tigris.org)

Page 16: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 19External Medias

(CD, DVD,…)

Mass Data StorageData Stores

Meta Data Server

Department

Employee

Simulation

Geometry

Grid Generation

Flow Solution

Visualisation

Data Access

WebDAV Server

FTP/GridFTP Server

Tivoli StorageManager

Storage Resource Broker

File System

Amazon S3

Logical View User ClientStorage

Locations

Page 17: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 20

DataFinder Technical Aspects

Access privilege management

Authentication using WebDAV and LDAP

Authorization for users and groups based on WebDAV (ACP)

Client available on many platforms

Linux, Windows, …

Restricted by availability of Python 2.5 and Qt 3 + PyQt

Extensible through Python scripts

Python application programming interface (API)

Accessing data and meta data

Page 18: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 21

Python API User Client Extension with GUI

import threadingfrom datafinder.application import search_supportfrom datafinder.gui.user import facade

def searchAndDisplayResult(): """Searches and displays the result in the search result logging window. """ query = "displayname contains ‘test’ OR displayname == ‘ab’" result = search_support.performSearch(query) resultLogger = facade.getSearchResultLogger() for path in result.keys(): resultLogger.info("Found item %s." % path)

thread = threading.Thread(target=searchAndDisplayResult)thread.start()

import threadingfrom datafinder.application import search_supportfrom datafinder.gui.user import facade

def searchAndDisplayResult(): """Searches and displays the result in the search result logging window. """ query = "displayname contains ‘test’ OR displayname == ‘ab’" result = search_support.performSearch(query) resultLogger = facade.getSearchResultLogger() for path in result.keys(): resultLogger.info("Found item %s." % path)

thread = threading.Thread(target=searchAndDisplayResult)thread.start()

Page 19: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 22

Python API Command Line Example (without GUI)

# Get APIfrom datafinder.application import ExternalFacade

externalFacade = ExternalFacade.getInstance()

# Connect to a repositoryexternalFacade.performBasicDatafinderSetup(username, password, startUrl)

# Download the whole contentrootItem = externalFacade.getRootWebdavServerItem()items = externalFacade.getCollectionContents(rootItem)for item in items: externalFacade.downloadFile(item, baseDirectory)

# Get APIfrom datafinder.application import ExternalFacade

externalFacade = ExternalFacade.getInstance()

# Connect to a repositoryexternalFacade.performBasicDatafinderSetup(username, password, startUrl)

# Download the whole contentrootItem = externalFacade.getRootWebdavServerItem()items = externalFacade.getCollectionContents(rootItem)for item in items: externalFacade.downloadFile(item, baseDirectory)

Page 20: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 23

Additional “Batteries”…Used Libraries beyond the Python Standard Library (1)

PyQt (http://www.riverbankcomputing.co.uk/software/pyqt)Interface to the Qt GUI framework (currently Qt 3)Used for DataFinder UI layer

Pyparsing (http://pyparsing.wikispaces.com/)Creating and executing simple grammarsUsed for highlighting search expressions

python-ldap (http://python-ldap.sourceforge.net/)Object-oriented API to access LDAP serversAuthentication against LDAP / ActiveDirectory server

paramiko (http://www.lag.net/paramiko)SSH2 protocol implementation

Page 21: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 24

Additional “Batteries”…Used Libraries beyond the Python Standard Library (2)

PyGlobus (http://www-itg.lbl.gov/gtg/projects/pyGlobus)

Interface to The Globus Toolkit

Used for GridFTP Data Store

Boto (http://code.google.com/p/boto)

Interfaces to Amazon Web Services

Used for S3 (Simple Storage Service) Data Store

davlib (http://www.webdav.org/mod_dav/davlib.py)

WebDAV client library

Used for core WebDAV functions

Page 22: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 25

WebDAV Client LibrarySupport for DAV Extensions

Provides an object-oriented interface for accessing WebDAV server

Extracted from DataFinder source

WebDAV client-side library supports

Core WebDAV specification

Access Control Protocol

Basic Versioning (experimental)

DAV Searching and Locating

Secure HTTP connections

Implementation based on davlib and standard httplib

Apache License Version 2

Project Site: http://sourceforge.net/projects/pythonwebdavlib

Page 23: DataFinder: A Python Application for Scientific Data Management

Folie 26EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Working with DataFinder…

Page 24: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 27

Configuration and CustomizationPreparing DataFinder for certain “use cases”

Requirements Analysis

Analyze data, working environment, and users workflows

Configuration

Define and configure data model

Configure distributed storage resources (Data Stores)

Customization

Write functional extensions with Python scripts

Page 25: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 28

DataFinder ConfigurationData Model and Data Stores

Logical view to data

Definition of data structuring and meta data(“data model”)

Separated storage of data structure / meta data and actual data files

Flexible use of (distributed) storage resources

File system, WebDAV, FTP, GridFTP

Amazon S3 (Simple Storage Service)

Tivoli Storage Manager (TSM)

Storage Resource Broker (SRB)

Complex search mechanism to find data

Department

Employee

Simulation

Geometry

Grid Generation

Flow Solution

Visualisation

Page 26: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 29

Data StructureMapping of Organizational Data Structures

User

Project A

Project B

Project C

File 1

File 2

Simulation I

Experiment

Simulation II

Project MegaCode UltraUser EddieKey Value

Object(collection)

Object(file)

Relation Project MegaCode UltraUser EddieKey Value

Project MegaCode UltraUser EddieKey Value

Project MegaCode UltraUser EddieKey Value

Project MegaCode UltraUser EddieKey Value

Project MegaCode UltraUser EddieKey Value

Project MegaCode UltraUser EddieKey Value

Project MegaCode UltraUser EddieKey Value

Project MegaCode UltraUser EddieKey Value

Attributes(meta data)

Page 27: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 30

Meta Data

Describe and annotate data (“files”) and collections (“directories”)

Different levels of meta data

Required attributes defined by administrator

User is free to choose additional ones

Different types of meta data

String

Numbers (float, double, …)

Lists

Pictures

Links

Stored in XML format

User can search in meta data

Page 28: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 31

Impact for Users

“Damn! I’m a great scientist!I want freedom to have

my own directory layout…”

DataFinder restricts the rights of users!

Enforcement of “good behavior”

User must comply to organizational standards

Data is stored in defined (directory) hierarchy on data server

Required meta data must be set prior upload

User have certain access rights within hierarchy

Page 29: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 32

Customization Python-Scripting for Extension and Automation

Integration of DataFinder with environment

User, infrastructure, software, …

Extension of DataFinder by Python scripts

Actions for resources (i.e., files, directories)

User interface extensions

Typical automations and customizations

Data migration and data import

Start of external application (with downloaded data files)

Extraction of meta data from result files

Automation of recurring tasks (“workflows”)

Page 30: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 33

DataFinder Scripting Downloading File and Starting Application# Download the selected file and try to execute it.from datafinder.application import ExternalFacadefrom guitools.easygui import *import osfrom tempfile import *from win32api import ShellExecute # Get instance of ExternalFacade to access DataFinder APIfacade = ExternalFacade.getInstance() # Get currently selected collection in DataFinder Server-View resource = facade.getSelectedResource()

if resource != None: tmpFile = mktemp(ressource.name) facade.downloadFile(resource, tmpFile)

if os.path.exists(tmpFile): ShellExecute(0, None, tmpFile, "", "", 1)else: msgbox("No file selected to execute.")

# Download the selected file and try to execute it.from datafinder.application import ExternalFacadefrom guitools.easygui import *import osfrom tempfile import *from win32api import ShellExecute # Get instance of ExternalFacade to access DataFinder APIfacade = ExternalFacade.getInstance() # Get currently selected collection in DataFinder Server-View resource = facade.getSelectedResource()

if resource != None: tmpFile = mktemp(ressource.name) facade.downloadFile(resource, tmpFile)

if os.path.exists(tmpFile): ShellExecute(0, None, tmpFile, "", "", 1)else: msgbox("No file selected to execute.")

Page 31: DataFinder: A Python Application for Scientific Data Management

Folie 34EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Examples…

Page 32: DataFinder: A Python Application for Scientific Data Management

Folie 35EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Example 1:Example 1:Turbine SimulationTurbine Simulation

Page 33: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 36

Example 1: Fluid Dynamics SimulationTurbine Simulation

Design of new turbine engines

High-resolution simulation of flow

Computational Fluid Dynamics (CFD)

Use of high-performance computing resources (Cluster / Grid)

Huge amounts of data (>100 GByte)

DataFinder used for

Management of results

Automation of simulation runs

Starting pre-/post processing

Used for CFD-code TRACE (DLR)

See http://www.aero-grid.de

Page 34: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 37

Simulation steps (example):

1. splitCGNSPreparing data for TRACE

2. TRACE (CFD solver)Main computation

3. fillCGNSConflating results

4. Post ProcessingData reduction and visualization

Automation with customized DataFinder

Turbine SimulationData Model

Page 35: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 38

Turbine Simulation: Graphical User Interface

Page 36: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 39

Turbine Simulation: Customized GUI Extensions11

22

3

4

55

1.1. Create new simulationCreate new simulation

2.2. Start a simulation Start a simulation

3.3. Query statusQuery status

4.4. Cancel simulationCancel simulation

5.5. Project overviewProject overview

Page 37: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 40

Turbine Simulation Starting External Applications

1. CGNS Infos / ADFview / CGNS Plot

2. TRACE GUI

3. Gnuplot

1

2

3

Page 38: DataFinder: A Python Application for Scientific Data Management

Folie 41EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Example 2:Example 2:Automobile SupplierAutomobile Supplier

Page 39: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 42

Example 2: Automobile SupplierDataFinder for Simulation and Data Management

Tasks

• Automation and management of simulation of customers

• Mapping of specific work sequence

• High flexibility regarding customers requirements

Page 40: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 43

Automobile SupplierData Model

Page 41: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 44

Automobile SupplierConfiguration of Customers Parameters

Page 42: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 45

Automobile SupplierManagement of Simulations

Status overview

Create, change, and deletedata sets

Manage versions of datafiles

Parameter overview

Page 43: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 46

Automobile SupplierUpload, Download, and Versioning of Files

Upload/download of results

Versioning of results

Script store results in DataFinder data structures

Page 44: DataFinder: A Python Application for Scientific Data Management

Folie 47EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Example 3:Example 3:Air Traffic Air Traffic ManagementManagement

Example 3:Example 3:Air Traffic Air Traffic ManagementManagement

Page 45: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 48

Example 3: Air Traffic Monitoring Database for Air Traffic Monitoring

Air traffic monitoring is important for researchPredictions of air trafficNew traffic management approaches

Usage of DataFinderDatabase for traffic data and reportsProject oriented view

Page 46: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 49

Database for Air Traffic MonitoringData Model and Data Migration

Page 47: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 50

Database for Air Traffic MonitoringData Import Wizard

Import of all data sources (PDF/Word/text files, Excel, Access, …)

Classification into multiple categories

Prevention of duplicated data and consistent naming

Page 48: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 51

Database for Air Traffic MonitoringSearch Results

Page 49: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 52

Current Work and Future Plans

Current work

Migration to Qt 4

Improved usage (e.g., search dialogs)

Integration with Shibboleth

Future

Web interfaces

Jython

Embedding in Java/Eclipseapplications

Reuse of custom GUI dialogs

Migration to Py3k

Page 50: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 54

Links

DataFinder Web site

http://www.dlr.de/datafinder

Python WebDAV library

http://sourceforge.net/projects/pythonwebdavlib

Catacomb

http://catacomb.tigris.org

AeroGrid Project

http://www.aero-grid.de

Page 51: DataFinder: A Python Application for Scientific Data Management

EuroPython 2008 > Andreas Schreiber > DataFinder > 09.07.2008

Folie 56

Questions?Questions?