Analysis of Turbulence Datasets using a Database Cluster

38
Analysis of Turbulence Datasets using a Database Cluster: Requirements, Design, and Sample Applications ITR group @ JHU Charles Meneveau 1 , Yi Li 1* , Eric Perlman 2 , Minping Wan 1 , Yunke Yang 1 , Randal Burns 2 , Shiyi Chen 1 , Gregory Eyink 3 , Alex Szalay 4 (1) Mechanical Engineering, (2) Computer Science (3) Applied Mathematics & Statistics, (4) Physics and Astronomy. (* Now at Sheffield Univ. UK) Funding: National Science Foundation ITR and MRI APS-DFD meeting, November 2007

Transcript of Analysis of Turbulence Datasets using a Database Cluster

Page 1: Analysis of Turbulence Datasets using a Database Cluster

Analysis of Turbulence Datasets using a DatabaseCluster: Requirements, Design, and Sample

Applications

ITR group @ JHU

Charles Meneveau1, Yi Li1*, Eric Perlman2, Minping Wan1, Yunke Yang1,

Randal Burns2, Shiyi Chen1, Gregory Eyink3, Alex Szalay4

(1) Mechanical Engineering, (2) Computer Science (3) Applied Mathematics & Statistics, (4) Physics and Astronomy.

(* Now at Sheffield Univ. UK)

Funding: National Science FoundationITR and MRI

APS-DFD meeting, November 2007

Page 2: Analysis of Turbulence Datasets using a Database Cluster

Report on Cyber Fluid Dynamics

Page 3: Analysis of Turbulence Datasets using a Database Cluster

…………………………

…………………………

Report on Cyber Fluid Dynamics

Among recommendations:

Page 4: Analysis of Turbulence Datasets using a Database Cluster

Motivation:

• DNS of turbulence generating huge multi-Tbyte datasets• Analyzed on the fly during simulation by practitioner• Must redo runs for each new question that was not

foreseen during simulation,• Or: store time-history• If time-history is stored, one must retrieve data from large storage to large CPU for analysis, e.g. with local accounts on HPC machine.

Must know how to use and run programs there, privileges foraccess to large chunks of memory, casual initial analysis notpossible without investing significant effort to set up such access…

Page 5: Analysis of Turbulence Datasets using a Database Cluster

Motivation:

• This prevailing approach has hampered wide accessibility of large CFD datasets.

e.g. mathematician wishing to test whether material loops inturbulence develop fractal wrinkling will have a hard timeexploring this question on (e.g.) a 1,0243 DNS simulationwithout asking the simulator to do it for him/her.

• The fact is:“very large simulations remain out of reach of most”The problem will not automatically get better - even if“ftp” gets faster, size of “top-ranked simulations” growingeven faster: i.e. without changing current approach,top-ranked simulations will be accessible only to ashrinking subset of the scientific community.

Page 6: Analysis of Turbulence Datasets using a Database Cluster

Motivation:

• Develop database techniques for analysis of DNS• What does that mean?

Page 7: Analysis of Turbulence Datasets using a Database Cluster

Look to other fields where database technologyis more developed (e.g. astronomy)

Page 8: Analysis of Turbulence Datasets using a Database Cluster

Example of web-accessible astronomy database (co-developed at JHU):

Sloan Digital Sky Survey (SDSS):

• When completed: images covering more than a quarter of the sky, and a 3-Dmap of about a million galaxies and quasars.

• SDSS (“the Cosmic Genome Project”) is the most cited astronomical resourcein the world,

• Led to more than 1,300 refereed publications so far.• Its 2.5 Terabyte archive has been accessed more than 200 million times over

the last five years.• Some breakthroughs this astronomy resource has enabled:

- first precision measurements of the fundamental cosmological parameters- discovery of the most distant quasars.

Page 9: Analysis of Turbulence Datasets using a Database Cluster

Example of web-accessible astronomy database (co-developed at JHU):

Page 10: Analysis of Turbulence Datasets using a Database Cluster
Page 11: Analysis of Turbulence Datasets using a Database Cluster

Can same technology be used to store and access a full DNS database?

• Turbulence has concentrated coherent structures (like galaxies ?)• Can data be stored in a way that facilitates particular queries?• Can/should these be classified in some way?• e.g. SQL search for “most intense vortices in region around high mean

dissipation” ??• Too cumbersome. We’ll show that slightly different approach is needed• First, describe the dataset:

Page 12: Analysis of Turbulence Datasets using a Database Cluster

Dataset: a 1,0244 space-time history of isotropic turbulence

• 1,0243 DNS of forced isotropic turbulence, in (2π)3 periodic box• pseudo-spectral, de-aliased (phase-shifting - Rogallo)• rms velocity:

• Taylor microscale Reynolds number Rλ = 425

• Kolmogorov time scale:

• Kolmogorov length scale:

• Large eddy turnover time:• Simulation time-step: δt=0.0002

• Every 10 steps stored at Δt=0.002

• 1,024 times to be stored, in TL

(~ 1 integral time-scale)

!

K= 0.044

! = 2.86E-03

T

L! 2.2

u'= 0.679

k

E(k)

100

101

102

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

t=-5 .0

t=-3 .0

t=-1 .0

t=0.0

k-5/3

Page 13: Analysis of Turbulence Datasets using a Database Cluster

• Each cluster node:– Runs Windows 2003 Server and SQL Server 2005 (64-bit)

– Data tables striped across 12-15 sets of mirrored disks (4 - 11 TB/node)

• 4 or 8 GB memory per node• Code written in C# runs as a user defined procedure inside the SQL Server process using the CLR environment

Database cluster architecture:

Page 14: Analysis of Turbulence Datasets using a Database Cluster

Data generation and ingest:

• The DNS is run on a Linux cluster at JHU• Output is stored as standard C/Fortran arrays• Files are copied onto the data cluster for ingest• Data are reshaped into smaller cubes and inserted into the database using “BULK INSERT”

–We read 72x1024x1024 slices into memory at a time

Will be 27 Tbytes

Page 15: Analysis of Turbulence Datasets using a Database Cluster

Unified addressing and partitioning

• Oct-trees and Z-curves (Morton) are used for:

– Indexing space– Indexing workload– Data distribution– Data partitioning

Page 16: Analysis of Turbulence Datasets using a Database Cluster

Kernel of Computation & Edge Replication

• Most routines need “kernel of computation”– Spatial Interpolation

Lagrange polynomials 6th and 8th order

x

x

n x

n+1 x

n+2 x

n+3 x

n!1 x

n!2

u x( ) =i=1

6

!ln"3+ i

x( )u xn"3+ i

( )

lj

x( ) = i=n!2,i" j

n+3

# x ! xi( )

i=n!2,i" j

n+3

# xj! x

i( )

Page 17: Analysis of Turbulence Datasets using a Database Cluster

Kernel of Computation & Edge Replication

• Most routines need “kernel of computation”– Spatial Interpolation

Lagrange polynomials 6th and 8th order– Velocity gradients

Based on same Lagrange polynomials– Spatial averages

Not yet implemented

• We replicate edge of each region– Data needs met with a single read– 48% overhead (643 to 723 )– Data “molecule” is 723 subcube

• Time interpolation: - Piecewise Cubic Hermite Interpolation:

needs 4 t-neighbors

Page 18: Analysis of Turbulence Datasets using a Database Cluster

Kernel of Computation & Edge Replication

• Most routines need “kernel of computation”– Spatial Interpolation

Lagrange polynomials 6th and 8th order– Velocity gradients

Based on same Lagrange polynomials– Spatial averages

Not yet implemented

• Time interpolation: Piecewise Cubic Hermite Interpolation: needs 4 t-neighbors

t

t

n t

n+1 t

n+2 t

n!1

a = u t

n( )

b = !u tn

( ) =u t

n+1( ) ! u t

n!1( )

2"t

c =1

!t

u tn+1

( ) " u tn

( )!t

" !u tn

( )#

$%

&

'(

d =2

!t2

1

2!u t

n( ) + !u t

n+1( )"

#$% &

u tn+1

( ) & u tn

( )!t

'()

*)

+,)

-)

u t( ) = a + b t ! t

n( ) + c t ! t

n( )

2

+ d t ! tn

( )2

t ! tn+1

( )

Page 19: Analysis of Turbulence Datasets using a Database Cluster

All access to database is done through Web Service

Accessing the database: http://turbulence.pha.jhu.edu

Page 20: Analysis of Turbulence Datasets using a Database Cluster

Enter position

Get velocity andpressure output

Enter time

Demo: web accessible queries:

Page 21: Analysis of Turbulence Datasets using a Database Cluster

Enter position

Get velocity andpressure output

Enter time

Demo: web accessible queries:

OK for casual checks, but not useful for

serious turbulence/fluids research

Page 22: Analysis of Turbulence Datasets using a Database Cluster

Web-Service calls can be made fromC or Fortran on client computerusing the gSOAP libraries…

New paradigm: client computer runs the analysisusing fortran, C, matlab codes and they fetch

data from database through a web-service

Page 23: Analysis of Turbulence Datasets using a Database Cluster

GetVelocity()

Web service request for GetVelocitydx p (t)

dt= u(x p (t),t)

xp

n+1= x

p

n+ u(x

p

n,t

n) !t

p=1,2,3,4

Page 24: Analysis of Turbulence Datasets using a Database Cluster

GetVelocity()

Mediator divides workload spatially

Page 25: Analysis of Turbulence Datasets using a Database Cluster

GetVelocity()

Request sent to each database node

Heavy local computationsdone close to the data:

• Lagrange-polynomial interpolation in space

• P-cubic Hermite interp in time

• Other functions:vel gradient tensor, pressure,Hessian, Laplacian, pressuregradient…

Page 26: Analysis of Turbulence Datasets using a Database Cluster

Velocity at each location is returned

Heavy local computationsdone close to the data:

• Lagrange-polynomial interpolation in space

• P-cubic Hermite interp in time

• Other functions:vel gradient tensor, pressure,Hessian, Laplacian, pressuregradient…

Page 27: Analysis of Turbulence Datasets using a Database Cluster

Velocity values are collated by the mediator

Page 28: Analysis of Turbulence Datasets using a Database Cluster

… and returned to the user

Page 29: Analysis of Turbulence Datasets using a Database Cluster

+

… and repeat with a new time t…

× δt =

Client uses velocities to determine new positions

xp

n+ u(x

p

n,t

n) !t = x

p

n+1

Page 30: Analysis of Turbulence Datasets using a Database Cluster

Sample code (gfortran 90) running on this Mac (unix)

Pick 1,000 particle positionsaround a square loop

Page 31: Analysis of Turbulence Datasets using a Database Cluster

Sample code (gfortran 90) running on this Mac (unix)

advect and print-out results atsubsequent times

4.3 sec. wall-time/time-step for 1,000 particles

Page 32: Analysis of Turbulence Datasets using a Database Cluster

advect and print-out results atsubsequent times

4.3 sec. wall-time/time-step for 1,000 particles

Sample code (gfortran 90) running on this Mac (unix)

Page 33: Analysis of Turbulence Datasets using a Database Cluster

advect backwards in time !

-

-

- minus

Not possible during DNS

Sample code (gfortran 90) running on this Mac (unix)

Page 34: Analysis of Turbulence Datasets using a Database Cluster

Demo (gfortran 90) running on this Mac (unix)

advect 10 particles with and without inertia (Euler 1st order)

dx f (t)

dt= u(x f (t),t) x f

n+1= x f

n+ u(x f

n,t

n) !t

dv p (t)

dt=1

! p

u(x p (t),t) " v p (t)( ) xp

n+1= x

p

n+ v

p

n !t

Page 35: Analysis of Turbulence Datasets using a Database Cluster

Sample code (gfortran 90) running on this Mac (unix)

Get velocity gradients on a planeand evaluate dissipation

Page 36: Analysis of Turbulence Datasets using a Database Cluster

Sample code (gfortran 90) running on this Mac (unix)

Get velocity gradients on a planeand evaluate dissipation

280 secs for 62,500 positions

Page 37: Analysis of Turbulence Datasets using a Database Cluster

Conclusions:

• Built an environment well suited for explorations of turbulent flows

• Computationally expensive DNS are performed once and then ingested and archived for possible future use by entire community

• Generally applicable for analysis of time-dependent, “dense”, 3-D vector and tensor fields, defined on a grid

• Brings the generic, common-place, computations (differentiation, interpolation) as close to the data as possible

• Maintains flexibility by letting user implement own computations

• But avoids having to give user accounts on server, HPC location, by providing instead a web service interfaces for Fortran, C, Matlab (using gSOAP)

• Cost-benefit: Slower analysis than when done “in core during DNS”, but much more open and easily accessible.

Page 38: Analysis of Turbulence Datasets using a Database Cluster

Outlook:

• Complete the time-history - 27 Terabytes(as of now: “840 more time-steps to ingest”)

• Further optimizations to accelerate service

• Develop additional functions, e.g. spatial filtering - (but note: all-space transforms are a challenge in this approach)

• Couple with visualization tools

• Include other datasets (other flows etc..)

• STAY TUNED AND VISIT US AT http://turbulence.pha.jhu.edu