Cosmological simulations in a relational database: modelling and storing merger trees Gerard Lemson,...

1
Cosmological simulations in a relational database: modelling and storing merger trees Gerard Lemson, GAVO, Max-Planck-Institut für extraterrestrische Physik, Garching, Germany Volker Springel, Max-Planck-Institut für Astrophysik, Garching, Germany Abstract:We present a method for storing tree-like data structures in a relational database that allows for fast querying of children and parents of any node and from and down to any level. We have used this method in storing halo merger trees derived from a large cosmological N-body simulation and the merger trees of model galaxy catalogues derived from the halo catalogues using semi-analytical methods. We give SQL queries corresponding to typical science questions that can be asked from such a database and present an online query interface available through the web portal of the German Astrophysical Virtual Observatory. References: [1] Springel V., White S.D.M., et al, 2005, Nature, 435, 629 [2] http://www.mpa-garching.mpg.de/galfor m/virgo/millennium/index.shtml [3] Croton D.J., Springel V., et al, 2005, MNRAS, submitted (astro-ph/0508046) [4] de Lucia G., Kauffmann G. & White S.D.M, MNRAS. 349 (2004) 1101 [5] Gray J., Szalay A., et al, 2002. http://arxiv.org/abs/cs.DB/0202014 [6] Gao L., Springel V., White S.D.M, 2005, MNRAS, in press (astro-ph/0506510) [7] Lemson G., Kauffmann G, 1999, MNRAS 302, 111 Background and goals: This work was done in the context of the German Astrophysical Virtual Observatory (GAVO). GAVO pays special attention to the introduction of theory data (simulations) into the Virtual Observatory (VO). To test our ideas we have created various prototype implementations. Our main goal for the project presented here was to investigate the use of relational database technology in the analysis of results of large scale structure simulations, as well as in their online publication. The former may lead to direct scientific benefits to the owners of the data, the latter leads to benefits to the larger community that gets access to the data in a well defined and standardized manner. Database Implementation: The design of the information system started with the construction of an analysis model, shown in Fig. 3a. It contains the important concepts and their relationships of the domain under investigation (see [8] and references therein). Important for this project are simulator (the code), simulation (the running of the simulator with particular input parameters) and its snapshots. The actual data stored in the database are the results of post-processing: cluster extraction and galaxy formation. In our model all of these specialize a common pattern that in [8] is identified with the basic concepts: protocol, experiment and result. They are especially important for describing the provenance of the data. The physical database model is restricted to the data part of the conceptual model. It is more constrained in that it must fit the data in a relational model, that it must enable translation of the science questions into (relatively easy) SQL and moreover that it do so efficiently. The science questions deal with relations between different types of objects, between object and environment and, especially, with the formation history of objects. The history is embodied in the merger trees of both halos and galaxies. One can store trees using a single link from progenitor to descendant, but this requires recursion to retrieve a complete progenitor tree. This is not a standard feature of all relational databases and a more efficient solution is desirable even where it is supported. Fig. 2 illustrates our solution. Each object gets an identifier corresponding to its order in a depth first sort of the trees rooted in objects at the final snapshot. Each object furthermore gets a pointer (foreign key) to the last progenitor in the ordering of the sub-tree rooted in that object. The complete progenitor tree rooted in a given object (at any snapshot !) is now precisely the set of objects whose identifier has value between the root object’s id and the id of the last progenitor. In SQL the relevant query is as follows: select prog.* from halo des, halo prog where des.haloId = 5000063000000 -- example value and prog.haloId between des.haloId and des.lastProgId This is the query corresponding to science question 1 above. In the database the tables are clustered (ordered) according to the id columns, which ensures that merger trees are sequentially stored on the disks, speeding up the retrieval. One other feature of the data model is the spatial indexing based on the Peano-Hilbert space filling curve. The Millennium simulation’s files are organized around this index (see [9]), which is a higher dimensional equivalent to the recursive HTM [10] or HEALPix [11] indexes on the sky. In the database it will likewise allow efficient spatial searches, though for now it is used to link the objects and the density field. Simulation and science questions: The simulation that was used in this prototype is a relatively small, dark matter, cosmological N-body simulation, that was created as preparation for the Millennium simulation [1,2]. For this project we were interested in post- processing products of this simulation: density fields, halo catalogues including halo merger trees and mock galaxy catalogues. The latter were produced using semi- analytical galaxy formation (SAGF) routines that use the merger trees as input (see [3,4] for descriptions of the SAGF algorithms). The database was designed to answer a number of science questions, similar to the Approach in [5]. We polled astrophysicists associated to the simulation project, which resulted in the following list which is a subset of these questions: 1. Return the complete halo merger tree for a halo identified at z=0 2. Find positions and velocities for all galaxies at redshift zero with B-luminosity, colour and bulge-to-disk ratio within given intervals. 3. Return B-band luminosity function of galaxies residing in halos of mass between 10^13 and 10^14 solar masses. 4. Return the formation time of halos, defined as the maximum time at which it still has a progenitor of greater than half its mass, as function of the matter density in its environment, defined by the matter density smoothed on scale of 10Mpc (inspired by [6,7]). Webportal and example queries: The database is accessible online from a special purpose web application accessible through the GAVO portal (http://www.g-vo.org), which follows design ideas from the SkyServer [12] and GalICS [13] web applications. The user can type in free-form SQL queries and retrieve the result in a variety of formats: HTML, CSV, VOTable (Fig 4b). A particular feature is the ability to visualise the results directly via a VOPlot [14] applet (see Fig 4c). A number of example queries are available. In Fig 5. we show the queries corresponding to the other three science questions above. DEMO This GAVO web application is being demonstrated at this conference. Fig. 1: Slice through the density field of the Millennium simulation at redshift z=0. The slice is 15 Mpc/h thick. Fig 2: Illustration of the merger tree structure of objects (halos/galaxies) in the simulation. The black lines indicate the traditional, descendant pointers. The red lines indicate the pointer structure used in the database model. [8] Lemson, G., Dowler, P., Banday, A.J. 2003, in ASP Conf. Ser., Vol. 314 Astronomical Data Analysis Software and Systems XIII, eds. F. Ochsenbein, M. Allen, & D. Egret (San Francisco: ASP), 472 [9] Springel, V., 2005, MNRAS, submitted (astro-ph/0505010) [10] http://skyserver.org/HTM/ [11] http://healpix.jpl.nasa.gov/ [12] http://cas.sdss.org/dr4/en/tools/search/sql.asp [13] http://galics.cosmologie.fr/main_frames.php?dir=database [14] http://vo.iucaa.ernet.in/~voi/voplot.htm Fig. 3: Formal datamodels used in the design of the Millennium database. (a) shows an analysis model (UML), detailing the important domain concepts and their interrelationships. (b) shows a schematic relational model (ER) for the tables in the database and their foreign key relations. a b Fig 4: Snapshots of the GAVO portal web pages providing access to the simulation database. (a) shows the query page, with demo queries and links to the schema and documentation. (b) shows the result of the query in (a) in VOTable format. (c) show the same result plotted with VOPlot. The query implements science question number 1 and the plot shows the evolution of the merger tree below a given halo at redshift 0 by plotting the X-position vs the snapshot number. This gives a very nice illustration of the orbits of the halos and their merging behaviour. a c b 2. select x,y,z, velX, velY, velZ from MMGalaxy where mag_b between –23 and –18 and bulgeMass >= .1*stellarMass 3. select .2*round(5*g.mag_b) as magB, count(*) as num from MMGalaxy g, MMHalo h where g.haloId = h.haloId and h.mTopHat between 1000 and 10000 and h.redshift=0 group by magB 4. select zForm, avg(g10) as g10 from MMField f, ( select des.haloId, des.phkey, max(PROG.redshift) as zForm from MMHalo PROG, MMHalo DES where DES.redshift = 0 and PROG.haloId between DES.haloId and DES.lastProgenitorId and prog.np >= des.np/2 and des.np between 100 and 200 group by des.haloId, des.phkey ) t where t.phkey = f.phkey and f.snapnum=63 group by zForm Fig 5: SQL implementations of science questions 2-4. The database dialect is Postgres.

Transcript of Cosmological simulations in a relational database: modelling and storing merger trees Gerard Lemson,...

Page 1: Cosmological simulations in a relational database: modelling and storing merger trees Gerard Lemson, GAVO, Max-Planck-Institut für extraterrestrische Physik,

Cosmological simulations in a relational database: modelling and storing merger trees

Gerard Lemson, GAVO, Max-Planck-Institut für extraterrestrische Physik, Garching, Germany

Volker Springel, Max-Planck-Institut für Astrophysik, Garching, Germany

Abstract:We present a method for storing tree-like data structures in a relational database that allows for fast querying of children and parents of any node and from and down to any level. We have used this method in storing halo merger trees derived from a large cosmological N-body simulation and the merger trees of model galaxy catalogues derived from the halo catalogues using semi-analytical methods. We give SQL queries corresponding to typical science questions that can be asked from such a database and present an online query interface available through the web portal of the German Astrophysical Virtual Observatory.

References:[1] Springel V., White S.D.M., et al, 2005, Nature, 435, 629[2] http://www.mpa-garching.mpg.de/galform/virgo/millennium/index.shtml [3] Croton D.J., Springel V., et al, 2005, MNRAS, submitted (astro-ph/0508046)[4] de Lucia G., Kauffmann G. & White S.D.M, MNRAS. 349 (2004) 1101 [5] Gray J., Szalay A., et al, 2002. http://arxiv.org/abs/cs.DB/0202014[6] Gao L., Springel V., White S.D.M, 2005, MNRAS, in press (astro-ph/0506510)[7] Lemson G., Kauffmann G, 1999, MNRAS 302, 111

Background and goals: This work was done in the context of the German Astrophysical Virtual Observatory (GAVO). GAVO pays special attention to the introduction of theory data (simulations) into the Virtual Observatory (VO). To test our ideas we have created various prototype implementations. Our main goal for the project presented here was to investigate the use of relational database technology in the analysis of results of large scale structure simulations, as well as in their online publication. The former may lead to direct scientific benefits to the owners of the data, the latter leads to benefits to the larger community that gets access to the data in a well defined and standardized manner.

Database Implementation:The design of the information system started with the construction of an analysis model, shown in Fig. 3a. It contains the important concepts and their relationships of the domain under investigation (see [8] and references therein). Important for this project are simulator (the code), simulation (the running of the simulator with particular input parameters) and its snapshots. The actual data stored in the database are the results of post-processing: cluster extraction and galaxy formation. In our model all of these specialize a common pattern that in [8] is identified with the basic concepts: protocol, experiment and result. They are especially important for describing the provenance of the data.

The physical database model is restricted to the data part of the conceptual model. It is more constrained in that it must fit the data in a relational model, that it must enable translation of the science questions into (relatively easy) SQL and moreover that it do so efficiently. The science questions deal with relations between different types of objects, between object and environment and, especially, with the formation history of objects. The history is embodied in the merger trees of both halos and galaxies. One can store trees using a single link from progenitor to descendant, but this requires recursion to retrieve a complete progenitor tree. This is not a standard feature of all relational databases and a more efficient solution is desirable even where it is supported.

Fig. 2 illustrates our solution. Each object gets an identifier corresponding to its order in a depth first sort of the trees rooted in objects at the final snapshot. Each object furthermore gets a pointer (foreign key) to the last progenitor in the ordering of the sub-tree rooted in that object. The complete progenitor tree rooted in a given object (at any snapshot !) is now precisely the set of objects whose identifier has value between the root object’s id and the id of the last progenitor. In SQL the relevant query is as follows:

select prog.* from halo des, halo prog

where des.haloId = 5000063000000 -- example value and prog.haloId between des.haloId and des.lastProgId

This is the query corresponding to science question 1 above. In the database the tables are clustered (ordered) according to the id columns, which ensures that merger trees are sequentially stored on the disks, speeding up the retrieval.

One other feature of the data model is the spatial indexing based on the Peano-Hilbert space filling curve. The Millennium simulation’s files are organized around this index (see [9]), which is a higher dimensional equivalent to the recursive HTM [10] or HEALPix [11] indexes on the sky. In the database it will likewise allow efficient spatial searches, though for now it is used to link the objects and the density field.

Simulation and science questions: The simulation that was used in this prototype is a relatively small, dark matter,cosmological N-body simulation, that was created as preparation for the Millennium simulation [1,2]. For this project we were interested in post-processing productsof this simulation: density fields, halo catalogues including halo merger trees and mock galaxy catalogues. The latter were produced using semi-analytical galaxy formation (SAGF) routines that use the merger trees as input (see [3,4] for descriptions of the SAGF algorithms). The database was designed to answer a number of science questions, similar to the Approach in [5]. We polled astrophysicists associated to the simulation project, which resulted in the following list which is a subset of these questions:1. Return the complete halo merger tree for a halo identified at z=0 2. Find positions and velocities for all galaxies at redshift zero with B-luminosity, colour

and bulge-to-disk ratio within given intervals. 3. Return B-band luminosity function of galaxies residing in halos of mass between 10^13

and 10^14 solar masses. 4. Return the formation time of halos, defined as the maximum time at which it still has a

progenitor of greater than half its mass, as function of the matter density in its environment, defined by the matter density smoothed on scale of 10Mpc (inspired by [6,7]).

Webportal and example queries:The database is accessible online from a special purpose web application accessible through the GAVO portal (http://www.g-vo.org), which follows design ideas from the SkyServer [12] and GalICS [13] web applications. The user can type in free-form SQL queries and retrieve the result in a variety of formats: HTML, CSV, VOTable (Fig 4b). A particular feature is the ability to visualise the results directly via a VOPlot [14] applet (see Fig 4c). A number of example queries are available. In Fig 5. we show the queries corresponding to the other three science questions above. DEMO This GAVO web application is being demonstrated at this conference.

Fig. 1: Slice through the density field of the Millennium simulation at redshift z=0. The slice is 15 Mpc/h thick.

Fig 2: Illustration of the merger tree structure of objects (halos/galaxies) in the simulation. The black lines indicate the traditional, descendant pointers. The red lines indicate the pointer structure used in the database model.

[8] Lemson, G., Dowler, P., Banday, A.J. 2003, in ASP Conf. Ser., Vol. 314 Astronomical Data Analysis Software and Systems XIII, eds. F. Ochsenbein, M. Allen, & D. Egret (San Francisco: ASP), 472

[9] Springel, V., 2005, MNRAS, submitted (astro-ph/0505010)[10] http://skyserver.org/HTM/[11] http://healpix.jpl.nasa.gov/[12] http://cas.sdss.org/dr4/en/tools/search/sql.asp[13] http://galics.cosmologie.fr/main_frames.php?dir=database[14] http://vo.iucaa.ernet.in/~voi/voplot.htm

Fig. 3: Formal datamodels used in the design of the Millennium database. (a) shows an analysis model (UML), detailing the important domain concepts and their interrelationships. (b) shows a schematic relational model (ER) for the tables in the database and their foreign key relations.

a

b

Fig 4: Snapshots of the GAVO portal web pages providing access to the simulation database. (a) shows the query page, with demo queries and links to the schema and documentation. (b) shows the result of the query in (a) in VOTable format. (c) show the same result plotted with VOPlot. The query implements science question number 1 and the plot shows the evolution of the merger tree below a given halo at redshift 0 by plotting the X-position vs the snapshot number. This gives a very nice illustration of the orbits of the halos and their merging behaviour.

a cb

2.select x,y,z, velX, velY, velZ from MMGalaxy where mag_b between –23 and –18 and bulgeMass >= .1*stellarMass

3. select .2*round(5*g.mag_b) as magB, count(*) as num from MMGalaxy g, MMHalo h where g.haloId = h.haloId and h.mTopHat between 1000 and 10000 and h.redshift=0group by magB

4.select zForm, avg(g10) as g10from MMField f, ( select des.haloId, des.phkey, max(PROG.redshift) as zForm from MMHalo PROG, MMHalo DES where DES.redshift = 0 and PROG.haloId between DES.haloId and DES.lastProgenitorId and prog.np >= des.np/2 and des.np between 100 and 200 group by des.haloId, des.phkey ) t where t.phkey = f.phkey and f.snapnum=63group by zForm

Fig 5: SQL implementations of science questions 2-4. The database dialect is Postgres.