JESSICA Holding Fund in Bulgaria - Status, types and selection of UDFs – Next steps
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins...
-
Upload
drew-maull -
Category
Documents
-
view
227 -
download
4
Transcript of Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins...
Applications of UDFsin Astronomical Databases and
Research
Manuchehr Taghizadeh-Popp
Johns Hopkins University
User Defined Functions (UDFs)
Motivation:
-Scientists need to execute own code/functions where the data is stored (databases)
-Need fast code/algorithms no more complex than O(N log N), parallelizable if possible in 104+ threads.
For astronomers:
-Basic astronomical UDFs bring 3-Dimensional and temporal view of the universe.
-Created Cosmological functions library (CfunBASE) written in C# (.NET framework). Library uploaded into SQL SERVER and code executed through CLR integration.
-Used in CasJobs/SkyServer service hosting SDSS data archive.
-Execute Functions/Stored procedures in simple SQL commands.
Functions for SQL Server-Cosmological Functions:
-volume, distances and times as a function of redshift “z” (F=F(z)) -inverse functions z = F-1(F(z)) also implemented.
-Basic data exploratory and statistical functions also included:
- Cumulative distribution and quantile functions (both scalar and aggregate)
- Binning and grids (1-D streaming table valued function, linear/log-scaled)(for aggregation, table creation, etc)
- N-Dimensional weighted histogram.
-Numerical Methods:
Integration, root finding, interpolation. Customizable for speed/precision.
-Many functions in astronomy contain integrals/sums:
many problems parallelizable with CUDA/GPU (to be done…)
Advanced Astronomical Examples
-Galaxy clusters from Friends-of-Friends algorithm: 3D view of the Large Scale Structure.
-Luminosity Function (1-D weighted histogram)
SELECT dbo.fMathBin(v.AbsMag_r,-25, -15, 100 ,1, 1), sum(1/v.Vmax)/0.1, sqrt(sum( 1/(v.Vmax*v.Vmax) ) )/0.1,count(*)
FROM( SELECT dbo.fCosmfAbsMag(m_r,z) AS AbsMag_r, Vmax FROM DR7) AS v
GROUP BY dbo.fMathBin(v.AbsMag_r,-25, -15, 100 ,1, 1)ORDER BY dbo.fMathBin(v.AbsMag_r,-25, -15, 100 ,1, 1)
-Color-Magnitude Diagram (2-D weighted histogram)
EXECUTE spMathHistogramNDim ‘SELECT dbo.fCosmfAbsMag(m_r,z), Color_u_r, 1.0/Vmax FROM DR7’,2, '-25,0', '-15,5', '50,50' ,1
-Use query parsing function for preventing SQL injection when functions run user’s query.
Extreme Value Statistics (EVS) as a tool
-Used widely in calculations of risk and the study of tails of distributions.
-EVS predicts the biggest/smallest value we will ever observe.
- Distribution φ(x) of extremes is known for the extremes of n i.i.d. random variables (of parent distribution P(x) ) when n ∞:
- ξ defines 3 universal distributions depending on tail of parent distribution P(x):
(1) (power law tail) ξ > 0 [ φ(x) called Frechet distribution]
(2) (exponential tail) ξ = 0 [ φ(x) called Gumbel distribution]
(3) ( x0>x ) (finite cutoff tail) ξ < 0 [ φ(x) called Weibull distribution]
With large data sets , questions to answer:
-Are maximal galaxy luminosities really Gumbel distributed [P(L) ~ exp(-L)] ?
-Having lots of galaxies, can we observe the finite size correction of φ(x) due to having finite n?
1~)( xxP
)exp(~)( xxP
10 )(~)( xxxP
11
1
)(1exp
)(1
1),,|(
xxx
Sampling luminosities from HealPIX cells
-HealPIX tessellation library uploaded into database.-Can be used for spatial indexing. (use tree schema and bitshift on HealPIX ID)-Equal area cells.
Applications for EVS:
-Build HealPIX SDSS footprint on the sky. Use HTM spatial indexing library.
-Each cell has 1 “realization” of the random variable (Luminosity) -Sample highest luminosity at each one of all n cells.
-3 different spatial resolutions:Nside =(16, 32, 64)
n ~ (296, 1450, 6642)
RESULTS: tail classes and finite size correction
-Tail index ξ from DEdH estimator
η = normalized order statistics
Test 4 different galaxy samples:
Generally close to ξ = 0 [P(L) ~ exp(-Lβ)]
-1st time observation of finite size correction
- x = Standardized maximal luminosities
- Finite size correction Δ due to finite n:
Δ = P(x) – StandardGumbel
- Slow theoretical convergence:
Δ(n) ~ 1/log n
RESULT:Correction appears when n>6000(tradeoff between noise/convergence)
Mining the space of Galaxy Properties
How to classify galaxies in the n-dimensional cloud of Photometric/Spectral properties?
- Use Principal Components Analysis (PCA) on properties and consider important eigenvectors.
- Build PRINCIPAL CURVE: Smooth fit/projection to the cloud’s spine. Complexity of ~O(N2)
- Explore diverse statistics as a function of arc length.
- Scalability for big N:
Streaming PCA (T. Budavari) and randomized sampling for principal curve
(P. curve not yet implemented in SQLCLR)
Final remarks
- Algorithms useful if randomized, ~O(N log N), streaming capable and parallelizable
- For analysis, an astronomer would like
- A programming layer on the database (with the functionality of e.g R)
- implementing matrix algebra, calculus, statistics, etc.
- Including data visualization.