Post on 20-Dec-2015
Cosmic Microwave Background Data Analysis :
From Time-Ordered Data To Power Spectra
Julian BorrillComputational Research Division, Berkeley Lab
& Space Sciences Laboratory, UC Berkeley
CMB Science - I
The CMB is a snapshot of the Universe when it first became neutral 400,000 years after the Big Bang.
Cosmic - filling all of space.
Microwave - redshifted by the expansion of the Universe from 3000K to 3K.
Background - primordial photons coming from “behind” all astrophysical sources.
CMB Science - II
The CMB is a unique probe of the very early Universe.Its tiny (1:105-8) fluctuations carry information about - the fundamental parameters of cosmology - ultra-high energy physics beyond the Standard Model
CMB Science - III
The new frontier in CMB research is polarization :• consistency check of temperature results• re-ionization history of the Universe• gravity wave production during inflationBut polarization fluctuations are up to 3 orders of magnitude fainter than temperature (we think) requiring :• many more detectors• much longer observation times• very careful analysis of very large datasets
The Planck Satellite
•A joint ESA/NASA mission due to launch in fall 2007.
•An 18+ month all-sky survey at 9 microwave frequencies from 30 to 857 GHz.
•O(1012) observations, O(108) sky pixels, O(104) spectral multipoles.
CMB analysis moves from the time domain - observations to the pixel domain - maps to the multipole domain - power
spectracalculating the compressed data and
their reduced error bars at each step.
Overview (I)
Overview (II)
CMB data analysis typically proceeds in 4 steps:
- Pre-processing (deglitching, pointing, calibrating).
- Estimating the time-domain noise statistics.
- Estimating the map (and its errors).
- Estimating the power spectra (and their errors).
iterating & looping as we learn more about the data.
Then we can ask about the likelihoods of the parameters of any particular class of cosmologies.
Goals
To cover the(a)basic mathematical formalism(b)algorithms & their scaling behaviours(c) example implementation issues
for map-making and power spectrum estimation.
To consider how to extract the maximum amount of information from the data, subject to practical computational constraints.
To illustrate some of the computational issues faced when analyzing very large datasets (eg. Planck).
Data CompressionCMB data analysis is an exercise in data
compression:1. Time-ordered data: #samples = #detectors x sampling rate x duration ~ 70 x 200Hz x 18 months for Planck2. (HEALPixelized) Maps: #pixels = #components x sky fraction x 12 nside2
~ (3 - 6) x 1 x 12 x 40962 for Planck3. Power Spectra: #bins = #spectra x #multipoles / bin resolution ~ 6 x (3 x 103) / 1 for Planck
Data Parameters
Symbol Description Planck
Number of samples 5 x 1011
Noise bandwidth O(104)
Number of pixels 6 x 108
Number of spectra 6
Maximum multipole 3 x 103
Number of spectral bins
2 x 104
Number of iterations -
Number of realizations -
Computational Constraints• 1 GHz processor running at 100% efficiency for
1 day performs O(1014) operations.
• 1 Gbyte of memory can hold O(108) element vector, or O(104 x 104) matrix, in 64-bit precision.
• Parallel (multiprocessor) computing increases the operation count and memory limits.
• Challenges to computational efficiency & scaling:
- load balancing (work & memory)- data-delivery, including communication & I/O
Map-Making - Formalism (I)
Consider data consisting of noise and sky-signal
where the pointing matrix A encodes the weight of
eachpixel p in each sample t - for a total power
temperatureobservation
and the sky-signal sp is both beam & pixel smoothed.
Map-Making - Formalism (II)
Assume Gaussian noise with probability distribution
and a time-time noise correlation matrix
whose inverse is (piecewise) stationary & band-limited
Aside : Noise Estimation
To make the map we need the inverse time-time noise correlations. Approximate:
which requires the pure noise timestream. i) Assume nt = dt
ii) Solve for the map: dp ~ sp
iii) Subtract the map from the data: nt = dt - Atp dp
iv) Iterate.
Map-Making - Formalism (III)
Writing the noise in terms of the data and signal & maximizing its likelihood over all possible signals gives the minimum variance map
with pixel-pixel noise correlations
Taken together, these are a complete description of the data.
Map-Making - Algorithms (I)
We want to solve the system:
Eg. (5 x 1011)2 x (6 x 108) ~ 2 x 1032 for Planck !
Equation Naive OpCount
Map-Making - Algorithms (II)
a) Exploit the structure of the matrices– Pointing matrix is sparse– Inverse noise correlation matrix is band-Toeplitz
Associated matrix-matrix & -vector multiplication scalings reduced from & to .Eg. (5 x 1011) x 104 ~ 5 x 1015 for Planck.
Map-Making - Algorithms (III)
b) Replace explicit matrix inversion with an iterative solver (eg. preconditioned conjugate gradient) using repeated matrix-vector multiplications
reducing the scaling from to .
depends on the required solution accuracy and the quality of the preconditioner (white noise works well).
Eg. 30 x (6 x 108)2 ~ 1019 for Planck.
Map-Making - Algorithms (IV)
c) Leave the inverse pixel-pixel noise matrix in implicit form
Now each multiplication takes operations in
pixel space, or operations in fourier
space.
Eg. 30 x (5 x 1011) x log2 104 ~ 2 x 1014 for Planck.
But this gives no information about the pixel-pixel noise correlations (ie. errors).
Map-Making - Extensions (I)
For polarization data the signal can be written in terms of the i, q & u Stokes parameters and the angle of the polarizer relative to some chosen coordinate frame
where Atp now has 3 non-zero entries per row.We need at least 3 observation-angles to separate i, q & u.
Map-Making - Extensions (II)
If the data includes a sky-asynchronous contaminating
signal (eg. MAXIMA’s chopper-synchronous signal)
This can be extended to any number of contaminants,
including relative offsets between sub-maps.
If the data includes a sky-synchronous contaminating signal then more information is needed. Given multi-frequency data with foreground
With detectors at d different frequencies this can be extended to (d-1) different foregrounds.
Map-Making - Extensions (III)
Map-Making - Implementation (I)
We want to be able to calculate the inverse pixel-pixel
noise correlation matrix
because it
- encodes the error information
- is sparse, so can be saved even for huge data
- provides a good (block) diagonal preconditioner & i/q/u pixel degeneracy test with white noise.
Map-Making - Implementation (II)
Both the time- & pixel-domain data must be distributed.
We want to balance simultaneously both the work- & the memory-load per processor.
Balanced distribution in one basis is unbalanced in other:
(i) uniform work/observation but varied work/pixel.
(ii) unequal numbers of pixels observed/interval.
No perfect solution - have to accept the least limiting imbalance (pixel memory).
Map-Making - Implementation (III)
Pseudo-code:
Initialize
For each time t
For each pixel p observed at time t with weight w
For each time t’ within of t
For each pixel p’ observed at time t’ with weight w’
Accumulate
Doesn’t work well - pixel-memory access too random.
Map-Making - Implementation (IV)
Alternative pseudo-code:
For each pixel p find the times at which it is observed.
For each pixel p
Initialize
For each time t at which p is observed with weight w
For each time t’ within of t
For each pixel p’ observed at time t’ with weight w’
Accumulate
Map-Making - Implementation (V)
Distribute time-ordered data equally over processors.
Each processor accumulates part of Npp’-1 in sparse form
- only over pixels observed in its time interval
- only over some of the observations of these pixels
Then all of these partial matrices, each stored in its own sparse form:
(p, z, (p1, n1), … , (pz, nz)), …, (p’, z’, (p’1, n’1), …, (p’z’,
n’z’))
have to be combined.
Map-Making - Conclusions
• The maximum-likelihood map only can be calculated in operations - O(1014) for Planck.
• The sparse inverse pixel-pixel correlation matrix can be calculated in operations - O(1015) for Planck.
• A single column of the pixel-pixel correlation matrix can be calculated (just like a map) in operations - O(1014) for Planck.
• Computational efficiency only ~5% serial/small data, ~0.5% parallel/large data.
Power Spectrum Estimation
From the map-making we have:
(i) the pixelized data-vector dp containing the CMB
signal(s) plus any other separated components.
(ii) some representation of the pixel-pixel noise correlation matrix Npp’ - explicit, explicit inverse, or implicit inverse.
Truncation of Npp’ (only) is equivalent to marginalization.
Power Spectrum - Formalism (I)
Given a map of noise + Gaussian CMB signal
with correlations
and - assuming a uniform prior - probability distribution
Power Spectrum - Formalism (II)
Assuming azimuthal symmetry of the CMB, its pixel-pixel correlations depend only on the angle between the pair.
3 components 6 spectra (3 auto + 3 cross).
Theory predicts
CTB = CEB = 0
but this should still be
calculated/confirmed
(test of systematics).
Power Spectrum - Formalism (III)
Eg: temperature signal correlations:
with beam+pixel window
For binned multipoles
Power Spectrum - Formalism (IV)
No analytic likelihood maximization - use iterative search.
Newton-Raphson requires two derivatives wrt bin powers:
Power Spectrum - Algorithms (I)
For each NR iteration we want to solve the system:
EquationNaive
OpCount
Eg. (2 x 104) x (6 x 108)3 ~ 4 x 1030 operations/iteration & 8 x (6 x 108)2 ~ 3 x 1018 bytes/matrix for Planck !
Power Spectrum - Algorithms (II)
a) Exploiting the signal derivative matrices’ structures
- reduces the operation count by up to a factor of 10. - increases the memory needed from 2 to 3 matrices.
Power Spectrum - Algorithms (III)
b) Using PCG for D-1 & trace approximant
for dL/dC & d2L/dC2 terms.
scaling as
Eg. 5 x 104 x (2 x 104) x 103 x (5 x 1011) x 10 ~ 5 x 1024 for Planck.
Power Spectrum - Algorithms (IV)c) Abandoning maximum likelihood altogether !
Eg. Take SHT of noisy, cut-sky map.
Build a pseudo-spectrum
Assume spectral independence
& an invertible linear transfer function
Power Spectrum - Algorithms (VI)
Now use Monte-Carlo simulations of the experiment’s observations of signal+noise & noise only to determine the transfer matrix Tll’ & the noise spectrum
Cln, and hence the signal spectrum Cl
s.
Scales as
Eg. 104 x 30 x 5 x 1011 x 10 ~ 1018 for Planck.
And provided we simulate time-ordered data, any other abuses (filtering etc) can be included in the Monte Carlo.
Power Spectrum - Implementation (I)
Even with supercomputers full maximum likelihood analysis
is unfeasible for more than O(105) pixels.
However, it is still useful for(a) low-resolution all-sky maps (eg. for low multipoles
where Monte Carlo methods are less reliable).(b) high-resolution cut-sky maps (eg. for consistency
& systematics cross-checks).(c) hierarchical resolution/sky-fraction analyses.
Power Spectrum - Implementation (II)
(i) Build the dS/dCb matrices.
(ii) Make initial guess Cb
(iii) Calculate D = N + Cb dS/dCb & invert.
(iv) For each b, calculate Wb = D-1 dS/dCb.
(v) For each b, calculate dL/dCb ~ d Wb z - Tr[Wb]
(vi) For each b, b’ calculate Fbb’ ~ Tr[Wb Wb’]
(vii) Calculate new Cb += Fbb’-1 dL/dCb’.
(viii) Iterate to convergence.Using as many processors as possible, as efficiently as
possible.
Power Spectrum - Implementation (III)
The limiting calculation is
so it sets the constraints for the overall implementation.
Maximum efficiency requires minimum communication keep data as dense as possible on the processors 3 matrices fill available memory
Check other steps - none requires more than 3 matrices.Out-of-core implementation, so I/O efficiency is critical.
Power Spectrum - Implementation (IV)
Dense data requirement sets the processor count:#PE ~ 3 x 8 x Np
2 / bytes per processor
Eg. 3 x 8 x (105)2 / 109 = 240, but we want to scale to very large systems.
Each bin multiplication is independent - divide the processors into gangs & do many Wb simultaneously.
Derivatives terms are bin-parallel toodL/dCb trivial
d2L/dCbdCb’ requires some load-balancing
Use Tr[AB] = Aij Bji
= ATji Bji
For all my gang’s b Read Wb
Transpose WTb
WTb & Wb Fbb
For all b’ > b Read Wb’ over Wb
WTb & Wb’ Fbb’
Aside : Fisher Matrix (I)
2 matrices in memory; Nb(Nb+1)/2 reads.
Aside : Fisher Matrix (II)
3 matrices in memory; Nb(Nb+2)/4 reads
For all my gang’s b Read Wb
Transpose WTb
WTb & Wb Fbb
Read Wb+1 over Wb
WTb & Wb+1 Fbb+1
Transpose WTb+1
For all b’ > b+1 Read Wb’ over Wb+1
WTb & Wb’ Fbb’
WTb+1 & Wb’ Fb+1 b’
Power Spectrum - Implementation (V)
All gangs need dS/dCb & D-1 first but these terms do not bin-parallelize.
Either - calculate them using all processors & re-map or - calculate them on 1 gang & leave others idle
Balance re-mapping/idling cost against multi-gang benefit; optimal WT2C depends on architecture & data parameters.
For large data with many bins & efficient communication network, sustain 50%+ peak on thousands of processors.
Aside : Parameterizing Architectures
• Portability is crucial, but architectures differ.• Identify critical components & parameterize limits.• Tune codes to each architecture.
Eg. ScaLAPACK - block-cyclic matrix distribution is most efficient data layout (mapped to re-use pattern).Hardware limit is cache size - set blocksize so that requisite number of blocks fit & fill cache.
Eg. Input/Output - how many processors can efficiently read/write simultaneously (contention vs idling) ?
Conclusions
• Maximum likelihood spectral estimation cannot handle high-resolution all-sky observation, but Monte Carlo methods can (albeit with questions at low-l).• For sufficiently cut-sky or low-resolution maps ML methods are feasible, and allow complete error-tracking.• Dense matrix-matrix kernels achieve 50%+ of peak performance (cf. 5% for FFT & SHT).• Inefficient implementation of data-delivery can reduce these by an order of magnitude or more; re-use each data unit as much as possible before moving to the next.• Things are only going to get harder !