The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming...
Transcript of The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming...
University Campus Bio-Medico of Rome
The context: bioimage data explosion
High-throughput imaging techniques have led to bioimage data explosion
Submicrometer resolution
TeraByte scale images
Heterogeneous images
Need for tools capable to process these huge datasets
Which requirements?
Need for automation to extract information
Fully- or semi-automated analysis?
Is human assistance needed?
Whole mouse brain mapping
March 26, 2018 2
University Campus Bio-Medico of Rome
TeraStitcher: born for Confocal Light Sheet Microscope Images
Generates a 2D matrix of 3D tiles (image stacks)
Tiles are regularly spaced with reasonably reliable nominal positions
- Stack acquisition is performed with a single movement guaranteeing a reliable constant inter-slices distance along z
- Long range movements (order of cm) make necessary an alignment step in all directions
Sufficient overlap between adjacent tiles
March 26, 2018 4
University Campus Bio-Medico of Rome
TeraStitcher: Ultra-terabyte images
Need to process 16 bits images larger than20,000 x 20,000 x 3,000 voxels (~2.5 TB)
Increasingly high resolutions in X-Y
Multiple channels
Very large slices (e.g. 40K x 30K pixels)
Thousands of slices Solutions:
Sparse datasets
Compressed formats
Parallelism
Multi-channel support
March 26, 2018 5
?
University Campus Bio-Medico of Rome
What is image stitching?
Image stitching is theprocess of combiningmultiple photographicimages withoverlapping fields ofview to produce asegmented panoramaor high-resolutionimage (from Wikipedia)
March 26, 2018 7
Initially developed for confocal light sheet microscopy (CLSM) images
Fast 2D approach to align adjacent stacks
Efficient use of memory resources
Minimum I/O workload (2 readings, 1 writing)
Can generate a multi-resolution representation suited for further processing
University Campus Bio-Medico of Rome
TeraStitcher: overview
March 26, 2018 8
University Campus Bio-Medico of Rome
TeraStitcher: software structure
The software structure provides means for incorporating new functionalities
Object-oriented software architecture and design patterns
Functional decomposition visible to the user through command line options
Stitcher
StackedVolume
MIP-NCC-Displ
- MIP_displacements: int[6]
- NCC_widths: float[3]
- NCC_maxs: float[3]
- reliab_factors: float[3]
+ evalReliability(): float
+ …
…
MST
+ execute(…)
…
Stack
«interface»
PairwiseDisplAlgo
+ execute(…): Displacement
… MIP-NCC
+ execute(…): Displacement
«executes» «executes»
«uses»
Abstract Factory
VolumeManager
1
N
1 N
«interface»
Displacement
+ evalReliability(): float
+ getDisplacement(direction: int): int
+ threshold(thres: float)
+ project(displ: Displacement)
# def_displacements: int[3]
# displacements: int[3]
«produces»
Strategy
Stitcher
+ computeDisplacements(algorithm_type: int, …)
+ projectDisplacements() + thresholdDisplacements(thres: float)
+ computeTilesPlacement(algorithm_type: int)
+ mergeTiles(blending_type: int,…)
«interface»
TilesPlacementAlgo
+ execute(…)
«uses»
«uses»
March 26, 2018 9
Strategy (aims at minimizing memory requirements)
o Divide the volume into layers (substacks) along Z
o For each layer
o Keep in memory a row (or column) of substacks at the time
o For each pair of adjacent substacks, call
Algorithm
o Compute the Maximum Intensity Projections (MIPs)along X, Y, and Z for the two substacks
o Apply 2D Normalized Cross-Correlation (NCC) to thethree pairs of MIPs to find three 2D displacements
o Output “most-reliable” displacement for each direction
University Campus Bio-Medico of Rome
TeraStitcher: separating the strategy from the algorithm
March 26, 2018 10
University Campus Bio-Medico of Rome
TeraStitcher: global optimization step
Global optimization is performed using computed (or even nominal) alignments and their reliability
The resulting alignments are used to generate the xml file
Manual intervention is possible to correct isolated errors
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:0(∞ )
H:375(∞ )
D:0(∞ )
V:1(1,35)
H:372(1,40)
D:-3(1,09)
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:372(1,37)
H:1(1,30)
D:0(1,40)
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:371(1,36)
H:2(1,35)
D:2(1,33)
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:0(∞ )
H:373(1,15)
D:2(1,11)
V:1(1,40)
H:375(1,26)
D:-1(1,12)
V:0(∞ )
H:375 (∞ )
D:0(∞ )
V:0(1,32)
H:373(1,22)
D:-1(1,10)
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:372(1,41)
H:3(1,22)
D:-4(1,35)
V:372(1,21)
H:4(1,35)
D:-1(1,31)
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:1(1,37)
H:371(1,35)
D:-3(1,37)
V:3(1,34)
H:373(1,23)
D:0(1,08)
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:0(∞ )
H:375(∞ )
D:0 (∞ )
V:372(1,35)
H:3(1,22)
D:-1(1,37)
V:374(1,35)
H:5(1,22)
D:-1(1,12)
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:374(1,25)
H:6(1,15)
D:2(1,12)
V:375(∞ )
H:0(∞ )
D:0(∞ )
V:375(∞ )
H:0(∞ )
D:0(∞ )
horizontal axis H
vert
ical
axi
s V
MST along V
MST along H
MST along D
March 26, 2018 12
University Campus Bio-Medico of Rome
TeraStitcher: developments and issues
Parallelization: maintenance issues different alternatives to exploit parallel processing
different code for sequential and parallel version
Solution: launching multiple instances on disjoint sub-regions command line options to specify the sub-region to process
use a scripting language (python) to implement a driver
Alignment computation (computation bound) each instance processes a group of adjacent sub-stacks
merge the xml files with alignment information
Generation of the stitched image (I/O bound) the directory structure is first generated
each instance writes disjoint groups of tiles in parallel
metadata to be used in subsequent steps are finally generated
March 26, 2018 13
Why do we need GPUs?
Tile size: 2048 x 2048 x 2950 voxel (16 bits per voxel)
Tile matrix: 3 x 3
Total dataset size: Tile size x #tiles (>220 GByte)
Stitching on one Power8 (2.061 Ghz) core: ~15000 secs!
Stitching on one Xeon ES2640 (2.6 Ghz) core: ~20000 secs!
GPUs come to rescue!
March 26, 2018 16
Normalized Cross Correlation The most time-consuming part of the stitching
process is the evaluation of the Normalized Cross-Correlation (NCC) between the Maximum IntensityProjections (MIP) of the tiles.
NCC is a variant of the classic Cross Correlation inwhich data are normalized by subtracting the meanand dividing by the standard deviation of the twodatasets.
TeraStitcher implements a fast NCC from J. P. Lewis,“Fast Template Matching”, Vision Interface (1995).
March 26, 2018 17
CPU Normalized Cross Correlationf_mean = t_mean = 0.;
for ( i=0, pxl1=im1, pxl2=im2; i<dimi; i++, pxl1+=stride, pxl2+=stride ) {
for ( j=0; j<dimj; j++, pxl1++, pxl2++ ) {
f_mean += *pxl1;
t_mean += *pxl2;
}
}
f_mean /= (dimi*dimj);
t_mean /= (dimi*dimj);
numerator = factor1 = factor2 = 0.;
for ( ij=0, pxl1=im1, pxl2=im2; ij<dimi*dimj; ij++) {
f_prime = pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] - f_mean;
t_prime = pxl2[(ij%dimj)+(stride+dimj)*(ij/dimj)] - t_mean;
numerator += pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] * t_prime;
factor1 += f_prime * f_prime;
factor2 += t_prime * t_prime;
}
March 26, 2018 18
The previous fragment of code is repeated(2*delayu+1) x (2*delayv+1) times at each iteration(there are few hundreds iterations per run).
A typical value of delayu and delayv is ~ 100
We developed a gpu_NCC kernel that does all the workwith a single invocation per iteration:
minimize the overhead due to the invocation
allow to overlap memory copies from CPU to GPU(using streams)
March 26, 2018 19
From CPU TO GPU NCC
The gpu_NCC kernel heavily relies on shuffle primitives tocompute the mean values required by the NCC
#define CALCSUM(v) (v)=warpReduceSumD((v)); \
if(lane==0) shared[wid]=(v); \
__syncthreads(); \
(v) = (threadIdx.x<(blockDim.x/warpSize))*shared[lane]; \
if(wid==0) (v)=warpReduceSumD((v)); \
if(tid==0) shared[0]=(v); \
__syncthreads(); \
(v)=shared[0]; \
__syncthreads();
#define CALCAVE(v) CALCSUM(v) \
(v) /= (dimi*dimj);
March 26, 2018 20
GPU NCC
IBM S822LC HPC, courtesy of IBM Italy (G. Richelli)
512 GB of RAM,
16 Power8 cores (SMT 8)
2 960 GB solid state disks
4 P100 GPUs.
March 26, 2018 21
Test Platform
Time comparison (220 GB testcase)1 Power8 process 15300 seconds (14617 for the NCC)
2 Power8 processes 8191 seconds
4 Power8 processes 4100 seconds
16 Power8 processes 1168 seconds!
64 Power8 processes 2457 seconds (2200 for the NCC)!
1 process using one P100 580 seconds (70 for the NCC)
2 processes each one using a P100 344 seconds
4 processes each one using a P100 174 seconds
8 processes sharing 4 P100 93 seconds
16 processes sharing 2 P100 115 seconds
16 processes sharing 4 P100 56 seconds! (17.5 for the NCC)March 26, 2018 22
Results on other GPUs Running on a Titan V(olta) the GPU time reduces from
70 seconds to (about) 42.5 seconds
Running on a K80 the GPU time increases up to 493seconds!
The nvprof tool helps in understanding why timeincreases much more than expected…
March 26, 2018 23
K80 vs. Volta (average values)
March 26, 2018 24
The difference in performance is due to the memory subsystem of the Volta
There are no-changes with respect to the original CPUimplementation, just tiny additions
Can compile on platforms without GPU by turning-off amacro in the CMAKE configuration file
CUDA version activated by means of an env variable
Simple mechanism that works on any platform
CUDA version tested under Linux, MacOS, Windows
We are testing/evaluating an OpenCL version
CUDA code will be soon available from github
March 26, 2018 25
Other Features of the GPU version
University Campus Bio-Medico of Rome
TeraStitcher: current status
Open source multi-platform software (https://abria.github.io/TeraStitcher)
Available both as standalone application (with GUI) and as plugin for Vaa3D
The CPU version is already in use in many organizations including
WYSS Center, Geneve Switzerland
Renier Lab, ICM Brain and Spine Institute, Paris, France
Chung's Lab, MIT, Boston, USA
Adam Glaser, Washington University, Seattle, USA
Tomer Lab, Columbia University, New York, USA
Deisseroth Lab, Stanford University, CA, USA
Several users already expressed interest for the CUDA version.
TeraStitcher Brain Cell FinderVaa3D-TeraFlyTeraConverterMarch 26, 2018 26
University Campus Bio-Medico of Rome
Credits (1/2): image processing, bioinformatics, machine learning
TeraStitcher
Vaa3D-TeraFly
Brain cell finder
Prof. Giulio Iannello
Alessandro Bria Leonardo Onofri
Prof. Paolo Frasconi Paolo Soda
Hanchuan Peng
Roberto Cortini
March 26, 2018 27
University Campus Bio-Medico of Rome
Credits (2/2): confocal light sheet microscopy, sample processing
Prof. Francesco Pavone Leonardo SacconiLudovico Silvestri
Irene Costantini
International Center of
Computational Neurophotonics
March 26, 2018 28
Credits: Jonathan Liu and Adam Glaser, Univ. of Washington have built the microscope and acquired the datathe samples were prepared by: Joshua Vaughan, University of Washington (the expanded brain), Rusty Nicovich, Allen Institute (the entire brain slice), Michael Gerner, University of Washington (mouse lymph node and mouse lung)
March 26, 2018 30
Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell Finder
University Campus Bio-Medico of Rome
Brain cell finder (1/5): overview
Developed for confocal light sheet microscopy (CLSM) images [7]
Applied to localize and count the Purkinje cells in the cerebellum of an L7-GFP mouse
First complete map of a selected neuronal population in a large area of the mouse brain
Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 32
Brain cell finder (2/5): semantic deconvolution
Boosts weak somata and decreases thevoxel intensities in non-soma regions
Use of a neural network (NN) trained tomap the original image into an ideal one
10 substacks covering different regions weremanually annotated to feed data to the NN
Original imageIdeal image (reference)
Image filtered by the trained
neural networktr
ain
ing
test
ing
Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 33
University Campus Bio-Medico of Rome
Brain cell finder (3/5): mean shift clustering
First we get rid of very dark voxelswhich are unlikely to be part of a soma
Seeds are chosen as local maximaexceeding a maximum entropy-basedoptimal threshold
The mean shift algorithm then:
Places a spherical kernel of radius R oneach seed
Shifts each point towards the meanvalue computed as the kernel-weighted average of the data
Repeats the previous steps untilconvergence is achieved
Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 34
University Campus Bio-Medico of Rome
Brain cell finder (4/5): manifold filter
Exploiting anatomical knowledge
Cells are not scattered randomly inthe 3D space but are laid out inmanifolds
Isolated or off-manifold cells arediscarded
Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 35
University Campus Bio-Medico of Rome
Brain cell finder (5/5): results
Performances of the pipeline were estimatedon 56 substacks (~1.09 GigaVoxels) manuallyannotated with Vaa3D
- F1 = 0.96
A second human operator annotated the same
56 substacks and achieved F1 = 0.98
Processing the whole mouse cerebellum (120GigaVoxels) yielded cells, which agrees withstereology data
Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 36