The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming...

University Campus Bio-Medico of Rome

The context: bioimage data explosion

High-throughput imaging techniques have led to bioimage data explosion

Submicrometer resolution

TeraByte scale images

Heterogeneous images

Need for tools capable to process these huge datasets

Which requirements?

Need for automation to extract information

Fully- or semi-automated analysis?

Is human assistance needed?

Whole mouse brain mapping

March 26, 2018 2


TeraStitcher: born for Confocal Light Sheet Microscope Images

Generates a 2D matrix of 3D tiles (image stacks)

Tiles are regularly spaced with reasonably reliable nominal positions

- Stack acquisition is performed with a single movement guaranteeing a reliable constant inter-slices distance along z

- Long range movements (order of cm) make necessary an alignment step in all directions

Sufficient overlap between adjacent tiles

March 26, 2018 4


TeraStitcher: Ultra-terabyte images

Need to process 16 bits images larger than20,000 x 20,000 x 3,000 voxels (~2.5 TB)

Increasingly high resolutions in X-Y

Multiple channels

Very large slices (e.g. 40K x 30K pixels)

Thousands of slices Solutions:

Sparse datasets

Compressed formats

Parallelism

Multi-channel support

March 26, 2018 5


What is image stitching?

Image stitching is theprocess of combiningmultiple photographicimages withoverlapping fields ofview to produce asegmented panoramaor high-resolutionimage (from Wikipedia)

March 26, 2018 7

Initially developed for confocal light sheet microscopy (CLSM) images

Fast 2D approach to align adjacent stacks

Efficient use of memory resources

Minimum I/O workload (2 readings, 1 writing)

Can generate a multi-resolution representation suited for further processing


TeraStitcher: overview

March 26, 2018 8


TeraStitcher: software structure

The software structure provides means for incorporating new functionalities

Object-oriented software architecture and design patterns

Functional decomposition visible to the user through command line options

Stitcher

StackedVolume

MIP-NCC-Displ

- MIP_displacements: int[6]

- NCC_widths: float[3]

- NCC_maxs: float[3]

- reliab_factors: float[3]

+ evalReliability(): float

+ …

…

MST

+ execute(…)

…

Stack

«interface»

PairwiseDisplAlgo

+ execute(…): Displacement

… MIP-NCC

+ execute(…): Displacement

«executes» «executes»

«uses»

Abstract Factory

VolumeManager

1

N

1 N

«interface»

Displacement

+ evalReliability(): float

+ getDisplacement(direction: int): int

+ threshold(thres: float)

+ project(displ: Displacement)

# def_displacements: int[3]

# displacements: int[3]

«produces»

Strategy

Stitcher

+ computeDisplacements(algorithm_type: int, …)

+ projectDisplacements() + thresholdDisplacements(thres: float)

+ computeTilesPlacement(algorithm_type: int)

+ mergeTiles(blending_type: int,…)

«interface»

TilesPlacementAlgo

+ execute(…)

«uses»

«uses»

March 26, 2018 9

Strategy (aims at minimizing memory requirements)

o Divide the volume into layers (substacks) along Z

o For each layer

o Keep in memory a row (or column) of substacks at the time

o For each pair of adjacent substacks, call

Algorithm

o Compute the Maximum Intensity Projections (MIPs)along X, Y, and Z for the two substacks

o Apply 2D Normalized Cross-Correlation (NCC) to thethree pairs of MIPs to find three 2D displacements

o Output “most-reliable” displacement for each direction


TeraStitcher: separating the strategy from the algorithm

March 26, 2018 10


TeraStitcher: global optimization step

Global optimization is performed using computed (or even nominal) alignments and their reliability

The resulting alignments are used to generate the xml file

Manual intervention is possible to correct isolated errors

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0(∞ )

V:1(1,35)

H:372(1,40)

D:-3(1,09)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:372(1,37)

H:1(1,30)

D:0(1,40)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:371(1,36)

H:2(1,35)

D:2(1,33)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:0(∞ )

H:373(1,15)

D:2(1,11)

V:1(1,40)

H:375(1,26)

D:-1(1,12)

V:0(∞ )

H:375 (∞ )

D:0(∞ )

V:0(1,32)

H:373(1,22)

D:-1(1,10)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:372(1,41)

H:3(1,22)

D:-4(1,35)

V:372(1,21)

H:4(1,35)

D:-1(1,31)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:1(1,37)

H:371(1,35)

D:-3(1,37)

V:3(1,34)

H:373(1,23)

D:0(1,08)

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:372(1,35)

H:3(1,22)

D:-1(1,37)

V:374(1,35)

H:5(1,22)

D:-1(1,12)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:374(1,25)

H:6(1,15)

D:2(1,12)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

horizontal axis H

vert

ical

axi

s V

MST along V

MST along H

MST along D

March 26, 2018 12


TeraStitcher: developments and issues

Parallelization: maintenance issues different alternatives to exploit parallel processing

different code for sequential and parallel version

Solution: launching multiple instances on disjoint sub-regions command line options to specify the sub-region to process

use a scripting language (python) to implement a driver

Alignment computation (computation bound) each instance processes a group of adjacent sub-stacks

merge the xml files with alignment information

Generation of the stitched image (I/O bound) the directory structure is first generated

each instance writes disjoint groups of tiles in parallel

metadata to be used in subsequent steps are finally generated

March 26, 2018 13

Why do we need GPUs?

Tile size: 2048 x 2048 x 2950 voxel (16 bits per voxel)

Tile matrix: 3 x 3

Total dataset size: Tile size x #tiles (>220 GByte)

Stitching on one Power8 (2.061 Ghz) core: ~15000 secs!

Stitching on one Xeon ES2640 (2.6 Ghz) core: ~20000 secs!

GPUs come to rescue!

March 26, 2018 16

Normalized Cross Correlation The most time-consuming part of the stitching

process is the evaluation of the Normalized Cross-Correlation (NCC) between the Maximum IntensityProjections (MIP) of the tiles.

NCC is a variant of the classic Cross Correlation inwhich data are normalized by subtracting the meanand dividing by the standard deviation of the twodatasets.

TeraStitcher implements a fast NCC from J. P. Lewis,“Fast Template Matching”, Vision Interface (1995).

March 26, 2018 17

CPU Normalized Cross Correlationf_mean = t_mean = 0.;

for ( i=0, pxl1=im1, pxl2=im2; i<dimi; i++, pxl1+=stride, pxl2+=stride ) {

for ( j=0; j<dimj; j++, pxl1++, pxl2++ ) {

f_mean += *pxl1;

t_mean += *pxl2;

}

}

f_mean /= (dimi*dimj);

t_mean /= (dimi*dimj);

numerator = factor1 = factor2 = 0.;

for ( ij=0, pxl1=im1, pxl2=im2; ij<dimi*dimj; ij++) {

f_prime = pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] - f_mean;

t_prime = pxl2[(ij%dimj)+(stride+dimj)*(ij/dimj)] - t_mean;

numerator += pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] * t_prime;

factor1 += f_prime * f_prime;

factor2 += t_prime * t_prime;

}

March 26, 2018 18

The previous fragment of code is repeated(2*delayu+1) x (2*delayv+1) times at each iteration(there are few hundreds iterations per run).

A typical value of delayu and delayv is ~ 100

We developed a gpu_NCC kernel that does all the workwith a single invocation per iteration:

minimize the overhead due to the invocation

allow to overlap memory copies from CPU to GPU(using streams)

March 26, 2018 19

From CPU TO GPU NCC

The gpu_NCC kernel heavily relies on shuffle primitives tocompute the mean values required by the NCC

#define CALCSUM(v) (v)=warpReduceSumD((v)); \

if(lane==0) shared[wid]=(v); \

__syncthreads(); \

(v) = (threadIdx.x<(blockDim.x/warpSize))*shared[lane]; \

if(wid==0) (v)=warpReduceSumD((v)); \

if(tid==0) shared[0]=(v); \

__syncthreads(); \

(v)=shared[0]; \

__syncthreads();

#define CALCAVE(v) CALCSUM(v) \

(v) /= (dimi*dimj);

March 26, 2018 20

GPU NCC

IBM S822LC HPC, courtesy of IBM Italy (G. Richelli)

512 GB of RAM,

16 Power8 cores (SMT 8)

2 960 GB solid state disks

4 P100 GPUs.

March 26, 2018 21

Test Platform

Time comparison (220 GB testcase)1 Power8 process 15300 seconds (14617 for the NCC)

2 Power8 processes 8191 seconds

4 Power8 processes 4100 seconds

16 Power8 processes 1168 seconds!

64 Power8 processes 2457 seconds (2200 for the NCC)!

1 process using one P100 580 seconds (70 for the NCC)

2 processes each one using a P100 344 seconds

4 processes each one using a P100 174 seconds

8 processes sharing 4 P100 93 seconds

16 processes sharing 2 P100 115 seconds

16 processes sharing 4 P100 56 seconds! (17.5 for the NCC)March 26, 2018 22

Results on other GPUs Running on a Titan V(olta) the GPU time reduces from

70 seconds to (about) 42.5 seconds

Running on a K80 the GPU time increases up to 493seconds!

The nvprof tool helps in understanding why timeincreases much more than expected…

March 26, 2018 23

K80 vs. Volta (average values)

March 26, 2018 24

The difference in performance is due to the memory subsystem of the Volta

There are no-changes with respect to the original CPUimplementation, just tiny additions

Can compile on platforms without GPU by turning-off amacro in the CMAKE configuration file

CUDA version activated by means of an env variable

Simple mechanism that works on any platform

CUDA version tested under Linux, MacOS, Windows

We are testing/evaluating an OpenCL version

CUDA code will be soon available from github

March 26, 2018 25

Other Features of the GPU version


TeraStitcher: current status

Open source multi-platform software (https://abria.github.io/TeraStitcher)

Available both as standalone application (with GUI) and as plugin for Vaa3D

The CPU version is already in use in many organizations including

WYSS Center, Geneve Switzerland

Renier Lab, ICM Brain and Spine Institute, Paris, France

Chung's Lab, MIT, Boston, USA

Adam Glaser, Washington University, Seattle, USA

Tomer Lab, Columbia University, New York, USA

Deisseroth Lab, Stanford University, CA, USA

Several users already expressed interest for the CUDA version.

TeraStitcher Brain Cell FinderVaa3D-TeraFlyTeraConverterMarch 26, 2018 26

https://abria.github.io/TeraStitcher


Credits (1/2): image processing, bioinformatics, machine learning

TeraStitcher

Vaa3D-TeraFly

Brain cell finder

Prof. Giulio Iannello

Alessandro Bria Leonardo Onofri

Prof. Paolo Frasconi Paolo Soda

Hanchuan Peng

Roberto Cortini

March 26, 2018 27


Credits (2/2): confocal light sheet microscopy, sample processing

Prof. Francesco Pavone Leonardo SacconiLudovico Silvestri

Irene Costantini

International Center of

Computational Neurophotonics

March 26, 2018 28

Credits: Jonathan Liu and Adam Glaser, Univ. of Washington have built the microscope and acquired the datathe samples were prepared by: Joshua Vaughan, University of Washington (the expanded brain), Rusty Nicovich, Allen Institute (the entire brain slice), Michael Gerner, University of Washington (mouse lymph node and mouse lung)

March 26, 2018 30

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell Finder


Brain cell finder (1/5): overview

Developed for confocal light sheet microscopy (CLSM) images [7]

Applied to localize and count the Purkinje cells in the cerebellum of an L7-GFP mouse

First complete map of a selected neuronal population in a large area of the mouse brain

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 32

Brain cell finder (2/5): semantic deconvolution

Boosts weak somata and decreases thevoxel intensities in non-soma regions

Use of a neural network (NN) trained tomap the original image into an ideal one

10 substacks covering different regions weremanually annotated to feed data to the NN

Original imageIdeal image (reference)

Image filtered by the trained

neural networktr

ain

ing

test

ing



Brain cell finder (3/5): mean shift clustering

First we get rid of very dark voxelswhich are unlikely to be part of a soma

Seeds are chosen as local maximaexceeding a maximum entropy-basedoptimal threshold

The mean shift algorithm then:

Places a spherical kernel of radius R oneach seed

Shifts each point towards the meanvalue computed as the kernel-weighted average of the data

Repeats the previous steps untilconvergence is achieved



Brain cell finder (4/5): manifold filter

Exploiting anatomical knowledge

Cells are not scattered randomly inthe 3D space but are laid out inmanifolds

Isolated or off-manifold cells arediscarded



Brain cell finder (5/5): results

Performances of the pipeline were estimatedon 56 substacks (~1.09 GigaVoxels) manuallyannotated with Vaa3D

- F1 = 0.96

A second human operator annotated the same

56 substacks and achieved F1 = 0.98

Processing the whole mouse cerebellum (120GigaVoxels) yielded cells, which agrees withstereology data


The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming...

Documents

Transcript of The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming...