The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming...

32

Transcript of The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming...

Page 1: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation
Page 2: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

The context: bioimage data explosion

High-throughput imaging techniques have led to bioimage data explosion

Submicrometer resolution

TeraByte scale images

Heterogeneous images

Need for tools capable to process these huge datasets

Which requirements?

Need for automation to extract information

Fully- or semi-automated analysis?

Is human assistance needed?

Whole mouse brain mapping

March 26, 2018 2

Page 3: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

TeraStitcher: born for Confocal Light Sheet Microscope Images

Generates a 2D matrix of 3D tiles (image stacks)

Tiles are regularly spaced with reasonably reliable nominal positions

- Stack acquisition is performed with a single movement guaranteeing a reliable constant inter-slices distance along z

- Long range movements (order of cm) make necessary an alignment step in all directions

Sufficient overlap between adjacent tiles

March 26, 2018 4

Page 4: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

TeraStitcher: Ultra-terabyte images

Need to process 16 bits images larger than20,000 x 20,000 x 3,000 voxels (~2.5 TB)

Increasingly high resolutions in X-Y

Multiple channels

Very large slices (e.g. 40K x 30K pixels)

Thousands of slices Solutions:

Sparse datasets

Compressed formats

Parallelism

Multi-channel support

March 26, 2018 5

Page 5: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

?

Page 6: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

What is image stitching?

Image stitching is theprocess of combiningmultiple photographicimages withoverlapping fields ofview to produce asegmented panoramaor high-resolutionimage (from Wikipedia)

March 26, 2018 7

Page 7: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Initially developed for confocal light sheet microscopy (CLSM) images

Fast 2D approach to align adjacent stacks

Efficient use of memory resources

Minimum I/O workload (2 readings, 1 writing)

Can generate a multi-resolution representation suited for further processing

University Campus Bio-Medico of Rome

TeraStitcher: overview

March 26, 2018 8

Page 8: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

TeraStitcher: software structure

The software structure provides means for incorporating new functionalities

Object-oriented software architecture and design patterns

Functional decomposition visible to the user through command line options

Stitcher

StackedVolume

MIP-NCC-Displ

- MIP_displacements: int[6]

- NCC_widths: float[3]

- NCC_maxs: float[3]

- reliab_factors: float[3]

+ evalReliability(): float

+ …

MST

+ execute(…)

Stack

«interface»

PairwiseDisplAlgo

+ execute(…): Displacement

… MIP-NCC

+ execute(…): Displacement

«executes» «executes»

«uses»

Abstract Factory

VolumeManager

1

N

1 N

«interface»

Displacement

+ evalReliability(): float

+ getDisplacement(direction: int): int

+ threshold(thres: float)

+ project(displ: Displacement)

# def_displacements: int[3]

# displacements: int[3]

«produces»

Strategy

Stitcher

+ computeDisplacements(algorithm_type: int, …)

+ projectDisplacements() + thresholdDisplacements(thres: float)

+ computeTilesPlacement(algorithm_type: int)

+ mergeTiles(blending_type: int,…)

«interface»

TilesPlacementAlgo

+ execute(…)

«uses»

«uses»

March 26, 2018 9

Page 9: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Strategy (aims at minimizing memory requirements)

o Divide the volume into layers (substacks) along Z

o For each layer

o Keep in memory a row (or column) of substacks at the time

o For each pair of adjacent substacks, call

Algorithm

o Compute the Maximum Intensity Projections (MIPs)along X, Y, and Z for the two substacks

o Apply 2D Normalized Cross-Correlation (NCC) to thethree pairs of MIPs to find three 2D displacements

o Output “most-reliable” displacement for each direction

University Campus Bio-Medico of Rome

TeraStitcher: separating the strategy from the algorithm

March 26, 2018 10

Page 10: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

TeraStitcher: global optimization step

Global optimization is performed using computed (or even nominal) alignments and their reliability

The resulting alignments are used to generate the xml file

Manual intervention is possible to correct isolated errors

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0(∞ )

V:1(1,35)

H:372(1,40)

D:-3(1,09)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:372(1,37)

H:1(1,30)

D:0(1,40)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:371(1,36)

H:2(1,35)

D:2(1,33)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:0(∞ )

H:373(1,15)

D:2(1,11)

V:1(1,40)

H:375(1,26)

D:-1(1,12)

V:0(∞ )

H:375 (∞ )

D:0(∞ )

V:0(1,32)

H:373(1,22)

D:-1(1,10)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:372(1,41)

H:3(1,22)

D:-4(1,35)

V:372(1,21)

H:4(1,35)

D:-1(1,31)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:1(1,37)

H:371(1,35)

D:-3(1,37)

V:3(1,34)

H:373(1,23)

D:0(1,08)

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:0(∞ )

H:375(∞ )

D:0 (∞ )

V:372(1,35)

H:3(1,22)

D:-1(1,37)

V:374(1,35)

H:5(1,22)

D:-1(1,12)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:374(1,25)

H:6(1,15)

D:2(1,12)

V:375(∞ )

H:0(∞ )

D:0(∞ )

V:375(∞ )

H:0(∞ )

D:0(∞ )

horizontal axis H

vert

ical

axi

s V

MST along V

MST along H

MST along D

March 26, 2018 12

Page 11: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

TeraStitcher: developments and issues

Parallelization: maintenance issues different alternatives to exploit parallel processing

different code for sequential and parallel version

Solution: launching multiple instances on disjoint sub-regions command line options to specify the sub-region to process

use a scripting language (python) to implement a driver

Alignment computation (computation bound) each instance processes a group of adjacent sub-stacks

merge the xml files with alignment information

Generation of the stitched image (I/O bound) the directory structure is first generated

each instance writes disjoint groups of tiles in parallel

metadata to be used in subsequent steps are finally generated

March 26, 2018 13

Page 12: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Why do we need GPUs?

Tile size: 2048 x 2048 x 2950 voxel (16 bits per voxel)

Tile matrix: 3 x 3

Total dataset size: Tile size x #tiles (>220 GByte)

Stitching on one Power8 (2.061 Ghz) core: ~15000 secs!

Stitching on one Xeon ES2640 (2.6 Ghz) core: ~20000 secs!

GPUs come to rescue!

March 26, 2018 16

Page 13: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Normalized Cross Correlation The most time-consuming part of the stitching

process is the evaluation of the Normalized Cross-Correlation (NCC) between the Maximum IntensityProjections (MIP) of the tiles.

NCC is a variant of the classic Cross Correlation inwhich data are normalized by subtracting the meanand dividing by the standard deviation of the twodatasets.

TeraStitcher implements a fast NCC from J. P. Lewis,“Fast Template Matching”, Vision Interface (1995).

March 26, 2018 17

Page 14: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

CPU Normalized Cross Correlationf_mean = t_mean = 0.;

for ( i=0, pxl1=im1, pxl2=im2; i<dimi; i++, pxl1+=stride, pxl2+=stride ) {

for ( j=0; j<dimj; j++, pxl1++, pxl2++ ) {

f_mean += *pxl1;

t_mean += *pxl2;

}

}

f_mean /= (dimi*dimj);

t_mean /= (dimi*dimj);

numerator = factor1 = factor2 = 0.;

for ( ij=0, pxl1=im1, pxl2=im2; ij<dimi*dimj; ij++) {

f_prime = pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] - f_mean;

t_prime = pxl2[(ij%dimj)+(stride+dimj)*(ij/dimj)] - t_mean;

numerator += pxl1[(ij%dimj)+(stride+dimj)*(ij/dimj)] * t_prime;

factor1 += f_prime * f_prime;

factor2 += t_prime * t_prime;

}

March 26, 2018 18

Page 15: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

The previous fragment of code is repeated(2*delayu+1) x (2*delayv+1) times at each iteration(there are few hundreds iterations per run).

A typical value of delayu and delayv is ~ 100

We developed a gpu_NCC kernel that does all the workwith a single invocation per iteration:

minimize the overhead due to the invocation

allow to overlap memory copies from CPU to GPU(using streams)

March 26, 2018 19

From CPU TO GPU NCC

Page 16: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

The gpu_NCC kernel heavily relies on shuffle primitives tocompute the mean values required by the NCC

#define CALCSUM(v) (v)=warpReduceSumD((v)); \

if(lane==0) shared[wid]=(v); \

__syncthreads(); \

(v) = (threadIdx.x<(blockDim.x/warpSize))*shared[lane]; \

if(wid==0) (v)=warpReduceSumD((v)); \

if(tid==0) shared[0]=(v); \

__syncthreads(); \

(v)=shared[0]; \

__syncthreads();

#define CALCAVE(v) CALCSUM(v) \

(v) /= (dimi*dimj);

March 26, 2018 20

GPU NCC

Page 17: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

IBM S822LC HPC, courtesy of IBM Italy (G. Richelli)

512 GB of RAM,

16 Power8 cores (SMT 8)

2 960 GB solid state disks

4 P100 GPUs.

March 26, 2018 21

Test Platform

Page 18: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Time comparison (220 GB testcase)1 Power8 process 15300 seconds (14617 for the NCC)

2 Power8 processes 8191 seconds

4 Power8 processes 4100 seconds

16 Power8 processes 1168 seconds!

64 Power8 processes 2457 seconds (2200 for the NCC)!

1 process using one P100 580 seconds (70 for the NCC)

2 processes each one using a P100 344 seconds

4 processes each one using a P100 174 seconds

8 processes sharing 4 P100 93 seconds

16 processes sharing 2 P100 115 seconds

16 processes sharing 4 P100 56 seconds! (17.5 for the NCC)March 26, 2018 22

Page 19: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Results on other GPUs Running on a Titan V(olta) the GPU time reduces from

70 seconds to (about) 42.5 seconds

Running on a K80 the GPU time increases up to 493seconds!

The nvprof tool helps in understanding why timeincreases much more than expected…

March 26, 2018 23

Page 20: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

K80 vs. Volta (average values)

March 26, 2018 24

The difference in performance is due to the memory subsystem of the Volta

Page 21: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

There are no-changes with respect to the original CPUimplementation, just tiny additions

Can compile on platforms without GPU by turning-off amacro in the CMAKE configuration file

CUDA version activated by means of an env variable

Simple mechanism that works on any platform

CUDA version tested under Linux, MacOS, Windows

We are testing/evaluating an OpenCL version

CUDA code will be soon available from github

March 26, 2018 25

Other Features of the GPU version

Page 22: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

TeraStitcher: current status

Open source multi-platform software (https://abria.github.io/TeraStitcher)

Available both as standalone application (with GUI) and as plugin for Vaa3D

The CPU version is already in use in many organizations including

WYSS Center, Geneve Switzerland

Renier Lab, ICM Brain and Spine Institute, Paris, France

Chung's Lab, MIT, Boston, USA

Adam Glaser, Washington University, Seattle, USA

Tomer Lab, Columbia University, New York, USA

Deisseroth Lab, Stanford University, CA, USA

Several users already expressed interest for the CUDA version.

TeraStitcher Brain Cell FinderVaa3D-TeraFlyTeraConverterMarch 26, 2018 26

Page 23: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

Credits (1/2): image processing, bioinformatics, machine learning

TeraStitcher

Vaa3D-TeraFly

Brain cell finder

Prof. Giulio Iannello

Alessandro Bria Leonardo Onofri

Prof. Paolo Frasconi Paolo Soda

Hanchuan Peng

Roberto Cortini

March 26, 2018 27

Page 24: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

Credits (2/2): confocal light sheet microscopy, sample processing

Prof. Francesco Pavone Leonardo SacconiLudovico Silvestri

Irene Costantini

International Center of

Computational Neurophotonics

March 26, 2018 28

Page 25: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Credits: Jonathan Liu and Adam Glaser, Univ. of Washington have built the microscope and acquired the datathe samples were prepared by: Joshua Vaughan, University of Washington (the expanded brain), Rusty Nicovich, Allen Institute (the entire brain slice), Michael Gerner, University of Washington (mouse lymph node and mouse lung)

Page 26: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

March 26, 2018 30

Page 27: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell Finder

Page 28: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

Brain cell finder (1/5): overview

Developed for confocal light sheet microscopy (CLSM) images [7]

Applied to localize and count the Purkinje cells in the cerebellum of an L7-GFP mouse

First complete map of a selected neuronal population in a large area of the mouse brain

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 32

Page 29: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

Brain cell finder (2/5): semantic deconvolution

Boosts weak somata and decreases thevoxel intensities in non-soma regions

Use of a neural network (NN) trained tomap the original image into an ideal one

10 substacks covering different regions weremanually annotated to feed data to the NN

Original imageIdeal image (reference)

Image filtered by the trained

neural networktr

ain

ing

test

ing

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 33

Page 30: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

Brain cell finder (3/5): mean shift clustering

First we get rid of very dark voxelswhich are unlikely to be part of a soma

Seeds are chosen as local maximaexceeding a maximum entropy-basedoptimal threshold

The mean shift algorithm then:

Places a spherical kernel of radius R oneach seed

Shifts each point towards the meanvalue computed as the kernel-weighted average of the data

Repeats the previous steps untilconvergence is achieved

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 34

Page 31: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

Brain cell finder (4/5): manifold filter

Exploiting anatomical knowledge

Cells are not scattered randomly inthe 3D space but are laid out inmanifolds

Isolated or off-manifold cells arediscarded

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 35

Page 32: The context - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · The most time-consuming part of the stitching process is the evaluation of the Normalized Cross-Correlation

University Campus Bio-Medico of Rome

Brain cell finder (5/5): results

Performances of the pipeline were estimatedon 56 substacks (~1.09 GigaVoxels) manuallyannotated with Vaa3D

- F1 = 0.96

A second human operator annotated the same

56 substacks and achieved F1 = 0.98

Processing the whole mouse cerebellum (120GigaVoxels) yielded cells, which agrees withstereology data

Vaa3D-TeraFlyTeraStitcher TeraConverter Brain Cell FinderMarch 26, 2018 36