PathGrid: a service-orientated architecture for microscopy image...

16
Phil. Trans. R. Soc. A (2010) 368, 3937–3952 doi:10.1098/rsta.2010.0158 PathGrid: a service-orientated architecture for microscopy image analysis BY N. A. WALTON 1, *, J. D. BRENTON 2,3 , C. CALDAS 2,3 , M. J. IRWIN 1 , A. AKRAM 1 , E. GONZALEZ-SOLARES 1 , J. R. LEWIS 1 , P. H. MACCALLUM 2 , L. J. MORRIS 2 AND G. T. RIXON 1 1 Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge CB3 0HA, UK 2 Cancer Research UK Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK 3 Department of Oncology, University of Cambridge, Hutchison MRC Research Centre, Cambridge CB2 0XZ, UK This paper describes ‘PathGrid’—an analysis and data integration system, developed initially to meet the demands in the analysis of medical microscopy imaging data. An overview of the current system is given, describing the techniques used in developing the data handling infrastructure and the analysis algorithm development. The use of software created in the context of systems designed for the astronomy domain is noted, specifically infrastructure from the astronomy virtual observatory movement for data discovery, access and workflow management, and astronomical image analysis software adapted for the analysis of high-throughput astronomy imaging surveys. This paper notes the applicability of the techniques from the astronomy domain. The testbed infrastructure deployment is described, emphasizing its speed and ease of use and support. The validity of the analysis techniques is confirmed through the pilot study described here—with the application to a large sample of immunohistochemistry microscopy data obtained in part for assessing the oestrogen receptor status of breast cancers. The analysis showed that the specificity and sensitivity values for the automatic scoring using PathGrid were within the errors of those obtained via a ‘gold standard’ manual pathologist scoring. Keywords: microscopy; image processing; astronomy; information extraction 1. Introduction Microscopy of clinical samples has recently begun to generate large image datasets because of the increasing availability of high-throughput automated scanning microscopes and the creation of tissue microarrays (TMAs). TMAs enable the analysis of hundreds of tissue sections on a single slide, resulting in *Author for correspondence ([email protected]). One contribution of 16 to a Theme Issue ‘e-Science: past, present and future I’. This journal is © 2010 The Royal Society 3937 on June 2, 2018 http://rsta.royalsocietypublishing.org/ Downloaded from

Transcript of PathGrid: a service-orientated architecture for microscopy image...

Page 1: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

Phil. Trans. R. Soc. A (2010) 368, 3937–3952doi:10.1098/rsta.2010.0158

PathGrid: a service-orientated architecture formicroscopy image analysis

BY N. A. WALTON1,*, J. D. BRENTON2,3, C. CALDAS2,3, M. J. IRWIN1,A. AKRAM1, E. GONZALEZ-SOLARES1, J. R. LEWIS1, P. H. MACCALLUM2,

L. J. MORRIS2 AND G. T. RIXON1

1Institute of Astronomy, University of Cambridge, Madingley Road,Cambridge CB3 0HA, UK

2Cancer Research UK Cambridge Research Institute, Li Ka Shing Centre,Robinson Way, Cambridge CB2 0RE, UK

3Department of Oncology, University of Cambridge, Hutchison MRCResearch Centre, Cambridge CB2 0XZ, UK

This paper describes ‘PathGrid’—an analysis and data integration system, developedinitially to meet the demands in the analysis of medical microscopy imaging data. Anoverview of the current system is given, describing the techniques used in developingthe data handling infrastructure and the analysis algorithm development. The use ofsoftware created in the context of systems designed for the astronomy domain is noted,specifically infrastructure from the astronomy virtual observatory movement for datadiscovery, access and workflow management, and astronomical image analysis softwareadapted for the analysis of high-throughput astronomy imaging surveys. This papernotes the applicability of the techniques from the astronomy domain. The testbedinfrastructure deployment is described, emphasizing its speed and ease of use and support.The validity of the analysis techniques is confirmed through the pilot study describedhere—with the application to a large sample of immunohistochemistry microscopy dataobtained in part for assessing the oestrogen receptor status of breast cancers. Theanalysis showed that the specificity and sensitivity values for the automatic scoringusing PathGrid were within the errors of those obtained via a ‘gold standard’ manualpathologist scoring.

Keywords: microscopy; image processing; astronomy; information extraction

1. Introduction

Microscopy of clinical samples has recently begun to generate large imagedatasets because of the increasing availability of high-throughput automatedscanning microscopes and the creation of tissue microarrays (TMAs). TMAsenable the analysis of hundreds of tissue sections on a single slide, resulting in

*Author for correspondence ([email protected]).

One contribution of 16 to a Theme Issue ‘e-Science: past, present and future I’.

This journal is © 2010 The Royal Society3937

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 2: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3938 N. A. Walton et al.

the conservation of tissue and the reduction in inter-experimental variability.Tissue samples collected from patients during surgery and preserved in paraffin(donor blocks) are reviewed by pathologists, who identify tumour tissue andnormal control tissue from the samples. Cylindrical cores (usually less than1 mm in size) are cut from these samples and placed in an array for analysisby immunohistochemistry (IHC) using antibodies to detect a panel of candidatebiomarkers. The subsequent manual scoring of TMAs by a trained pathologist isa major bottleneck in their analysis and there is a need for automated approachesto image analysis to provide increased throughput and objective assessment ofbiomarker expression.

These high-throughput methods, e.g. IHC, underpin research into the discoveryand validation of new predictive markers for cancer (e.g. Brenton et al. 2001,2005; Callagy et al. 2003, 2008; Ahmed et al. 2007; Rexhepaj et al. 2008).New research is increasingly moving towards exploiting a systems approach topathology (e.g. Cordon-Cardo et al. 2007; Donovan et al. 2008) to discover newpredictive markers. This integrative strategy combines morphometric analysisof cancer cells and tissues with other complex molecular datasets and outcomedata from clinical trials, together with a range of tools and related informationsuch as genomic information from curated databases. Applying this approachto microscopy promises to accelerate the rate at which new biomarkers canbe evolved from discovery and quickly applied in the clinic using standardpathological workflows.

Building on techniques developed in the astrophysics domain, the PathGridproject has developed an integrated data analysis and access system specificallydesigned to handle effectively the wide range of complexity inherent in microscopydata. The PathGrid solution is fully open, scalable and extensible, and thus isrelevant for use in environments where data input and access is required to large,distributed, heterogeneous data.

In this paper, the focus is on the technical basis of the PathGrid system,which is allowing it to provide the vital technological workbench for discovery,supporting the systems pathology approach. The overall system architecture isdiscussed in §2. The description of the use of analysis algorithms developedinitially for use in the astronomy domain is given in §3. The current PathGridtestbed system is outlined in §4, while experience gained from the initial useand deployment of the PathGrid system is given in §5. The paper closes withconclusions and outlook in §6.

2. PathGrid architecture

The PathGrid (http://www.pathgrid.org) system is composed of a number ofdistinct components. In the following sections, the use of a data handlinginfrastructure originally developed through the astronomy virtual observatory(VO) is described (§2a) along with detail on how it has been adopted specificallyfor PathGrid (§2b). The workflow management system that supports thedevelopment and execution of complex workflows is discussed, together with abrief indication of how the processing chain will in future be integrated witha database management system. A full discussion of the analysis algorithms isleft until §3.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 3: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

PathGrid 3939

(a) The virtual observatory

The VO initiative in astronomy has been developed to meet the specificchallenges resulting from the rapid growth of data in astronomy, bothobservational and model data.

Historically, astronomy has been an observationally based science. Study ofthe cosmos has enabled a better understanding of the physical processes atwork in the Universe, and thus allowed astronomers to answer a range ofkey questions: from how the Universe formed at the time of the Big Bang,through the formation and evolution of galaxies, to the properties of terrestrialextra-solar planets.

A range of large observatories, both ground based (e.g. the European SouthernObservatory telescopes; http://www.eso.org) and in space (e.g. the Hubble SpaceTelescope; http://www.stsci.edu), are producing observational data across thewavelength domain. Technological advances in areas such as detectors haveenabled the sky to be observed across the full range of the electromagneticspectrum. These new observational facilities generate significant data volumesand this coupled with an increasing need to combine data from differingwavelength regimes (e.g. X-ray and infrared data) leads to significant data andcomputational challenges.

The VO movement emerged in 2001 with the aim to create a global systemto provide uniform access to this distributed data, with project initiativesin the USA (the National Virtual Observatory) and Europe (the EuropeanVirtual Observatory) leading the way. In order to coordinate the developmentof interoperating data services, the International Virtual Observatory Alliance(http://www.ivoa.net) was formed in 2002 (Genova et al. 2002) by representativesfrom the major VO projects. It has successfully developed a number ofinteroperability protocols (see http://www.ivoa.net/Documents which giveslinks to these standards) upon which the VO implementations have beenbuilt. This ensures that those VO systems are able to access data andapplications provided by data centres publishing their resources conforming tothese standards.

In the UK, the AstroGrid project (http://www.astrogrid.org) generated a setof interoperating infrastructure components to enable the publishing of data andapplications in a secure environment. AstroGrid was funded over the period2001–2009, being a consortium (as of 2008) consisting of participating groupsfrom the universities of Bristol, Cambridge, Central Lancashire, Edinburgh,Leicester and Manchester and the Rutherford Appleton Laboratory. From 2009further development of this infrastructure is being carried out within the contextof a number of project initiatives at the European level including the Euro-VO (http://www.euro-vo.org) and the Virtual Atomic and Molecular DataCentre (http://www.vamdc.eu). This ensures future sustainability and continuedtechnical support of the system.

The AstroGrid system is interoperable with data and application servicespublished more generally by a wide range of data centres located globally inthe USA, Europe and elsewhere. An astronomer science user of the AstroGridsystem is able to make use of the VOEXPLORER client (Tedds et al. 2008), as a toolto search for and discover relevant data and application resources. Queries andmanipulation of these data can then be carried out using inbuilt user interfaces

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 4: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3940 N. A. Walton et al.

Figure 1. This figure shows use of VO software for data discovery and visualization in astronomy.This example shows the use of VODESKTOP, with its VOEXPLORER window open (top left)—wherethe user searches and selects the Hipparcos astrometric catalogue. The user then performs an actionon that resource (bottom left), in this case a query of the data. The result of that query generatesa dataset, which is transferred to the desktop tool (bottom right) and displayed graphically (topright). This particular visualization shows the distribution of observations of objects on the sky.

relevant for specific service interfaces, or by interoperating clients (handlingfor instance data visualization). Figure 1 shows an example use case whereastrometric data from the Hipparcos data are discovered through VODESKTOP,retrieved from the data centre and displayed in a connected desktop visualizationtool, all achieved in a short sequence of simple actions through uniform interfaces.In this manner, a comprehensive range of data is available to any astronomerthrough a single interface.

The AstroGrid software is available from http://www.astrogrid.org and ispublished as open source software (with an Academic Free Licence). Walton &Gonzalez-Solares (2009) and references therein describe the AstroGrid systemand its use for astronomical research (e.g. Walton 2005).

(b) The virtual observatory applied to PathGrid

As noted in §2a, the PathGrid Service-Orientated Architecture (SOA) is basedon that developed in the context of the AstroGrid VO and Euro-VO projects.

AstroGrid provides software components to make Web services for resourcediscovery (registry component), virtual file storage (VOSpace), databaseaccess (DSA/catalogue) and application execution (CEA application-server).The latter two components wrap, respectively, a relational database anddata-processing modules developed by PathGrid, providing Web access tothose functions. A clear separation, with formal interfaces, is maintained between

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 5: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

PathGrid 3941

the AstroGrid code and the PathGrid code. This allows independent maintenanceand development.

The PathGrid application modules for server-side application are called by theWeb-service wrapper through a Unix command-line interface. The applicationmodules themselves do not contain Web-service code and need not be writtenin a language that supports Web services. Further, the application modules areseparate programs and can be written in different languages, which makes iteasier to incorporate legacy software.

The Web services use both Simple Object Access Protocol (SOAP) andrepresentational state transfer (REST; Fielding 2000) styles. They can be calledfrom desktop applications specific to PathGrid (written in a wide choice of high-level and scripting languages), from generic clients provided by AstroGrid, fromother Web applications or from the Taverna workflow system. The AstroGridcode in the Web services handles access control and access to data in VOSpace.The AstroRuntime component is a client-side library supporting access to theseservices. It encapsulates the details of the Web-service protocols and can itself becalled from most languages.

We note the evolution of the VO-based infrastructural components. In earlierimplementations, there were significant overheads inherent in the Web service-based approach. However, there have been significant improvements from earlierimplementations. In the first place, the Java-based implementation has benefitedfrom significant improvements in the Java Virtual Machine, especially with JavaSE 6 (http://java.sun.com/performance/reference/whitepapers/6_performance.html). The SOAP interfaces have been optimized. In some areas, the use ofthe REST style model (Fielding & Taylor 2002) has allowed for a significantsimplification of the interfaces with resultant improvements in speed. Finally, theevolution of the workflow enactment engine (Taverna) has seen a significant focuson optimization of performance. The recent releases (thus 2.1) reflect this and nowshow that the use of the workflow enactment engine adds only marginal overheadsto the execution times of complex workflows (see Taverna 2.1 documentation:http://www.taverna.org.uk/documentation/taverna-2-1/release-notes/). Thesefactors, coupled with our experience from use of the testbed systems as describedin actual use in §5 demonstrate that the PathGrid architecture is suitable for thescale of data inherent in this domain.

(c) Workflow management

In order to allow for the construction and management of a set of processingservices into one data analysis pipeline, the PathGrid system contains a workflowcomponent. This is based around the Taverna (Hull et al. 2006; Oinn et al. 2006)workflow management system.

Taverna is a set of tools for designing and running workflows. It consists ofa server (or client)-based enactment engine and a desktop client (the TavernaWorkbench). It was originally developed for use in the bio-informatics realm buthas now been taken up for use across a wide range of disciplines from biology,chemistry, medicine, to astronomy and the social sciences among others. Tavernahas a datamodel view of workflow. It can invoke various types of services, localjava classes, standard WSDL described Web services, ‘grid’ services.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 6: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3942 N. A. Walton et al.

Figure 2. This figure shows a typical PathGrid analysis workflow. In this case, the user indicatesthe location of a whole slide of microscopy data and gives a location for the output results. Theworkflow then takes the input data, reformats them, and processes them through the analysis tasks,outputting the final and intermediate results to the virtual storage.

The PathGrid implementation makes use of the Astro-Taverna (Walton et al.2008) plugin for Taverna to construct and enact the complex processing chains.This processor plugin for Taverna provides the interface to the AstroGridAstroRuntime (Winstanley et al. 2007) thus allowing for the integration of thePathGrid VO-based data and application services.

With a PathGrid workflow (see an example in figure 2) the user can execute amultiple-step pipeline as a single-click operation. This covers the login process, filetransfer, image conversion, image analysis, generation of catalogues and storage ofresulting images and files on local or virtual file and database systems. It providescomputing scalability, interface to CaBIG (which now also interfaces Taverna; seehttp://cabig.nci.nih.gov/tools/taverna) and automatic handling of submission to‘grid’ and ‘cloud’ clusters.

In a future development to facilitate greater sharing of the research process, themyExperiment (De Roure et al. 2009) virtual research interface will be offered forstorage and sharing of packaged processing Astro-Taverna PathGrid workflows.This is a powerful virtual research environment that makes it easy to find, use andshare scientific workflows, and thus will provide a useful underpinning ‘sharing’technology for the growing numbers of users of PathGrid workflows and services.

(d) Database management

A single TMA slide typically contains several hundred tissue samples (cores),each originating from a single donor (i.e. patient) block. It is important to relatethe final image analysis results from the PathGrid system to the original slidesand tissue cores to enable the results to be integrated with the clinical andpathological data. An algorithm to detect positively stained nuclei in IHC tissueimages has been developed (§3). The number of detected nuclei in a 0.6 mm

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 7: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

PathGrid 3943

core image is typically of the order of 1000. A database schema was designedto record selected output parameters from the image analysis, including thenumber, position and intensity values for the nuclear features. The capacity toquery the data should enable the pathologists and the statisticians to performcomplex analysis; for example, queries to compare different analyses performedwith different input parameters and the selection of anomalous results withirregular staining patterns.

At this stage, only preliminary capture of the output catalogues to the databasesystem has been implemented. However, this will be a significant focus of futurework, especially with the acquisition of increasingly large datasets and resultantoutput catalogues.

We note that currently the output binary tables are being ingressed intoan Oracle 9g RDBMS. The interface is provided by the AstroGrid DatasetAccess DSA component. This provides a service interface to the Oracle databasesystem that makes it compliant and accessible from the PathGrid workbench. Inparticular queries can be actioned through a workflow. The database system tosupport PathGrid will be more fully described in a forthcoming paper.

3. Astronomical algorithms

The PathGrid system incorporates a number of image analysis algorithmsdeveloped for the analysis of optical and infrared image data.

The Cambridge Astronomical Survey Unit (CASU; http://casu.ast.cam.ac.uk) at the Institute of Astronomy (IoA), University of Cambridge, is themain UK centre of expertise in the analysis of astronomy image data. It isresponsible for the processing of significant volumes of imaging data from arange of major observatories. In particular, it has both developed the analysisalgorithms and associated pipelines, and operated these pipelines in support oflarge public surveys from ESO’s 4 m VISTA infrared telescope (e.g. Dye et al.2006). This telescope, commissioned in 2009, is now being used in survey modegenerating typically some 100 GB of data per night, which will eventually lead toimage archives of hundreds of terabytes. The analysis systems have been designedspecifically to support high-throughput data flows, and are thus robust, and havea high degree of automation.

The astronomy analysis pipelines are designed to extract the maximum amountof information from the astro-imaging sky surveys, enabling the highest possiblescience return from these surveys.

The processing chains typically involve a range of operations.

— Image processing to remove instrumental effects to generate a linearphoton noise-limited image. Deep stacking is often undertaken to enablethe rejection of various artefacts such as cosmic rays and bad/dead pixels.

— Detection and parametrization of objects in the images. The extractionalgorithms are able to detect objects against complex backgroundvariations (at local and global scales). Optimal detection is via matchedfilters with image segmentation into objects. For each extracted object,parameter estimations are generated, giving information on, for instance,position, flux and morphology.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 8: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3944 N. A. Walton et al.

Figure 3. This figure shows a CASU pipeline reduced Y, J, K pseudo-colour image from ESO’sVISTA infrared telescope of the Galactic centre (credit: ESO/VISTA, CASU). The structure,complexity and data volume is of a similar order to that found in the IHC microscopy data.

— External calibration and object classification. This covers both astrometric(positional) and photometric (flux) calibration of the images and extractedobjects. These parameters in turn enable classification of each object tobe made, for instance generating a star/galaxy probability for any objectbased on shape/morphology parameters.

— A range of quality control parameters are automatically generated,allowing for an estimation of image quality, background variation, detectorperformance, etc.

— Matching of images is often performed, across image bands (e.g. measuringthe appearance and property of objects as detected through differing colourfilters), or detecting variations in the position or flux of an object in timeseries data.

Figure 3 shows an example of a recent commissioning image of the centre ofour Milky Way as observed with the VISTA telescope. This image is composed ofthree infrared bands and shows the complexity, and richness of structure at theheart of our Galaxy.

The detection of objects follows a two-stage process. A background is fittedand removed, and remaining objects are then identified and parametrized. Thetechnique for detection of the objects follows the formulation described in Irwin(1985) which uses optimal matched filter detection techniques, feature extractionusing thresholded pixel connectivity (Lutz 1979) and deblending of objectsthrough multiple hill climbing, analagous to watershedding (Meyer 1991).

The implementations of these algorithms into processing pipelines for opticaland infrared astronomical imaging data are described in Irwin & Lewis (2001)and Irwin et al. (2004), respectively.

These then are the main analysis algorithms, which have been transferred andadapted for use as the source extraction routines in PathGrid. These routines areapplied to the microscopy images, with the full parametrization of each detectedobject being output to an object catalogue file specific to each of those images.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 9: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

PathGrid 3945

Figure 4. The image to the left shows a sample slide with the detected objects overplotted. Tothe right is the same image, where the object detection algorithm has been run with a betteradjusted initial estimate for typical object diameter. This results in a much improved efficiency ofobject detection.

The pilot study showed that simple changes to the configuration parametersof the object detection algorithm were sufficient to achieve a high degree ofdetection efficiency. An example is shown in figure 4 where a first run of thedetection algorithm fails to locate individual cell nuclei. However, by decreasingthe parameter governing the initial estimation for the FWHM size of the objectsof interest, an excellent detection efficiency is achieved.

PathGrid provides a suite of applications covering the whole workflow process,from the conversion of the images as received from the image scanner, objectextraction and statistical routines for final high level analysis.

Effective use of a number of client side tools from astronomy is made inthe handling and visualization of the microscopy data. The bulk binary tablecatalogue files generated as a result of the image analysis process can be convertedinto comma separated value (CSV) and extensible markup language (XML) files,using the TOPCAT Stilts libraries (Taylor 2006) if the local user so wishes. TOPCAT(Taylor 2005; see http://www.star.bris.ac.uk/∼mbt/topcat/) is also used as aninteractive graphical viewer and editor for the PathGrid tabular data. ALADIN(Bonnarel 2000; see http://aladin.u-strasbg.fr/) is used to handle the display ofimage data. It has the ability to stack images, and also allows for the efficientvisualization of catalogue information. Figure 6 shows an examples of the use ofTOPCAT and ALADIN in combination—with data handling between them and thedata as stored in the virtual storage area enabled via use of the Simple ApplicationMessage Protocol (see http://www.ivoa.net/Documents/latest/SAMP.html; aninteroperability standard developed through the IVOA). Note that the bulk dataare generated in binary form; however, use of CSV is most appropriate for rapiddatabase ingression, while the XML representation is suitable for visual displayowing to the ease of transforming XML data.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 10: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3946 N. A. Walton et al.

4. PathGrid testbed

The current 2009 deployment of the PathGrid data infrastructure testbedlinks services at the IoA and Cambridge Research Institute (CRI) of CancerResearch-UK (CR-UK).

The hardware system includes two eight core Dell Poweredge servers, withassociated 2 TB disk stores. One server is configured to host the PathGridcommunity, registry and ‘VOSpace’ disk store, while the other acts as theapplication server.

Figure 5 shows the configuration of the PathGrid infrastructure modules acrossthe servers. The end user runs the user client software from any remote location.Currently, the end users are located at the CRI, CR-UK.

The experience gained from the initial use of the system now underwayat the CRI, with the analysis of large sets (2500 samples) of ER (nuclearmarker—the analysis of which described below in §5) and HER2 (membranemarker—the analysis of which is described in our forthcoming paper) microscopydata, processed with PathGrid analysis workflows is demonstrating the efficiacyof the system. This initial deployment has demonstrated a range of keyfeatures, required for future larger scale rollout. This includes secure datatransport of the image sets from the scanning microscopes at CRI to thedevelopment analysis server at the IoA, the deployment of the client userinterface tools at CRI, the actioning of the relevant workflow from that client,with the actual analysis run on the servers at the IoA, together with theingression of the output catalogues into the development Oracle database atthe IoA. The reduced images and data products are accessible via the userclients at CRI.

Figure 6 shows the result of the visualization of one of the image cores andthe use of desktop client tools interoperating to handle the interplay between thedisplay of image and the catalogue data.

At this stage, little optimization of the workflows has been undertaken.However, preliminary use of the processing chain on the sample datasets hasdemonstrated that a full analysis of a typical 180 core (equivalent to one slide)dataset (equating to approx. 500 MB of image data) requires of the order of 1200 susing the current testbed application server. We note here that the processingoverheads introduced by the workflow management system are of the order of20 per cent, which is an acceptable value when balanced against the operationalefficiency that use of the workflow system brings. This overhead is mainly becauseof the exchange of XML-formatted control messages between the server andTaverna workflow. These messages are small and thus they only impose thismodest additional overhead.

5. Pilot study validation

The initial validation of the PathGrid system was undertaken by developing andevaluating a scoring algorithm applied to a sample ‘nuclear marker’ dataset.In brief, the presence of the oestrogen receptor (ER) protein in data obtainedthrough the Eastern Cancer Registration & Information Centre (ECRIC)campaign (Wishart et al. 2010) was scored and compared with ‘gold standard’

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 11: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

PathGrid 3947

applicationserver

CEAapplications

registryOracle

file storage

community

DSA catalogue

VOSpace

client

managementserver

Figure 5. This figure shows a schematic overview of the PathGrid testbed deployment. The serversare located at the IoA and can be accessed in a secure fashion from any remote location. Thus,here the user (client) is located at the CRI, CR-UK.

Figure 6. This figure shows an image section displayed in the ALADIN visualizer with the detectedobjects overlayed. The catalogue information is displayed in the TOPCAT tabular data viewer.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 12: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3948 N. A. Walton et al.

pathologist scoring (Makretsov et al. 2008). The pilot validation describedhere was performed using the PathGrid testbed system as described in theprevious section.

Pathologists use the Allred classification (Allred et al. 1998; Harvey et al. 1999)to assign a score from 0 to 8 scale for each immunostained image core within aslide. This factor is composed of two elements: an estimation of the proportion ofpositively stained tumour cells (0, none; 1, <1%; 2, 1–10%; 3, 10–33%; 4, 33–66%;and 5, >66%) together with an intensity score describing the average intensity ofthe positive tumour cells (0, none; 1, weak, 2, intermediate; and 3, strong). Added,these give a range of 0–8. Scores of 0–2 represent a negative result, whereas scoresin the range 3–8 indicate a positive result.

The analysis showed high sensitivity and specificity when the automatedPathGrid technique was used to generate an equivalent Allred score and comparedwith the pathologists’ gold standard scores. Moreover, the automated processingand scoring was significantly faster (Walton et al. 2009), measured in minutesfor the automated technique compared with hours for the manual scoring. Ourforthcoming paper gives a fuller description of the experimental data used in thisstudy.

The validation process involved the development of a multi-step process. First,all image data obtained from the imaging microscopy system were ingressed tothe PathGrid system in the form of standard JPG images as exported by theARIOL microscope software system.

These JPG images were then converted to the multi-extension FITS file formatused in astronomical data analysis (e.g. Hanisch et al. 2001) with each JPG imagebeing decomposed into its three component channels: red (R), green (G) and blue(B). (We note that FITS is an efficient file format, and is used throughout theprocessing chain, enabling all applications to rapidly access the individual RGBcolour channels. The use of FITS, as the format which the astronomical analysisroutines accept, has significant speed gains compared with those algorithmsthat could handle native JPG files. The overhead in the transformation fromJPG to FITS is very low.) These individual channels were then processed toenable a ‘colour’ analysis of each image to be undertaken. For convenience froman astronomy imaging perspective, prior to processing each image channel wasinverted (x → 255-x) with the result that a ‘brown’ stain (absence of blue light)now becomes a ‘blue’ glow.

The underpinning object detection and morphological analysis was done ona ‘black and white’ image created by coadding the R+G+B channels. With anobject list it is then straighforward to place apertures over each detected featureand integrate the flux in each channel with respect to the local background perchannel. This enables a detailed colour/intensity analysis to be made for eachdetected feature. The full object shape descriptors are used to preselect featuresthat are most likely to be nuclei based on their circularity and their size on theimage. This, for example, allows simple rejection of the majority of fibroblasts inER data, which have significant ellipticity.

The degree and intensity of nuclear staining for each detected nuclear regionthen follows trivially from the ratio of blue channel flux (stain) to the average ofred and green (reference). Examples summarizing these measures are shown infigure 7, which demonstrates the distribution of the nuclear staining parametersfor a heavily stained image and for an unstained image. The dotted lines are

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 13: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

PathGrid 3949

2.0

1.5

1.0

ratio

0.5

0 5000intensity

104 1.5 × 104 5000intensity

104 1.5 × 1040

(a) (b)

Figure 7. Panel (a) shows the results from an extraction of an example heavily stained image,whereas panel (b) shows that from an unstained image. The ‘ratio’ parameter is a measure of theintensity of the object as detected in the ‘B’ colour when compared with the ‘R+G’ colours. Thisis plotted against the ‘intensity’ of each object.

internal estimates generated from the loci of points denoting unreliable faintdetections (vertical) and the dynamically computed boundary between stainedand unstained nuclei.

These distributions of individual nuclear data points are then treated as anensemble to create an overall score for each image. To duplicate the Allred scoringprocess as closely as possible, we define two statistics per image: the ratio ofstained to unstained nuclei, given by counting the ratio of ‘blue’ points above thehorizontal dashed line compared with the total number of points (the ‘proportion’statistic); and the median intensity ratio of the blue points relative to the medianlocus of the points below the line (the ‘intensity’ statistic). For each image, thesesummary statistics are generated.

In figure 6, the results for one slide sample of 182 cores are also shown.These are displayed using TOPCAT; note the interplay between windows,selecting a point in the proportion/statistic plane and locating that within thetabular data.

The analysis of Makretsov et al. (2008) compared manual scores with thoseobtained using the Genetix ARIOL processing software, and a semi-automatictechnique using the NIH IMAGEJ (Collins 2007) image analysis tool.

A two-dimensional receiver operating characteristic analysis (cf. Florkowski2008) of the manually scored 273 sample IHC slices is used to define theoptimal decision boundary based on a combination of the resulting sensitivity(the proportion of positives which are correctly identified as such) and specificity(the proportion of negatives correctly identified as negatives) figures. Figure 8shows the automatic PathGrid scoring compared with the gold standard manualscoring determined in the earlier study of Makretsov et al. The results are shownin table 1. It is apparent that the PathGrid algorithm for this nuclear markergives results comparable to those from both the ARIOL software system and theIMAGEJ manual scoring.

The full analysis of these results together with the a validation study againsta membrane marker is described in our forthcoming paper.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 14: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3950 N. A. Walton et al.

boundary: fromROC analysis

nagative: ER 0–2

positive: ER ≥ 3

positive: ER 8

positive: ER 7positive: ER 5positive: ER 3positive: ER 2positive: ER 0

0

0

0.5

1.0

0.5 1.0 1.5intensity statistic

prop

ortio

n st

atis

tic

Figure 8. The figure shows the comparison of the automatic scoring with values for each of thesamples as manually scored. The boundary line is set to optimize the sensitivity/specificity value.

Table 1. The sensitivities and specificities determined on the 273 sample set using the PathGridanalysis algorithm, and compared with the values utilizing ARIOL and IMAGEJ techniques reportedin Makretsov et al. (2008). The data used were the same across both studies.

technique sensitivity (%) specificity (%)

PathGrid 88.1 92.7ARIOL 85 95IMAGEJ 80 85

6. Outlook and conclusions

This paper has described the initial pilot development of the PathGrid systemand demonstrated that data analysis and data handling techniques developed forastronomical data are applicable when applied for use in the analysis and dataintegration of medical microscopy imaging data.

The initial validation of the PathGrid analysis algorithms applied to the caseof the ER marker data demonstrates the accuracy of the results compared againstgold standard scoring.

The future direction of the program will involve the development of analysisalgorithms and processing pipelines to be adapted to a wider range ofIHC image data. For instance, the extension of the algorithm set is nowbeing extended to cover a range of cytoplasmic (e.g. Bcl2) and membrane(e.g. HER2) markers. Initial validation on test sample data for these iscurrently under way. In particular, our assessment of the algorithm for a‘membrane marker’ dataset (approx. 2400 samples of HER2 data) shows bothsignificantly improved speed compared with manual pathologist scoring usingthe ‘Hercep test’ guidelines, and improvements in the specificity and sensitivityof the assessments.

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 15: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

PathGrid 3951

Further, from the infrastructure and data handling aspects, the PathGridtestbed system is currently limited to a small scale deployment at the IoA,Cambridge, and CRI, CR-UK. In order to support an extended set of distributedresearchers, involved in wider research collaborations, the client software will bemade more widely available.

Work described in this paper was funded through an MRC Discipline Hopper programme award(G0601785) and via a STFC miniPIPSS grant (ST/G003556/1). We acknowledge advice throughthe Oracle EMEA External Research and Development Programme. Use is made of the softwaredeveloped by the AstroGrid Virtual Observatory Project, which was funded by the Science andTechnology Facilities Council and through the EU’s Framework 6 programme.

References

Ahmed, A. A. & Brenton, J. D. 2005 Microarrays and breast cancer clinical studies: forgettingwhat we have not yet learnt. Breast Cancer Res. 7, 96–99. (doi:10.1186/bcr1017)

Ahmed, A. A. et al. 2007 The extracellular matrix protein TGFBI induces microtubulestabilization and sensitizes ovarian cancers to paclitaxel. Cancer Cell 12, 514–527.(doi:10.1016/j.ccr.2007.11.014)

Allred, D. C., Harvey, J. M., Berardo, M. & Clark, G. M. 1998 Prognostic and predictive factorsin breast cancer by immunochemical analysis. Mod. Pathol. 11, 155–168.

Bonnarel, F., Fernique, P., Bienaymé, O., Egret, D., Genova, F., Louys, M., Ochsenbein, F.,Wenger, M. & Bartlett, J. G. 2000 The ALADIN interactive sky atlas. A reference tool foridentification of astronomical sources. Astron. Astrophys. Suppl. Ser. 143, 33–40. (doi:10.1051/aas:2000331)

Brenton, J. D., Aparicio, S. A. & Caldas, C. 2001 Molecular profiling of breast cancer: portraitsbut not physiognomy. Breast Cancer Res. 3, 77–80. (doi:10.1186/bcr274)

Brenton, J. D., Carey, L. A., Ahmed, A. A. & Caldas, C. 2005 Molecular classification and molecularforecasting of breast cancer: ready for clinical application? J. Clin. Oncol. 23, 7350–7360.(doi:10.1200/JCO.2005.03.3845)

Callagy, G., Cattaneo, E., Daigo, Y., Happerfield, L., Bobrow, L. G., Pharoah, P. D. & Caldas, C.2003 Molecular classification of breast carcinomas using tissue microarrays. Diag. Mol. Pathol.12, 27–34. (doi:10.1097/00019606-200303000-00004)

Callagy, G. M., Webber, M. J., Pharoah, P. D. & Caldas, C. 2008 Meta-analysis confirms BCL2is an independent prognostic marker in breast cancer. BMC Cancer 8, 153. (doi:10.1186/1471-2407-8-153)

Collins, T. J. 2007 ImageJ for microscopy. BioTechniques 43(Suppl.), S25–S30. (doi:10.2144/000112517)

Cordon-Cardo, C. et al. 2007 Improved prediction of prostate cancer recurrence through systemspathology. J. Clin. Invest. 117, 1876–1883. (doi:10.1172/JCI31399)

De Roure, D., Goble, C. & Stevens, R. 2009 The design and realisation of the myExperiment virtualresearch environment for social sharing of workflows. Future Gen. Comput. Syst. 25, 561–567.(doi:10.1016/j.future.2008.06.010)

Donovan, M. J. et al. 2008 Systems pathology approach for the prediction of prostatecancer progression after radical prostatectomy. J. Clin. Oncol. 26, 3923–3929. (doi:10.1200/JCO.2007.15.3155)

Dye, S. et al. 2006 The UKIRT infrared deep sky survey early data release. Mon. Not. R. Astron.Soc. 372, 1227–1252. (doi:10.1111/j.1365-2966.2006.10928.x)

Fielding, R. T. 2000 Architectural styles and the design of network-based software architectures.Doctoral dissertation, University of California, Irvine.

Fielding, R. T. & Taylor, R. N. 2002 Principled design of the modern web architecture. ACMTrans. Internet Technol. 2, 115–150. (doi:10.1145/514183.514185)

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from

Page 16: PathGrid: a service-orientated architecture for microscopy image analysisrsta.royalsocietypublishing.org/content/roypta/368/1925/3937.full.pdf · PathGrid: a service-orientated architecture

3952 N. A. Walton et al.

Florkowski, C. M. 2008 Sensitivity, specificity, receiver-operating characteristic (ROC) curvesand likelihood ratios: communicating the performance of diagnostic tests. Clin. Biochem. Rev.29(Suppl. i), S83–S87.

Genova, F. et al. 2002 International collaboration for the virtual observatory. Bull. Am. Astron.Soc. 34, 789.

Hanisch, R. J., Farris, A., Greisen, E. W., Pence, W. D., Schlesinger, B. M., Teuben, P. J.,Thompson, R. W. & Warnock, A. 2001 Definition of the flexible image transport system (FITS).Astron. Astrophys. 376, 359–380. (doi:10.1051/0004-6361:20010923)

Harvey, J. M., Clark, G. M., Osborne, C. K. & Allred, D. C. 1999 Estrogen receptor statusby immunohistochemistry is superior to the ligand-binding assay for predicting response toadjuvant endocrine therapy in breast cancer. J. Clin. Oncol. 17, 1474–1481.

Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P. & Oinn, T. 2006 Taverna:a tool for building and running workflows of services. Nucleic Acids Res. 34, 729–732.(doi:10.1093/nar/gkl320)

Lutz, R. K. 1979 On the realtime analysis of astronomical images. In Proc. 5th Colloquium onAstrophysics, Trieste, Italy, 4–8 June 1979 (eds G. Sedmak, M. Capaccioli & R. J. Allen),p. 218. Trieste, Italy: Osservatorio Astronomico di Trieste.

Irwin, M. J. 1985 Automatic analysis of crowded fields. Mon. Not. R. Astron. Soc. 214, 575–604.Irwin M. & Lewis J. 2001 INT WFS pipeline processing. New Astron. Rev. 45, 105–110.

(doi:10.1016/S1387-6473(00)00138-X)Irwin, M. J., Lewis, J. R., Hodgkin, S. T., Bunclark, P. S., Evans, D. W., McMahon, R. G.,

Emerson, J. P., Stewart, M. & Beard, S. 2004 VISTA data flow system: pipeline processing forWFCAM and VISTA. Proc. SPIE 5493, 411–422. (doi:10.1117/12.551449)

Makretsov, N., Howart, W., Pharoah, P., Dawson, S.-J., Provenzano, E., Bows, F., Driver, K. &Caldas, C. 2008 Quantitative digital image analysis of estrogen receptor immunostaining as amodel for morphological screening of nuclear immunomarkers. Virchows Arch. 452(Suppl 1),S22.

Meyer, F. 1991 Un algorithme optimal pour la ligne de partage des eaux. 8me congré dereconnaissance des formes et intelligence artificielle 2, 847–857.

Oinn, T. 2006 Taverna: lessons in creating a workflow environment for the life sciences. ConcurrencyComput. Pract. Exp. 18, 1067–1100. (doi:10.1002/cpe.993)

Rexhepaj, E., Brennan, D. J., Holloway, P., Kay, E. W., McCann, A. H., Landberg, G., Duffy,M. J., Jirstrom, K. & Gallagher, W. M. 2008 Novel image analysis approach for quantifyingexpression of nuclear proteins assessed by immunohistochemistry: application to measurementof oestrogen and progesterone receptor levels in breast cancer. Breast Cancer Res. 10, R89.(doi:10.1186/bcr2187)

Taylor, M. B. 2005 TOPCAT & STIL: Starlink Table/VOTable processing software. ASP Conf.Ser. 347, 29.

Taylor, M. B. 2006 STILTS—a package for command-line processing of tabular data. ASP Conf.Ser. 351, 666.

Tedds, J. A., Winstanley, N., Lawrence, A., Walton, N. A., Auden, E. & Dalla, S. 2008 VOExplorer:visualising data discovery in the virtual observatory. ASP Conf. Ser. 394, 159.

Walton, N. A. 2005 Deploying the AstroGrid: science use ready. ASP Conf. Ser. 347, 273.Walton, N. A. & Gonzalez-Solares, E. 2009 AstroGrid and the virtual observatory. In Jets from

young stars V. Lecture Notes in Physics, no. 791, pp. 81–113. Berlin, Germany: Springer.Walton, N. A., Witherwick, D. K., Oinn, T. & Benson, K. M. 2008 Taverna and workflows in the

virtual observatory. ASP Conf. Ser. 394, 309.Walton, N. A. et al. 2009 PathGrid: the transfer of astronomical image algorithms to the analysis

of medical microscopy data. ASP Conf. Ser. 411, 77.Winstanley, N., Taylor, J. D., Taylor, M. B., Noddle, K. T., Gonzalez-Solares, E. & Lindroos, J.

2007 Astro Runtime: an API to the virtual observatory. ASP Conf. Ser. 376, 571.Wishart, G. C., Azzato, E. M., Greenberg, D. C., Rashbass, J., Kearins, O., Lawrence, G., Caldas,

C. & Pharoah, P. D. 2010 PREDICT: a new UK prognostic model that predicts survivalfollowing surgery for invasive breast cancer. Breast Cancer Res. 12, R1. (doi:10.1186/bcr2464)

Phil. Trans. R. Soc. A (2010)

on June 2, 2018http://rsta.royalsocietypublishing.org/Downloaded from