Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT...

26
Session goals • Review existing APIs, and how they fit with – overall data architecture – MBAT architecture • Create a strategy for developing and assimilating uniform APIs, and priorities • Explore consequences for MBAT architecture

Transcript of Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT...

Page 1: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Session goals

• Review existing APIs, and how they fit with– overall data architecture– MBAT architecture

• Create a strategy for developing and assimilating uniform APIs, and priorities

• Explore consequences for MBAT architecture

Page 2: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Architecture; data types and interfaces

Gene Expression

2D Images2D vector

segmentations 3D Volumes

4+D Volumes (FMRI) Time Series

Phenotype / behavioral

Surfaces

SourcesSourcesSources

Publication

Discovery, Retrieval, Analysis, Viz, Integration APIs

Spatial Registry

BIRNLex, etc.

Source wrappers, following uniform web service APIs

CCDB

SRB, other

sources

DataRegistration

Portlets

MBAT WOMBAT Other clients

Mediator

Catalogs andindexes

Catalog wrappers, following uniform web service APIs

publication

Page 3: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

APIs

Read data from other atlases/databases, in a uniform way for the data type

API for data retrieval and transformation

Find relevant data in other atlases/ databases

API for atlas catalogs

View the region of interest in another atlas

API for atlas state exchange

Page 4: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Uniform Web Services API(towards BIRN-ML??)

Web services is a standard way to access remote functionality cross-platform, and assemble applications. We have several data types accessed by atlases: microarray data, 2D images, 3D volumes, surfaces, segmentations, annotations, phenotype/behavioral data, FMRI, time series, etc. Some of these data types have common representation models (e.g. MAGE). These models are typically large and exist in multiple incarnations. The level of detail they provide often is not needed for data discovery and common data access and integration tasks. So it would be useful to envelope such data in a common set of services that would expose the most essential data characteristics and represent the common denominator queries against each particular data type (e.g. getGenes, getProbes, getStructures...) that any dataset of this type shall respond to. Such services would support multiple clients, including atlases, BDR interface, mediator, etc.

Plug-in architecture vs SOA: no contradiction (focus on a single product, vs on a larger system)

Page 5: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Issues/steps (for MA and 2D)1. figure out how search requests and outputs as implemented in MBAT (http://

www.loni.ucla.edu/twiki/bin/view/MouseBIRN/WebServices), MA module in BIRN (http://microarray.nbirn.net/), and GN (http://www.genenetwork.org/CGIDoc.html)

2. Examine MAGE and see how the same MA requests and output can be expressed in MAGE. Then, depending on the results of (1), either abandon MAGE in favor of some simpler XML (potentially embedded in XCEDE?), or rely on MAGE constructs (and include them in XCEDE wrappers for gene expression sources, as a foreign namespace?). This shall be done vis-à-vis common information requirements of client applications (e.g. GetProbes?, GetGenes?, GetStructures?, etc.)

3. In parallel, review the schema used in the MA module, for whether it sufficiently reflects information model for GE data, and update as necessary

4. If we decide on the XCEDE route, make sure the mediator can connect with XCEDE sources, be it a database source or a web source in XCEDE wrapper (see http://mediator.nbirn.net:8080/axis/services/MedTestBService?wsdl – this would involve passing web service calls via the ExecuteQuery? method, and conversion between XML output and mediator’s recordset)

5. Identify additional sources or databases to be wrapped in the same API (GN, Gensat, ABA, BIRN MA + GEO +UCSC (VISIGENE) – for MA data; CCDB, ABA, ArcIMS, spatial registry, Gensat – for 2D). Then finalize the signatures.

6. Make sure terms used in queries and in the output, are tagged with BIRNLex terms (e.g. develop controlled vocabularies for each term)

7. Implement web services for the GN and MA module (incl testing/deployment) 8. Based on results of (3), update data publication tools (i.e. software for loading data from common

CSV and text files into the MA module), make sure controlled vocabularies are enforced; 9. Make sure AIDB’s XCEDE wrapper supports the services as well(?). Now CCDB-based.10. Publish and document web services; develop a series of examples of how they can be called

from various programming environments and applications 11. GEO API: connect the region names with MBAT: need semantic registration;

possibly scrape the GEO catalog, reconcile labels with MBAT semantics, and have a service wrapper into GEO data,

Page 6: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

About XCEDE and MA• XCEDE is the common schema providing access to BIRN databases.

– HID and the emerging AIDB are being wrapped in XCEDE (see http://www.na-mic.org/Wiki/index.php/Slicer3:Remote_Data_Handling), and - as deployed at BIRN-CC: http://bcc-dev-mediator.nbirn.net:8080/axis/services/HidQuerierWS?wsdl. Web services are being written against XCEDE, so both HID and AIDB will be accessible through XCEDE web services. The goal, therefore, could be to route common metadata requests against gene expression, 2d images, 3d volumes data, in XCEDE, and extend XCEDE to support additional requests.

– If we switch to CCDB as the image catalog – what components of XCEDE shall be retained

• MAGE-ML/FUGE/MAGEv2 – MAGE-ML is derived from Microarray Gene Expression Object Model (MAGE-

OM), which is developed and described using the Unified Modelling Language (UML. MAGE-ML is by purpose used to describe microarray designs, microarray manufacturing information, microarray experiment setup and execution information, gene expression data and data analysis results. MAGEv2 is being built on top of FuGE as an extension to add in microarray specific classes (extending Data as ArrayDesign, DesignElementData, etc, Material as Array, QPCRPlate, etc, and DimensionElement as DesignElement extended by Feature, Reporter, and CompositeElement).

• FUGE Home Page • MAGE Home Page

Page 7: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

From XCEDE API

• Gets/Puts:– GetProjects, GetProject, GetProjectDetail– GetSubject, GetSubjects, GetSubjectDetail– GetVisits,…– GetStudies,…– GetSeries,…– Get Data Acquisitions– Get Assessments,…– getData, Get DataSizeEstimate

• Also some getCapabilities returns (e.g. getMethods)

Page 8: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

API Examples: Mediator services

• http://mediator.nbirn.net:8080/axis/services/MedTestBService?wsdl– SOAP Method : executeQuery (loginTimeoutSecs,

maxByteCountPerBatch, queryID, queryLifeInSeconds, queryParameters.item0.name, queryParameters.item0.value, queryParameters.item1.name, queryParameters.item1.value, queryString, queryTimeoutSecs, resultLifeInSeconds, securityCertificateString)

– SOAP Method : fetchNextResultBatch, fetchPreviousResultBatch, fetchCurrentResultBatch, fetchRelativeResultBatch, fetchResultBatch, getErrorMessage, getStatistics

Page 9: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

API Examples: BIRN MA• BIRN Microarray (http://

microarray.nbirn.net/get_data.php? )– REST service: cmd=<get_probes|get_my_probes> – user_id=<int> – dset=<all|null> – strain=<string> – keyword=<string> – species=<string> – sex=<string> – stage=<string> – subject_group=<string> – anatomy=<string> – probe_id=<string>

GN will need several more:

- platformID

- GeneID (proxied by ProbeID here?)

- ExonID

- “bestID” (sort by quality, based on user-selected metric, e.g. highest expression)Infomodel:

Species – probes – structures - genes

Passing SQL queries as opposed to just filters that we have…

Is there a way to unify what is returned from GN, ABA, BIRN-MA?

Matrix (from ABA-Neuroblast): who are best covariants: spatially,semantically, temporally; which genes were most modulated

Page 10: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

API examples: Gensat

• GENSAT:http://maloney.loni.ucla.edu:8080/axis/GensatSource.jws?wsdl

• getGene(geneSym, geneName, exprLevel, anatStruc, stage, sex)

– Don’t have probes; ABA doesn’t have them either (= genes)

• get2DImage(geneSym, geneName, exprLevel, anatStruc, stage, sex, plane)

– No spatial info• getDataTypes(dataSourceID)

– Essentially, a capabilities request

Page 11: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

API Examples: GeneNetwork

• http://www.genenetwork.org/webqtl/WebQTL.py?

• cmd=birn (also: genotype, get, trait, map, interval, correlation…)

• species=XXXX • tissue=XXXX • symbol=XXXX • ProbeId=XXXX • function=XXXX • Strain=XXXX

http://www.genenetwork.org/CGIDoc.html

Check with Amarnath on ontology mapping for genes in GN

Page 12: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Our expectations for MA data

Methods Summary

   getGenes(String geneCode, String geneName, String geneFunction, String keyWord)       Get the gene information by either gene code, gene name, gene function or keyword.

   getProbe(String ProbesID String geneName, String geneCode)       Get the probe information by either probe id, gene code, or gene name.

   getStructures(String structureName, String geneCode, String geneName)       Get the gene information by either gene code, gene name, gene function or keyword.

More?

Page 13: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Allen Brain Atlas API

The API to the Allen Brain Atlas-Mouse Brain consists of a set of services allowing users to programmatically download the complete high resolution images, 3D volumes, and metadata for more than 20,000 genes in the database. In addition to the documentation, a demo has been created to demonstrate the use of the services of the API. The demo's source code is also available…

Page 14: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

ABA API details (expressions)• ImageSeries Structure Expression

(ImageSeries ID) -> XML in ABA schema

• Expression Energy Volumes (ImageSeriesID) -> sparse volume file (x,y,z for each voxel where expression energy value > 0, + density of expression)

Comment:Smoothed energy volume for gene Tspan2imageseriesId 75144618 Dimensions:67,41,5838,14,4,2.01994e-06 39,14,4,2.37068e-05 40,14,4,3.08554e-05 . . .

Page 15: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

ABA API (Genes)• Genes (GeneSymbol) xml (image-

series, gene-expressions)

- <image-series>  <age>56</age>   <geneid>12593</geneid>   <imageseriesdisplayname>Coch-Coronal-05-2779</imageseriesdisplayname>   <imageseriesid>71717614</imageseriesid>   <ncbiaccessionnumber>NM_007728</ncbiaccessionnumber>   <plane>coronal</plane>   <probeorientation>antisense</probeorientation>   <projectname>0310</projectname>   <riboprobename>RP_050623_02_G08</riboprobename>   <sex>male</sex>   <specimenid>05-2779</specimenid>   <strain>C57BL/6J</strain>   <templateid>143280</templateid>   <transcriptgi>31982455</transcriptgi>   <transcriptid>9068</transcriptid>   <transcriptname />   <treatmenttype>ISH</treatmenttype>   </image-series>

- <gene-expression>  <avgdensity>100.0</avgdensity>   <avglevel>93.9770317077637</avglevel>   <geneid>12593</geneid>   <projectcode>0310</projectcode>   <rgb>#a0d8e8</rgb>   <structureid>343</structureid>   <structurelabel>STRd</structurelabel>   <structurename>Striatum dorsal region</structurename>   </gene-expression>

No need to put sex or strainNeed to control for dummy entries on front endResolution is another issueProvenance information, when multiple probes (have on their site)Status=OK|failed|single_best

Page 16: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

ABA API Details (images)

• Get Image: http://www.brain-map.org/aba/api/image?zoom=[zoom]&path=[filePath]&mime=[mime]&top=[top]&left=[left]&width=[width]&height=[height]– Default output = jpeg; zooms = 0…6; path = filepath

to a file in image series– Top, left – in image coords, for full size image (implied

zoomify images)• ImageProperties (by path; by ImageID):

– <IMAGE_PROPERTIES WIDTH="15185" HEIGHT="8817" NUMTILES="2832" NUMTIERS="7" NUMIMAGES="1" VERSION="1.8" TILESIZE="256" />

• ImageSeries (ID) – Have GetImage feature implemented in GN (per Rob)

Page 17: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

ABA API DemosCode Samples  

API Wrapper This Java class wraps the API URLs in convienience methods. A simple caching scheme is implemented.

3D Volume classes

A set of Java classes for reading & writing 3D expression volume files retrieved through the API.  Methods are availble for reading & writing our text-based volume data in MetaImage-compatible binary format; other methods allow the extraction of a slice of volume data as a Java BufferedImage.

Images Define & download regions of interest or complete images at multiple resolutions.

Visualization Java classes for displaying and navigating 3D volumes, setting color maps, adjusting image dynamic range.

Analysis Generate a median expression volume over any input set of expression volumes.  Query volume files and compute an overall gene expression "energy" statistic for each region. Calculate ISH image regions of interest based on 3D regions of interest defined in the Atlas coordinate space.

More Several general purpose Java Swing-based UI forms for tasks like downloading & processing data; classes for efficiently indexing Gene-to-image_series mapping data; define/read/write 2D & 3D ROIs.

      Data         

Annotated Atlas volumes

3D volume files annotated with the Allen Brain Atlas structure IDs at each voxel. Volumes available at 25, 100 & 200 micron resolution.

Brain structure ontology

The major structures of the Allen Brain Atlas and their parent-child relationships; includes the abbreviations and IDs used throughout the ABA data set.

Gene to image series mapping

A complete list of the genes available in the ABA data set, mapped to the IDs of all of the image series for each.  Each entry also references EtrezGene IDs and NCBI accession numbers.

Page 18: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Other existing APIs (spatial)• The registry has web service interface, to find available images in ROI:

– http://smartatlas.nbirn.net:8080/axis/services/ImageMetadataForROI?wsdl<request><category>mouse</category><regionofinterest>-2,2,-2,-2,2,-2,2,2,-2,2</

regionofinterest><slicenumber>031</slicenumber></request>

• Requesting image fragments:– E.g. http://geon15.sdsc.edu/axis/services/ImageQueryService?wsdl– the name of the method is getSimpleImageWithSpecs– method inputs:

• host - 132.239.131.188• serviceName - slice_15b_warped1194307428796 • minx - -6.035011 miny - -8.165376 maxx - 12.088584 maxy - 0.812765 • imageHeight - 800 imageWidth - 600

• There are standards in the GIS world on how you exchange spatial data, e.g. GML simple features that are the basis of many application schemas. E.g. – <gml:Point srsName="urn:ogc:def:crs:EPSG:6.6:4269>

            <gml:pos>45.256 -71.92</gml:pos>         </gml:Point>

Page 19: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Catalog, and catalog services

• The current model:– MBAT registers individual source services, queries

them for both metadata and data• Response time based on the slowest of them

• Catalog-based:– For each data type, there is a catalog that stores

information for initial discovery• Eg. probes: ABA:probe1; geneA…;…. GN:probe1

(ensuring unique probe IDs…)• Or images: ABA:image1; type=zoomify;URL=…;

– Discovery queries (getProbes, getProbeInfo, getTissues, getTissueInfo, etc.) are executed against the catalog, while getData go against data sources

– The Catalog is synched with data sources periodically (sync services)

Page 20: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Feature Requests

• Ability to mix "AND" and "OR" in queries – (currently all queries assume "AND" of all

parameters) – might need a "language" to specify query

• Ability to request "pages" of results – for example, show me the first 10 results – similar to most search engine results

• More requests: more complete SQL

Page 21: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Issues/steps (for MA and 2D)1. figure out how search requests and outputs as implemented in MBAT (http://

www.loni.ucla.edu/twiki/bin/view/MouseBIRN/WebServices), MA module in BIRN (http://microarray.nbirn.net/), and GN (http://www.genenetwork.org/CGIDoc.html)

2. Examine MAGE and see how the same MA requests and output can be expressed in MAGE. Then, depending on the results of (1), either abandon MAGE in favor of some simpler XML (potentially embedded in XCEDE?), or rely on MAGE constructs (and include them in XCEDE wrappers for gene expression sources, as a foreign namespace?). This shall be done vis-à-vis common information requirements of client applications (e.g. GetProbes?, GetGenes?, GetStructures?, etc.)

3. In parallel, review the schema used in the MA module, for whether it sufficiently reflects information model for GE data, and update as necessary

4. If we decide on the XCEDE route, make sure the mediator can connect with XCEDE sources, be it a database source or a web source in XCEDE wrapper (see http://mediator.nbirn.net:8080/axis/services/MedTestBService?wsdl – this would involve passing web service calls via the ExecuteQuery? method, and conversion between XCEDE and mediator’s recordset)

5. Identify additional sources or databases to be wrapped in the same API (GN, Gensat, ABA, BIRN MA – for MA data; CCDB, ABA, ArcIMS, spatial registry, Gensat – for 2D). Then finalize the signatures

6. Make sure terms used in queries and in the output, are tagged with BIRNLex terms (e.g. develop controlled vocabularies for each term)

7. Implement web services for the GN and MA module (incl testing/deployment) 8. Based on results of (3), update data publication tools (i.e. software for loading data from common

CSV and text files into the MA module), make sure controlled vocabularies are enforced; 9. Make sure AIDB’s XCEDE wrapper supports the services as well(?). 10. Publish and document web services; develop a series of examples of how they can be called

from various programming environments and applications

Page 22: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

API for GE/MA• What are use cases?• What is the information model, and what is the catalog?:

a) species -> subjects (age, sex, etc.)-> strains -> probes ->genes -> tissues

• API for discovery… (by tissues, genes, probe sets,…)– getSpecies() list of species in the registry– getSpeciesInfo -> Species metadata from the registry– getProbes (species, strain, sex, age)– getProbeInfo

• API for retrieval…

Page 23: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Probes(resolution, failed or not)

GE values(categorical or

numeric)

Species

Subjects(Stage|age, sex)Strains

(genetic manipulations

Tissues(incl.

pointer totissue

vocabulary)

Probe-tissueCatalog

CCDB

Genes(from a master

list)

Manufacturer/

Geneticmanipulations

(biowarehouse)

getSpecies, getSpeciesInfogetSubjects, ..getGenes({probeseries}) ,..getGenes({tissue})getProbes({Genes})getProbeTissuesgetGE ({probeTissues})

provenanceInfo

normalization,Units, etc.

Discovery stage: ultimately till GetProbeTissues call DataRetrieval: getGE

Page 24: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

API for 2D images• What are use cases?

• What is the information model, and what is the catalog?

• API for discovery… (by ROI, by labeled regions, by spatial relations…also, getImageInfo? getImageSeriesInfo?)

• API for retrieval (getImage? getImageStack?)

Page 25: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

Find expression for a Gene based on Gene Name or abbreviation

Image catalogType of image, gene expressed,

Spatial characteristics (coord system,Spatial extent, plane, local XYZs

Genes(name, abbrev)

Imagery servers

Species, strains,

Subjects, projects

getImagesgetImageInfo (size, orientation..)getImageStacksgetImageStackInfo

Proteins

Gene|protein-tissuecatalog

Image stacksand groups

discovery retrieval

Page 26: Session goals Review existing APIs, and how they fit with –overall data architecture –MBAT architecture Create a strategy for developing and assimilating.

• Have “representatives” for each of the GE and 2D sources, who would vet the schema for whether the sources can be mapped into it without significant losses