Post on 29-Jan-2016
1
Cancer Biomedical Informatics Grid
(caBIG) – An Approach towards Data
Access and Integration
Avinash Shanbhag
Director, Core Infrastructure EngineeringNational Cancer Institute
Center for Bioinformatics
2
National Cancer Institute 2015 Goal
Relieve suffering and death due to cancer by the year 2015
3
Origins of caBIG
Need: Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal.
Strategy: Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network and data can be seamlessly shared
4
caBIG Challenges
Handle diversity of data types
Precise “Meaning” of data
Provide local hosting of data
Local access control
Provide tools to “publish” and “access” data easily
High Performance computing will be needed in future
5
SemanticSemanticinteroperabilityinteroperability
SyntacticSyntacticinteroperabilityinteroperability
Interoperability
ability of a system to access and use the parts or equipment of another system
6
How to Achieve Interoperability for Data Systems?
Well Documented public API access to data
Based on object oriented abstraction of underlying data– No particular technology or tool specified
Abstraction layer must be derived using widely accepted “standards”– Model Driven Architecture
Information Model is the “Metadata” of the data and needs to be persisted and accessible via API
Need to be able to “unambiguously” and programmatically determine the meaning of data
7
OMG Model Driven Architecture (MDA) Approach
Analyze the problem space and develop the artifacts for each scenario– Use Cases
Use Unified Modeling Language (UML) to standardize model representations and artifacts. Design the system by developing artifacts based on the use cases– Class Diagram – Information Model– Sequence Diagram – Temporal Behavior
Use meta-model tools to generate the code
8
Limitations of MDA
Limited expressivity for semantics
No facility for runtime semantic metadata management
9
caCORESyntactic and Semantic Integration
MDA Plus a whole lot more!
10
caCORE
Bioinformatics Objects
Enterprise Vocabulary
Common Data Elements
SECURITY
11
Use Cases
Description
Actors
Basic Course
Alternative Course
12
Bioinformatics Objects
13
What do all those data classes and attributes actually mean, anyway?
Data descriptors or “semantic metadata” required
Computable, commonly structured, reusable units of metadata are “Common Data Elements” or CDEs.
NCI uses the ISO/IEC 11179 standard for metadata structure and registration
Semantics all drawn from Enterprise Vocabulary Service resources
Common Data Elements
14
Preferred Name
Synonyms
Definition
Relationships
Concept Code
Enterprise Vocabulary Description Logic
15
Semantic metadata example: Agent
<Agent>
<name>Taxol</name>
<nSCNumber>007</nSCNumber>
</Agent>
16
Why do you need metadata?Why do you need metadata?
Class/Attribute
Example Object Data
CIA Metadata NCI Metadata
Agent A sworn intelligence agent; a spy
Chemical compound administered to a human being to treat a disease or condition, or prevent the onset of a disease or condition
AgentnSCNumber
007 Identifier given to an intelligence agent by the National Security Council
Identifier given to chemical compound by the US Food and Drug Administration Nomenclature Standards Committee
Agentname
Taxol CIA code name given to intelligence agents
Common name of chemical compound used as an agent
17
Computable Interoperability
Agent
name
nSCNumber
FDAIndID
CTEPName
IUPACName
Drug
id
NDCCode
approver
approvalDate
fdaCode
C1708:C41243
C1708:C41243
C1708 C1708
My model Your model
18
Cancer Data Standards Repository
ISO/IEC 11179 Registry for Common Data Elements – units of semantic metadata
Client for Enterprise Vocabulary: metadata constructed from controlled terminology and annotated with concept codes
Precise specification of Classes, Attributes, Data Types, Permissible Values: Strong typing of data objects.
19
caCORE Tools
UML Loader: automatically register UML models as metadata components
CDE Curation: Fine tune metadata and constrain permissible values with data standards
Form Builder: Create standards-based data collection forms
CDE Browser: search and export metadata components
Common Security Module: Provides role based security
20
caCORE Software Development Kit
UML Modeling Tool (any with XMI export)
Semantic Connector (concept binding utility)
UML Loader (model registration in caDSR)
Codegen (middleware code generator)
Security Adaptor (Common Security Module)
caCORE SDK generates syntactically and semantically interoperable data service system
21
caGrid
caCORE meets grid technology!
22
Use cases not satisfied by caCORE alone
Advertisement– Service Provider composes service metadata describing the
service and publishes it to grid.
Discovery– Researcher (or application developer) specifies search criteria
describing a service of interest– The research submits the discovery request to a discovery
service, which identifies a list of services matching the criteria, and returns the list.
Invocation– Researcher (or application developer) instantiates the grid
service and access its resources
23
GolGoldd
Cancer Center Cancer Center
Cancer Center
Cancer Center
Cancer Center
NCIOTHER caBIGSERVICE
PROVIDERS
OTHERTOOLKITS
SilverSilver
SilverSilver
SilverSilverSilverSilver
SilverSilver
SilverSilver SilverSilver
24
caGrid Components
Leverage existing technologies:– caDSR, EVS, Mobius GME: Common data elements, controlled vocabularies, schema
management– Globus Toolkit (currently version 4.0.1)
• Core grid services infrastructure• Service deployment, service registry, invocation, base security infrastructure
Additional Core Infrastructure– Higher-level security services (Dorian)– Grid service access to metadata components (caDSR, GME, etc)– Workflow, Identifier services
Service Provider Tooling (Introduce)– Graphical service development and configuration environment– Abstractions from service infrastructure for Data and Analytical services– Deployment wizards
Client Tooling– High-level APIs for interacting with core components and services– Graphical Tools
25
caGrid 0.5 Architecture(May be updated for 1.0)
Grid Communication Protocol
Service Description
Service
Business ProcessService R
egistry
Secu
rity
Sem
antic service
Resource M
anagement
Functions Quality of Service
ID R
esolution
Transport
GSI
GUMS
GT3
Analytical
OGSA-DAI GT3
GLOBUS Toolkit
caDSR
EVS GT3
UI
caDSR IndexGME
CAMS
26
Data Object Semantics, Metadata, and Schemas
Object oriented, APIs, well-defined data types
Classes defined in UML and converted into ISO/IEC 11179, registered in the caDSR
Definitions drawn from Enterprise Vocabulary Services (EVS), relationships semantically described
XML serialization of objects adhere to XML schemas registered in the Global Model Exchange (GME)
Service
Core Services
Client
XSDWSDL
Grid Service
Service Definition
Data TypeDefinitions
Service API
Grid Client
Client API
Registered In
Object Definitions
SemanticallyDescribed In
XMLObjectsSerialize To
ValidatesAgainst
Client Uses
Cancer Data Standards Repository
Enterprise Vocabulary
Services
Objects
GlobalModel
Exchange
GMERegistered In
ObjectDefinitions
Objects
27
Introduce Toolkit
A framework which enables fast and easy creation of caGrid compatible services whether they are data, analytical, custom, or core services.
Provide easy to use graphical service authoring tools.
Hide all “grid-ness” from the developer so that they can concentrate on the domain expert implementation.
Utilize best practice layered grid service architectures.
Handle all service architecture requirements of the caGrid.– Strong service interface data typing– Metadata and service registration– Grid security integration
28
Data Service Access on caGrid
Specialization of caGrid grid services to expose data through a common query interface
Present an object view of data sources
Exposed objects are registered in caDSR and their XML representation in GME
Queries made with caBIG Query Language (CQL) Query objects
Results returned as objects (or identifiers) nested in a CQL Query Result Set
29
Data Service Query Language
Specialization of caGrid grid services to expose data through a common query interface
Present an object view of data sources
Exposed objects are registered in caDSR and their XML representation in GME
Queries made with CQL Query objects
Results returned as objects (or identifiers) nested in a CQL Query Result Set
30
Data Service Interface
public CQLQueryResultsType processQuery(CQLQueryType query)
Data Provider’s only responsibility is to implement CQL over their local data resource– A default implementation will be provided for caCORE SDK created
systems
caGrid provides grid service implementation to invoke provider’s CQL implementation
Service provides all features necessary for compliance, such as advertisement of data service metadata, and security integration
31
Data Service Query Scenario
4. Data Source is queried by the Grid Data Service
5. Grid Data Service Builds a CQL Result Set
6. Result Set is serialized and returned to the client
7. Client deserializes result set
8. Result set is iterated with client tools to retrieve objects
1. Client builds a CQL Query
2. CQL Query is serialized and submitted to the Grid Data Service
3. Grid Data Service deserializes the CQL Query Object and processes it
32
Federated and Aggregated Queries
Componentized library being developed to facilitate limited federating and aggregating queries
An extension language used to describe distributed queries
Library creates and executes a Query Plan for the distributed query, using multiple CQL queries to targeted data services
33
Data Service Client Tooling
APIs provided to discover available data services on the grid based on client-defined criteria (such exposed data models and concepts)
Object-Oriented API for building queries, querying a given data service, and processing the results
Client tools available to iterate query result sets– Object iterator deserializes XML into registered objects– XML iterator simply returns XML documents
34
Acknowledgements (caGrid Team)
Ohio State University - Department of BioMedical Informatics – Dave Ervin– Shannon Hastings– Tahsin Kurc– Stephen Langella– Scott Oster– Joel Saltz
Argonne National Lab / University of Chicago– William Allcock– Jarek Gawor– Ravi Madduri– Frank Siebenlist– Michael Wilde
Duke University– A. Jamie Cuticchia– Patrick McConnell
Georgetown University– Colin Freas– Paul A. Kennedy– Chad La Joie
SAIC (http://www.saic.com)– Manav Kher
ScenPro/Semantic Bits– Vinay Kumar– David Wellborn– Valerie Bragg
Booz | Allen | Hamilton (http://www.bah.com) – Arumani Manisundaram– Michael Keller– Reechik Chatterjee
35
Acknowledgements
NCIAndrew von EschenbachAnna BarkerWendy PattersonOCDCTDDCBDCPDCEGDCCPSCCR
Industry PartnersSAICBAHOracleScenProEkagraApelonTerrapin SystemsPanther Informatics
NCICBKen BuetowPeter CovitzGeorge Komatsoulis Denise Warzel Frank HartelSherri De CoronadoDianne ReevesGilberto FragosoJill HadfieldLeslie Derr