purpose designed building Science Research Infrastructure Fund: £ 6m
description
Transcript of purpose designed building Science Research Infrastructure Fund: £ 6m
January 2006
Biological data integration by bi-directional schema transformation rulesAlexandra Poulovassilis, Birkbeck, U. of
London
January 2006
purpose designed buildingScience Research Infrastructure Fund: £ 6m
Research staff and students: 50Location: Bloomsbury
Open: June 2004
Institute of EducationUniversity of London
Birkbeck College University of London
Social scientistsExperts in education, sociology, culture and media, semiotics, philosophy, knowledge management ...
Computer scientistsExperts in information systems,
information management, web technologies, personalisation,
ubiquitous technologies …
The London Knowledge Lab
January 2006
LKL Research Themes
Research at the London Knowledge Lab consists mainly of externallyfunded projects by EU, EPSRC, ESRC, AHRB, BBSRC, JISC, Wellcome Trust – currently about 25 projects.
Four broad themes guide our work and inform our research strategy:
• new forms of knowledge
• turning information into knowledge
• the changing cultures of new media
• creating empowering technologies for formal and informal learning
January 2006
Turning Information Into Knowledge
• The need to cope with ubiquitous, complex, incomplete and inconsistent information is pervasive in our societies
• How can people benefit from this information in their learning, working and social lives ?
• What new techniques are necessary for managing, accessing, integrating and personalising such information ?
• How to design and build tools that help people to understand such information and generate new knowledge from it ?
January 2006
Turning Information Into Knowledge – Information Integration
AutoMed (EPSRC)– developing tools for semi-automatic integration of heterogeneous information sources– can handle both structured and semi-structured (RDF/S, XML) data – can handle virtual, materialised and hybrid integration scenarios – application in biological data integration, e-learning, p2p data integration
ISPIDER (BBSRC e-Science programme)– developing an integrated platform of proteomic data sources, enabled as Grid and Web services– collaboration with groups at EBI, Manchester, UCL
January 2006
The AutoMed Project
Partners: Birkbeck and Imperial Colleges Data integration based on schema equivalence Low-level metamodel, the Hypergraph Data Model (HDM),
in terms of which higher-level modelling languages are defined – extensible therefore with new modelling languages
Automatically provides a set of primitive equivalence-preserving schema transformations for higher-level modelling languages: • addT(c,q) deleteT(c,q) renameT(c,n,n’)
There are also two more primitive transformations for imprecise integration scenarios:• extendT(c,Range q q’) contractT(c,Range q q’)
January 2006
AutoMed Features
Schema transformations are automatically reversible:• addT/deleteT(c,q) by deleteT/addT(c,q)• extendT(c,Range q1 q2) by contractT(c,Range q1 q2)• renameT(c,n,n’) by renameT(c,n’,n)
Hence bi-directional transformation pathways (more generally transformation networks) are defined between schemas
The queries within transformations allow automatic data and query translation
Schemas may be expressed in a variety of modelling languages
Schemas may or may not have a data source associated with them; thus, virtual, materialised or hybrid integration can be supported
January 2006
Schema Transformation/Integration Networks
US1 US2 USi USn
LS1 LS2 LSi LSn
GS
id id id id id
… …
… …
January 2006
Schema Transformation/Integration Networks (cont’d)
On the previous slide:• GS is a global schema• LS1, …, LSn are local schemas• US1, …, USn are union-compatible schemas• the transformation pathways between each pair LSi and
USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository
• the transformation pathway between USi and GS is similar
• the transformation pathway between each pair of union-compatible schemas consists of id transformation steps
January 2006
AutoMed Architecture
Global Query Processor
Global Query Optimiser
Schema Evolution Tool
Schema Transformationand Integration Tools
Model Definition Tool
Schema and Transformation
Repository
Model Definitions Repository
Wrapper
January 2006
Comparison with GAV & LAV Data Integration
Global-As-View (GAV) approach: specify GS constructs by view definitions over LSi constructs
Local-As-View (LAV) approach: specify LS constructs by view definitions over GS constructs
RDF
XMLFileRDB
Local Schema
GlobalSchema
Local SchemaLocal Schema
Vie
wD
efin
itio
n
View
Def
initi
on
View
Definition
January 2006
GAV Example
student(id,name,left,degree) = [ x,y,z,w |x,y,z,w,_ug x,_,_,_,_phd
x,y,z,w,_phd w = ‘phd’]
monitors(sno,id) = [ x,y |x,_,_,_,yug
x,_,_,_,_phd x,ysupervises]
staff(sno,sname,dept) = [ x,y,z |x,y,z,w,_tutor
x,_,_supervisor
x,y,zsupervisor]
January 2006
LAV Example
tutor(sno,sname) = [ x,y | x,y,_staff
x,zmonitors z,_,_,wstudent
w ‘phd’] ug(id,name,left,degree,sno)
= [ x,y,z,w,v | x,y,z,wstudent
v,xmonitors
w ‘phd’] phd, supervises, supervisor
are defined similarly
January 2006
Evolution problems of GAV and LAV
GAV does not readily support evolution of local schemas e.g. adding an ‘age’ attribute to ‘phd’ invalidates some of the global view definitions
In LAV, changes to a local schema impact only the derivation rules defined for that schema e.g. adding an ‘age’ attribute to ‘phd’ affects only the rule defining ‘phd’
But LAV has problems if one wants to evolve the global schema since all the rules defining local schema constructs in terms of the global schema would need to be reviewed
These problems are exacerbated in P2P data integration scenarios where there is no distinction between local and global schemas
January 2006
AutoMed approach, ‘Growing’ Phaseassuming initially a schema U = S1 + S2
addRel(<<student,id>>, [x | x <<ug,id>>
x <<phd,id>>]) addAtt(<<student,name>>,
[<x,y> | (<x,y><<ug,name>>
x <<phd,id>>) <x,y>
<<phd,name>>]) addAtt(<<student,left>>,
[<x,y> | (<x,y> <<ug,left>> x <<phd,id>>) <x,y> <<phd,left>>]) …
January 2006
AutoMed approach, `Shrinking’ Phase
contrAtt(<<tutor,sname>>, Range [<x,y> | <x,y> <<staff,sname>> <z,x> <<ug,sno>>] Any)
contrRel(<<tutor,sno>>, Range [x | x<<staff,sno>> <z,x> <<ug,sno>>] Any)
Similarly contractions for the ug attributes and relation
January 2006
AutoMed approach, Shrinking Phase (cont’d)
contrAtt(<<phd,title>>, Range Void Any)
delAtt(<<phd,left>>, [<x,y> | <x,y><<student,left>> x <<phd,id>>])
delAtt(<<phd,name>>, [<x,y> | <x,y> <<student,name>> x <<phd,id>>]) delRel(<<phd,id>>, [x |
x <<student,id>> <x,’phd’> <<student,degree>>])
Similarly deletions for supervises and supervisor
January 2006
AutoMed vs GAV/LAV/GLAV
AutoMed schema transformation pathways capture at least the information available from GAV and LAV rules:• add/extend transformations correspond to GAV rules• delete/contract transformations correspond to LAV
rules We discussed this our ICDE’03 paper where we termed our
integration approach both-as-view (BAV) In particular, we discussed how GAV and LAV view
definitions can be derived from a BAV specification GLAV rules e :- e’ are captured by BAV transformations of
the form add(T,e); …; del(T,e’) Thus any reasoning or processing that is possible using
GAV, LAV or GLAV is also possible using BAV
January 2006
Schema Evolution in BAV
Unlike GAV/LAV/GLAV, BAV framework readily supports the evolution of both local and global schemas
The evolution of the global or local schema is specified by a schema transformation pathway from the old to the new schema
For example, the figure on the right shows transformation pathways T from an old to a new global or local schema
Global SchemaS
New GlobalSchema S’
T
New LocalSchema S’
Local SchemaS
T
January 2006
Global Schema Evolution
Each transformation step t in T:SS’ is considered in turn• if t is an add, delete or rename then schema
equivalence is preserved and there is nothing further to do (except perhaps optimise the extended transformation pathway); the extended pathway can be used to regenerate the necessary GAV or LAV views
• if t is a contract then there will be information present in S that is no longer available in S’; again there is nothing further to do
• if t is an extend then domain knowledge is required to determine if the new construct in S’ can in fact be derived from existing constructs; if not, there is nothing further to do; if yes, the extend step is replaced by an add step
January 2006
Local Schema Evolution
This is a bit more complicated as it may require changes to be propagated also to the global schema(s)
Again each transformation step t in T:SS’ is considered in turn
In the case that t is an add, delete, rename or contract step, the evolution can be carried out automatically
If it is an extend, then domain knowledge is required See our CAiSE’02, ICDE’03 and ER’04 papers for more
details The last of these discusses a materialised data
integration scenario where the old/new global/local schemas have an extent
January 2006
Global Query Processing
We handle query language heterogeneity by translation into/from a functional intermediate query language – IQL
A query Q expressed in a high-level query language on a schema S is first translated into IQL (this functionality is not yet supported in the AutoMed toolkit)
View definitions are derived from the transformation pathways between S and the requested data source schemas
These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs
January 2006
Global Query Processing (cont’d)
Query optimisation (currently algebraic) and query evaluation then occur
During query evaluation, the evaluator submits to wrappers sub-queries that they are able to translate into the local query language. Currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data sources
The wrappers translate sub-query results back into the IQL type system
Further query post-processing then occurs in the IQL evaluator
January 2006
Other AutoMed research at BBK
As well as virtual integration of data sources, we have investigated using AutoMed for materialised data integration e.g. a data warehousing approach
In particular, Hao Fan has worked on incremental view maintenance, data lineage tracing and schema evolution over AutoMed schema transformation pathways
Lucas Zamboulis has been looking at semi-automatic techniques for transforming and integrating heterogeneous XML data
In recent work we have also investigated using correspondences to RDFS schemas to enhance these techniques
January 2006
Other AutoMed research at BBK (cont’d)
Dean Williams has been working on extracting structure from unstructured text sources
The aim here is to integrate information extracted from unstructured text with structured information available from other sources
Dean is using existing technology (the GATE tool) for the text annotation and IE part of this work
The information extracted from the text is matched with existing structured information to derive new instance data and perhaps also new schema fragments
AutoMed is being used for the schema and data integration aspects of this project
January 2006
Other AutoMed research at Imperial
Automatic generation of equivalences between different data models
A graphical schema & transformations editor Data mining techniques for extracting schema
equivalences Optimising schema transformation pathways
January 2006
ISPIDER Project
Partners: Birkbeck, EBI, Manchester, UCL Aims:
• Vast, heterogeneous biological data• Need for interoperability• Need for efficient processing • Development of Proteomics Grid Infrastructure, use
existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.
January 2006
Project Aims
January 2006
Project Aims
January 2006
Project Aims
January 2006
Project Aims
January 2006
Project Aims
January 2006
myGrid / DQP / AutoMed
myGrid: collection of services/components allowing high-level integration of data/applications for in-silico experiments in biology
DQP• OGSA-DAI (Open Grid Services Architecture Data
Access and Integration)• Distributed query processing over OGSA-DAI enabled
resources Current research: AutoMed – DQP interoperation Future research: AutoMed – myGrid workflows
interoperation
January 2006
DQP – AutoMed Interoperability
Data sources wrapped with OGSA-DAI
AutoMed OGSA-DAI wrappers extract data sources’ metadata
Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema
IQL queries submitted to this integrated schema are:• Reformulated to IQL
queries on the data sources, using the AutoMed transformation pathways
• Submitted to DQP for evaluation
AutoMed Wrappers
AutoMedRepository
OGSA-DAIActivity
OGSA-DAIActivity
OGSA-DAIActivity
DB
AutoMedwrapper
AutoMedwrapper
AutoMedwrapper
DistributedQuery Processor
IntegratedAutoMed Schema
AutoMedSchema
AutoMedSchema
AutoMedSchema
AutoMedQuery Processor
IQL query
OQL query
OGSA-DAIService
OGSA-DAIService
OGSA-DAIService
DBDB
AutoMed DQPwrapper
OQL result
IQL result
IQL query
IQL result
January 2006
Data source schema extraction
AutoMed wrapper requests the schema of the data source using an OGSA-DAI service
The service replies with the source schema encoded in XML
The AutoMed wrapper creates the corresponding schema in the AutoMed repository
AutoMedwrapper
AutoMedSchema
OGSA-DAIService
schema request
DB
XMLresponse
January 2006
Using AutoMed for in the BioMap Project
Relational/XML data sources containing protein sequence, structure, function and pathway data; gene expression data; other experimental data
Wrapping of data sources Translation of source and global
schemas into AutoMed’s XML schema
Domain expert provides matchings between constructs in source and global schemas
Automatic schema restructuring, with automatic generation of schema transformation pathways
See DILS’05 paper for more details RDB
XMLFileRDB
AutoMedRelationalSchema
AutoMedIntegratedSchema
AutoMedXMLDSSSchema
AutoMedRelationalSchema
XMLWrapper
RDBWrapper
RDBWrapper
Tra
nsf
orm
atio
np
athw
ay
Tran
sfor
mat
ion
path
way
Transformation
pathway
IntegratedDatabaseWrapper
IntegratedDatabase
…..
…..
…..
January 2006
Ongoing and future research
Using the BAV approach for data integration in Grid and P2P environments
The integration may be virtual, materialised or hybrid P2P query processing over BAV pathways P2P update processing over BAV pathways Use of ECA rules and a P2P ECA rule execution engine Optimisation of ECA rules on semi-structured data