© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 1
Exploring the Cloud of Research Information Systematically
Keith G Jeffery
Director, IT & International Strategy
CCLRC
Anne G S Asserson
Research Department
University of Bergen
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 2
Structure
• The Problem, Proposition, Requirement
• Past Work, Analysis, Conclusion
• Solution
• Additional Aspects
• Future Work
• Conclusion
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 3
university
Funding agency
Research chemist
doctor
synchrotron PhD thesis
Journal paper
presentation
conference
industry
entrepreneur
Classification scheme protein
DNA
The universe of information of relevance
What I want
‘magic’ system
The Problem
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 4
The Problem: The Universe of Relevant Information
• Unstructured, semistructured– Policy papers– Research proposals (part)– Publications
• Structured– Research proposals (part)– Research information records
(commonly metadata)– Research datasets
• Heterogeneity– Character set– Language– Syntax– Semantics
Highly relevant
relevant
Partially relevant
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 5
RelationshipUsual Relation
person
OrgUnit
PERSON
OrgUnit
ORGUNIT
PK PKFK
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 6
Project
OrgUnitPerson
Structured Approach
Linking relations expressing semantics
Pr-Pe Pr-OU
Pe-Pe
Pe-OU
OU-OU
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 7
The research project aims to discover the particular properties of the umouzo worm that cause it to provide local tequila with
its characteristic taste. The method will involve chemical analysis of the worm at various stages of its lifecycle related
to the tequila: its early life in its usual forest habitat, its middle life when collected and held in artificial conditions as stock
before it is placed into the bottle of tequila and finally within thetequila environment at various stages of maturity of the tequila. The research will be conducted by Professor Quoaxocoatl andhis Analytical Biochemistry team, belonging to the Departments
of Chemistry and Biology at the University of Chicken Itza, Mexico using the latest spectroscopic analysis techniques. The
beneficiaries of the research will be the Chicken Itza Tequila company which expect to improve the flavour by understanding
the chemical changes in the worms. The research will be conducted following the ethical code for experimentation on
animals. The funding requested is 1,000,000 US Dollars over 3 years.
Research Text: un- or semi-structured text
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 8
The research project aims to discover the particular properties of the umouzo worm that cause it to provide local tequila with
its characteristic taste. The method will involve chemical analysis of the worm at various stages of its lifecycle related
to the tequila: its early life in its usual forest habitat, its middle life when collected and held in artificial conditions as stock
before it is placed into the bottle of tequila and finally within thetequila environment at various stages of maturity of the tequila. The research will be conducted by Professor Quoaxocoatl andhis Analytical Biochemistry team, belonging to the Departments
of Chemistry and Biology at the University of Chicken Itza, Mexico using the latest spectroscopic analysis techniques. The
beneficiaries of the research will be the Chicken Itza Tequila company which expect to improve the flavour by understanding
the chemical changes in the worms. The research will be conducted following the ethical code for experimentation on animals. The funding requested is 1,000,000 US Dollars
over 3 years.
Research Text: finding / classifying syntax
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 9
The research <project> aims to discover the particular properties of the umouzo worm that cause it to provide local tequila with its characteristic taste. </project> <abstract> the method will involve chemical analysis of the worm at various stages of its lifecycle related to the tequila: its early life in its usual forest habitat, its middle life when collected and held in artificial conditions as stock before it is placed into the bottle of tequila and finally within the tequila environment at various stages of maturity of the tequila. </abstract><person> The research will be <relationship> conducted by </relationship> Professor Quoaxocoatl </person> And <relationship> his </relationship> <orgunit> Analytical Biochemistry team </orggunit> <relationship> belonging to the </relationship> <orgunit> Departments of Chemistry </orgunit> and <orgunit> Biology </orgunit> <relationship> at the </relationship> <orgunit> University of Chicken Itza, Mexico </orgunit><abstract> using the latest spectroscopic analysis techniques. </abstract> The <relationship> beneficiaries </relationship> of the research will be the <orgunit> Chicken Itza Tequila company </orgunit>which expect to <abstract> improve the flavour by understanding the chemical changes in the worms. </abstract> The research will be conducted <condition> following the ethical code for experimentation on Animals. </condition> The funding requested is <funding> 1,000,000 US Dollars </funding>over <duration> 3 years. </duration>
Parsing to a defined schema for research information
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 10
Problem: How to build the ‘magic system’
• (1) how to assure quality data – so that results are accurate;
• (2) how to assist the end-user in formulating correctly the query – to obtain the expected
results; • (3) how to structure the
data to obtain the optimal response – in terms of performance,
recall and relevance including taming heterogeneity.
What I want
‘magic’ system
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 11
Proposition
• 3 questions are related
• Solution based on– Structured data (and using it as metadata)– Formal logic (reproducible)– Knowledge engineering techniques
• Exposed to end user with– User-friendly assistant– Graphic metaphors
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 12
Requirement• The end-user wants a
homogeneous response to a query – which may involve
functional processing in addition to a simple retrieval.
– response in a reasonable time
– with all relevant information (recall)
– some indication of relevance (how closely he answer matches the query).
• Example:
• Total amount of funding by year by university spent on biomedical research
• For last 10 years by midday today
• Ensuring all universities are included
• With computed distance score from ‘biomedical’ in classification scheme
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 13
Requirement: Data Quality• Real world Decision-
making based on the computer system
• Decision-making quality depends critically on the data quality
• end-user presented with graphical representations of statistically-reduced or model-enhanced data
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 14
Requirement: User Query
• How does an end-user express this in a structured query
• or find the (relevant, complete) information via Google?
– Are the years calendar or financial?
– Are the patents measured by number or value of licences granted?
– Only publications in peer-reviewed journals?
– Universities in which countries
“compare the research performance of my university against others across a range of metrics (products, patents, publications) over years”
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 15
Requirement: Data Structure
• Timeliness – when required, usually
immediately
• Relevance – Of results to query
• Recall – Completeness of results
• All require structured data• Or structured metadata
describing semi-structured or unstructured data
• What is the SQL for I want it now?
• How to match complex logic of query (with processing) to result set
• How to know what is in the ‘universe’
• For precise matching / measurement / calculation
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 16
Structure
• The Problem, Proposition, Requirement
• Past Work, Analysis, Conclusion
• Solution
• Additional Aspects
• Future Work
• Conclusion
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 17
Related Work
• Generally relevant– Information management, user interfaces,
performance….. • Copious
– Related directly to CRIS • limited and mainly in CRIS conferences
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 18
Related Work - analysis• integrating heterogeneous
distributed databases of CRIS information including
– open access institutional repositories
– research datasets
– represented by structured metadata
• into a homogeneous canonical form;
• harvesting semistructured information from the web to create a structured metadata index – with limited information in a
structured form– May be only term and URL
• which points to the original semistructured data in its native form;
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 19
Related Work: Conclusions• unstructured or semistructured data is valuable only when
indexed by structured metadata • satisfactory responses to end-user queries only produced
when the data are structured, or indexed by structured metadata;
• satisfactory response to end-user queries over heterogeneous distributed data requires knowledge-based techniques (And KB techniques rely on structured data)
• semistructured harvesting/browsing techniques require much human effort and so do not scale
• Structured data-based techniques allow for further processing, automating and scaling but more R&D is needed to make it work
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 20
Related Work: Conclusion
• Semistructured data indexed
• Satisfactory user query• Basis for advanced KB
techniques• Scalability• Further processing
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 6
Project
OrgUnitPerson
Structured Approach
Linking relations expressing semantics
Pr-Pe Pr-OU
Pe-Pe
Pe-OU
OU-OU
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 21
CERIF-CRIS
Publtext
Scidata
ProjDesc
CVOrg
Desc
Publtext
Scidata
OrgDescCV
ProjDesc
Much human effort in selecting and integrating
Harvesting / browsing
Human effort spent on decision-making
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 22
Structure
• The Problem, Proposition, Requirement
• Past Work, Analysis, Conclusion
• Solution
• Additional Aspects
• Future Work
• Conclusion
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 23
Solution: Quality Structured Information: Data
• Data quality can only be assured if the input or edit of data attribute values is validated and – if necessary – supported with explanation of valid values
• To achieve this requires structured data i.e. data arranged as attribute values in a structure to form information
• validation techniques applicable to each data attribute and relationships between attributes [GoGlJe93].
• validation relies on first order logic and Boolean algebra. This demands structured data (information).
• For CRIS: CERIF – data structured formally as information– Attribute values in controlled vocabularies (domain ontologies)
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 24
Solution: Quality Structured Information: Metadata
• CERIF can act as metadata to other structured, semistructured and unstructured information
• Examples: OA IRs and research datasets and software. Being extended to financial, project and personnel information
• ensures such associated datasets are validated, retrieved and interpreted in a structured and logical context.
Repository
CERIFas metadata
Object e.g. document
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 25
Solution: Expert Advisor and Query Assistant
• Much past work on this area. Process is:– Interact with user to discover not ‘what they say’ but ‘what
they mean’
– Enhance the query appropriately for relevance, recall, performance
– Replay (in natural language) query to user for confirmation
– Execute query including taming heterogeneity
– Provide user with answer integrated (automatically according to user preference) in preferred
• Character set, Language, Syntax, Semantics, Media / Mode
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 26
Solution: Homogeneous View through Knowledge-Based Information Integration• Several known techniques for providing a homogeneous
information (syntactic) & knowledge (semantic) view over heterogeneous data
– http://epubs.cclrc.ac.uk/work-details?w=33728
• May be a stored view (data warehouse)– Problem of currency
• Or on-demand view– Problem of performance
• In any case requires– Schema matching– Query rewriting and distribution– Answer conversion and integration
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 27
Solution: Explanation
• When user receives answer– Even if translated to canonical (CERIF) form
• It may not be understandable so need explanation• This is done using knowledge-based techniques
– The ‘reverse’ of query assistance / improvement
– Using domain ontologies to associate additional information to the answer to explain
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 28
Structure
• The Problem, Proposition, Requirement
• Past Work, Analysis, Conclusion
• Solution
• Additional Aspects
• Future Work
• Conclusion
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 29
Analysis, Modelling, Visualisation
• Decision making today is complex– Multiple parameters– Huge volumes of data, sometimes structured as
information– Information of varying reliability
• So typically the data is processed to represent in a succinct fashion (eg graph, map, video)
• And computer models are used to produce predicted results for comparison with the recorded data
• This cannot be done without structured information which is somehow made consistent
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 30
Push Technology
• Push technology relies on a standard query which is executed when triggered by a (relevant) change in the observed database or an externally defined condition
• Therefore it requires structured data (or structured metadata representing un- or semi-structured data)
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 31
Now, recall the example query
• “compare the research performance of my university against others across a range of metrics (products, patents, publications) over years”
• And let us imagine how it would be handled by the proposed system environment
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 32
Utilising the Solution: The Researcher
• The researcher would find – an easy-to-use assisted interface;
– behind which integration of information from multiple sources would be performed automatically,
– even bringing into a homogeneous context semistructured information (notably publications and research datasets with associated software) via the structured metadata.
• Thus the cloud of research information becomes a well-formed set of quality information.
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 33
Utilising the Solution: The Research Manager
• The research manager in a funding organisation would find – an easy-to-use assisted interface with the required information
from multiple heterogeneous sources integrated. – The interface provides appropriate functions for comparing the
funding; the explanation engine provides background information on the way in which funding is calculated in each country.
– quality information structured in context and with appropriate explanation to assist in interpretation.
– interface assistance to formulate the query to ensure the correct year(s) and departments are selected and the count of publications is done correctly according to the criteria (e.g. against a list of acceptable peer-reviewed publication channels).
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 34
Utilising the Solution: The Innovative Entrepreneur
• The innovative entrepreneur utilises the advanced interface – to formulate a homogeneous query over the
heterogeneous national sources of information. – The query is improved by the inference engine and
domain ontology to overcome the fact that terminology differs from country to country and she wishes to compare like with like in terms of patents, products and their generated wealth through licences or sales.
– More complex is to compare track records in wealth creation of research groups; a graphical representation of value against time is used with annotation to explain the basis of the reasoning leading to the inclusion or exclusion of certain research outputs.
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 35
Structure
• The Problem, Proposition, Requirement
• Past Work, Analysis, Conclusion
• Solution
• Additional Aspects
• Future Work
• Conclusion
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 36
Future Work
• As indicated more R&D required:– Faster and more effective schema integration– generation of software to then automatically
manage the distributed heterogeneous queries and the answer homogenisation
– Faster and more effective tools for building and maintaining domain ontologies (including consistency checking)
– Tools for managing multiple languages and character sets
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 37
Structure
• The Problem, Proposition, Requirement
• Past Work, Analysis, Conclusion
• Solution
• Additional Aspects
• Future Work
• Conclusion
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 38
Conclusion
Keith G JefferyDirector, IT & International [email protected]
Anne G S AssersonResearch DepartmentUniversity of Bergen
The widespread use of web browsers gives the impression that information for decision makers is readily available.
It is, but is of varying quality and requires heavy human involvement to use it. This does not scale.
Structured data, or metadata representing un- or semi-structured data is the only reliable foundation.
Upon this, advanced CRIS systems for decision-making can be built.
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 39
© Keith G Jeffery, Anne G S Asserson IWIRCRIS Copenhagen 2006 20061109 40
Entity represented by a relation or table
One instance
Oneattribute
Another Entity
relationship
Structured Approach
Top Related