CCCT-041 Semantic Extensions to Domain- Specific Markup Languages Aparna Varde, Elke Rundensteiner,...

30
CCCT-04 1 Semantic Extensions to Domain-Specific Markup Languages Aparna Varde, Elke Rundensteiner, Murali Mani, Mohammed Maniruzzaman and Richard D. Sisson Jr. Worcester Polytechnic Institute (WPI) Worcester, Massachusetts, USA

Transcript of CCCT-041 Semantic Extensions to Domain- Specific Markup Languages Aparna Varde, Elke Rundensteiner,...

CCCT-04 1

Semantic Extensions to Domain-Specific Markup Languages

Aparna Varde, Elke Rundensteiner, Murali Mani, Mohammed Maniruzzaman and Richard D. Sisson Jr.

Worcester Polytechnic Institute (WPI)

Worcester, Massachusetts, USA

CCCT-04 2

Introduction

• XML, the eXtensible Markup Language: Widespread standard in storing and publishing data.

• Domain-specific markup languages designed with XML tag sets.

• Standardization bodies extend these to include additional semantics.

• Aspects such domain knowledge, XML constraints are important.

• Focus of Paper: Generic issues in extending markup languages.

CCCT-04 3

Domain-specific markup language

• Medium of communication for potential users of the domain.

• Users: industries, consumers, universities, research organizations, publishers etc.

• Follows XML syntax.• Encompasses the semantics

of the domain.• Examples

• MML: Medical Markup Language

• MatML: Materials Science Markup Language

Markup Language

Industries

Consumers

Universities Research Organizations

Publishers

CCCT-04 4

MML: Medical Markup Language

• Creates standards for medical data to be stored and accessed worldwide.

• MML module contents, e.g., “basic clinic information”, “surgery record information”.

• Used by primary care physicians, general surgeons etc.

• Specific information in sub-areas such as “opthalmology” cannot be stored with these modules.

• Thus there is need for more semantics in MML.

CCCT-04 5

Motivation for extension to markup languages

• Analogous to medical domain and opthalmology there are specifics in other domains.

• Why not define a new markup language for each aspect?– Typically basic information in generic language that

needs cross-referencing, e.g., basic surgical details in opthalmology.

– Common information should not be stored twice.

• Advisable to extend existing markup language with additional semantics.

CCCT-04 6

Extending the Materials Science Markup Language, MatML

• MatML: Materials Science Markup Language.

• XML for materials property data.• Heat Treating: controlled

heating and cooling of materials to achieve desired mechanical and thermal properties.

• Need to include semantics of Heat Treating in MatML.

• At WPI, Heat Treating extension to MatML is proposed.

• Several issues, domain-specific and XML-related crucial here.

<MatML_doc><Material> <BulkDetails>

…………… </BulkDetails> <ComponentDetails>

……………...

</ComponentDetails>………………….………………….…………………. ………………….

</Material></MatML_doc>

CCCT-04 7

General issues in extending any markup language

• Steps essential in markup language extension.

• Desired language features.

• XML schema constraints.

• Retrieval using XQuery.

CCCT-04 8

Steps essential in markup language extension

1. Understand domain semantics.2. Model the data.3. Conduct interviews.4. Define the ontology.5. Reiterate the ontology.6. Outline the initial schema.7. Revise the schema based on critical

reviews.

CCCT-04 9

1. Understand domain semantics

• Acquire domain knowledge: terminology, processes, entities etc.

• This helps determine essential tags to store data in the domain.

• Study existing markup language in detail.

• This is to understand where exactly it needs extension.

CCCT-04 10

2. Model the data

• Build data model after studying domain.

• Use techniques such as Entity-Relationship diagrams.

• Thus represent domain entities, their properties and relationships.

Subset of E-R Diagram for Heat Treating

CCCT-04 11

3. Conduct interviews

• Needs of potential users are important.

• This helps determine entities and attributes in extension.

• Users: industries, universities, research organizations, publishers etc.

• Domain experts can identify needs of users.

• Hence, interview the domain experts.

CCCT-04 12

4. Define the ontology

• Ontology serves as established lingo for the domain.

• Hence defining ontology is important to proceed with design.

• Issues• Synonyms: two or more words with same meaning, e.g., in

financial domain, “salary” and “income”. • Homographs: one word with multiple meanings, e.g., “share”

in financial domain could refer to “sharing of assets” or “shares in the stock market”.

• Clarify such terms with reference to context through ontology.

CCCT-04 13

5. Reiterate the ontology

• Once ontology established, useful to have another round of discussions with experts.

• Additional discussions with domain experts may lead to further clarifications.– Example: remove existing

entities, create new ones, based on terminology.

• Accordingly ontology needs to be altered.

• Use this ontology for schema design. High-level ontology for Heat Treating

CCCT-04 14

6. Outline the initial schema

• Schema provides structure, i.e., defines grammar for the markup language.

• Once data model and ontology are approved by domain experts, outline the initial schema.

• Adhere to the syntax of original markup language to be accommodated as extension.

Partial snapshot of schema for Heat Treating extension to MatML.

CCCT-04 15

7. Revise the schema based on critical reviews

• Initial schema serves as medium of communication between designers and users.

• This is subject to further changes until domain experts are satisfied.

• Schema revision may involve several iterations.• Some of these include discussions with standards

bodies.• For proposed extension to be accepted as

worldwide standard, it must be approved by experts & standards bodies.

CCCT-04 16

Desired language features

1. Avoid redundancy.

2. Make information non-ambiguous.

3. Provide easy interpretability of data.

4. Capture domain constraints in the schema.

CCCT-04 17

1. Avoid redundancy

• Markup language extension should be such that duplication of storage is avoided.

• Data stored in the original markup language should be cross-referenced in the extension.

• Example– In medical domain, there should be cross-referencing

between “basic clinic information” in the original language and “opthalmological details” in the extension.

• Schema should be structured accordingly.

CCCT-04 18

2. Make information non-ambiguous

• Domain terminology, its semantics, aspects such as synonyms / homographs are significant.

• The schema design should adhere to the ontology to avoid ambiguity.

• Annotations should be included within the schema to enhance clarity.

• Example: – For spectacle prescriptions in opthalmology, include

meanings of terms “myope” and “hypermetrope” in schema as annotations.

CCCT-04 19

3. Provide easy interpretability of data

• Data is stored using markup language tags.• Readers should be able to interpret this data

without much reference to the literature.• Thus the schema design should be organized

accordingly.• Example:

– In science and engineering domains, experimental conditions should be stored close to results to enhance readability.

CCCT-04 20

4. Capture domain constraints in the schema

• Certain requirements imposed by the domain need to be captured in schema.

• Done through XML constraints feature.

• Some constraints– Primary key: To uniquely identify an entity.– Choice: To declare mutually exclusive elements.

• Example: In financial domain, a person could be either “insolvent” (bankrupt) or “asset-holder” but not both.

CCCT-04 21

XML schema constraints

1. Sequence constraint.

2. Disjunction constraint.

3. Key constraint.

4. Occurrence constraint.

CCCT-04 22

1. Sequence constraint

• To declare a list of elements in order.

• Enclose elements in <xsd:sequence> tags.

• Example: – In Heat Treating

extension, element “QuenchConditions” must occur before “Results”.

CCCT-04 23

2. Disjunction constraint

• To declare mutually exclusive elements, i.e., only one of them can exist.

• Enclose elements in <xsd:choice> tags.

• Example:– In Heat Treating, a

part can be made by “Casting” OR “Powder Metallurgy”, not both.

CCCT-04 24

3. Key Constraint• To declare an attribute

to be a primary key, i.e., it must be unique and non-null.

• Indicate the attribute as type “xsd:ID” and its use as “required”.

• Example:– In Heat Treating, the

name of the cooling medium (quenchant) is crucial because the purpose of the experiments is to categorize the quenchants.

CCCT-04 25

4. Occurrence constraint• To declare minimum and maximum

permissible occurrences of an element.• Indicate “minOccurs = x” and

“maxOccurs = y” where “x” and “y” denote the minimum and maximum occurrences respectively.

• Value “maxOccurs = unbounded” means no upper bound on number of occurrences.

• Value “minOccurs = 0” means that element need not be stored even once.

• Example:– In Heat Treating, Cooling Rate must be

recorded at a minimum of 8 points in an experiment and there is no upper bound for it. The maximum number of graphs stored per experiment is 3 and it is not necessary that at least one graph be stored.

CCCT-04 26

Retrieval using XQuery

1. Encourage users to store data in a case-sensitive manner.

2. Use tags to enhance querying efficiency.

CCCT-04 27

1. Encourage users to store data in a case-sensitive manner

• XQuery is case-sensitive

• Hence it is useful to place emphasis on case when storing data using markup language.

• This facilitates retrieval using XQuery.

CCCT-04 28

2. Use tags to enhance querying efficiency

• It is possible to anticipate a typical user query in a domain.

• Thus advisable to add a level of abstraction for faster retrieval of information.

• Example: – In Heat Treating, a user is likely to

retrieve name details of quenchant without its property details.

– Hence place tags <NameDetails> and <PropertyDetails> around quenchant information.

– Thus entire path of quenchant need not be traversed for name details.

– This enhances querying efficiency.

CCCT-04 29

Conclusions

• Aspects of extending domain-specific markup languages discussed here.

• These include motivation for extension, steps in extension, language features, XML constraints and retrieval considerations.

• Extension to MatML proposed at CHTE, WPI to include Heat Treating semantics.

• Paper summarizes general issues in extending domain-specific markup languages.

CCCT-04 30

Acknowledgments

• Database Systems Research Group in Department of Computer Science at WPI.

• Quenching Research Team in Department of Materials Science at WPI.

• Center for Heat Treating Excellence and its member companies.