1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of...

34
1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance of Normalization For the Data Search SiCOP Conference 6 February 2007

Transcript of 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of...

Page 1: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

1

Lola M. OlsenGlobal Change Master Directory

NASA’s Goddard Space Flight Center

The Value of Controlled Vocabularies Beyond the DRM 2.0:The Importance of Normalization

For the Data Search

SiCOP Conference

6 February 2007

Page 2: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

2

Presentation Guide

The Global Change Master Directory (GCMD) and the DRM 2.0 Vision and Mission Data Context, Data Descriptions, and Data Sharing Content and Usage Value (most appreciated aspects expressed from users)

Design Evolution MD2 - MD9.7

The Data Reference Model 3.0

Presentation Guide

Page 3: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

3

Strategic Vision:To serve as a trusted source for Earth (and space)

science metadata and related services.To contribute to scientific discovery.

Mission: To provide for the creation of “unique”, high quality

data, services, and ancillary descriptions of data. To design enabling authoring tools (with ability to tag entries)

and robust scientific search software. [The ability to “tag” data at the time of writing is key. Tagging is more difficult when this ability is not integrated through tools and is less likely to be accurate and normalized.] Best to gather information when the “getting is good.”}

To assist the global community in the discovery of the scientific resources within the directory.

The GCMD Vision and Mission

Page 4: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

4

Presentation GuideDoes the GCMD Follow the DRM 2.0?

Page 5: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

5

Presentation GuideData Context

Facilitates discovery of data through an approach to the categorization of data according to taxonomies. Enables the definition of authoritative data assets within the COI, (using unique identifiers).Provides linkages to data described, thereby managing the ‘info glut’, through:

Open API links, such as OPeNDAP (an open source framework that simplifies aspects of science data networking.) Related_URL “controlled keyword” links to data. New “use” metadata associated with detailed variables within the data sets.~ 20 Petabytes of data represented through the GCMD.

Page 6: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

6

0

50

100

150

200

250

AgricultureAtmosphereBiosphere

Biological Classification

Climate Indicators

Cryosphere

Human Dimensions

Land Surface

Oceans

PaleoclimateSolid Earth

Spectral/EngineeringSun-Earth InteractionsTerrestrial Hydropshere

Number of Science Keywords by Topic

Page 7: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

7

0

2

4

6

8

10

12

14

16

Environmental Advisories

Reference and Information ServicesData Management/Data Handling

Education/Outreach

Models

Data Analysis and Visualization

Hazards Management

Number of Services Keywords by Topic

Page 8: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

8

0

500

1000

1500

2000

2500

3000

Data Centers

Projects

InstrumentsLocations Platforms

Ancillary Keywords

0

500

1000

1500

2000

2500

3000

3500

Ancillary Science Science (new) Services

Coming Soon: Orbit Types, Spectral/Frequency Domain,

Launch Sites and 4 Level Taxonomy for Models.

Page 9: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

9

Presentation GuideData Description: “How do we understand what data are

available?”Provides a means to uniformly describe data - thereby supporting its discovery, harmonization, categorization, sharing, and rapid coordination/ communication.

GCMD uses the DIF “standard”. There are many advantages.

Descriptions must be identified UNIQUELY.

Page 10: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

10

1995 - Major steps in evolution through modification to a multilevel Earth science hierarchy:Category > Topic > Term > Variable > Detailed VariableTwo important trends were emerging that would affect evolution:

FGDC and concept of “metadata” for geospatial and other data initiated. Web taking shape.

1997 - DIF evolves from 23 to 34 fields Compatible with mandated FGDC and Dublin Core. Era of metadata initiated. Other “standards” emerging: ANZLIC Web expanding: Search interfaces abound; GCMD ready for this revolution.

1999 - DIF evolves to 35 fields in MD7. [3 added; 2 deleted] DIF creation date and revision history added. New field for paleoclimate data: paleo-temporal coverage. Personnel subfields modified. FGDC mandated, but DIF compatible with all required fields, serving users with added

benefits of unique ID. Conversion tools available: FGDC=><=DIF2002 - DIF acquires new sibling: the SERF, allowing cross linkages between services & data.

Redesign of query language; XML syntax; separation of presentation from business/application logic, with unexpected gifts: SOA architecture; querying multiple data sources for spatial, temporal, RDF and RDBMS databases, full-text; Struts facilitated creation of customized portals.

LDA experiment

2004 - MD9 ISO 19115 compliance and evolves to 36 fields: 3 New fields added: new address; 2 data resolution subfields

The Evolving DIF **International Interoperability Forum functions at the international level through CEOS.

Page 11: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

11

1634

5321 5393

253

1630

3077

2493

3878

4831

985

1805

241

1954

0

1000

2000

3000

4000

5000

6000

AGRICULTURE-9.5%ATMOSPHERE-30.8%BIOSPHERE-31.2%CLIMATE INDICATORS-1.5%CRYOSPHERE-9.4%HUMAN DIMENSIONS-17.8%HYDROSPHERE-14.4%LAND SURFACE-22.5%OCEANS-28.0%PALEOCLIMATE-5.7%SPECTRAL/ENG.-10.5%SUN-EARTH INTERACTIONS-1.4%SOLID EARTH-11.3%

Data Set (DIF) Population by Topic

Page 12: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

12

0

100

200

300

400

500

600

700

800

Data Analysis & Visualization

Data Management/Data Handling

Education/OutreachEnvironmental Advisory

Hazards Management

Reference and Information Services

Metadata Handling

Models

# SERFs

Data Services (SERF) Population by Topic

Page 13: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

13

Presentation GuideData SharingSupports the access to data - enabled by capabilities provided by both the Data Context and Data Description standardization areas through:

1. Ad-hoc requests (such as a query of a data asset) -

an OpenAPI supports ad hoc requests. Example: OPeNDAP.

2. Exchange of data (such as those that consist of fixed, reoccurring transactions among parties):

Examples:GeoConnections (Canada)OAI with NCAR and NOAA Data centers that use docBuilder tools to

submitmetadata descriptions.

Page 14: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

14

JCADM/AMD Collaborations: 18 NADC’s

Page 15: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

15

Page 16: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

16

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

Jan-01May-01Sep-01Jan-02May-02Sep-02Jan-03May-03Sep-03Jan-04May-04Sep-04Jan-05May-05Sep-05Jan-06May-06Sep-06

month

#hits

GCMD Hits Recorded Jan 2001-Dec 2006

Page 17: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

17

Maintenance reduction Improving the Discovery of and Access to Data and Services

Ease of use, such as web site navigation. Accuracy of Results

Content Requirements Quality control Integration with metadata authoring tools that allow real-time

updating by data set holders/producers. Integrated keyword and free-text search, with both as

“refinements”. Bidirectional linkages between data sets and data set services. Providing virtual subsets of the directory Standards: ISO 19115/19139; OpenGIS; XML; RDF. NASA needs Science User Working Group Recommendations; user and partner requests. Evolving coding languages and databases [e.g., C to Perl to Java].

Evolution: Project Development Drivers

Page 18: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

18

MD710/99

MD504/97

020100999897969594 03 04

MD210/94

MD404/96

MD604/98

MD8 OPS06/01

MD808/03

DIFs

SERFs 10,000 5,000

200

MD History

5/04

MD7

MD5

• Science keyword hierarchy• FGDC Compatibility• Isite free-text search

MD2 500

2,000

Features

10/96

MD4

• DIFmorph for translatingbetween FGDC and DIF

• PC-based DIF Writing Tool• Transitioned space science

DIFs to NSSDC• First web client distributed DIFWEB tools

• X-Windows client

• JAM client

• First use of Oracle

• DIFmacs Authoring Tool

MD

12/00

05/00

08/00

MD7

• Switched code base from C to Perl• Conference Calendar• Personnel "Role" field• Paleo_Temporal Coverage 1st request in FGDC MD8 DIFbuilder tools

09/01

07/02

MD8 OPS• Switched code base from Perl to Java

• XML syntax for metadata

• OPS for managing metadata Services Prototype Launched First time the coordinators were able

to load their own DIFs/SERFs docBUILDER tools

11/01

• Upgraded Isite free-text• JAVA Applet for geospatial search

and "Advanced" search interface • First web page to use Science

keyword Topics to search• Parent/Child display • Related_URL field added• HCIL and Matrix interface

MD6

Page 19: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

19

06/04

MD8

MD9.3

MD9.1- 9.2

05 07

02/05

DIFs

SERFs 500

1,200 1,000 15,000

03 04 06

MD9.5MD9.6

17,000

03/05

02/06 07/0

6

08/03

• New Home Page

• Portals

• DTD for DIF and SERF

• Open API

MD8

• Struts • Compatible with ISO 19115 metadata standard• Geographic coverage map added to record

MD9.1- 9.2

• Lucene Search engine• Search term highlighted in records• Refinement search by keywords or full text search• User Comment form

MD9.3• Spatial search with google map• Refinement option by data resolution for NASA portals • Support foreign characters record display • Subscription service for science keywords docBUILDER tools available for public

MD9.5

• Relative Temporal coverage added to accommodate data pools• Added two level hierarchy for Related_URL (e.g. support Get Data)

MD9.6

Features

MD9.4 08/05

• Location and data center hierarchy• Increased number of characters for fields• Spatial and temporal resolution range keywords • docBUILDER tool personalized templates

MD9.4

MD History (MD 8 and beyond)

Page 20: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

20

The Hype: Distributed Systems >Check if application is appropriate for

needs.>Determine its ROI. >Offered LDA, as “Local Database Agents”

- not “Latent Dirichlet Allocation”. >Be vigilant for change.>Know when to cut losses.>Scope the future.

The Out-In-The-Wilderness Request: Example: Offline Authoring Tool>Check longevity to assure usefulness when development complete.>Determine the ROI in advance.>Know when to cut your losses.

Following the “Hype” or Listening Too Intently to the “Wilderness Request”.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 21: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

21

“The Web doesn’t have a single, comprehensive clearinghouse where you can find all of the data and domains of knowledge covering all geographies …. Instead there are hundreds of …

“Very few geospatial information scientists are working on the challenge beyond the GCMD (Global Change Master Directory), whose database holds more than 15,000 [actually this number is 17,300 +] descriptions of data sets and services covering all aspects of earth and environmental sciences.”

Finding OurselvesLiebhold (May 2005)

O’Reilly Network

Page 22: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

22

Controlled Keywords (& definitions) to reference and retrieve a record or sets of records. Authoring Tools with Update Capability. {Heavy use of controlled

keywords.} Keyword & Full Text Search to Data and Services with ordered “Result

Set”. (No need to build a client to query, although the option to do so is available through Open API.)

Customized Portals - virtual subsets of the directory, created through use of controlled keywords.

The “Get Data” feature, which takes the user directly to the data. Unique data set and services entries. Easy compliance to related standards through XML. Results available through Google. Well-designed home page, with access to full set of services provided.

Internal View of GCMD Value (2007)

Page 23: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

23

MD Software Version 9.7

Support for 2 additional levels of Science Keyword hierarchy.Improved Features for docBUILDER Authoring Tools

Support for writing Platform and Instrument descriptions & new keywordsSupport for “GET DATA” tab. “Text Only” display for 508 compliance. (in docBuilder)Improved multimedia sample.Improved spatial coverage selection.Ability to change entry identifier.Reference Guide for use of international characters and symbols.

RSS Feed, in addition to Keyword Subscription Service, to signal new directory entries.Upgrades to Java 1.5 and Tomcat 5.Location Keywords & “Chronostratigraphic Units” recreated as true taxonomies.

Page 24: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

24

Keyword Functionality Upgrade. Functionality “abstracted” to use a SKOS data model for navigating

arbitrary taxonomies. Integrated SKOS query into query language. Backed by Berkeley DB XML for querying. Example: [skos:Parameters=‘EARTH SCIENCE|ATMOSPHERE’] AND

[skos:Instrument=‘AVHRR’] New Platform/Instrument Display Reflects Taxonomic Changes.

Support for loading, extracting, querying. Support for navigating through new taxonomies. Support for full text search. Support for creating these descriptions in docBUILDER.

MD Software Version 9.7

Page 25: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

25

SKOS Application

Page 26: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

26

Page 27: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

27

Page 28: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

28

Page 29: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

29

Page 1of SERF

Page 30: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

30

Page 2of SERF

Page 31: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

31

Page 1of DIF

Page 32: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

32

Page 2of DIF

Page 33: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

33

docBUILDER Enhancements Option for public vs private view. Automated reminders to metadata authors. Initial testbed for multilingual capabilities using SKOS. Variable Keyword extensions for “use” metadata. Client to ECHO for metadata sharing using web services.

MD Software Version 9.8

Page 34: 1 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance.

34

The Data Reference Model 3.0, Web 3.0 & SOAs

Data Resource Awareness Agent

Data & Information & Knowledge Repository

staticdynamic

Figure 3-1 DRM standardization Areas

Language Logic