Outline

45
Science, Data, You, and the Future: A variation on the “The 3 Little Pigs.” Which “Little Pig” will You be?! A Presentation for “ A Presentation for “ NSF Facilities Users’ Workshop: Working Together to Meet NSF Facilities Users’ Workshop: Working Together to Meet New Observational Challenges.” New Observational Challenges.” September 24, 2007 September 24, 2007 Boulder, CO Boulder, CO Raymond McCord Raymond McCord Oak Ridge National Laboratory* Oak Ridge National Laboratory* Oak Ridge, Tennessee Oak Ridge, Tennessee *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725 contract DE-AC05-00OR22725

description

Science, Data, You, and the Future : A variation on the “The 3 Little Pigs.” Which “Little Pig” will You be?!. A Presentation for “ NSF Facilities Users’ Workshop: Working Together to Meet New Observational Challenges.” September 24, 2007 Boulder, CO Raymond McCord - PowerPoint PPT Presentation

Transcript of Outline

Page 1: Outline

Science, Data, You, and the Future: A variation on the “The 3 Little

Pigs.” Which “Little Pig” will You be?!

A Presentation for “A Presentation for “NSF Facilities Users’ Workshop: Working Together to Meet New NSF Facilities Users’ Workshop: Working Together to Meet New Observational Challenges.”Observational Challenges.”

September 24, 2007September 24, 2007Boulder, COBoulder, CO

Raymond McCord Raymond McCord Oak Ridge National Laboratory*Oak Ridge National Laboratory*

Oak Ridge, TennesseeOak Ridge, Tennessee

*Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725*Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725

Page 2: Outline

OutlineOutline

• Objectives• Conclusions (already??)• Storytelling• Introduction to “the story”• Evaluation of current and future issues• Pathways to the future??• An ending to “the story”• Conclusions (again!!)

Page 3: Outline

ObjectivesObjectives

• To present my assessment of current scientific data management practices and issues that need to be addressed in the future.

• To be informative, provocative, and entertaining

To StOP YOU from thinking about supper!!??

Page 4: Outline

Conclusions Conclusions (already??)(already??)

• We are swamped with more information than we can access*– *Access is a broad topic (EPDUS = ????)

• Our current practices may not be sustainable and reliable.– Exponential vs linear capacity increases– Optimization is unbalanced

• Scientific expertise within data centers will improve future data access.• Science and data management must be integrated.• Many solutions are NOT technological, but behaviorial.

– Think - Training• “Data Science” training must developed and implemented.• The needed changes will not happen by accident.

– “My ~30 years of experience and systems observation suggests otherwise!!”

• Sooner is better than later.

Page 5: Outline

StorytellingStorytelling

• Storytelling is a VERY OLD form of “information technology” (IT)– Preservation and access

• When old IT meets new IT– Supercomputer implementation

• Just go ask!

• Excuse for Analogies– Engages the listener

• “The 3 Little Pigs”– “Once upon a time…”

Page 6: Outline

About RaymondAbout Raymond

• Trained as a Theoretical Ecologist (landscape ecology)– Conducted extensive statistical analysis

• Scientific data analyst (to pay the bills)– Tired of rerunning analyses at last minute to correct data

management problems• Data manager / System “whacker”

– GIS implementation in “early PC days”• Implementer and manager of progressively larger

environmental information systems!!– Requires “research” outside of “science”– “Smell the fumes” of many scientific disciplines– Very few publications!!??– Acquired respect???

Page 7: Outline

CreditsCredits

• The concepts presented are derived from managing environmental data and information systems over the past 30 years.

• Variations of these concepts were observed from many disciplines:– plant community research – impact assessment in marine systems– acid rain surveys– environmental monitoring and cleanup projects at DOE facilities– land use assessment– climate change research (atmospheric research)

• These concepts extend to other scientific disciplines.

Page 8: Outline

Quotes from RaymondQuotes from Raymond

• “Storing data is easy. Finding and using data later is NOT…”

• “Systematically and consistently organized data does not occur without cost.”– “The existence of “no cost”, well-organized

data is not supported by the current situation”– “Consider the results from previous science

projects with “no cost” for data archiving.”• “The natural tendency over time for data

and information is chaos. Effort must be exerted to overcome this.”

• “Successfully managed data by projects may not be ready to be archived. (for permanent access)”

Page 9: Outline

Pop Quiz (Wake UP!!)Pop Quiz (Wake UP!!)

• What is “access combination” to my lock?– Hints:

• “I love it”• X=(Yz)/12) - z+1

• How is my necktie related to:– Data?– Metadata?– Scientists?– 2 year old children?

• “Why do I care?”

(Answers near the end of the presentation.)

Page 10: Outline

Story time…Story time…

Page 11: Outline

““The 3 Little Pigs”The 3 Little Pigs”

• Characters– The Wolf– First Pig builds a house of Straw– Second Pig builds a house of Sticks– Third Pig builds a house of Bricks

• What does this have to do with “Data, Science, You, and The Future…?”

Page 12: Outline

The The WolfWolf

• Unending appetite• Out of control • Bad mannered• Too clever?

• Exponential growth in:– Data retention

capacity and habits– Data re-use demands

• Significant chaos in:– Data automation styles– Data documentation

• Lack of training in– {ditto above!!}

Who will eat whom? Scientists or Data

managers?

0

500000

1000000

1500000

2000000

2500000

Oct-95

Oct-96

Oct-97

Oct-98

Oct-99

Oct-00

Oct-01

Oct-02

Oct-03

Oct-04

Oct-05

Oct-06

files MB

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Oct-95Oct-96 Oct-97 Oct-98Oct-99 Oct-00 Oct-01Oct-02 Oct-03 Oct-04Oct-05 Oct-06

files MB

Data out Data in

Page 13: Outline

The The StrawStraw Building Pig Building Pig

• Gathered more and more of “what was at hand.”

• Wanted to go back to “being a pig”.

• Metadata catalogs• Metadata harvesting• Layers of ontologies• Automated “data mining”??• Can we sort through all of

the details?• What about

recommendations and priorities for use?

• When will the “straw quality” improve?– Changing the “masses”

Stay out of the way of the

Scientists

Page 14: Outline

The The SticksSticks Building Pig Building Pig

• Did a bit more work to gather materials

• Used a bit more structure• Did not have good

specifications• Only acted after First Pig failed• Wanted to go back to “being a

pig”. (after some effort)

• A mixture of data structures, metadata, and a few standards

• XML Links• Automated data access• Data warehouse

– Business information concept• How do we know the

balance of structure, metadata, and standards?

• What is the evolutionary pathway?

• Many “Sticks” to choose from • Can we show the improvement

over “Straw”?

Work with the Scientists

Page 15: Outline

The The BricksBricks Building Pig Building Pig

• Did a lot more work to gather (AND PREPARE) materials

• Used significantly more structure• Required working with “a plan”• Was a braggart over First and

Second Pig• Wanted to go back to “being a

pig”. (after “winning”)

• Metadata standards and more standards

• Internet does not decide (distributed vs central)

• Removes ambiguity of definitions, but contents get “boxed”.

• What about Type I errors vs. Type II errors?

• An “odds box or junk bin” will always remain.

• “Bricks” are:– Hard to change– Slow and costly to make

• CHANGE is fundamental to SCIENCE (more later!!)Defeat or

stymie the Scientists?

Page 16: Outline

Elements of Data Preservation for Elements of Data Preservation for Future AccessFuture Access

• A “framework” for assessing improvements for the future

• Restricts flow like irregular plumbing

• “We want {more} Cake!!??”

Page 17: Outline

Elements of “Permanent Access”…Elements of “Permanent Access”…

• “Permanent access” to scientific information requires ALL of the following:

– Existence– Permission– Discovery– Understanding– Support

Page 18: Outline

ExistenceExistence

• Definition– Information is recorded and retained.– Information can be found and used by “experts”.

• Requirements– Information technology is used for recording and retention.– Scientists are trained and required to record and retain information.

• Issues– The availability of information technology will far surpass the “ability” to

use it effectively.• Training will be needed to extend “ability” beyond the immediate need.• Training must include both fact and philosophy.

– Plans to use information technology must be “pushed” beyond the immediate objectives.

• Need to establish reasonable and more “global” plans and objectives.

Page 19: Outline

““Why Don’t I Archive My Data?”Why Don’t I Archive My Data?”

• No incentives - What’s in it for me? • No acknowledgment - Does a dataset = a

publication?• Give up publication rights - Will somebody scoop

me?• Poor planning - It was not in “the Plan”.• No resources - Who’s going to pay for it?• No future – Who will support this later?• Lack of training - What do I need to do first?• Unsure about metadata content - How much is

enough?

Page 20: Outline

PermissionPermission

• Definition– Someone “beyond” the originator is allowed to acquire and use

the data.• Requirements

– Scientists relinquish control of the data.– Sponsors and agencies relinquish control of the data.– “They” not only allow future use, but encourage it.

• Issues– Encourage data re-use.

• Explain larger research objectives.• Reward data citation.

– Balance openness and protection.• Allow early discovery.• Prevent resource abuse.• Protect individual privacy.

Page 21: Outline

DiscoveryDiscovery

• Definition– Starts with the inspiration to look “here” for “what you want”.– Includes knowing how to find “what you want”.– Ends with recognizing it when “what you want” is found.

• Requirements– Logical organization.– Good and meaningful metadata (categories and keywords).– Multiple pathways for discovery.

• Issues– Documentation must be significantly extended beyond the “local view” of

the data.– Documentation development is “not career building” for scientists.– Interactions between “developers and users” must be sustained.

Page 22: Outline

Measurement

An Initial View of Data… An Initial View of Data…

Page 23: Outline

Measurement

Single Experiment ViewSingle Experiment View

datesample

ID

parameter name

location

Page 24: Outline

Measurement

Integrated System & Archive ViewIntegrated System & Archive View

QA flag

media

generator

method

datesample

ID

parameter name

location

records

Units

Sample def.typedatelocationgenerator

labfield

Method def.

words, wordsunitsmethod

Parameter def.

org.typenamecustodianaddress, etc.

coord.elev. typedepth

Recordsystem

datewords, words.

QA def.

Units def.

GIS

Page 25: Outline

“I need to look in the odd parts bin.”Direct access to IOP data. Navigate /year/site/iop directory tree. Also use narrow Google search.

“I need to read about what you have, then I will decide.”Discover areas of interest by browsing the ARM web documentation and collect items of interest.

“I will know what I want when I see it.”Searching with a combination of predefined selection criteria and visual review of data plots

“I know what I want. Do you have it?”Searching with predefined selection criteria.

“I am not sure what I want. I need to see what you have available.”Browsing a hierarchy of availability summaries.

Comparison of User Interface OptionsComparison of User Interface Options

IOP, special, PI, and beta data

IOP Data Browser

Routine ARM data and some IOP data

Web Shopping Cart

Most routine ARM data

Thumbnail Browser

Routine ARM data

Catalog Interface

Routine ARM data

ARM Data Browser

“Shopping” approach

([email protected], 1-888-ARM-DATA)

Accessible data

Interface name

Page 26: Outline

Moving on to … Moving on to … Results-based searchingResults-based searching

• An interface of “Statistical Views” (or data) under development for the ARM Archive.

• Not all users want “data.”

User interface to select thumbnails of Statistical Views Detailed view of graph; options to

order statistics, data, or data files.

Page 27: Outline

UnderstandingUnderstanding

• Definition– The interpretation of the full context of the information.

• Requirements– Descriptive metadata that correctly “matches up” information that was:

• Generated from a variety of sources, • Collected for a variety of purposes, • Retained over a broad range of time.

– “Understanding” applies to both:• Persons who read documentation.• Computers that “read” the data format.

• Issues– “Language barriers” must be overcome between scientific disciplines.– Inadequate documentation and software can make “data” useless.– Additional effort will need to be allocated beyond original purpose.

• Trade off between: current quantity of measurements and future use.

Page 28: Outline

Measurement

Sequence of Information BirthSequence of Information Birth

QA flag

media

generator

method

datesample

ID

parameter name

location

records

Units

Sample def.typedatelocationgenerator

labfield

Method def.

words, wordsunitsmethod

Parameter def.

org.typenamecustodianaddress, etc.

coord.elev. typedepth

Recordsystem

datewords, words.

QA def.

Units def.

GIS

Page 29: Outline

SupportSupport

• Definition– Providing help and service beyond the creation of initial

information and documentation.

• Requirements– Answers user questions beyond the initial documentation.– Responds to the evolution of information technology.– Includes scientific and technology expertise.

• Issues– Maintaining information does not:

• Fit traditional science program planning.• Contain “whiz bang” appeal.

– Requires development of new “career pathways”.

Page 30: Outline

Research Implies Change …Research Implies Change …

repeat…

New datarequirements

New questions

Research

DiscoveryThis is not always true for

other information

systems.

Page 31: Outline

Issues to Consider about Issues to Consider about ChangeChange

• What will change?

• Which changes can be controlled?

• How are changes approved?

• How are users notified about changes?

• How and when can changes be “smoothed” in the cumulative “Archive” view?

Page 32: Outline

Pathways to the Future??Pathways to the Future??

Page 33: Outline

EPDUS = EPDUS = ?????????? (1) (1)

• Existence– Technology has pushed this out of control – a path to chaos– Dilution of value causes a recovery problem– Develop procedures for retention guidelines

• Permission– Plans to that encourage permanent access of scientific data are

a “management responsibility”– Consistent rules to protect privacy, resources, and propriety

• Discovery– Significant effort on cataloging and searching– Large scale data collections depend on rational metadata– Need interrelated discovery pathways (query, catalog, pictures)– Results-based views are still very limited from large scale data– Inspiration is “an undeveloped frontier”

Page 34: Outline

EPDUS = EPDUS = ?????????? (2) (2)

• Understanding– Expanding human and computer “interpretation” is difficult;

• Does not keep up increase in diversity of information types

– Web documentation has an inverted outline of scientific publications

• Web users don’t read !!!• (??!!! More later !!!??)

• Support– Inclusion of scientific expertise in Data Centers is still debated

and limited– Programmatic justification of Data Centers outside of (or after!!)

measurement program has limited “sponsor appeal”

Page 35: Outline

““Inverted” Documentation Outline!?! Inverted” Documentation Outline!?! Science Publication vs. “Web Reading”Science Publication vs. “Web Reading”

• Science Publication– Abstract– Introduction– Literature review– Materials / methods– Results– Discussion– Conclusion– References

• “Web Reading”– Conclusions– Results– Abstract– Materials / methods– Discussion– Literature review– References– Introduction

Reference: McCord (200?) ???

Page 36: Outline

Cross cutting issuesCross cutting issues

• Training about scientific data management– Some for all scientists, graduate program for “data scientists”– Reward system for scientific data “reuse”

• Feudal relationship between more “Science” and data preservation– More measurements and experiments– Bigger computers driving science– Stop it!! Cooperation is needed!!

• Scientific input needed for:– Metadata creation

• Mesh with scientific planning– Defining priorities and recommendations

• “An answer is better than NO answer!!” (Going for 0 points??!!)– Defining a reasonable boundary between “system and scientist”

• Handshake needed for QA review, analysis tools, documentation, and automated discovery (?!?)

Page 37: Outline

Looking from the Past to the FutureLooking from the Past to the Future(Common questions from my peers.)(Common questions from my peers.)

• “Should I computerize my data?” (~1974-1975)• “Should I save my {computerized} data?” (~1978-1980)• “Why would anyone want my data?” (~1980)• “Can anyone else properly understand my data?” (~1985)• “Can I have your data?” (~1990)• “Can I find your data?” (~1993-1994)• “Will I have to contact you to know how you used your data?”

(~1998-1999)• “Can you tell me who else has used your data?” (~2000 - ????)• “Can you tell me where to find similar data?” (~2003 - ????)• “Do you want to know (or get back) how I used your data?” (~2005 -

????)• “Will you work together with me on ‘our’ data?” (????)• “Can we work together with our and ‘their’ data?” (????)• … What next … ??

Interactive computing

starts

PC gets common

www.??? takes off

Cheap storage

Collaboration is common

Internet “premie”

Page 38: Outline

Conclusions Conclusions (again!!)(again!!)

• We are swamped with more information than we can access*– *Access is a broad topic (EPDUS = ????)

• Our current practices may not sustainable and reliable– Exponential vs linear capacity increases– Optimization is unbalanced

• Scientific expertise within data centers will improve future data access.

• Science and data management must be integrated.• Many solutions are NOT technological, but behaviorial.

– Think - Training• “Data Science” training must developed and implemented.• The needed changes will not happen by accident.

– “My ~30 years of experience and systems observation suggests otherwise!!”

• Sooner is better than later.

Page 39: Outline

Story time … Story time … (again)(again)

Page 40: Outline

An Ending to the Story…An Ending to the Story…(More conclusions)(More conclusions)

• The best house is probably a combination of:– Bricks to build on– Sticks (wood) to bend and change with– Straw to rest on when sorting out the problem is too

early.• The Wolf needs to be tamed with:

– Reduction in needless data management and documentation chaos and uninformed practices.

– More thought (research??) about “our appetite” (priorities) for storage and retention.

• The Wolf and Pigs both need more training!! • And they live happily ever after…!

Page 41: Outline

Quiz Answers

Page 42: Outline

Pop Quiz (Answer 1)Pop Quiz (Answer 1)

• What is “access combination” to my Lock?

• Hints + missing hints– “I love it”

• Decode numeric sequence from word lengths

– {X=Yz – ((Y*z)/12)}• All unknowns are integers • Solution is the integer number showing the

sequence

• Combination = 142

Page 43: Outline

Pop Quiz (Answer 2)Pop Quiz (Answer 2)

• How is my necktie related to:– Data?

• They all look alike at first?

– Metadata?• Neckties distinguish the teddy bears

– Scientists?• They distinguish data in varying and “unseen” ways

– 2 year old children?• They distinguish teddy bears in varying and “unseen” ways

Both CRY when given the “wrong one”

Page 44: Outline

Pop Quiz (Answer 3)Pop Quiz (Answer 3)

• “Why do I care?”– Better data access can turbo charge Science.– Things are a bigger mess than necessary.– Progress toward improvement is too passive

and too slow.– Independently managing information from

each project is like “paying rent” rather than “building equity.”

Page 45: Outline

ReferencesReferences

• Information about my current project– Atmospheric Radiation Measurement (ARM) Program www.arm.gov – ARM Archive www.archive.arm.gov

• Extended version of “The Three Little Pigs”– http://math-www.upb.de/~odenbach/pigs/pigs.html– Linked to a German Math professor’s web site??– An English version is presented

• Very good reference on Data, Science, and the need for new roles– “Long-Lived Digital Data Collections Enabling Research and Education

in the 21st Century” – http://www.nsf.gov/pubs/2005/nsb0540/ – Sponsored by NSF National Science Board

• Reports on other NSF cyber infrastructure activities to watch and encourage– http://www.nsf.gov/od/oci/reports.jsp