TMS Metro Controls Pack DEVELOPERS GUIDE - TMS Software | VCL
20160922 Materials Data Facility TMS Webinar
-
Upload
ben-blaiszik -
Category
Science
-
view
60 -
download
4
Transcript of 20160922 Materials Data Facility TMS Webinar
![Page 1: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/1.jpg)
Ben Blaiszik ([email protected]),Kyle Chard, Rachana AnanthakrishnanMichael Ondrejcek, Kenton McHenry
PIs: Ian Foster ([email protected]), Steven Tuecke, John Towns
materialsdatafacility.orgglobus.org
Materials Data Facility -Data Services to Advance Materials
Science Research
![Page 2: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/2.jpg)
2
http://dx.doi.org/10.1007/s11837-016-2001-3
MDF Article in JOM (August Issue)
![Page 4: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/4.jpg)
4
Outline
APIs
• Overview§ MDF Overview§ Globus quick introduction
• MDF Data Publication Service§ Key MDF data pub service features§ Publication walk-through
• General Observations and Future Outlook
![Page 5: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/5.jpg)
What is MDF?
5
We are developing production services to make it more simple for materials
datasets and resources to be ...
PublishedIdentifiedDescribedCurated
VerifiableAccessiblePreserved
DiscoveredSearchedBrowsedShared
RecommendedAccessed
and
SRD
Publishable ResultsPublished Results
Resource DataRef Data
Derived DataWorking Data * Figure adapted from Warren et al.
![Page 6: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/6.jpg)
Data Service Infrastructure
6
Publication Discovery
Compute for data interaction
and viz
Resource Registration
APIs
+ +
+ - Initial Foci
![Page 7: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/7.jpg)
7
Publication
APIs
• Identify datasets with persistent identifiers (e.g. DOI)
• Describe datasets with appropriate metadata and provenance
• Verify dataset contents over time
• Preserve critical datasets in a state that increases transparency, replicability, and helps encourage reuse
![Page 8: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/8.jpg)
8
Discovery
• Search and query datasets in modern ways – e.g. via search against indexed metadata and harvested file contents rather than remembering opaque file paths
Future...
Spotlight for all data you have
access to regardless of
location
Under Development
![Page 9: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/9.jpg)
9
DiscoveryUnder Development
• SaaS cloud-hosted solution
• Logical metadata repository to index many external sources
• Flexible queries (boosting, full text, partial matches, etc.)
• Search results are limited by ACLs
![Page 10: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/10.jpg)
10
DiscoveryUnder Development
• All MDF-published datasets will be indexed
• May use common schemas (Datacite, Dublin Core etc.) or domain specific
• Globus endpoint contents may be indexed (owner enabled)
• Index has the flexibility of no required schema
• Built on Elasticsearch for proven scalability and speed, hosted on scalable AWS resources
![Page 11: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/11.jpg)
11
DiscoveryUnder Development
Custom boosting
Facets
Test-indexed data
![Page 12: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/12.jpg)
13
Globus Backgroundhttps://www.globus.org
![Page 13: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/13.jpg)
Globus Platform-as-a-Service (PaaS)
14
Identity management
User groups
Data transfer
Data sharing
• Share directly from your storage device (laptop or cluster)
• File and directory-level ACLs
• Manage user group creation and administration flows
• Share data with user groups
• High-performance data transfer from a web browser
• Optimize transfer settings and verify transfer integrity
• Add your laptop to the Globus cloud with Globus Connect Personal
• create and manage a unique identity linked to external identities for authentication
Publication Discovery
![Page 14: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/14.jpg)
REST APIs, Clients, and Docs
15
• New version of core services released in Feb.
• New Python SDK available§ https://github.com/globusonline/globus-sdk-python
• Jupyter Notebook Examples§ https://github.com/globus/globus-jupyter-notebooks
• Sample Data Portal§ https://github.com/globus/globus-sample-data-portal
• (alpha) MDF Data Publication Service API
![Page 15: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/15.jpg)
Globus Background
16
B
Globus moves the data for you
secureendpoint,
e.g. laptop
You submit a transfer request Globus
notifies you once the transfer is complete
secureendpoint,e.g. midway
transfer
A
Endpoint• E.g. laptop or server
running a Globus client (e.g. Dropbox client)
• Enables advanced file transfer and sharing
• Currently GridFTP, future GridFTP + HTTP
Some Key Features• REST API for
automation and interoperability
• Web UI for convenience
• Optimizes and verifies transfers
• Handles auto-restarts
• Battle tested with big data
![Page 16: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/16.jpg)
Globus Web UI
17
Endpoint• E.g. laptop or server
running a Globus client (e.g. Dropbox client)
• Enables advanced file transfer and sharing
• Currently GridFTP, future GridFTP + HTTP
Some Key Features• REST API for
automation and interoperability
• Web UI for convenience
• Optimizes and verifies transfers
• Handles auto-restarts
• Battle tested with big data
![Page 17: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/17.jpg)
19
Data Publication
Where are we Now?
![Page 18: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/18.jpg)
20
Materials Data Publication/Discovery is Often a Challenge
Data Collection Data Storage and Process Publication
![Page 19: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/19.jpg)
21
Materials Data Publication/Discovery is Often a Challenge
Data Collection
???
Networked storage, sometimes many TBUnique identifier data for search/citeCustom metadata descriptionsData curation workflowAutomation capabilities
Data Storage and Process Publication
Want to Discover / Use
Want to Publish
Don’t put under desk!
Needed to close the loop
![Page 20: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/20.jpg)
22
Data Collection
???
Need storage, sometimes many TBNeed to uniquely identify data for search/citeNeed custom metadata descriptionsNeed a data curation workflowNeed automation capabilities
Data Storage and Process Publication
Want to Discover / Use
Want to Publish
Materials Data Publication/Discovery is Often a Challenge
Don’t put under desk!
![Page 21: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/21.jpg)
Collection Model
23
• Collections might be a research group or a research topic...
• Collections have specified§ Mapping to storage endpoint
§ Currently handled as automatically created shared endpoints
§ Metadata schemas§ Access control policies§ Licenses§ Curation workflows
• Collections contain§ Datasets
§ Data§ Metadata
• Metadata Persistence§ Metadata log file with dataset§ Metadata replicated in search
index
![Page 22: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/22.jpg)
Hybrid Distributed Model
24
Petrel @Argonne1.7 PB
BlueWaters Condo@UIUC100 TB
EP 1
EP 2
EP 3
CampusRDS
DOE
Cloud Metadata IndexAnd Tools
Centralized resource
Globus endpoint
NSF(XSEDE)
ElectroCatEP
![Page 23: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/23.jpg)
Publish Large Datasets
25
• Distributed data model leverages Globus production capabilities for file transfer (i.e. dataset assembly), user authentication, and access control groups
• 100s of TB of reliable storage @ NCSA, and more storage at Argonne§ Globusendpointatncsa#mdf onNebula§ ExpandabletomanyPBsasnecessary§ Automatedtapebackupforreliability(inprogress)
• Researchers can optionally use your own local or institutional storage
![Page 24: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/24.jpg)
Uniquely Identify Datasets
26
• Associate a unique identifier with a dataset§ DOI,Handle
• Improve dataset discovery and citability§ Aligningincentivesandunderstandingtheculture
willbecriticaltodrivingadoption
Data
set D
ownl
oads
Time
• Your work has been cited 153 times in the last year
• Researchers from 30 institutions have downloaded your datasets
Future...
![Page 25: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/25.jpg)
Share Data with Flexible ACLs
27
• Share data publicly, with a set of users, or keep data private
Leverage Curation Workflows• Collection administrators can specify
the level of curation workflow required for a given collection e.g.§ Nocuration§ Curationofmetadataonly§ Curationofmetadataandfiles
![Page 26: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/26.jpg)
Customize Metadata
28
• Build a custom metadata schema for your specific research data
• Re-use existing metadata schemas• Working in conjunction with NIST
researchers to define these schemas
• Can we build a system that allows schema:§ Inheritance
§ E.g. a schema “polymers” might inherit and expand upon the “base material” of NIST
§ Versioning§ E.g. Understand contextually how to map fields
between versions§ Dependence
§ E.g. Allows the ability to build consensus around schemas
Future...
![Page 27: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/27.jpg)
29
MDF Submission Walkthrough
![Page 28: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/28.jpg)
Example Use Case
30
Publishing Big, Remote Data
Collected multi TBof data at a light source
Bundle the data with metadataand provenance
Want a citable DOI to share the raw and derived data with the community
Want their data to be discoverable by free text search and custom metadata
![Page 29: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/29.jpg)
MDF Collection Home
31
![Page 30: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/30.jpg)
MDF Collections
32
Recall: Policies Set at the Collection Level• Required metadata, schemas• Data storage location• Metadata curation policies
![Page 31: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/31.jpg)
MDF Metadata Entry
33
• Scientist or representative describes the data they are submitting
• For this collection Dublin Core and a custom metadata template are required
![Page 32: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/32.jpg)
MDF Custom Metadata
34
• Scientist or representative describes the data they are submitting
• For this collection Dublin Core and a custom metadata template are required
![Page 33: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/33.jpg)
Dataset Assembly
35
• Shared endpoint is auto-created on collection-specified data store
• Scientist transfers dataset files to a unique publish endpoint
• Dataset may be assembled over any period of time
• When submission is finished, dataset will be rendered immutable via checksum
(e.g. NU) (e.g. UIUC Nebula)
![Page 34: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/34.jpg)
Dataset Assembly
36
• Shared endpoint is auto-created on collection-specified data store
• Scientist transfers dataset files to a unique publish endpoint
• Dataset may be assembled over any period of time
• When submission is finished, dataset will be rendered immutable via checksum
(e.g. NU) (e.g. UIUC Nebula)
![Page 35: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/35.jpg)
Dataset Curation (Optional)
37
• Optionally specified in collection configuration
• Can be approved or rejected (i.e. sent back to the submitter)
![Page 36: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/36.jpg)
Mint a Permanent Identifier
38
CanbeDOI orHandle
![Page 37: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/37.jpg)
Dataset Record
39
![Page 38: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/38.jpg)
Dataset Discovery
40
![Page 39: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/39.jpg)
47
General Observations
and Future Outlook
![Page 40: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/40.jpg)
48
Publication Year 1 Milestones
APIs
• Opened to the public in March 2016
• Provisioned reliable storage to support researchers sharing open materials data (~200 TB)
• MDF data volume approaching ~ 6 TB of materials data
• Started building deep relationships with many of the key materials data generating groups and communities
• Ingested dataset > 1 TB in size
• Ingested dataset > 1.5M files
![Page 41: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/41.jpg)
Integration with the Community is Key
49
MaterialsProject
OQMD
CitrinationMaterialsCommons
OtherFacilities(APS,SNS,NSLS,…),InstitutionalRepositories,Publishers!
MetadataPublishing
MetadataMD,Pub.,Compute
MetadataPublishing
NCSA-PIREHV/TMSMBDH
![Page 42: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/42.jpg)
Understanding Incentives is Critical
50
Meeting Award Requirements
Smoothing Dislocations
Increasing Impact
• Increase paper citations1
• Add dataset citation capabilities
• [Distance] Enable simple sharing among collaborators (near and far)
• [Personnel] Ease transitions between students• [Format] Lessen need for ad hoc resource sharing
(e.g. via group websites)
• Simplify DMP compliance
1 Citation increase 30 (10.7717/peerj.175) - 60% (10.1371/journal.pone.0000308) [caveat bio research]
![Page 43: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/43.jpg)
Lessons Learned
51
• The demand is there from researchers and institutions
• Lots of cross-over with centers and projects§ (NIST) CHiMaD§ (DOE) ElectroCat, MICCoM, JCESR, PRISMS, Argonne IT, Integrated Imaging Institute§ (NSF) T2C2 [DIBBS], AMI-CFP (PIRE), HV/TMS (I/UCRC), BD Hubs, IMaD BD Spoke*
• Data Heterogeneity is a challenge§ Metadata is the major sticking point
• Friction points§ Need more flexible data objects e.g. {“temperature”:100, “unit”:“K”}§ Need file or directory based metadata§ Immutable datasets alone is not enough à Versioning§ Data gathering in retrospect§ Schema generation and interoperability
§ Working with and following developments at NIST, RDA, Citrine et al. § Differing institutional approval processes§ Lack of programmatic interface (planned).
• Support for data interactivity and visualization• Smart versioning for large file-based datasets
![Page 44: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/44.jpg)
Wider Data Community
52
• Curated and described datasets• Well-posed problems• Community to share analyses• Challenges to start “sprints”
• Great APIs and clients• Examples to get started• Hundreds of video tutorials
MaterialsProjectOQMD
CitrinationMaterialsCommons
• Less inherently intuitive problems• Sometimes need advanced compute
capabilities• Often many TB
![Page 45: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/45.jpg)
53
• Continuous integration, QA, and testing• Containerized solutions, microservice architecture, abstracting software from
hardware• Automation • Internet of Things (IoT) – connect everything• Machine Learning / AI• Natural Language Processing (Siri, chatbots or “slack”bots, etc.)• Search rules the world – ok this was 20 years ago…
What are the analogs and applications in the materials community?
MaterialsProjectOQMD
CitrinationMaterialsCommons
• Less inherently intuitive problems• Sometimes need advanced compute
capabilities• Often many TB
Broader Trends
![Page 46: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/46.jpg)
54
Experimentation Ahead
No team commitments here!
Open source opportunities, contact:
![Page 47: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/47.jpg)
Use Case: Scenario Generator-Consumer
55
• Data generator§ Generates data periodically (perhaps from an instrument)§ Pushes data to a public channel§ Schema is validated before inclusion in channel stream
• Data consumer§ Polls channel periodically§ Wants to pull datasets by property
DatasetChannelMDF-composites
Data Generator
Data Consumer
DatasetDatasetDatasetDatasetDatasetcreate q: result
q
![Page 48: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/48.jpg)
Automated Data Aggregation (consumer)
56
![Page 49: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/49.jpg)
Aggregate, Perform ML
58
• Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and “reproduce” data from journal publication
![Page 50: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/50.jpg)
Aggregate, Perform ML, Visualize
59
• Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and validate journal publication
![Page 51: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/51.jpg)
What’s Currently Available?
60
• Web interface to support data publication (public-facing APIs coming soon)
• 100s of TB of storage at NCSA (scalable to many PB) more at Argonne (1.7 PB total on Petrel – not all for materials…)
• Help with developing metadata schemas to describe your research datasets
MDF Tutorial on Githubhttps://github.com/blaiszik/materials-data-facility-training
![Page 52: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/52.jpg)
What are we looking for?
61
• Early adopters, willing to get their hands dirty with the service and give honest feedback
• Key integration points where metadata is picked up automatically!
• Key datasets and resources of all sizes, shapes, raw or derived, that might help us understand the process better
![Page 53: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/53.jpg)
Thanks to Our Sponsors!
62
U .S . D E PART M E N T O F
ENERGY
![Page 54: 20160922 Materials Data Facility TMS Webinar](https://reader031.fdocuments.in/reader031/viewer/2022020314/5883a8751a28ab3b488b53cb/html5/thumbnails/54.jpg)
Publication REST APIs Discovery
• Identify datasets with persistent identifiers (e.g. DOI)
• Describe datasets with appropriate metadata and provenance
• Verify dataset contents over time
• Handle big (and small) data:We have already ingested datasets with > 1.5M files and > 1TB in size
• Search and query datasets in modern ways
• Index metadata and harvest file contents
• Simple user interfaces (i.e., after Google and Amazon)
Opened to external users in Mar. 2016~ 6 TB of data published
Materialsdatafacility.org