Cloud Dataverse: A Data repository platform for an OpenStack Cloud
-
Upload
merce-crosas -
Category
Data & Analytics
-
view
259 -
download
0
Transcript of Cloud Dataverse: A Data repository platform for an OpenStack Cloud
CLOUD
DATAVERSEMercè Crosas1
Orran Krieger2
Piyanai Saowarattitada2
1Institute for Quantitative Social Science (IQSS), Harvard University2Massachussetts Open Cloud (MOC), Boston University
DATA REPOSITORIES NEED CLOUDS
CLOUDS NEED DATA REPOSITORIES
This Talk
1. The Need– The rise of big data-centric computation
– The rise of modern data repositories
2. Our Platforms– Dataverse: A premier open-source data repository platform
– MOC: Top collaborative OpenStack cloud with Big Data compute
3. The Solution– Cloud Dataverse: Bringing MOC and Dataverse together
AWS sees the value in data
“When data is made publicly available on AWS, anyone can analyze any volume of data without needing to download or store it themselves.”
Data and Compute leads to
Discovery
A wide range of fields and industries
can benefit from access to data
But, AWS public datasets miss key
aspects needed in data repositories
• Incentives to share data
• Citation to each version of the data
• Metadata for Discoverability
• Tiered access to non-public data
• Commitment to data archival & preservation
The scientific community has been thinking
about data archives and repositories for some
time
1957
Roper
Center
public and
operational
1960
Zentral Archiv für
Empirische
Sozialforschung
(Germany)
1962
ICPSR
1964
Steinmetz
Archive
(Netherlands)
ODUM
Data Archive
1965 1970
Protein
Data Bank
1982
European
Nucleotide
Archive
GenBank
social sciences life sciences
UK Data
Archive
19671966
National Space
Science Data
Center
1995
Pangae
a
1987
Astrophysics
Data System
1990
EOSDIS
astronomy
earth sciences
Number of data repositories grows
with growth of data sharing
Dryad Figshare
Zenodo
2006 2009 20112013
DataCiteData Citation Principles
# of (all types of ) data repositories from 2012 to 2016 source: r3data.org
> 1,500 research data repositories
Today’s repositories incentivize data sharing by giving
credit to data authors through formal citation
Persistent citations to datasets published in data repositories
Bibliography
The Dataverse open-source platform
enables building any type of data repository
Agriculture data
Repository in Fudan, China
Data from 20 Universities
Public data repository
Science Consortium
Challenges
• Datasets have to be small
• Hard to copy 40 PB over the internet
• Not every one has the right compute infrastructure
DATAVERSE NEEDED A CLOUD
The Massachusetts Open Cloud – an Open Cloud eXchange
Imagine shrinking Pacific Research Platform to the size of a building
Imagine shrinking Pacific Research Platform to the size of a building
Consortium comparable to Pacific Research Platform• Huge community covering every field of research• Collaborations across the globe• Massive data and computational requirements• Massive student population covering every discipline
Widths are proportional to enrollment
MGHPCC Data Center
15 MW, 90,000 square feet + can grow
10s of thousand HPC users, potentially many more cloud users
The MOC partnership
Today’s model of Cloud
What we need:
an “Open Cloud eXchange (OCX)”
C3DDB
HP
CBig
DataWe
b
OpenStack great, but… where is the data?
• We need:
– share data between providers
– expose cloud meta-data researchers/companies
• Our scientific users need
– In-situ compute on public & community data sets
– control with whom and how their data is shared
– reduced barrier to exploit rich tools to compute on the data
• Our commercial & public sector partners need
– share data with researchers/startups
– reduce the risk/barrier of publishing data
– model to expose technology in environment with rich data
The MOC need a modern Dataset repositoryOpenStack Needs a Dataset repository project
Data depositor Data users
Compute
Dataverse Before Cloud Dataverse
Data depositor Data users
Swift
Object Storage
Nova
Compute
Horizon
Nova
Compute
Sahara
Analytics
Giji
Dataverse After Cloud Dataverse
UI
Data depositor Data users
Swift
Object Storage
Nova
Compute
Horizon
Nova
Compute
Sahara
Analytics
Giji
What’s missing in Cloud ?
UI
Data depositor Data users
Swift
Object Storage
Nova
Compute
Horizon
Nova
Compute
Sahara
Analytics
Giji
What’s missing in Dataverse ?
Swift
Object Storage
So what is Cloud Dataverse ?
DEMO: Billion Object Platform(BOP) GeoTweets
Data users/analyst
Swift
Object Storage
Horizon
Tweets
BOP GeoTweetsCOLD report
Nova
Compute
Nova
Compute
Sahara
Analytics
Summary : BOP GEOTWEETS Cold Demo
Giji
BOP
Data depositor
Dataverse Community review
SUMMER 2016
FALL 2016 JANUARY 2017
DECEMBER2016
POC
Barcelona OpenSstack Summit #vBrownBag
Full Collaborative Development Begins
MAY 2017
Boston OpenStack Summit : *Swift per repository*URI*Demos
SUMMER 2017
Worldwide Data Federation
A Year in the life of Cloud Dataverse
MOC Annual WorkshopPOCPreview
OCTOBER 2016
World Wide Data Federation
DATA REPOSITORIES NEED CLOUDS
CLOUDS NEED DATA REPOSITORIES
With Cloud Dataverse, we combine the power and scalability of OpenStack cloud with the need to access data using a feature-rich repository
THANKS