Stanley C. Ahalt, PhD Director, Renaissance Computing...
Transcript of Stanley C. Ahalt, PhD Director, Renaissance Computing...
D t S iData Science
Stanley C. Ahalt, PhDDirector, Renaissance Computing Institute
Professor of Computer Science UNC Chapel HillProfessor of Computer Science, UNC-Chapel HillDirector, Biomedical Informatics Core, NC TraCS
I was tasked by the meeting’s organizers to:organizers to:
• Summarize what I have heard during the last gtwo days.
• Stimulate further thought and discussion around gpossible next steps.
2
My Framework: RENCI
E1: Storm H1: UNC CTSAH2:
Surge Modeling E2: NSF SSI (HydroShare)
SequencingH3: Secure Research Enviro/CoastalEnviro/Coastal
Health SciencesHealth Sciences
(HydroShare)E3: WSSI Workspace
H4: Decision Support
HPCHPCHPCHPC
VisualizationVisualizationVisualizationVisualization
VirtualVirtualVirtualVirtual
HPCHPC
VisualizationVisualization
VirtualVirtual
SciencesSciences
ppC1: WSSI
C2: REACH NC
C3: E iRODS C5: CIBER (NARA)Data Data Data Data
Virtual Virtual OrganizationsOrganizations
Virtual Virtual OrganizationsOrganizations Science oScience of f
CyberinfrastructureCyberinfrastructureScience oScience of f
CyberinfrastructureCyberinfrastructure
NetworksNetworksNetworksNetworksData Data
Virtual Virtual OrganizationsOrganizations Science oScience of f
CyberinfrastructureCyberinfrastructure
NetworksNetworksC3: E‐iRODS
C4. DataNet
C5: NCDS
C5: CIBER (NARA)
C6: ORCA/BEN (NSF GENI)
C7: ExoGENISoftwareSoftwareSoftwareSoftwareAnalyticsAnalyticsAnalyticsAnalytics
NetworksNetworksNetworksNetworks
SoftwareSoftwareAnalyticsAnalytics
NetworksNetworks
3
Cyberinfrastructure Catalyzing Research
(Qverly simplistic) Summary:• Data is easy:
– to generate• But data is still hard
– to make findable– to collect– to view as a common-good– to accumulate
– to share (sometimes)– to manage– to make understandable
• We are getting better or even good at:
D li ith “ ll d”
– to sustain– to value
to cite– Dealing with “well resourced” data (LHC, ENCODE, …)
– Analyzing data in very creative
– to cite– to control (security)
ways– Realizing the secondary and
tertiary utility of data
4
And there are some obvious things we lack
• Robust, shared cyber – infrastructure, y• Broadly shared meta-collections (Amazon)• Widely used sets of robust tools (Google Docs)Widely used sets of robust tools (Google Docs)But despite the challenges
There have been a remarkable number of data• There have been a remarkable number of data collections created, curated, and used.
5
List of Collections Mentioned in the past 2 daysENCODE EBIUN IPCC GHGENCODEQIIMECDC NHANESEPA HPVISEPA IUR
EBINCBIDDBJWormBaseVectorBase
UN IPCC GHGNIOSH NOESEPA CHADEPA NHAPSEU ESISEPA IUR
EPA TRIHousehold Products DBCosemtic Voluntary Reg.
VectorBaseSGDNDARProtein Data BankGenBank
EU ESISUK Pharmaceutical UsageActor EPAToxRefDBy g
DBEPA Pesticide Usage DataATSDR Tox ProfilesDEA NFLIS
GenBankFlyBaseCMIP3PRISMCTD
ToxRefDBToxCastDBEspoCastDBBioRefDBDevToxDBDEA NFLIS
ECOTOX DBCESARDOE Indoor AirNHEXAS
CTDTox21IRISHPVISChemSpider
DevToxDBPubChemUS CensusCMS Medicare EnrollmentNHEXAS
CTEPPEPA NATAEPA AIRS/AFS
ChemSpiderSEERVDWTCGABAM
EnrollmentMCAPSGEO/SRAEnsemblFactorbook
6
BAMCGHub
FactorbookUniProtGo
And despite the challenges…p g
• There has been a great deal of excellent gresearch and science– Many papers, and production has increased
• And excellent innovation:– Mattingly’s CTD relationship “algebra” g y p g– Cohen-Hubal’s relationship diagrams in ACTOR– Richard’s efforts to quantify the structures that give
rise to properties.
7
Our problems and opportunities have arisen from technologyarisen from technology
$100,000 120%$100,000
$10,000 100%99%97%
94%
$1,000 80%75%$1,000.00
$100
$10
60%
40%$10.00Digital (%)Paper (%)$/GB
$1 20%
6%
25%
$0 0%1986 1993 2000 2006
6%1% 3%
$0.108
The Five Vs: And now we have "Big” Data
• Volume: The Large Hadron Collider discards 99.999% of its data because the data cannot be processed!
• Velocity: Retail transactions, communications,industrial sensor data demand real timeindustrial sensor data, demand real-time analysis and action.
• Variety: Health data includes images, test results, medical histories, doctor’s notes.
• Veracity: Data quality essential for discovery and informed decision making
• Value: How important or rare is the data, and what do we keep and for how long?what do we keep and for how long?
Data use cases are heterogeneous• Importance of each V varies, even within
same data setData management and analytics hardware and expertise are expensive and time consuming
• Can be barriers to entry, especially for small organizations and new hresearchers
9
Furthermore, our problems are social, p
• We are cats, not dogs., g• We hoard data because it used to be scarce
and because it gave us an advantage.g g• We lack the mechanisms (financial and social)
to unify small and large data communities.to unify small and large data communities.• Change is hard!
10
Each discipline has a different perspective on the problemon the problem
This is environmental This is
t!This must
data!
Astrophysics
government!be finance!
Genomics!
11
(Big) Data challenges require thoughtful approaches
• To work through the social issues.
thoughtful approaches
• To build better algorithms and visualizations• To improve data management techniques
To lower equipment and staffing• To lower equipment and staffing costs
• To develop the workforcep (Big) Data must have the status of
a science, using scientific th d t l t b dmethods to accumulate a body
of knowledge
NCDS Update12
Defining Data Science
Data science: the systematic study of the organization and use of digital data in order to accelerate research discoveries, improve critical decision-making processes, and enable a data-driven economy.
13
The cost of data is a big deal9.0
8.0
7.0
6.0Quantity of Global Digital Data, Current rate of growth
1 ZB = 1 billion TB
5.0
4.0
ZettabytesCurrent rate of growth fills the Library of Congress every 10 sec.
2.0
3.0
sec.
From Allen’s
0.0
1.0 talk: 1.8ZB/year
Source: EMC/IDC Digital Universe Study 2011. NSF.
2005 2010 2012 2015
14
From Allen’s talk: 1.8ZB/yeary
assumed cost per kwh 0.09average watts per DDN shelf 1750average watts per DDN shelf 1750Hours per year 8760
For 1.8ZB this works out to $13.8 B/yr energy costFor 1.8ZB this works out to $13.8 B/yr energy cost
15
The cost of storing the human genome• 10 PB = storage for 100,000 full human genomes at low
coverage (1) • 2 PB = storage for 100,000 human exomes at medium g ,
coverage (2)Or:
• $5 to $25 million for UNC Health Care System to store$5 to $25 million for UNC Health Care System to store every patient’s genome once on enterprise data storage– Not including archived copies, not including analysis data sets
Or:Or:• $15 to $75 billion for the US to store every patient’s
genome once– This is the cost of disk space alone and is not a one-time costp
(1) Empirical data, assuming ~100 Gb per sample compressed fastq, bam, vcf, and ancillary data files at coverage between 3-15x(2) Empirical data, assuming ~20 Gb per sample at around 30x only storing compressed fastq and bam file
16
End-to-end clinical genomics informatics• Blood draw to clinical relevant variants
• High performanceHigh performance analysis pipelines
• Large-scale data storage systems
• System-level workflow management
• Laboratory information ymanagement systems
• Orchestration around multiple storage and computer systemscomputer systems
• Closed loop system with independent validation paths (CLIA lab and exom chips)
17
The economics of volume The current Transmission Gap
Cycles are 10x to 50x
CPU Cycle CPU Cycle
Cycles are 10x to 50xcheaper in the cloud!
CPU Cycle6 ‐ 27 picocents
1 bit storage/year6 picocents
1 bit network transfer800 ‐ 6000 picocents
CPU Cycle0.58 picocents1 bit storage/year5.3 ‐ 6 picocents6 picocents
Site‐based 100x to 1000x
p
Cloud‐basedSite‐basedmore costly
Cloud‐basedAdapted from: Radu Sion, Stony
Brook University, 2009A pico‐cent($) is approx equivalent to a pico‐cent(€)
This will not change18
Consider EXISTING “Data Ecosystems.”
– Commodity Internet:
• Futures markets• HFT
• Email, Facebook, YouTube,Twitter, iTunes, Google
– “Physics Net”:• LHC data
E th S iiTunes, Google– BTB:
• Supply chain
– Earth Science Information Partners (ESIP)pp y
coordination• Billing
( )– Others
• Customer management
• DistributionDistribution– Financial Internet
• Banking19
What other kinds of “Data Ecosystems” are forming?are forming?
• “Smart Grid”See IBM adverts– See IBM adverts
• “Industrial Internet”– Heavy machinery See recent announcement by GE– Heavy machinery. See recent announcement by GE
• “Health Care Internet” (ala 20 EPIC sites)– Usable patient records available everywhereUsable patient records available everywhere
• “Genome-net”– Compare rare variants from select sub-populations Co pa e a e a a s o se ec sub popu a o s
in real time
In ALL of the cases above, there are
20clear financial incentives in play.
What other kinds of “Data Ecosystems” might we like to see form?
(Brainstorming…)
• “Climate-net” (must be global)• “Aqua-net” (must be global)Aqua net (must be global)• “Enviro-health net”(should be global)
I assert that in these cases, the fiscal d i b idrivers are not as obvious.
21
To overcome this impediment:• Energy• Energy
– Passion (start with a compact group and grow)– FundingFunding– Organizational forms that permit broader participation– Incentives: dB citations, tool citations, Citizen Science,
Crowd Sourcing, “Branding”• Leadership
– Put forward a “framework” that MIGHT work (see USGS!)
– Pivot when it doesn’t work– Pivot when it doesn t work• Persistence
– Complex science is hard it will take time to do goodComplex science is hard, it will take time to do good
22
THANK YOU!THANK YOU!
23