Stanley C. Ahalt, PhD Director, Renaissance Computing...

23
Dt Si Data Science Stanley C. Ahalt, PhD Director, Renaissance Computing Institute Professor of Computer Science UNC Chapel Hill Professor of Computer Science, UNC-Chapel Hill Director, Biomedical Informatics Core, NC TraCS

Transcript of Stanley C. Ahalt, PhD Director, Renaissance Computing...

Page 1: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

D t S iData Science

Stanley C. Ahalt, PhDDirector, Renaissance Computing Institute

Professor of Computer Science UNC Chapel HillProfessor of Computer Science, UNC-Chapel HillDirector, Biomedical Informatics Core, NC TraCS

Page 2: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

I was tasked by the meeting’s organizers to:organizers to:

• Summarize what I have heard during the last gtwo days.

• Stimulate further thought and discussion around gpossible next steps.

2

Page 3: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

My Framework: RENCI

E1: Storm H1: UNC CTSAH2: 

Surge Modeling E2: NSF SSI (HydroShare)

SequencingH3: Secure Research Enviro/CoastalEnviro/Coastal

Health SciencesHealth Sciences

(HydroShare)E3: WSSI Workspace

H4: Decision Support

HPCHPCHPCHPC

VisualizationVisualizationVisualizationVisualization

VirtualVirtualVirtualVirtual

HPCHPC

VisualizationVisualization

VirtualVirtual

SciencesSciences

ppC1: WSSI

C2: REACH NC

C3: E iRODS C5: CIBER (NARA)Data Data Data Data 

Virtual Virtual OrganizationsOrganizations

Virtual Virtual OrganizationsOrganizations Science oScience of f 

CyberinfrastructureCyberinfrastructureScience oScience of f 

CyberinfrastructureCyberinfrastructure

NetworksNetworksNetworksNetworksData Data 

Virtual Virtual OrganizationsOrganizations Science oScience of f 

CyberinfrastructureCyberinfrastructure

NetworksNetworksC3: E‐iRODS

C4. DataNet

C5: NCDS

C5: CIBER (NARA)

C6: ORCA/BEN (NSF GENI)

C7: ExoGENISoftwareSoftwareSoftwareSoftwareAnalyticsAnalyticsAnalyticsAnalytics

NetworksNetworksNetworksNetworks

SoftwareSoftwareAnalyticsAnalytics

NetworksNetworks

3

Cyberinfrastructure Catalyzing Research

Page 4: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

(Qverly simplistic) Summary:• Data is easy:

– to generate• But data is still hard

– to make findable– to collect– to view as a common-good– to accumulate

– to share (sometimes)– to manage– to make understandable

• We are getting better or even good at:

D li ith “ ll d”

– to sustain– to value

to cite– Dealing with “well resourced” data (LHC, ENCODE, …)

– Analyzing data in very creative

– to cite– to control (security)

ways– Realizing the secondary and

tertiary utility of data

4

Page 5: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

And there are some obvious things we lack

• Robust, shared cyber – infrastructure, y• Broadly shared meta-collections (Amazon)• Widely used sets of robust tools (Google Docs)Widely used sets of robust tools (Google Docs)But despite the challenges

There have been a remarkable number of data• There have been a remarkable number of data collections created, curated, and used.

5

Page 6: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

List of Collections Mentioned in the past 2 daysENCODE EBIUN IPCC GHGENCODEQIIMECDC NHANESEPA HPVISEPA IUR

EBINCBIDDBJWormBaseVectorBase

UN IPCC GHGNIOSH NOESEPA CHADEPA NHAPSEU ESISEPA IUR

EPA TRIHousehold Products DBCosemtic Voluntary Reg.

VectorBaseSGDNDARProtein Data BankGenBank

EU ESISUK Pharmaceutical UsageActor EPAToxRefDBy g

DBEPA Pesticide Usage DataATSDR Tox ProfilesDEA NFLIS

GenBankFlyBaseCMIP3PRISMCTD

ToxRefDBToxCastDBEspoCastDBBioRefDBDevToxDBDEA NFLIS

ECOTOX DBCESARDOE Indoor AirNHEXAS

CTDTox21IRISHPVISChemSpider

DevToxDBPubChemUS CensusCMS Medicare EnrollmentNHEXAS

CTEPPEPA NATAEPA AIRS/AFS

ChemSpiderSEERVDWTCGABAM

EnrollmentMCAPSGEO/SRAEnsemblFactorbook

6

BAMCGHub

FactorbookUniProtGo

Page 7: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

And despite the challenges…p g

• There has been a great deal of excellent gresearch and science– Many papers, and production has increased

• And excellent innovation:– Mattingly’s CTD relationship “algebra” g y p g– Cohen-Hubal’s relationship diagrams in ACTOR– Richard’s efforts to quantify the structures that give

rise to properties.

7

Page 8: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

Our problems and opportunities have arisen from technologyarisen from technology

$100,000 120%$100,000

$10,000 100%99%97%

94%

$1,000 80%75%$1,000.00

$100

$10

60%

40%$10.00Digital (%)Paper (%)$/GB

$1 20%

6%

25%

$0 0%1986 1993 2000 2006

6%1% 3%

$0.108

Page 9: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

The Five Vs: And now we have "Big” Data

• Volume: The Large Hadron Collider discards 99.999% of its data because the data cannot be processed!

• Velocity: Retail transactions, communications,industrial sensor data demand real timeindustrial sensor data, demand real-time analysis and action.

• Variety: Health data includes images, test results, medical histories, doctor’s notes.

• Veracity: Data quality essential for discovery and informed decision making

• Value: How important or rare is the data, and what do we keep and for how long?what do we keep and for how long?

Data use cases are heterogeneous• Importance of each V varies, even within

same data setData management and analytics hardware and expertise are expensive and time consuming

• Can be barriers to entry, especially for small organizations and new hresearchers

9

Page 10: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

Furthermore, our problems are social, p

• We are cats, not dogs., g• We hoard data because it used to be scarce

and because it gave us an advantage.g g• We lack the mechanisms (financial and social)

to unify small and large data communities.to unify small and large data communities.• Change is hard!

10

Page 11: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

Each discipline has a different perspective on the problemon the problem

This is environmental  This is 

t!This must 

data!

Astrophysics

government!be finance!

Genomics!

11

Page 12: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

(Big) Data challenges require thoughtful approaches

• To work through the social issues.

thoughtful approaches

• To build better algorithms and visualizations• To improve data management techniques

To lower equipment and staffing• To lower equipment and staffing costs

• To develop the workforcep (Big) Data must have the status of

a science, using scientific th d t l t b dmethods to accumulate a body

of knowledge

NCDS Update12

Page 13: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

Defining Data Science

Data science: the systematic study of the organization and use of digital data in order to accelerate research discoveries, improve critical decision-making processes, and enable a data-driven economy.

13

Page 14: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

The cost of data is a big deal9.0

8.0

7.0

6.0Quantity of Global Digital Data, Current rate of growth

1 ZB = 1 billion TB

5.0

4.0

ZettabytesCurrent rate of growth fills the Library of Congress every 10 sec.

2.0

3.0

sec.

From Allen’s

0.0

1.0 talk: 1.8ZB/year

Source: EMC/IDC Digital Universe Study 2011. NSF.

2005 2010 2012 2015

14

Page 15: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

From Allen’s talk: 1.8ZB/yeary

assumed cost per kwh 0.09average watts per DDN shelf 1750average watts per DDN shelf 1750Hours per year 8760

For 1.8ZB this works out to $13.8 B/yr energy costFor 1.8ZB this works out to $13.8 B/yr energy cost 

15

Page 16: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

The cost of storing the human genome• 10 PB = storage for 100,000 full human genomes at low

coverage (1) • 2 PB = storage for 100,000 human exomes at medium g ,

coverage (2)Or:

• $5 to $25 million for UNC Health Care System to store$5 to $25 million for UNC Health Care System to store every patient’s genome once on enterprise data storage– Not including archived copies, not including analysis data sets

Or:Or:• $15 to $75 billion for the US to store every patient’s

genome once– This is the cost of disk space alone and is not a one-time costp

(1) Empirical data, assuming ~100 Gb per sample compressed fastq, bam, vcf, and ancillary data files at coverage between 3-15x(2) Empirical data, assuming ~20 Gb per sample at around 30x only storing compressed fastq and bam file

16

Page 17: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

End-to-end clinical genomics informatics• Blood draw to clinical relevant variants

• High performanceHigh performance analysis pipelines

• Large-scale data storage systems

• System-level workflow management

• Laboratory information ymanagement systems

• Orchestration around multiple storage and computer systemscomputer systems

• Closed loop system with independent validation paths (CLIA lab and exom chips)

17

Page 18: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

The economics of volume The current Transmission Gap

Cycles are 10x to 50x

CPU Cycle CPU Cycle

Cycles are 10x to 50xcheaper in the cloud!

CPU Cycle6 ‐ 27 picocents

1 bit storage/year6 picocents

1 bit network transfer800 ‐ 6000 picocents

CPU Cycle0.58 picocents1 bit storage/year5.3 ‐ 6 picocents6 picocents

Site‐based 100x to 1000x

p

Cloud‐basedSite‐basedmore costly

Cloud‐basedAdapted from: Radu Sion, Stony

Brook University, 2009A pico‐cent($)  is approx equivalent to a pico‐cent(€)

This will not change18

Page 19: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

Consider EXISTING “Data Ecosystems.”

– Commodity Internet:

• Futures markets• HFT

• Email, Facebook, YouTube,Twitter, iTunes, Google

– “Physics Net”:• LHC data

E th S iiTunes, Google– BTB:

• Supply chain

– Earth Science Information Partners (ESIP)pp y

coordination• Billing

( )– Others

• Customer management

• DistributionDistribution– Financial Internet

• Banking19

Page 20: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

What other kinds of “Data Ecosystems” are forming?are forming?

• “Smart Grid”See IBM adverts– See IBM adverts

• “Industrial Internet”– Heavy machinery See recent announcement by GE– Heavy machinery. See recent announcement by GE

• “Health Care Internet” (ala 20 EPIC sites)– Usable patient records available everywhereUsable patient records available everywhere

• “Genome-net”– Compare rare variants from select sub-populations Co pa e a e a a s o se ec sub popu a o s

in real time

In ALL of the cases above, there are 

20clear financial incentives in play.

Page 21: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

What other kinds of “Data Ecosystems” might we like to see form?

(Brainstorming…)

• “Climate-net” (must be global)• “Aqua-net” (must be global)Aqua net (must be global)• “Enviro-health net”(should be global)

I assert that in these cases, the fiscal d i b idrivers are not as obvious.

21

Page 22: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

To overcome this impediment:• Energy• Energy

– Passion (start with a compact group and grow)– FundingFunding– Organizational forms that permit broader participation– Incentives: dB citations, tool citations, Citizen Science,

Crowd Sourcing, “Branding”• Leadership

– Put forward a “framework” that MIGHT work (see USGS!)

– Pivot when it doesn’t work– Pivot when it doesn t work• Persistence

– Complex science is hard it will take time to do goodComplex science is hard, it will take time to do good

22

Page 23: Stanley C. Ahalt, PhD Director, Renaissance Computing ...nas-sites.org/emergingscience/files/2013/01/Ahalt-NAS-Jan10-2013... · Director, Renaissance Computing Institute Professor

THANK YOU!THANK YOU!

23