Keynote on 2015 Yale Day of Data
-
Upload
robert-grossman -
Category
Data & Analytics
-
view
447 -
download
2
Transcript of Keynote on 2015 Yale Day of Data
![Page 1: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/1.jpg)
Big Data & Analy-cs: Five Trends and Five Research Challenges
Robert Grossman University of Chicago
& Open Data Group
September 18, 2015
![Page 2: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/2.jpg)
Part 1 What is Big Data?
Researchers and policymakers are beginning to realize the poten-al for channeling these torrents of data into ac-onable informa-on that can be used to iden-fy needs & provide services for the benefit of low-‐income popula-ons. Source: Big Data, Big Impact: New Possibili-es for Interna-onal Development, World Economic Forum, 2012.
![Page 3: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/3.jpg)
• Volume • Velocity • Variety • Veracity • Value
• Megabytes • Gigabytes • Terabytes • Petabytes • Etabytes • Zetabytes
![Page 4: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/4.jpg)
The Name Changes 1830 sta-s-cs 1980 computa-onally intensive sta-s-cs 1993 data mining & knowledge discovery in databases 1997 business analy-cs 2004 predic-ve analy-cs 2011 big data, data science & data analy-cs
Source: Google Trends, www.google.com/trends
![Page 5: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/5.jpg)
What is Big Data? (Opera-ons POV)
A marke-ng term introduced by O’Reilly: Big data is data that exceeds the processing capacity of conven-onal database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alterna-ve way to process it. Edd Dumbill, What is Big Data?, strata.oreilly.com, January 11, 2012.
![Page 6: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/6.jpg)
What is Big Data? (POV: New Types of Data that IT Cannot Manage)
Period New types of data Term Used 1990’s Clicks on the Internet,
POS transac-ons Data mining
2000’s Unstructured data, graph data
Predic-ve Analy-cs
2010’s Mobile data, IoT data Big Data
![Page 7: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/7.jpg)
What Is Small Data?
• 100 million movie ra-ngs • 480 thousand customers • 17,000 movies • From 1998 to 2005 • Less than 2 GB data. • Fits into memory, but very sophis-cated models required to win.
![Page 8: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/8.jpg)
What are the origins of big data?
![Page 9: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/9.jpg)
Basic Choice with Hardware: Scale Up or Out
More memory, more processors, more disk ($K)
Specialized hardware (e.g. connects)($100K)
Specialized devices ($M)
One machine Cluster (racks) ($100K)
Cyber Pod $M
Distributed cyber pods $10M+
![Page 10: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/10.jpg)
Source: Interior of one of Google’s Data Center, www.google.com/about/datacenters/
Computa-onal adver-sing finds the “best match” between a given user in a given context and a suitable adver-sement ($100+ B market).
![Page 11: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/11.jpg)
The Google Data Stack
• The Google File System (2003) • MapReduce: Simplified Data Processing… (2004) • BigTable: A Distributed Storage System… (2006)
11
![Page 12: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/12.jpg)
Source: Terence Kawaja, hnp://www.slideshare.net/tkawaja
![Page 13: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/13.jpg)
• The leaders in big data analy-cs measure data in Megawans. – As in, Facebook’s leased data centers are typically between 2.5 MW and 6.0 MW.
– Facebook’s new Pineville data center is 30 MW.
What is Big Data? (My computer is a data center POV)
![Page 14: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/14.jpg)
Part 2 What is Analy-cs?
Source: Aaron Parecki, Everywhere I’ve Been, aaronparecki.com.
![Page 15: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/15.jpg)
What is Analy-cs? Short Defini8on • Using data to make decisions. Longer Defini8on • Using data to take ac-ons and make decisions using models that are sta-s-cally valid and empirically derived.
Defini-on of Sta-s-cs from ASA web page: • Sta-s-cs is the science of learning from data, and of measuring, controlling, and communica-ng uncertainty …
15
Source: American Sta-s-cal Associa-on, www.amstat.org/careers/wha-ssta-s-cs.cfm, from: Davidian, M. and Louis, T. A., 10.1126/science.1218685.
![Page 16: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/16.jpg)
16 1993 2004
Data Mining & KDD
1984
Computa-onally Intensive Sta-s-cs
Predic-ve Analy-cs
Big Data & Data Science
2011
PageRank Spanner TX algorithm
Devices/IoT Internet POS Direct marke-ng
ID3 & C4.5
![Page 17: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/17.jpg)
![Page 18: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/18.jpg)
1. Given n planes A1, …, An. Assume each plane Ai has bij bullet holes in the tail, wing, fuselage and other (j=1, 2, 3, 4, respec-vely).
2. Compute where to put addi-onal armor to maximize the chance that planes return.
![Page 19: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/19.jpg)
Part 3. Data Science
![Page 20: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/20.jpg)
A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Crea-ve Commons BY-‐SA 2.0, www.flickr.com/photos/58220828@N07/5350788732
Some fields have (one) billion dollar (or more) instrument that generates big data.
![Page 21: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/21.jpg)
A genomics sequencing facility might have 3-‐5 next genera-on sequencing instruments that cost $250,000 or more each.
Some fields have hundreds or thousands of million dollar instruments that in aggregate produce big data.
![Page 22: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/22.jpg)
Some fields have millions of hundred dollar sensors that in aggregate produce big data.
![Page 23: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/23.jpg)
Math & Sta-s-cs
Computer Science
Disciplinary Science
Data Science
![Page 24: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/24.jpg)
Understanding Salmon (A Cau-onary Tale)
Source: Salmo salar, (Atlan-c Salmon), wikipedia.org
![Page 25: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/25.jpg)
Methods
Subject. One mature Atlan-c Salmon (Salmo salar) par-cipated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the -me of scanning. Task. The task administered to the salmon involved comple-ng an open-‐ended mentalizing task. The salmon was shown a series of photographs depic-ng human individuals in social situa-ons with a specified emo-onal valence. The salmon was asked to determine what emo-on the individual in the photo must have been experiencing. Design. S-muli were presented in a block design with each photo presented for 10 seconds followed by 12 seconds of rest. A total of 15 photos were displayed. Total scan -me was 5.5 minutes.
![Page 26: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/26.jpg)
Several ac-ve voxels were discovered in a cluster located within the salmon’s brain cavity (Figure 1, see above). The size of this cluster was 81 mm3 with a cluster-‐level significance of p = 0.001. Due to the coarse resolu-on of the echo-‐planar image acquisi-on and the rela-vely small size of the salmon brain further discrimina-on between brain regions could not be completed. Out of a search volume of 8064 voxels a total of 16 voxels were significant.
![Page 27: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/27.jpg)
The bigger the data, the easier it is to do stupid things with it, such as forgetng to correct for mul-ple tests.
![Page 28: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/28.jpg)
Part 4. What Instrument Do we Use to Make Discoveries in Data Science?
How do we build a “datascope?”
![Page 29: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/29.jpg)
experimental science
simula-on science
1609 30x
1670 250x
1976 10x-‐100x
data science
![Page 30: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/30.jpg)
experimental science
simula-on science
data science
1609 30x
1670 250x
1976 10x-‐100x
2004 10x-‐100x
“Cyberpod”
![Page 31: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/31.jpg)
Could we con-nuously re-‐analyze the world’s cancer data?
![Page 32: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/32.jpg)
Complex sta-s-cal models over small data that are highly manual and update infrequently.
Simpler sta-s-cal models over large data that are highly automated and updated frequently.
memory databases
GB TB PB
W KW MW
datapods
cyber pods
![Page 33: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/33.jpg)
Part 5 Five Trends
Source: Google Trends, for term “data commons”, www.google.com/trends.
![Page 34: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/34.jpg)
Trend 1 Data Commons
Source: NEXRAD, NOAA, www.noaa.org
![Page 35: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/35.jpg)
The Standard Model of Biomedical Compu-ng No Longer Works
Public data repositories
Private local storage & compute
Network download
Local data ($1K)
Community souware
Souware, sweat and tears ($100K)
![Page 36: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/36.jpg)
Data Commons
Data commons co-‐locate data, storage and compu-ng infrastructure, and commonly used tools for analyzing and sharing data to create a resource for the research community.
Source: Interior of one of Google’s data centers, www.google.com/about/datacenters/
![Page 37: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/37.jpg)
Open Science Data Cloud (Open Cloud Consor-um, 2012)
NCI Data Commons (UChicago, Nov 2015)
Bionimbus Protected Data Cloud (UChicago, 2013)
NOAA Data Commons (Open Cloud Consor-umOct 2015)
![Page 38: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/38.jpg)
![Page 39: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/39.jpg)
![Page 40: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/40.jpg)
Purple balls are lung adenocarcinoma. Grey are lung squamous cell carcinoma. Green are misdiagnosed.
![Page 41: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/41.jpg)
Hospitals, medical research centers and doctors
Data commons containing genomic and clinical data.
Pa-ents
Output: con-nuously updated, data-‐driven, analy-cs-‐informed discovery, diagnosis and treatment.
![Page 42: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/42.jpg)
Trend 2 Analy-cs of Things, People and Places
Source: Urban sensor on street pole in Chicago (conceptual), arrayouhings.github.io/
![Page 43: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/43.jpg)
People and things genera-ng streaming data that are relevant for research.
![Page 44: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/44.jpg)
Places that generate data Source: Jane Macfarlane, Here, a Division of Nokia.
![Page 45: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/45.jpg)
Trend 3 Languages for Data, Sta-s-cal Models, Data Science Workflows & Exploratory Data Analysis
Source: M. Bostock, hnp://bl.ocks.org/mbostock/4063318
![Page 46: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/46.jpg)
Portable Format for Analy-cs (PFA) Predic-ve Model Markup Language (PMML)
Grammar of Graphics
d3.js
![Page 47: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/47.jpg)
Trend 4 More Policies That Make Data Available and Analy-cs Repeatable
![Page 48: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/48.jpg)
Execu-ve Order 13642 (May 9, 2013) Making Open and Machine Readable the Default for
Government Informa-on (“Open Data Policy”)
OMB Guidance President’s Ex Order
![Page 49: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/49.jpg)
Trend 5 Transla-onal Data Science
How do we translate data driven discoveries into ac-ons that impact society?
![Page 50: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/50.jpg)
Imaging Informatics
Clinical InformaticsBioinformatics Public Health
Informatics
Basic Research
Applied Research
Practice (dx, treatment and prevention)
Molecular & cellular
processes
Tissues & organs
Individuals (patients)
Groups & populations
Quality & outcomesTranslational Informatics
![Page 51: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/51.jpg)
New algorithms, new sta-s-cal models (data science)
Applica-ons to genomics, analysis of EMR, etc.
Souware stacks for data intensive compu-ng (data engineering)
Data driven discoveries
Data driven diagnosis
Data driven therapeu-cs
Develop souware stack that scales to a “datapod”, to create “commons” for data driven discoveries, dx & treatment. (Core strategy for Center for Data Intensive Science, University of Chicago)
Transla-onal Data Science
![Page 52: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/52.jpg)
Source: Maria T. Panerson and Robert L. Grossman, Detec-ng localized spa-al panerns of disease incidence using a neighbor-‐based bootstrapping method on electronic medical records data from 99.1 million pa-ents, to appear.
![Page 53: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/53.jpg)
Part 5 Five Challenges
![Page 54: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/54.jpg)
Challenge 1. Is More Different?
Source: P. W. Anderson, More is Different, Science, Volume 177, Number 4047, 4 August 1972, pages 393-‐396.
Do New Phenomena Emerge at Scale in Data?
![Page 55: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/55.jpg)
Challenge 2. One Million Genomes
• Sequencing a million genomes would likely change the way we understand genomic varia-on and provide a founda-on for precision medicine.
• The genomic data for a pa-ent is about 1 TB (including samples from both tumor and normal -ssue).
• One million genomes is about 1000 PB or 1 EB • With compression, it may be about 100 PB • At $1000/genome, the sequencing would cost about $1B
• Think of this as one hundred studies with 10,000 pa-ents each over three years.
![Page 56: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/56.jpg)
Challenge 3. Datapods
• Databases have fundamentally changed the way we manage and analyze scien-fic data.
• NoSQL databases allow us to scale out to mul-ple racks of computers, but are hard to to operate.
• If our scien-fic instrument for data science is a cyberpod of hardware and a souware stack suppor-ng data analysis, we need a simple-‐to-‐manage, open source “database” that scales to a cyberpod.
• Call this a “datapod.” • It could support open source data commons and allow them to peer.
![Page 57: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/57.jpg)
Challenge 4. A Billion Predic-ve Models
• Develop technology to generate automa-cally 1 to 10 billion heterogeneous segmented models
• Applica-ons – George Church’s challenge individual predic-ve models for each human genome 6.5 Billion humans.
– 1 Million cancer genomes x 1,000 models / genome.
– Urban science – instrumen-ng ci-es. – Consumer Marke-ng -‐ large adver-sers will see 1-‐3 billion different consumers
![Page 58: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/58.jpg)
Challenge 5. HDSI
• Human Computer Interac-on (HCI) was an important field before everyone got a computer and became an expert.
• Think of Human Data Science Interac-on (HDSI) of how humans interact with the souware suppor-ng the analysis of data science at the scale of datapods with billion models and trillions of hypotheses.
• How can we improve the interac-on to improve how we semi-‐automa-cally integrate data, validate hypotheses, interac-vely explore data, etc.
![Page 59: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/59.jpg)
Ques-ons?
59
rgrossman.com @bobgrossman
![Page 60: Keynote on 2015 Yale Day of Data](https://reader030.fdocuments.in/reader030/viewer/2022020203/589bc6851a28ab082b8b62d3/html5/thumbnails/60.jpg)
For More Informa-on
cdis.uchicago.edu
www.opendatagroup.com
rgrossman.com