Data and Society Lecture 3: Data-driven Sciencebermaf/Data Course 2016/Data and Society Lecture...
Transcript of Data and Society Lecture 3: Data-driven Sciencebermaf/Data Course 2016/Data and Society Lecture...
Fran Berman, Data and Society, CSCI 4370/6370
Data and Society Lecture 3: Data-driven Science
2/12/16
Fran Berman, Data and Society, CSCI 4370/6370
Announcements
• Section 1 Exam February 26.
– Practice test given out at the end of lecture today.
Fran Berman, Data and Society, CSCI 4370/6370
Today (2/12/16)
• Any questions about Lecture 2?
• Lecture 3: Data-driven Science
– Some history
– IT-enabled Applications
– Data and Computing at SDSC
– Guiding the future of data Science
• Break
• Data Roundtable (L2)
3
Fran Berman, Data and Society, CSCI 4370/6370
Section Theme Date First “half” Second “half”
Section 1: The Data Ecosystem -- Fundamentals
January 29 Class introduction; Digital data in the 21st Century (L1)
Data Roundtable / Fran
February 5 Data Stewardship and Preservation (L2)
L1 Data Roundtable / 5 students
February 12 Data-driven Science (L3) L2 Data Roundtable / 5 students
February 19 Future infrastructure – Internet of Things (L4)
L3 Data Roundtable / 5 students
February 26 Section 1 Exam L4 Data Roundtable / 5 students
Section 2: Data and Innovation – How has data transformed science and society?
March 4 Paper assignment description Section 1 Data Roundtable / 5 students
March 11 Data and Health: Phil Bourne guest lecture (L5)
Section 2 Data Roundtable / 5 students
March 18 Spring Break / no class
March 25 Data and Entertainment (L6) L5 Data Roundtable / 5 students
April 1 Big Data Applications (L7) L6 Data Roundtable / 5 students
Section 3: Data and Community – Social infrastructure for a data-driven world
April 8 Data in the Global Landscape (L8) Section 2 paper due
L7 Data Roundtable / 5 students
April 15 Digital Rights (L9) L8 Data Roundtable / 5 students
April 22 Bulent Yener Guest Lecture, Data Security (L10)
L9 Data Roundtable / 5 students
April 29 Digital Governance and Ethics (L11) L10 Data Roundtable / 5 students
May 6 Section 3 Exam L11 Data Roundtable / 5 students
We are here
Fran Berman, Data and Society, CSCI 4370/6370
What is the potential impact of Global
Warming?
How will natural disasters effect urban centers?
What therapies can be used to cure or
control cancer?
Can we accurately predict market outcomes?
Modeling, simulation, analysis a critical tool in addressing
science and societal challenges
What plants work best for biofuels?
Is there life on other planets?
Fran Berman, Data and Society, CSCI 4370/6370
Computational science increasing focus in the 80’s and 90’s (data issues often in the background …)
• Many reports in 80’s and early 90’s focused on the potential of information technologies (primarily computers and high-speed networks) to address key scientific and societal challenges
• First federal “Blue Book” in 1992 focused on key computational problems including
– Weather forecasting
– Cancer genes
– Predicting new superconductors
– Aerospace vehicle design
– Air pollution
– Energy conservation and turbulent combustion
– Microsystems design and packaging
– Earth’s bioshpere
– Broader education resources
Fran Berman, Data and Society, CSCI 4370/6370
Enabling IT: Increasing focus on a broader spectrum of resources
COMPUTE (more FLOPS)
DA
TA
(m
ore
BY
TE
S)
Home, Lab,
Campus, Desktop
Applications
Compute-
intensive
HPC
Applications
Data-intensive
and
Compute-
intensive
HPC
applications
Compute-intensive Grid,
Distributed, and Cloud
Applications
Data - oriented
Grid, Distributed
and Cloud
Applications
NETWORK
(more
BW)
Data-intensive
applications
More key resources: • Software • Human resources
(workforce) ‘80’s, 90’s +: Computational Science ‘90’s, 00’s +: Informatics 00’s, 10’s +: Data Science
Fran Berman, Data and Society, CSCI 4370/6370
In the beginning … The Branscomb Pyramid, circa 1993
Branscomb Pyramid provides a framework to associate computational power with community use.
Original Branscomb Committee Report (“From Desktop to TeraFlop”) at http://www.csci.psu.edu/docs/branscomb.txt
Fran Berman, Data and Society, CSCI 4370/6370
The Branscomb Pyramid, circa 2016
Small-scale devices and personal
computers
Small-scale Campus/Commercial
Clusters
Large-scale campus/commercial
resources, Center supercomputers
Leadership Class
PF EF
TF, PF
TF
MF, GF
Opportunities for Innovation at all levels …
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
Zetta 1021
Yotta 1024
Fran Berman, Data and Society, CSCI 4370/6370
Also in 1993: The Top500 List created to rank supercomputers
• TOP500 list ranks and details the 500 most powerful supercomputers in the world
• Most powerful = performance on the LinPack benchmark.
• Rankings provide invaluable statistics on supercomputer trends by country, vendor, sector, processor characteristics, etc.
• List compiled by Hans Meuer of University of Mannheim, Jack Dongarra of University of Tennessee, and Erich Strohmaier and Horst Simon of NERSC / LBNL. List comes out in November and June each year.
http://top500.org/
Fran Berman, Data and Society, CSCI 4370/6370
What the Top500 List measures
Rmax and Rpeak values are in TFlops
• Computers assessed based on their performance on the LINPACK Benchmark – calculating the solution to a dense system of linear equations.
– User may scale the size of the problem and optimize the software in order to achieve the best performance for a given machine
– Algorithm used must conform to LU factorization with partial pivoting (operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations.
• Rpeak values calculated using the advertised clock rate of the CPU. (theoretical performance)
• Rmax = maximal LINPACK performance achieved (actual performance)
Rensselaer CCI Blue Gene Q on current Top500 list (November 2015):
• 97th most powerful supercomputer in the world
• 30th most powerful Academic supercomputer in the world
• 5th most powerful Academic supercomputer in the US
Fran Berman, Data and Society, CSCI 4370/6370
Performance Development (Slide courtesy of Jack Dongarra)
0.1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
59.7 GFlop/s
400 MFlop/s
1.17 TFlop/s
10.5 PFlop/s
51 TFlop/s
74 PFlop/s
SUM
N=1
N=500
6-8 years
Jack’s Laptop (12 Gflop/s)
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Jack’s iPad2 & iPhone 4s (1.02 Gflop/s)
Fran Berman, Data and Society, CSCI 4370/6370
Fast Forward to 2016: Development of coordinated / integrated research information infrastructure on all axes needed driving application innovation
COMPUTE (more FLOPS)
DA
TA (
mo
re B
YTES
)
NETWORK (more BW)
Protein analysis and modeling of function and
structures
Storage and Analysis of Data from the CERN Large Hadron
Collider
Development of biofuels
Cosmology
Seti@home, MilkyWay@Home, BOINC
Real-time disaster response
Fran Berman, Data and Society, CSCI 4370/6370
Data has emerged as the “4th Paradigm” for research and discovery
Experimental methods
Theoretical modeling
Computational methods
Data analysis
Fran Berman, Data and Society, CSCI 4370/6370
Increasing Federal Expectations for Data
Data Management Plans:
• Grantee specification of plans for use, stewardship and preservation of data from their projects.
• Mandatory part of proposal process for increasing number of federal agencies
Agency Plans for Public Access of Research Data and Publications
• Relevant to federal R&D agencies with over 100M research expenditures (NSF, NIH, DOE, etc.)
• Agencies required to provide strategies for increasing / enhancing discoverability, access, dissemination, reproducibility, stewardship, preservation
Fran Berman, Data and Society, CSCI 4370/6370
Multi-paradigm applications
• Terashake Earthquake Simulation
• Large Hadron Collider Data Analysis
Fran Berman, Data and Society, CSCI 4370/6370
Earthquake Simulation
Background:
• Earth constantly evolving through the
movement of “plates”
• Using plate tectonics, the Earth's outer shell
(lithosphere) is posited to consist of seven large
and many smaller moving plates.
• As the plates move, their boundaries collide,
spread apart or slide past one another, resulting
in geological processes such as earthquakes and
tsunamis, volcanoes and the development of
mountains, typically at plate boundaries.
Fran Berman, Data and Society, CSCI 4370/6370
Why Earthquake Simulations are Important
Terrestrial earthquakes damage homes, buildings, bridges, highways
Tsunamis come from earthquakes in the ocean
• If we understand how earthquakes can happen, we can
– Predict which places might be hardest hit
– Reinforce bridges and buildings to increase safety
– Prepare police, fire fighters and doctors in high-risk areas to increase their effectiveness
• Information technologies drive more accurate earthquake simulation
Fran Berman, Data and Society, CSCI 4370/6370
6.7 M earthquake in Northridge California,1994, earthquake brought estimated $20B damage
Major Earthquakes on the San Andreas Fault, 1680-present
1906 M 7.8
1857 M 7.8
1680 M 7.7
?
What would be the impact of an earthquake on the lower San Andreas fault?
Fran Berman, Data and Society, CSCI 4370/6370
Simulation decomposition strategy leverages parallel high performance computers
– Southern California partitioned into “cubes” then mapped onto processors of high performance computer
– Data choreography used to move data in and out of memory during processing
Builds on data and models from the Southern California Earthquake Center, Kinematic source (from Denali) focuses on Cajon Creek to Bombay Beach
Fran Berman, Data and Society, CSCI 4370/6370
TeraShake Simulation
Simulation of Southern of 7.7 earthquake on lower San Andreas Fault
• Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m
• Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each
• Simulation for TeraShake 1 and 2 simulations generated 45+ TB data
Fran Berman, Data and Society, CSCI 4370/6370
Resources must support a
complicated orchestration of
computation and data
movement
47 TB output
data for 1.8
billion grid
points
Continuous
I/O 2GB/sec
240 procs on
SDSC Datastar,
5 days, 1 TB
of main memory
“Fat Nodes” with 256 GB of
DataStar used for pre-processing
and post visualization
10-20 TB data archived a day
Finer resolution simulations require even more resources.
TeraShake scaled to run on petascale architectures
Behind the Scenes: TeraShake Data Choreography
Parallel file system
Data parking
Fran Berman, Data and Society, CSCI 4370/6370
Terashake and Data
• Data Management
– 10 Terabytes moved per day during
execution over 5 days
– Derived data products registered into
SCEC digital library (total SCEC library
had 168 TB)
• Data Post-processing:
– Movies of seismic wave propagation
– Seismogram formatting for interactive
on-line analysis
– Derived data:
• Velocity magnitude
• Displacement vector field
• Cumulative peak maps
• Statistics used in visualizations
TeraShake Resources
Computers and Systems
• 80,000 hours on IBM Power 4 (DataStar)
• 256 GB memory p690 used for testing,
p655s used for production run,
TeraGrid used for porting
• 30 TB Global Parallel file GPFS
• Run-time 100 MB/s data transfer from
GPFS to SAM-QFS
• 27,000 hours post-processing for high
resolution rendering
People
• 20+ people for IT support
• 20+ people in domain research
Storage
• SAM-QFS archival storage
• HPSS backup
• SRB Collection with 1,000,000 files
Fran Berman, Data and Society, CSCI 4370/6370
TeraShake at Petascale – better prediction accuracy creates greater
resource demands
Estimated figures for simulated 240 second period,
100 hour run-time
TeraShake domain (600x300x80 km^3)
PetaShake domain
(800x400x100 km^3)
Fault system interaction
NO YES
Inner Scale 200m 25m
Resolution of terrain grid
1.8 billion mesh points
2.0 trillion mesh points
Magnitude of Earthquake
7.7 8.1
Time steps 20,000
(.012 sec/step) 160,000
(.0015 sec/step)
Surface data 1.1 TB 1.2 PB
Volume data 43 TB 4.9 PB
Information courtesy of the Southern California Earthquake Center
Tera 1012
Peta 1015
Fran Berman, Data and Society, CSCI 4370/6370
Application Evolution
• TeraShake PetaSHA, PetaShake, CyberShake, etc. at SCEC
• Evolving applications improving resolution, models, simulation accuracy, scope of results, etc.
• PetaSHA foci:
– Create a hierarchy of simulations for 10 most probably large (M>7) ruptures in southern California
– Validation of earthquake simulations using well-recorded regional events (M<=6.7) and assimilation of regional waveform data into community velocity models
– Validation of hazard curves and extension of maps to higher frequencies and more extensive geographic coverage, creating rich new database for earthquake scientists and engineers
Fran Berman, Data and Society, CSCI 4370/6370
The Large Hadron Collider (LHC) • LHC is the world’s most powerful particle collider.
• LHC’s goal is to allow physicists to test the predictions of
different theories of particle physics, high-energy physics,
(in particular the properties of the Higgs Boson) and the
large family of new particles predicted by supersymmetric
theories.
• LHC contains seven detectors, each designed for a different kind of research. LHC built near Geneva between 1998 and 2008 in collaboration with over 10,000 scientists and engineers from over 100 countries
• LHC lies in a 17 mile circumference tunnel beneath the France-Switzerland border.
• LHC collisions produce 10’s of PBs of data per year.
– Subset of data analyzed by distributed grid of 170+ computers in 36 countries
A collider is a type of a particle accelerator with two directed
beams of particles.
In particle physics colliders are used as a research tool: they
accelerate particles to very high kinetic energies and let them
impact other particles.
Analysis of the byproducts of these collisions gives scientists
good evidence of the structure of the subatomic world and the laws
of nature governing it.
Many of these byproducts are produced only by high energy
collisions, and they decay after very short periods of time. Thus many of them are hard or near
impossible to study in other ways.
Information from Jamie Shiers and Wikipedia
Fran Berman, Data and Society, CSCI 4370/6370
Higgs and Beyond
“A major goal of Run1 of the LHC was to find evidence of the Higgs boson…” Its discovery was announced on July 4 2012. After a prolonged downtime to prepare for running at almost twice the energy of Run1, the LHC restarted in 2015 and will run (mainly during the summer months) for another 3 years before yet another upgrade. Major goals of Run2 are to search “beyond the standard model”, including searches for Dark Energy and Dark Matter.
Information from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4370/6370
Worldwide LHC Computing Grid
Image from http://wlcg.web.cern.ch/
Fran Berman, Data and Society, CSCI 4370/6370
Data: Outlook for HL-LHC
• The LHC – including all
foreseen upgrades – will
run until circa 2040. By
that time between 10 and
100 EB of
data will have been
gathered.
• These data (the
uninteresting stuff has
already been discarded)
should be preserved for a
number of decades.
• Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates.
• To be added: derived data (ESD, AOD), simulation, user data…
At least 0.5 EB / year (x 10 years of data taking)
PB
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
Run 1 Run 2 Run 3 Run 4
CMS
ATLAS
ALICE
LHCb
We are here!
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4370/6370
LHC – Stewardship and Preservation Challenges
• Significant volumes of high energy physics data are thrown away “at birth” – i.e. via very strict filters (aka triggers) before writing to storage. To a first approximation, all remaining data needs to be preserved for a few decades.
– LHC data particularly valuable as reproducibility of experiments is tremendously expensive and almost impossible to achieve
• Tier 0 and 1 sites currently provide bit preservation at scale
– Data more usable and accessible when services coupled with bit preservation
– In the process of “self certification” according to ISO 16363 of the Tier0 and TIer1 sites.
• CERN also developing an advanced Data Management Plan that will be updated roughly annually based on “an intelligent super-set” of the Horizon 2020, DoE and NSF guidelines – with a few of our own for good measure.
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4370/6370
Post-collision
David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21-25 2012 | Page 6
After the collisions have stopped
> Finish the analyses! But then what do you do with the data?
§ Until recently, there was no clear policy on this in the HEP community
§ It’s possible that older HEP experiments have in fact simply lost the data
> Data preservation, including long term access, is generally not part of
the planning, software design or budget of an experiment
§ So far, HEP data preservation initiatives have been in the main not planned by the
original collaborations, but rather the effort a few knowledgeable people
> The conservation of tapes is not equivalent to
data preservation!
§ “We cannot ensure data is stored in file formats appropriate for
long term preservation”
§ “The software for exploiting the data is under the control of the
experiments”
§ “We are sure most of the data are not easily accessible!”
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4370/6370
• Cyberinfrastructure is the
organized aggregate of
information technologies
coordinated to address
problems in science and
society
• Cyberinfrastructure components:
– Digital data
– Computers
– Wireless and wireline
networks
– Personal digital devices
– Scientific instruments
– Storage
– Software
– Sensors
– People …
“Cyberinfrastructure” (aka e-Infrastructure)
“If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for
a knowledge economy.”
NSF Final Report of the Blue Ribbon Advisory Panel on Cyberinfrastructure
(“Atkins Report”, 2003)
Fran Berman, Data and Society, CSCI 4370/6370
Cyberinfrastructure (CI) an emerging national focus in 2000 and beyond
• Publication of the Atkins report
accelerated CI as a critical national focus
within federal R&D investments and
especially at NSF
• CI elevated from a division within CISE
directorate to an Office within NSF.
(Atkins became first OCI Director).
• San Diego Supercomputer Center
(SDSC), a pioneer in data-intensive
computing, focused on leadership in
data cyberinfrastructure and data-
enabled applications
Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf
Fran Berman, Data and Society, CSCI 4370/6370
Building Data Cyberinfrastructure at SDSC, 2001 - 2009
Fran Berman, Data and Society, CSCI 4370/6370
SDSC in a Nutshell:
• 1985 – 2004: NSF supercomputer / cyberinfrastructure center hosted at UCSD. 2004 – present: UCSD center with national impact.
• Multi-faceted facility focused on cyberinfrastructure-enabled applications, high performance computing, data stewardship
– Several hundred research, technology, and productions systems -focused staff
– Home to 100+ national research projects, allocated national machines, research data infrastructure
– Funded by NSF, Department of Energy, NIH, DHS, Library of Congress, National Archives and Records Administration, UC system, etc.
Data Cyberinfrastructure at the San Diego Supercomputer Center (SDSC), 2001-2009
Fran Berman, Data and Society, CSCI 4370/6370
SDSC Strategic Focus in the 2000’s : Support for Data-oriented science from the small to the large-scale
• Special needs at the extremes …
• Data Cyberinfrastructure should support
– Petabyte sized collections
– 100 PetaByte archive
– Collections which must be preserved 100 years or more
– Data-oriented simulation, analysis, and modeling at 10-100X university/research lab-level capacities
– Professional data services, software, curation beyond what is feasible in university, campus, and research lab facilities
size
number
timeframe
capability
SW support
Fran Berman, Data and Society, CSCI 4370/6370
SDSC Data Cyberinfrastructure / Selected Projects
SDSC
Data CI
DATA PORTALS
DATA
VISUALIZATION
DATA
MANAGEMENT
HPC
DATA
DATA
ANALYTICS
DATA
SERVICES
DATA
STORAGE
DATA
PRESERVATION
Data
Oasis
Fran Berman, Data and Society, CSCI 4370/6370
SDSC Data Central
• One of the first general-purpose programs of its kind to support research and community data collections and databases
• Data Central was available without charge to the scientific community and provided a facility to store, manage, analyze, mine, share and publish data collections, enabling access and collaboration in the broader scientific community
• Project led by Natasha Balac at SDSC
Who could apply
• Open to researchers affiliated with US educational institutions
• Proposals were merit-reviewed quarterly by Data Allocations Committee
Types of Allocations:
• Expedited Allocations
– 1 TB or less of disk & tape 1st year
– 5 GB Database 1st year
– Yearly review
• Medium Allocations
– Under 30 TB
• Large Allocations
– Larger than 30 TB
Fran Berman, Data and Society, CSCI 4370/6370
DataCentral Allocated Collections included
Seismology 3D Ground Motion Collection for the LA Basin
Atmospheric Sciences50 year Downscaling of Global Analysis over California Region
Earth Sciences NEXRAD Data in Hydrometerology and Hydrology
Elementary Particle Physics
AMANDA data
Biology AfCS Molecule Pages
Biomedical Neuroscience
BIRN
Networking Backbone Header Traces
Networking Backscatter Data
Biology Bee Behavior
Biology Biocyc (SRI)
Art C5 landscape Database
Geology Chronos
Biology CKAAPS
Biology DigEmbryo
Earth Science Education
ERESE
Earth Sciences UCI ESMF
Earth Sciences EarthRef.org
Earth Sciences ERDA
Earth Sciences ERR
Biology Encyclopedia of Life
Life Sciences Protein Data Bank
Geosciences GEON
Geosciences GEON-LIDAR
Geochemistry Kd
Biology Gene Ontology
Geochemistry GERM
Networking HPWREN
Ecology HyperLter
Networking IMDC
Biology Interpro Mirror
Biology JCSG Data
Government Library of Congress Data
Geophysics Magnetics Information Consortium data
Education UC Merced Japanese Art Collections
Geochemistry NAVDAT
Earthquake Engineering
NEESIT data
Education NSDL
Astronomy NVO
Government NARA
Anthropology GAPP
Neurobiology Salk data
Seismology SCEC TeraShake
Seismology SCEC CyberShake
Oceanography SIO Explorer
Networking Skitter
Astronomy Sloan Digital Sky Survey
Geology Sensitive Species Map Server
Geology SD and Tijuana Watershed data
Oceanography Seamount Catalogue
Oceanography Seamounts Online
Biodiversity WhyWhere
Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data
Structural Engineering TeraBridge
Various TeraGrid data collections
Biology Transporter Classification Database
Biology TreeBase
Geoscience Tsunami Data
Education ArtStor
Biology Yeast regulatory network
Biology Apoptosis Database
Cosmology LUSciD
Fran Berman, Data and Society, CSCI 4370/6370
Focus of Computing Procurements: Supercomputers that supported Data-Intensive Applications
• A balanced system to support data-oriented applications requires a trade-off of flops and other key system characteristics
• Balanced system provides support for tightly-coupled and strong I/O applications
– Grid platforms not a strong option – Data local to computation – I/O rates exceed WAN capabilities – Continuous and frequent I/O is latency intolerant
• Scalability is key
– Need high-bandwidth and large-capacity local parallel file systems
– Need large-capacity flexible “parking” storage for post-processing
– Need high-bandwidth and large-scale archival storage
• Application performance determines the best configuration
Locality Data Stride 0 Ignored
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
Spatial Locality
Te
mp
ora
l lo
ca
lity
IBM Pwr3
Linpack
RandomAccess STREAM
Overflow
DoD apps plotted for locality
Compute
Data
Fran Berman, Data and Society, CSCI 4370/6370
Data Activities at SDSC in 2016 • Data Oasis – 4 PB of storage
for users of SDSC Gordon and Comet in XSEDE
• Data-focused projects:
– NSF Big Data Hub West
– Health Cyberinfrastructure -- Cancer Data Infrastructure
– Big Data Benchmarking
– Data and Information Virtualization
– Data Integration
– Data Modeling
– Graph Analytics
– Spatial Data
– Predictive Analytics
– Time Series and Streaming Data
– Visualization, etc.
Fran Berman, Data and Society, CSCI 4370/6370
Data Science – an emerging field
• Slides based on slides by Rob Rutenbar for the NSF CISE AC Data Science Subcommittee
• [Wikipedia] Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).
Fran Berman, Data and Society, CSCI 4370/6370
How do we develop a roadmap for data science?
• Curriculum & pedagogy – Who teaches foundations? Where? To whom?
• Research – Who is doing data science research? Who is doing research enabled by
data science? How can this be fostered/facilitated? What are the programmatic needs?
• Infrastructure – What is needed to support data sci?
• Verticals – Disruptions of end to end computer stack
• Ethics & society – What informs use and collection of data? Social impacts?
Fran Berman, Data and Society, CSCI 4370/6370
How do we connect the dots in data science?
Data Science
Education and training
• Should help create data-literate citizenry
• Should prepare students for data-driven jobs in all sectors
• Should prepare students for jobs in data science in all sectors
Research and innovation
• Should support cutting edge research on key problems in the understanding, use and supporting environments for data
• Should support integration of data as driver for innovation in all fields
infrastructure
• Should support data education and training
• Should support data research at sufficient scale
• Should provide stewardship and preservation of valuable data
• Should support state-of-the art data services
Social Policy and Governance
• Should provide basis for appropriate social policy and regulation around the appropriate and ethical use of data
Fran Berman, Data and Society, CSCI 4370/6370
Foundational question to guide data science as it emerges: What kind of discipline is data science?
• Statistics?
• Machine learning?
• Hybrid discipline?
– X-analytics
– X-informatics (e.g. data expansion of bio-informatics)
– data + X (e.g. data equivalent of computational biology)
• Discipline on its own?
Maybe all of the above, but … what mechanisms can be used to grow and invest in the discipline to maximize its potential?
Fran Berman, Data and Society, CSCI 4370/6370
Org? Where Does Data Science Live, Grow?
CS
DS
Stat
DS
Stat CS DS
DataSci
CS Stat
A DS
B DS
C DS
Z DS
***
Where does foundational pedagagy get developed? Aimed at what audiences?
Fran Berman, Data and Society, CSCI 4370/6370
NSF CISE Current Data Science Investment Strategy
to develop new techniques and technologies to derive
knowledge from data
to manage, curate, and serve data to domain research
communities
for a growing emerging discipline
to support interdisciplinary science and communities
Fran Berman, Data and Society, CSCI 4370/6370
Future Investment Strategies: Proposed Focus of the NSF CISE Advisory Committee on Data Science
• What is data science?
• Where is data science? (i.e. within higher education)
• Who is data science? (i.e. within areas and professional occupations in the workforce)
• Why is data science a priority?
• PREPARING A WORKFORCE FOR DATA SCIENCE-ENABLED JOBS
– What do we need to teach and how? • Recommendations for curriculum development programs • Recommendations for graduate training and education programs • Recommendations for public-private data science internships
• ACCELERATING THE STATE OF THE ART OF DATA SCIENCE
– What are the big problems in data science? What are the big problems that data science enables? • Recommendations for NRC study on data science grand challenges • Recommendations for CISE programs that encourage research in data science grand challenges • Recommendations for joint CISE and other directorate (or agency) programs that encourage research in data science
grand challenges • Recommendations for joint CISE and private sector programs on data science grand challenges
• SUPPORTING DATA SCIENCE AND DATA-ENABLED RESEARCH AND EDUCATION
– What kind of infrastructure is needed to support successful data science research and education? • Recommendations for sustainable and reliable at-scale infrastructure and data collections • Recommendations for public-private partnerships to support data science infrastructure • Recommendations for broad training and education in using data science infrastructure and resources (data literacy …)
Fran Berman, Data and Society, CSCI 4370/6370
Lecture Materials (not already on slides)
• Southern California Earthquake Center, http://www.scec.org/
• Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf
• LHC, www.wikipedia.com
• Worldwide LHC Computing Grid website, http://wlcg-public.web.cern.ch/tier-centres
• San Diego Supercomputer Center, www.sdsc.edu
Fran Berman, Data and Society, CSCI 4370/6370
Two Weeks: L4 Roundtable February 26 • “How the Internet of Things got hacked,” Wired, December, 2015,
http://www.wired.com/2015/12/2015-the-year-the-internet-of-things-got-hacked/ (Theo B)
• “The Measured Life,” MIT Technology Review, June 21, 2011, http://www.technologyreview.com/featuredstory/424390/the-measured-life/ (Brenda T)
• “GM and Lyft are building a network of self-driving cars,” Wired, January 4, 2016, http://www.wired.com/2016/01/gm-and-lyft-are-building-a-network-of-self-driving-cars/ (Chris P)
• “Hijackers remotely kill a jeep on the highway – with me in it,” Wired, July 2, 2015, http://www.wired.com/2015/07/hackers-remotely-kill-jeep-highway/ (TK W)
• “Robot doctors, online lawyers and automated architects: the future of the professions?”, The Guardian, June 15, 2014, http://www.theguardian.com/technology/2014/jun/15/robot-doctors-online-lawyers-automated-architects-future-professions-jobs-technology (Sri I)
Fran Berman, Data and Society, CSCI 4370/6370
Next week: L3 Data Roundtable for February 19
• “Scientists Say that a Neptune-Sized Planet Lurks Beyond Pluto,” Science, January 20, 2016, http://www.sciencemag.org/news/2016/01/feature-astronomers-say-neptune-sized-planet-lurks-unseen-solar-system (Kiana M.)
• “Birdwatchers help Science fill gaps in the Migratory Story,” NYTimes, January 30, 2016, http://www.nytimes.com/2016/01/29/science/bird-watchers-help-science-fill-gaps-in-the-migratory-story.html?rref=collection%2Fsectioncollection%2Fscience&action=click&contentCollection=science®ion=rank&module=package&version=highlights&contentPlacement=8&pgtype=sectionfront (Amelia GB)
• “NCAR announces powerful new supercomputer for advanced atmospheric, geosciences modeling,” Scientific Computing, January 11, 2016, http://www.scientificcomputing.com/news/2016/01/ncar-announces-powerful-new-supercomputer-advanced-atmospheric-geosciences-modeling (Kienan K.)
• “Scientists Use Stargazing Technology in the Fight against Cancer,” Time, February, 2013 http://healthland.time.com/2013/02/27/scientists-use-stargazing-technology-in-the-fight-against-cancer/ (Courtney T.)
• “Digital Keys for Unlocking the Humanities’ Riches”, the New York Times, November 16, 2010, http://www.nytimes.com/2010/11/17/arts/17digital.html?pagewanted=all&_r=0 (Jessica J.)
Fran Berman, Data and Society, CSCI 4370/6370
Today: Readings for Lecture 2 Data Roundtable
• “Got Data? A Guide to Digital Preservation in the Information Age,” CACM (December, 2008) http://www.cs.rpi.edu/~bermaf/CACM08.pdf (Caitlin C.)
• “A Digital Life,” Scientific American (March, 2007) http://www.scientificamerican.com/article/a-digital-life/ (Rob R.)
• “Thirteen Ways of Looking at … Digital Preservation,” D-Lib Magazine (August, 2004), http://www.dlib.org/dlib/july04/lavoie/07lavoie.html (Amreen A.)
• “The Lost NASA Tapes: Restoring Lunar Images after 40 Years in the Vault”, ComputerWorld (June, 2009), http://www.computerworld.com/article/2525935/computer-hardware/the-lost-nasa-tapes--restoring-lunar-images-after-40-years-in-the-vault.html?page=2 (Jordan D.)
• “Preserving the Internet”, CACM (January 2016), http://cacm.acm.org/magazines/2016/1/195738-preserving-the-internet/fulltext (Dan L.)