Innovation in a Complex World: Examples and Challenges ...• Reproducible research Trident...
Transcript of Innovation in a Complex World: Examples and Challenges ...• Reproducible research Trident...
Innovation in a Complex World:
Examples and Challenges
www.microsoft.com/science
Dr Daron Green
Senior Director, Microsoft Research
• Context
• Innovation in action
– Data deluge
– Data visualization
– Data sharing
• Challenges/impediments
– Things we haven’t worked out
– What’s stopping us making progress
– Areas of concern
Overview
Microsoft Research At A Glance
Redmond, Washington Sep, 1991San Francisco, California Jun, 1995Cambridge, United Kingdom July, 1997Beijing, China Nov, 1998Silicon Valley, California July, 2001Bangalore, India Jan, 2005Cambridge, Massachusetts July, 2008
MSR India
Microsoft Research Mission Statement
• Expand the state of the art in each of the areas in which we do research
• Rapidly transfer innovative technologies into Microsoft products
• Ensure that Microsoft products have a future
Life Sciences
MultidisciplinaryResearch
New Materials,Technologies& Processes
Math andPhysical Science
Social SciencesEarth
Sciences
Computer &Information Sciences
Context: Science @ Microsoft
• Data collection– Sensor networks, satellite
surveys, high throughput laboratory instruments, astronomical telescopes, supercomputers, LHC …
• Data processing, analysis, visualization– Legacy codes, workflows,
data mining, indexing, searching, graphics …
• Archiving– Digital repositories, libraries,
preservation, …
A Data Deluge in Science
SensorMapFunctionality: Map navigationData: sensor-generated temperature, video camera feed, traffic feeds, etc.
Scientific visualizationsNSF Cyberinfrastructure report, March 2007
• Thousand years ago – Experimental Science
– Description of natural phenomena
• Last few hundred years – Theoretical Science
– Newton’s Laws, Maxwell’s Equations…
• Last few decades – Computational Science
– Simulation of complex phenomena
• Today – eScience or Data-centric Science
– Unify theory, experiment, and simulation
– Using data exploration and data mining
• Data captured by instruments
• Data generated by simulations
• Data generated by sensor networks
Scientists over-whelmed with data…
Computer Scientists and IT companies have technologies that will help innovate
Emergence of a New Research Paradigm?
2
2
2.
3
4
a
cG
a
a
• Data management along research pipeline:
Implications
•Capture
(inc metadata)
•Processing
•Storage
•Retrieval
•Sharing
•Visualization
•Publication
•Archival
Handling the data deluge…
Three examples:
• Machine Learning and HIV/AIDS research
• Advanced Database technologies and Environmental Science
• Oceanographic Workflows
Fighting HIV with Computer Science
• A major problem: Over 40 million infected
– Drug treatments are effective but are an expensive life
commitment
• Vaccine needed for third world countries
– Effective vaccine could eradicate disease
• Methods from computer science are helping with the design
of vaccine
– Machine learning: Finding biological patterns that may
stimulate the immune system to fight the HIV virus
– Optimization methods: Compressing these patterns into
a small, effective vaccine
11
Computational Biology Web Tools
Better vaccine design through improved understanding of HIV evolution
Goals• Use machine learning and
visualization tools developed at
Microsoft, which require HPC, to
build maps of within-individual
evolution of the HIV virus
Progress so far• Discovered ‘decoy epitopes’ that could have predicted recent failure of Merck vaccine
• Algorithms and medical results published in Science and Nature Medicine
• MSR Computational Biology Tools published (Open Source on CodePlex)
Handling the data deluge…
Two examples:
• Machine Learning and HIV/AIDS research
• Advanced Database technologies and Environmental Science
• Oceanographic Workflows
Carbon-Climate Data
• What is the role of photosynthesis in global warming? – Measurements of CO2 in the
atmosphere show 16-20% less than emissions estimates predict
– The difference is either due to plants or ocean absorption.
• Communal field science – each investigator acts independently.
• Cross site studies and integration with modeling increasingly important Pub_NEE (gC m
-2 yr
-1)
-1500 -1000 -500 0 500 1000 1500
LaT
hu
ile_N
EE
(g
C m
-2 y
r-1)
-1500
-1000
-500
0
500
1000
1500
14
Ameriflux Data
In collaboration with Berkeley Water Center
• 149 Ameriflux sites across the Americas reporting minimum of 22 common measurements
• Carbon-Climate Data published to and archived at Oak Ridge
• Total data reported to date on the order of 192M half-hourly measurements since 1994
Scientific Data Servers for Hydrology
• Sharepoint site www.fluxnet.org– 921 site-years of data from 240
sites around the world; 80+ site-years now being added
– 60+ paper writing teams – American data subset is public and
served more widely– Summary data products greatly
simplify initial data discovery
• Used modern Relational Database technologies– Scientists can access data through
Data Cubes– Allows simple data viewing
without need for knowledge of SQL language
Ameriflux Data Availability : All Data
Bra
zil
-- T
apajo
s (
Santa
rem
,Km
Bra
zil
-- T
apajo
s (
Santa
rem
,Km
Canada -
Bore
as 1
850
Canada -
- B
OR
EA
S N
SA
- 1
930 b
u
Canada -
- B
OR
EA
S N
SA
- 1
963 b
u
Canada -
- B
OR
EA
S N
SA
- 1
981 b
u
Canada -
- B
OR
EA
S N
SA
- 1
989 b
u
Canada -
- B
OR
EA
S N
SA
- 1
998 b
u
Canada -
- B
OR
EA
S N
SA
- O
ld B
la
Canada -
- B
ritish C
ol.,
Cam
pbe
Canada -
- Leth
bridge
US
A -
- A
K A
tqasuk,
Ala
ska
US
A -
- A
K B
arr
ow
, A
laska
US
A -
- A
K H
appy V
alle
y,
Ala
ska
US
A -
- A
K U
pad,
Ala
ska
US
A -
- A
Z A
udubon R
esearc
h R
an
US
A -
- C
A B
lodgett
Fore
st,
Cal
US
A -
- C
A S
ky O
aks,
Old
Sta
nd,
US
A -
- C
A S
ky O
aks,
Young S
tan
US
A -
- C
A T
onzi R
anch,
Calif
or
US
A -
- C
A V
aira R
anch,
Ione,
C
US
A -
- C
O N
iwot
Rid
ge F
ore
st,
US
A -
- C
T G
reat
Mounta
in F
ore
s
US
A -
- F
L F
lorida-K
ennedy S
pac
US
A -
- F
L F
lorida-K
ennedy S
pac
US
A -
- F
L S
lashpin
e-A
ustin C
ar
US
A -
- F
L S
lashpin
e-D
onald
son,
US
A -
- F
L S
lashpin
e-M
ize,c
lear
US
A -
- F
L S
lashpin
e-R
ayonie
r,m
US
A -
- IL
Bondvill
e,
Illin
ois
US
A -
- IN
Morg
an M
onro
e S
tate
US
A -
- K
S W
aln
ut
Riv
er
Wate
rsh
US
A -
- M
A H
arv
ard
Fore
st
EM
S T
US
A -
- M
A H
arv
ard
Fore
st
hem
lo
US
A -
- M
A L
ittle P
rospect
Hill
US
A -
- M
E H
ow
land F
ore
st
(main
US
A -
- M
I S
ylv
ania
Wild
ern
ess
US
A -
- M
I U
niv
. of
Mic
h.
Bio
lo
US
A -
- M
O M
issouri O
zark
Site
US
A -
- M
S G
oodw
in C
reek,
Mis
si
US
A -
- M
T F
ort
Peck,
Monta
na
US
A -
- N
C D
uke F
ore
st
- lo
blo
l
US
A -
- N
C D
uke F
ore
st-
hard
wood
US
A -
- N
E M
ead -
irr
igate
d c
on
US
A -
- N
E M
ead -
irr
igate
d m
ai
US
A -
- N
E M
ead -
rain
fed m
aiz
e
US
A -
- O
K L
ittle W
ashita W
ate
r
US
A -
- O
K P
onca C
ity,
Okla
hom
a
US
A -
- O
K S
hid
ler,
Okla
hom
a
US
A -
- O
K S
outh
ern
Gre
at
Pla
in
US
A -
- O
R M
eto
lius-f
irst
young
US
A -
- O
R M
eto
lius-inte
rmedia
t
US
A -
- O
R M
eto
lius-o
ld a
ged p
o
US
A -
- S
D B
lack H
ills,
South
D
US
A -
- S
D B
rookin
gs,
South
Dak
US
A -
- T
N W
alk
er
Bra
nch W
ate
rs
US
A -
- W
A W
ind R
iver
Cra
ne S
it
US
A -
- W
I Lost
Cre
ek,
Wis
consi
US
A -
- W
I P
ark
Falls
/WLE
F,
Wis
US
A -
- W
I W
illow
Cre
ek,
Wis
con
US
A -
- W
V C
anaan V
alle
y,
West
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
Mashup of Ameriflux Sites
Handling the data deluge…
Two examples:
• Machine Learning and HIV/AIDS research
• Advanced Database technologies and Environmental Science
• Oceanographic Workflows
Trident – Scientific Workbench
• Visually program workflows, through a web browser.
• Libraries of activities and workflows, to save and reuse workflows.
• Abstract parallelism for HPC, to test on desktop and then run on cluster.
• Adaptive workflows, to detect and respond to events in real-time.
• Automatic provenance capture, for all workflows and data products.
• Costing model, estimating resources required to run a workflow.
• Integrated data storage and access, allows researcher to store data on a SQL database, local files or in the cloud (Microsoft SDS, Amazon S3).
• Fault tolerance, facilitate smart reruns, what-if analysis
• Reproducible research
Trident Scientific Workflow WorkbenchWhat it provides to the scientists
• Three dominant issues:
– People: lack of alignment in benefits, incentives and budget…or, put another way, the way we respond to money, process, metrics, measurement and recognition…
– Technology: Transition to many/multi-core
– Privacy: risk of exposing personal information
However…Challenges/Impediments
Remote management of long-term conditions
The underlying challenge…
• Thousands of successful(?) pilots but none ‘make it big’• Many, many papers published• It has been shown† that:
– Largely no motivation for adoption by health practitioners because there is…
– no alignment of benefits, incentives and budgets
• Or, stated another way, it is dangerous to assume people will adopt an innovation just because it is ‘obviously’ the right thing to do.
• Consider the whole context for the innovation (people, money, metrics, reward structures, process, skills etc) it’s not just the technology.
• Sometimes the key innovation is in the business design
†Dr Daron G Green and Prof Terry Young; Value Propositions for Information Systems in Healthcare HICSS - Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences p257, 2008
• Three dominant issues:
– People: lack of alignment in benefits, incentives and budget…or, put another way, the way we respond to money, process, metrics, measurement and recognition…
– Technology: Multi-Core Transition
– Privacy: inadvertently exposing personal information
Challenges/Impediments
10,000
1,000
100
10
1
‘70 ‘80 ‘90 ‘00 ‘10
Pow
er
De
nsity (
W/c
m2)
4004
8008
8080
8085
8086
286 386
486
Pentium®
Hot Plate
Nuclear Reactor
Rocket Nozzle
Sun’s Surface
Intel Developer Forum, Spring 2004 - Pat Gelsinger
CPU Architecture
• Heat becoming an unmanageable problem
The End of Moore’s Law as We Know It
• Future of silicon chips
– “100’s of cores on a chip in 2015”
(Justin Rattner, Intel)
• Challenge for IT industry and Computer Science community
– How can we make parallel computing on a chip easy for developers of consumer applications?
• Challenge for the Scientific Community
– How will the Multi-Core transition affect scientific computing?
• Three dominant issues:
– People: lack of alignment in benefits, incentives and budget…or, put another way, the way we respond to money, process, metrics, measurement and recognition…
– Technology: Multi-Core Transition
– Privacy: inadvertently exposing personal information
Challenges/Impediments
• With web users becoming producers of information…
• We leave the footprint of our lives in digital trails…
• It is becoming easier for “data snoopers” to reconstruct the identity of an individual or an organization by cross-linking information from different sources.
Challenge: Data for Open Innovation
28
• “Search query data can contain the sum total of our work, interests, associations, desires, dreams, fantasies, and even
darkest fears.”
The New York Times, Aug 2006:
Thelma Arnold's identity was betrayed by the records of her Web searches
A face is exposed for searcher no. 4417749
29
Online Privacy
• We leave our traces online at multiple sites such as social networks, blogs, forums etc.– Re-identify users from movie mentions in forums to user ratings
of movies *Frankowski’06+
• However, researchers seek to gain insights, undertake experiments with real-world data and businesses need tools and analysis to understand market trends and needs…
30
• Research and Innovation is inhibited due to the lack of a framework to disseminate information in a safe way
• Open innovation roadblocks due to shortcomings in– Data confidentiality/privacy
– Different data regulations per country
• More research needed on technical (semantics), legal, societal solutions and processes to enable open innovation in an information-based society
In need of a framework for open innovation
31
• Three dominant issues:– People: lack of alignment in benefits, incentive and
budget…what is the business design that underpins your innovation?
– Technology: Multi-Core Transition…just how will this work out?
– Privacy: inadvertently exposing personal information…what personal/business risks are we prepared to accept?
Challenges/Impediments
Life Sciences
MultidisciplinaryResearch
New Materials,Technologies& Processes
Math andPhysical Science
Social SciencesEarth
Sciences
Computer &Information Sciences
Context: Science @ Microsoft
www.microsoft.com/science
1) BT originally tried to sell here
2) …then we aspired to be here…
3) …and needed to
understand what
functionality/value was
required
Comprehensive analysis of:
- NHS Stakeholder vs benefit
- NHS Stakeholder vs incentives
- NHS Stakeholder vs budget availability
- defining the scope of the service
Starting point
Simplified benefits
No significant
benefit to
these care
providers
PCT sees benefit and dis-benefit:
- Benefits of service are extremely diffuse
- Medication and strips costs ↑
- GP visits and A&E admissions ↓ over time
- Compliance increases: Yr 1 <£10k benefit
growing to £225k by Yr 10 (payback over v long
timescales)
- Near term: BT CDM solution roughly cash neutral
to PCT
Patients clearly benefit
provided they are motivated to
use service
Plays into political agenda:
- Access
- Choice
- Increased private sector
involvement in patient care
- New role of pharmacies
Incentives summary
Incentives dominated by financial imperatives
Current incentives operate against adoption of service
Implementation of service largely irrelevant given current incentives
Requires regular updates to ensure personal motivation
Budget availability summary
Lack of incentives and appropriate metrics lead to no
real acknowledgement of the problem and no defined budget
Patients see costs for diabetes (and other LTCs) as being
responsibility of NHS
Summary overlay [benefits/incentives/budget]
Alignment of benefits, incentives and
budget availability does not appear at
lower levels of stakeholder stack.
Explains why many hospital/PCT/SHA
pilots and other initiatives in this area
have failed. This is a ‘no profit zone’
for a CDM service in UK.
Accrual of benefits at upper levels in NHS/DoH encourages national-
scale service...however all management of long-term conditions is
devolved to ‘lower’ levels of the NHS