Acs denver dirks potenzone 30 aug2011
-
Upload
rudy-potenzone -
Category
Technology
-
view
361 -
download
1
description
Transcript of Acs denver dirks potenzone 30 aug2011
Enriched research documents at the cutting edge:When research papers no longer make sense on paper
Rudy PotenzoneSciencePoint Solutions
Lee DirksEducation & Scholarly Communication
Microsoft Research | Connections
Presented at the American Chemical Society National MeetingDenver CO, August 30, 2011
at the Skolnick Award Symposium in Honor of Sandy Lawson
Agenda
• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future
Agenda
• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Addins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future
3
A Brief History of Enriched Scientific Papers
• Research papers have long enjoyed the ability to exist on paper with enriched content
• Embed figures and associated electronic items– chemical structures that included full bonding and
structural information– Crystallographic databases– Spectral databases– Biological sequence and Pathway databases– Supplemental material repositories
Issues with External Repositories
• Often not complete• Poorly audited with some notable
exceptions• References between the paper and the
files are often lost or incorrect• There is a real loss of context due to the
separation of all the information• Reproducibility is not certain!
Agenda
• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future
My Bio – a Content Perspective
• The NIH/EPA Chemical Information System– SANSS, MSSS, FRSS, etc.
• Chemical Abstracts Service– CA, Registry, CASREACT, CHEMCATS, SciFinder
• MDL Information Systems/Elsevier– ACD, various synthesis, Beilstein
• LION bioscience, Ingenuity Systems– SRS and Ingenuity Pathway Analysis (IPA)
• CambridgeSoft– ACX, etc.
Why Are WeNOT
Focusing On Authoring Tools?
On the Verge of a Major Revolution
• Technology that enables authors to create elaborate versions of results of research
• Capturing the full context of research in progress:– The formal scientific report– The very METHODS used– Full data repository– Complete workflows
• With the resulting documentation offering information for completely reproducible results
DynamicDocuments
Reputation& Influence
Reproducible Research
Interactive Data
Collaboration
Envisioning a New Era of Research Reporting
Benefits of a Scientific ePaper
• Helping to improve the quality of science• Facilitating the intellectual transfer of the core
discoveries• Fully documenting the provenance of the research• Preserving the knowledge with complete context• Services easily accessible on top of the data
– a new value-added layer– visualization and analysis– discovery through simulation and modeling– etc.
• Accessible Reproducible Research!!
Jill P. Mesirov. Accessible Reproducible Research. Science Vol. 327 (22) Jan 2010 (from http://www.sciencemag.org/cgi/content/full/327/5964/415/DC1)
Reproducible Research
Scientific publications have at least two goals:1. to announce a result and2. to convince readers that the
result is correct.3. Preservation of knowledge
Fully Reproducible
Content Driving Better
Science
Rich Original Content
Content Sharing Services
Full Data Content
Embedded
Workflow Process
Embedded
Fully Reproducible
Content Driving Better
Science
Agenda
• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future
Redefining the Document
• Microsoft introduced their open document format – OpenXML – in Office 2007
Project "Chem4Word"– Chemical Drawing in Microsoft WordSemantic chemistry for students and publishers
<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml> V1.0 now available (binary and open source)
http://research.microsoft.com/chem4word/
Data: Semantics stored in Chemistry Markup Language (CML)
Intent: Recognizes chemical dictionary and ontology terms
Author/edit 1D and 2D chemistry. Change chemical layout styles.
Intelligence: Verifies validity of authored chemistry
http://www.nytimes.com/2010/04/08/technology/personaltech/08askk.html?_
r=1
Relationships: Navigate and link referenced chemistry
GenePattern Reproducible Research Add-in
Source code and binary:http://GenepatternWordAddin.codeplex.com
Services: Connects to GenePattern database
Data: Resulting data (and provenance) stored within Word document
Data: Control and execute query pipelines into GenePattern
Relationships: Inline graphics are synchronized to dataset
18
Research Information Centre (RIC) ProjectVirtual Research Environment (VRE) Toolkit for SharePoint
Version 1.1 (Open Source under Ms-PL):http://ric.codeplex.com/
Collaborative environment for research groups
Personal site for each researcher and project site for each project
Document management, federated search, social networking, real-time communication, blogs, wikis
Project Overview:http://research.microsoft.com/ric/http://research.microsoft.com/vre/
moleculestext
experiments
measurementsdocuments
datamolecules
data
scientists
oreChem – The Chemical Semantic Web
• Peter Murray-Rust• Jim Downing• Nico Adams
• Carl Lagoze• Geoffrey Fox • Jeremy Frey• Simon Coles
• Lee Giles• Karl Mueller• Prasenjit Mitra
Mash-up (re-use) of data
Semantic storage
Compound document authoring
Demonstrating:• Large collaboration project
focusing on interoperability• At-source capture of
chemistry data• Chemical structure search• Compound object authoring• Retrospective harvesting of
chemistry data• Reuse through common ORE
data model• Semantic authoring• Virtualized triple storage
“RSC Publishing and Southampton University drive the chemical semantic web…”
Enabling the Chemical Semantic Web
Elsevier's Article of the Future CompetitionGrand Challenge & Article of the Future contest -- ongoing collaboration between Elsevier and the scientific community to redefine how a scientific article is presented online.
PLoS Currents: Influenza In conjunction with NIH & Google Knol – a rapid research note service, enable this exchange by providing an open-access online resource for immediate, open communication and discussion of new scientific data, analyses, and ideas in the field of influenza. All content is moderated by an expert group of influenza researchers, but in the interest of timeliness, does not undergo in-depth peer review.
Nature Preceedings Connects thousands of researchers and provides a platform for sharing new and preliminary findings with colleagues on a global scale – via pre-print manuscripts, posters and presentations. Claim priority and receive feedback on your findings prior to formal publication.
Mendeley (and Papers)Called “iTunes” for academic papers; 400,000+ users have signed up and a staggering 30+ million scientific papers have been uploaded.
Recent developments of interest
• Swivel• IBM’s “Many Eyes”• Gapminder &
Google’s Trendalyzer• Metaweb’s “Freebase”• CSA’s “Illustrata”
Several CommercialData Sharing + Analysis Services
http://thedata.org
Via web application software, data citation standards, and statistical methods, the Dataverse Network project increases scholarly recognition and distributed control for authors, journals, archives, teachers, and others who produce or organize data; facilitates data access and analysis for researchers and students; and ensures long-term preservation whether or not the data are in the public domain. [From the Institute of Quantitative Social Science (IQSS) at Harvard University]
Harvard’s “Dataverse” Project
Taverna
• Taverna is an open source and domain-independent Workflow Management System– A suite of tools used to design and execute scientific
workflows and aid in silico experimentation.• Taverna has been created by the myGrid team and
funded through OMII-UK. The project has guaranteed funding until 2014.
• The Taverna Suite is written in Java and includes the Taverna Engine (used for enacting workflows) that powers both the Taverna Workbench (desktop client) and the Taverna Server.
More on Taverna
• Integrated with other myGrid tools– social networking and workflow sharing
environment for scientists– curated catalogue of Web services for Life
Sciences
26
Log what, where, when who
For data and for publications
27
1 1 2 2 1 3 1 4
Sample of 4-flourinatedbiphenyl
Add CoolReflux
Butanone Sample ofK2CO3Powder
Weigh
grammes0.9031
Measure
40 ml
Add
Weigh
2.0719 g
text
3 5
Add
g
Sample ofBr11OCB
2 6
Reflux
2 7
Cool
Water
Measure
30 ml
9
Liquid-liquid
extraction
DCM
Measure
3 of 40 ml
10
Dry
MgSO4
11
Filter(Buchner)
12
RemoveSolvent
by RotaryEvaporation
13
Fuse
Silica
14
ColumnChromatography
Ether/PetrolRatio
Butanone dried via silica column andmeasured into 100ml RB flask.
Used 1ml extra solvent to wash outcontainer.
Started reflux at 13.30. (Had tochange heater stirrer) Only reflux
for 45min, next step 14:15.
Inorganics dissolve 2layers. Added brine
~20ml.
Organics are yellowsolution
Washed MgSO4 withDCM ~ 50ml
Measure
excess
Observation Types
weight - grammes
measure - ml, drops
annotate - text
temperature - K, °C
Key
Process
Input
Literal
Observation
Add CoolRefluxAddAdd Reflux Cool Dry Filter Remove
Solventby Rotary
Evaporation
Fuse ColumnChromatography
Dissolve 4-flourinatedbiphenyl inbutanone
Add K2CO3powder
Heat at refluxfor 1.5 hours
Cool and addBr11OCB
Heat atreflux untilcompletion
Cool and addwater (30ml)
Combine organics,dry over MgSO4 &filter
Removesolvent invacuo
Liquid-liquid
extraction
Extract withDCM(3x40ml)
Fuse compound to silica &column in ether/petrol
4 8
Add
Add
text
Annotate
Annotate
text
Weigh
Annotate
g
Annotate Annotate
text text
Future Questions
Whether to have many subclasses of processes or fewer with annotations
How to depict destructive processes
How to depict taking lots of samples
What is the observation/process boundary? e.g. MRI scan
1.5918
Combechem
30 January 2004gvh, hrm, gms
Ingredient List
Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml
image
To
Do
Lis
tP
lan
Pro
ce
ss
Re
co
rd
Provenance
myGrid Open Suite of Tools
Client User InterfacesWorkflow GUI Workbench
and 3rd party plug-ins
Workflow Repository
Service Catalogue
Programming and APIs
Web Portals
Activity and Service Plug-in Manager
Provenance Store
Workflow Server
Open Provenance
Model
Secure Service Access, and Programming APIs
Recycling, Reuse, Repurposing
http://www.myexperiment.org/
• Share
• Search
• Re-use
• Re-purpose
• Execute
• Communicate
• Record
Project Trident – Scientific Workflow WorkbenchBuilt on Windows Workflow Foundation
Author, Execute and Monitor Workflows
Version 1.2 (Open Source under Apache 2.0 License):http://tridentworkflow.codeplex.com/
Compose and modify workflows via drag & drop canvas
View data products, performance metrics, and provenance data
KNIME
• KNIME (Konstanz Information Miner)• A user-friendly and comprehensive Open-
Source platform for:– Data integration– Processing– Analysis– Exploration
• Growing vendor adoption– PerkinElmer, Shrodinger, Tripos, CCG,
ChemAxon, etc.
Accelrys Pipeline PilotChemistry
Accelrys Pipeline PilotADME
Accelrys Pipeline PilotBiology
Accelrys Pipeline PilotGenomics
DynamicDocuments
Reputation& Influence
Reproducible Research
Interactive Data
Collaboration
Envisioning a New Era of Research Reporting
Imagine…• Live research reports
– multiple end-user ‘views’– dynamically tailor presentations
• An authoring environment that absorbs and encapsulates– research workflows– outputs from the lab experiments
• A report that can be dropped into an electronic lab workbench and reconstitute an entire experiment
• Dynamic mash up data and workflows across experiments
• Apply new analyses and visualizations and perform new in silico experiments
Agenda
• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future
Impact of These Innovations
• On Science• On the Business of Science• On the Scientific Community
• And Other Emotional Factors . . .
38
Overall Impacts
Authors will be somewhat inconvenienced to learn new things . . . But as readers and consumers it will clearly be beneficial!
Across Industry and Academia it will be positive advance
The vendors will be skeptical and reluctant to change – but will move with the spending community!
On the Scientific Community
• This will provide a significantly more capable platform for science– Extending collaboration– Easing validation of research– Offering transfer of knowledge and ease of
extension of research projects• But is DOES further erode the status quo
system of rewards and tenure!
And Other Emotional FactorsIs There An Elephant In This Room??
• The Publishers??
• CAS?? Other A&I companies??
• Well what about Electronic Lab Notebooks??
On the Business of Science
• Publishers will need to continue to evolve to find a role as “cool provider” of these tools and become a “hot” distribution center
• A&I companies will need to redefine their role
• Software vendors have a real opportunity, if they can adapt . . .
The Value of the A & I LayersAbstracting and Indexing in the Future
The Old Days• Abstracting was
Key• True Assessment
of Content
Today
• Indexing is Key• Precision and
Recall• “Beats” Google
every time
Going Forward
• Indexing with Context “Built-In”
• Will Abstracting or more correctly ‘Content Monitoring’ become the value add?
• Or be an reliable data aggregator?
Agenda
• Part 1 – The Scientific Paper• Part 2 – Emergence of the ePaper• Part 3 – Of Workflows and Add-ins• Part 4 – Impact of the ePaper• Part 5 – A Glimpse to the Future
Rich Content Sources Direct Search Tools
Reproducible Science Complete Provenance
ChallengeOr
Opportunity
The Opportunity Before Us• Faster Development in an Increasingly
Complex World– Improving reproducibility of scientific results– Data Sharing and collaboration services– Reliable maintenance of provenance– Faster availability and efficient query tools– Secure and/or controlled access to data– Finding related data and research partners– Assurance that data will be preserved
• A Brave New World for Scientific Discovery and Research– Cross-domain partnerships– Enhanced broad availability of data and prior
research
• Improved Knowledge Transfer– Both upstream and downstream– Realizing the promise of translational medicine
Thank You!
Rudy PotenzoneSciencePoint Solutions
Lee DirksEducation & Scholarly Communication
Microsoft Research | Connections
[email protected] or [email protected] – http://www.microsoft.com/scholarlycomm/Facebook: Scholarly Communication at Microsoft