Download - Bloomsbury ePublishing conference

Transcript
Page 1: Bloomsbury ePublishing conference

Plotting the future of scientific data

Dr Andrew WalkingshawUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge

27th June, 2008

Page 2: Bloomsbury ePublishing conference

Mineralogy (crocoite)

Page 3: Bloomsbury ePublishing conference

Schrödinger and his equation

Page 4: Bloomsbury ePublishing conference

Informatics shouting meta-meta-white-thing

“Informatics studies the structure, algorithms, behavior, and interactions of natural and artificial systems that store, process, access and communicate information. It also develops its own conceptual and theoretical foundations and utilizes foundations developed in other fields. Since the advent of computers, individuals and organizations increasingly process information digitally. This has led to the study of informatics that has computational, cognitive and social aspects, including study of the social impact of information technologies.”

— http://en.wikipedia.org/wiki/Informatics

Page 5: Bloomsbury ePublishing conference

Informatics is about the tools people (and machines) use to talk about things.

Page 6: Bloomsbury ePublishing conference

The seven stages of visualization

• acquire data

• parse data

• filter data

• mine data

• represent data

• refine representation

• add interactivity“Visualizing Data” by Ben Fry: cover © O’Reilly Media

Page 7: Bloomsbury ePublishing conference

True names

Page 8: Bloomsbury ePublishing conference

The semantic layer cake

Page 9: Bloomsbury ePublishing conference

The really important bit

Names!

Page 10: Bloomsbury ePublishing conference

The Web’s built out of names.

Page 11: Bloomsbury ePublishing conference

The Web is URLs plus links plus HTTP...

so the Web is a graph.

The Semantic Web adds information about the links between bits of data.

Page 12: Bloomsbury ePublishing conference

RDF (Resource Description Framework) is a model for expressing the links between URIs.

(You can write RDF using many different syntaxes,but they’re all equivalent.)

<http://subject/> <http://predicate/> <http://object/> links subject to object with a link of type predicate.

Collect these triples together and you’ve got a global, distributed graph of data.

Page 13: Bloomsbury ePublishing conference

1. Use URIs as names for things.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful information.

4. Include links to other URIs, so that they can discover more things.

— Tim Berners-Lee

The Rules of Linked Data

Page 14: Bloomsbury ePublishing conference

The web of linked data

Page 15: Bloomsbury ePublishing conference

You can query RDF graphs using SPARQL, the RDF query language.

A quick example – capitals in Africa:

PREFIX abc: <http://example.com/exampleOntology#> SELECT ?capital ?countryWHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y . ?y abc:countryname ?country ; abc:isInContinent abc:Africa .}

So RDF-formatted data is easy to reuse in programs.

Page 16: Bloomsbury ePublishing conference

“In cyberspace, people become places.”

— Geoff Ryman, author of ‘253’, in 1996 (http://www.ryman-novel.com/info/about.htm)

Page 17: Bloomsbury ePublishing conference

The information business

"Ultimately, Reuters' news is the raw material for analysis and application by investors and downstream news organizations. Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity. If you don't think of what you produce as the 'final product' but rather as a step in an information pipeline, what do you do differently to add value for downstream consumers? In Reuters' case, Devin thinks you add hooks to make your information more programmable. This is a really important insight, and one I'm going to be chewing on for some time."

— Tim O'Reilly (http://radar.oreilly.com/archives/2008/02/reuters-ceo-sees-semantic-web.html)

Photo by Thomas Hawk (cc-by-nc): http://www.flickr.com/photos/thomashawk/205664675/

Page 18: Bloomsbury ePublishing conference

The news industry...

Page 19: Bloomsbury ePublishing conference

Silo-breaking (real-time search)

Page 20: Bloomsbury ePublishing conference

The long tail of (non)-news (Blogs)

Page 21: Bloomsbury ePublishing conference

... but this is just the past, only faster. What’s genuinely new?

The programmer-journalist.

Page 22: Bloomsbury ePublishing conference

a dynamic newspaper.Everyblock

Page 23: Bloomsbury ePublishing conference

Crimes in SoMa Live facts

Page 24: Bloomsbury ePublishing conference

DayLife programmable news

Page 25: Bloomsbury ePublishing conference

The 5 Ws of journalism OpenCalais

Calais is a service which automatically detects the “atoms of news” in free text –

who, what, where, why, when.

Page 26: Bloomsbury ePublishing conference

What might the dynamic journal look like?

Page 27: Bloomsbury ePublishing conference

Journals: sources of names DOI etc

Page 28: Bloomsbury ePublishing conference

Journals: sources of names DOI etc

Names!

Page 29: Bloomsbury ePublishing conference

Journals: sources of data (a page from CrystalEye)

Page 30: Bloomsbury ePublishing conference

Crystallography

• CIF is the standard data format for crystallography:

http://www.iucr.org/iucr-top/cif/

• CrystalEye aggregates “supplementary data” from journals:

http://wwmm.ch.cam.ac.uk/crystaleye/

Photo by cobalt123 (cc-by-nc): http://www.flickr.com/photos/cobalt/94863441/

Page 31: Bloomsbury ePublishing conference

Underneath...

• The data is stored in an open format – CML – which we can program against.

• CrystalEye (the website) is just one presentation of CrystalEye (the data)...

• which itself is an aggregation and transformation of the experimental data underlying journal articles.

Page 32: Bloomsbury ePublishing conference

In triple time...

• Data in RDF would be good — mostly because of SPARQL.

• A bit of sleight of hand: CML...

<cml:property dictRef=”castep:Etot”> <cml:scalar units=”cmlunits:eV”> 125.35 </cml:scalar></cml:property>

to RDF (in pseudo-NTriples):

<http://doc-url> <castep:Etot> [125.35, “cmlunits:eV”]

• So we can go from CML to RDF automatically via our Golem toolkit (http://www.lexical.org.uk/golem/)

Photo by sillydog (cc-by-nc-sa): http://www.flickr.com/photos/sillydog/72697229/

Page 33: Bloomsbury ePublishing conference

What CrystalEye doesn’t exploit

Bibliographic data, like:author namesinstitutionsdate of publication

Photo by Thomas Hawk (cc-by-nc): http://www.flickr.com/photos/thomashawk/205664675/

Page 34: Bloomsbury ePublishing conference

Indexing

• Here’s a poser —

• What papers has an author written?

• Who were their coauthors?

• SPARQL makes this sort of question easy:

PREFIX dc: <http://purl.org/dc/terms/>PREFIX ce: <http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>DESCRIBE ?file WHERE { ?file dc:contributor '%s' .}

• And add a front end.

Photo by Reeding Lessons (cc-by-nc-sa): http://www.flickr.com/photos/reedinglessons/2238990839/

Page 35: Bloomsbury ePublishing conference

Automatic indexing (look, no cards)

Page 36: Bloomsbury ePublishing conference

More axes

• By author is good. By subject would be even better.

• We can tag papers automatically, depending on the chemical names and terms in them:

http://oscar3-chem.sourceforge.net/

• OpenCalais for chemistry. (Sort of.)

Photo by Global Mouser (cc-by-nc-sa): http://www.flickr.com/photos/hryciuk/234940727/

Page 37: Bloomsbury ePublishing conference

Integration

Photo by solofotones (cc-by-nc-sa): http://www.flickr.com/photos/solofotones/1839531915/

Documents about chemistry use the same names no matter where they come from; journals, blogs, mainstream news — so we can use shared terminology to connect journal articles and blogs.

Page 38: Bloomsbury ePublishing conference

Clustering

• What papers are tagged with a given tag?

• What tags co-occur frequently in our corpus?

• Another approach: brute-force. Just cluster all the papers by title – it works much better than you’d expect!

Page 39: Bloomsbury ePublishing conference

What new questions can we make this data answer?

For example; how has the global distribution of work in crystallography changed over the last few years?

Photo by Leo Reynolds (cc-by-nc-sa): http://www.flickr.com/photos/lwr/12364944/

Page 40: Bloomsbury ePublishing conference

Can we map institutions to locations?

http://geonames.org/ has geodata; you can use it to geotag the authors’ institutional affiliations.

We managed to geotag about 30,000 papers in our corpus successfully.

Photo by rogiro (cc-by-nc): http://www.flickr.com/photos/riot/71878108/

Page 41: Bloomsbury ePublishing conference

Making data beautiful

We can learn a lot from artists and graphic designers! http://processing.org/ is great for writing visualizations...

Image by scloopy (cc-by-nc-sa): http://www.flickr.com/photos/onecm/438336731/

Page 42: Bloomsbury ePublishing conference

The process(ing)

• Plot week-by-week

• Each unique location associated with a paper that week gets plotted.

• Points fade out over an eight-week period.

• So:

From http://en.wikipedia.org/wiki/Equirectangular_projection

Page 43: Bloomsbury ePublishing conference

Crystallography 2000-2007The world of...

Page 44: Bloomsbury ePublishing conference

Crystallography 2000-2007The world of...

Page 45: Bloomsbury ePublishing conference

Where next?

• What’s going to happen to:

• scientific data capture?

• scientific publishing?

• These are cultural and economic questions as much as anything else; we have the technology.

• What other data can we free, reuse, or find correlations with?

Photo by baboon™ (cc-by-nc): http://www.flickr.com/photos/baboon/92082777/

Page 47: Bloomsbury ePublishing conference

Acknowledgments (I) - my colleagues

• Prof. Peter Murray-Rust, Nick Day, Jim Downing (for CrystalEye)

• Dr Peter Corbett (for OSCAR 3)

• Dr Joe Townsend, Alan Tonge (SPECTRa-T)

• Dr Nico Adams, Nick England, Dr Lezan Hawizy (Polymer Informatics)

• Prof. Martin Dove, Dr Richard Bruin, Dr Kevin Yang, Dr Dan Wilson (MaterialsGrid)

• Dr Toby White (eMinerals project; FoX and http://cmlcomp.org/)

Page 48: Bloomsbury ePublishing conference

Acknowledgments (II) - academic collaborators

• Prof. Kurt Mikkelsen (DALTON)

• Dr Thomas Steinke (VAMP)

• the CASTEP Development Group (http://castep.org/)

• CrystalEye (http://wwmm.ch.cam.ac.uk/crystaleye/):

• the International Union of Crystallography (http://www.iucr.org/)the Royal Society of Chemistry (http://rsc.org/)

• OSCAR (http://oscar3-chem.sourceforge.net/):

• the Royal Society of Chemistry (http://rsc.org/)Nature Publishing Group (http://nature.com/)

Page 49: Bloomsbury ePublishing conference

Acknowledgments (III)

for access to the Talis N2 platform beta