Download - Bloomsbury ePublishing conference

Plotting the future of scientific data

Dr Andrew WalkingshawUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge

27th June, 2008

Mineralogy (crocoite)

Schrödinger and his equation

Informatics shouting meta-meta-white-thing

“Informatics studies the structure, algorithms, behavior, and interactions of natural and artificial systems that store, process, access and communicate information. It also develops its own conceptual and theoretical foundations and utilizes foundations developed in other fields. Since the advent of computers, individuals and organizations increasingly process information digitally. This has led to the study of informatics that has computational, cognitive and social aspects, including study of the social impact of information technologies.”

— http://en.wikipedia.org/wiki/Informatics

http://en.wikipedia.org/wiki/Informatics

http://en.wikipedia.org/wiki/Informatics

Informatics is about the tools people (and machines) use to talk about things.

The seven stages of visualization

• acquire data

• parse data

• filter data

• mine data

• represent data

• refine representation

• add interactivity“Visualizing Data” by Ben Fry: cover © O’Reilly Media

True names

The semantic layer cake

The really important bit

Names!

The Web’s built out of names.

The Web is URLs plus links plus HTTP...

so the Web is a graph.

The Semantic Web adds information about the links between bits of data.

RDF (Resource Description Framework) is a model for expressing the links between URIs.

(You can write RDF using many different syntaxes,but they’re all equivalent.)

<http://subject/> <http://predicate/> <http://object/> links subject to object with a link of type predicate.

Collect these triples together and you’ve got a global, distributed graph of data.

http://subject

http://subject

http://predicate

http://predicate

http://object

http://object

1. Use URIs as names for things.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful information.

4. Include links to other URIs, so that they can discover more things.

— Tim Berners-Lee

The Rules of Linked Data

The web of linked data

You can query RDF graphs using SPARQL, the RDF query language.

A quick example – capitals in Africa:

PREFIX abc: <http://example.com/exampleOntology#> SELECT ?capital ?countryWHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y . ?y abc:countryname ?country ; abc:isInContinent abc:Africa .}

So RDF-formatted data is easy to reuse in programs.

http://example.com/exampleOntology#

http://example.com/exampleOntology#

“In cyberspace, people become places.”

— Geoff Ryman, author of ‘253’, in 1996 (http://www.ryman-novel.com/info/about.htm)

http://www.ryman-novel.com/info/about.htm

http://www.ryman-novel.com/info/about.htm

The information business

"Ultimately, Reuters' news is the raw material for analysis and application by investors and downstream news organizations. Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity. If you don't think of what you produce as the 'final product' but rather as a step in an information pipeline, what do you do differently to add value for downstream consumers? In Reuters' case, Devin thinks you add hooks to make your information more programmable. This is a really important insight, and one I'm going to be chewing on for some time."

— Tim O'Reilly (http://radar.oreilly.com/archives/2008/02/reuters-ceo-sees-semantic-web.html)

Photo by Thomas Hawk (cc-by-nc): http://www.flickr.com/photos/thomashawk/205664675/

http://radar.oreilly.com/archives/2008/02/reuters-ceo-sees-semantic-web.html




http://www.flickr.com/photos/thomashawk/205664675/


The news industry...

Silo-breaking (real-time search)

The long tail of (non)-news (Blogs)

... but this is just the past, only faster. What’s genuinely new?

The programmer-journalist.

a dynamic newspaper.Everyblock

Crimes in SoMa Live facts

DayLife programmable news

The 5 Ws of journalism OpenCalais

Calais is a service which automatically detects the “atoms of news” in free text –

who, what, where, why, when.

What might the dynamic journal look like?

Journals: sources of names DOI etc

Journals: sources of names DOI etc

Names!

Journals: sources of data (a page from CrystalEye)

Crystallography

• CIF is the standard data format for crystallography:

http://www.iucr.org/iucr-top/cif/

• CrystalEye aggregates “supplementary data” from journals:

http://wwmm.ch.cam.ac.uk/crystaleye/

Photo by cobalt123 (cc-by-nc): http://www.flickr.com/photos/cobalt/94863441/







http://www.flickr.com/photos/felix42/76536001/

http://www.flickr.com/photos/felix42/76536001/

Underneath...

• The data is stored in an open format – CML – which we can program against.

• CrystalEye (the website) is just one presentation of CrystalEye (the data)...

• which itself is an aggregation and transformation of the experimental data underlying journal articles.

In triple time...

• Data in RDF would be good — mostly because of SPARQL.

• A bit of sleight of hand: CML...

<cml:property dictRef=”castep:Etot”> <cml:scalar units=”cmlunits:eV”> 125.35 </cml:scalar></cml:property>

to RDF (in pseudo-NTriples):

<http://doc-url> <castep:Etot> [125.35, “cmlunits:eV”]

• So we can go from CML to RDF automatically via our Golem toolkit (http://www.lexical.org.uk/golem/)

Photo by sillydog (cc-by-nc-sa): http://www.flickr.com/photos/sillydog/72697229/

http://www.lexical.org.uk/golem/

http://www.lexical.org.uk/golem/



What CrystalEye doesn’t exploit

Bibliographic data, like:author namesinstitutionsdate of publication

Photo by Thomas Hawk (cc-by-nc): http://www.flickr.com/photos/thomashawk/205664675/



Indexing

• Here’s a poser —

• What papers has an author written?

• Who were their coauthors?

• SPARQL makes this sort of question easy:

PREFIX dc: <http://purl.org/dc/terms/>PREFIX ce: <http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>DESCRIBE ?file WHERE { ?file dc:contributor '%s' .}

• And add a front end.

Photo by Reeding Lessons (cc-by-nc-sa): http://www.flickr.com/photos/reedinglessons/2238990839/

http://purl.org/dc/terms/

http://purl.org/dc/terms/

http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#






Automatic indexing (look, no cards)

More axes

• By author is good. By subject would be even better.

• We can tag papers automatically, depending on the chemical names and terms in them:

http://oscar3-chem.sourceforge.net/

• OpenCalais for chemistry. (Sort of.)

Photo by Global Mouser (cc-by-nc-sa): http://www.flickr.com/photos/hryciuk/234940727/

http://oscar3-chem.sourceforge.net




Integration

Photo by solofotones (cc-by-nc-sa): http://www.flickr.com/photos/solofotones/1839531915/

Documents about chemistry use the same names no matter where they come from; journals, blogs, mainstream news — so we can use shared terminology to connect journal articles and blogs.

http://www.flickr.com/photos/bretarnett/136222945/

http://www.flickr.com/photos/bretarnett/136222945/

Clustering

• What papers are tagged with a given tag?

• What tags co-occur frequently in our corpus?

• Another approach: brute-force. Just cluster all the papers by title – it works much better than you’d expect!

What new questions can we make this data answer?

For example; how has the global distribution of work in crystallography changed over the last few years?

Photo by Leo Reynolds (cc-by-nc-sa): http://www.flickr.com/photos/lwr/12364944/



Can we map institutions to locations?

http://geonames.org/ has geodata; you can use it to geotag the authors’ institutional affiliations.

We managed to geotag about 30,000 papers in our corpus successfully.

Photo by rogiro (cc-by-nc): http://www.flickr.com/photos/riot/71878108/

http://geonames.org

http://geonames.org



Making data beautiful

We can learn a lot from artists and graphic designers! http://processing.org/ is great for writing visualizations...

Image by scloopy (cc-by-nc-sa): http://www.flickr.com/photos/onecm/438336731/

http://processing.org

http://processing.org



The process(ing)

• Plot week-by-week

• Each unique location associated with a paper that week gets plotted.

• Points fade out over an eight-week period.

• So:

From http://en.wikipedia.org/wiki/Equirectangular_projection

http://en.wikipedia.org/wiki/Equirectangular_projection


Crystallography 2000-2007The world of...

Where next?

• What’s going to happen to:

• scientific data capture?

• scientific publishing?

• These are cultural and economic questions as much as anything else; we have the technology.

• What other data can we free, reuse, or find correlations with?

Photo by baboon™ (cc-by-nc): http://www.flickr.com/photos/baboon/92082777/

http://www.flickr.com/photos/baboon/92082777/

http://www.flickr.com/photos/baboon/92082777/

Photo by psd (cc-by): http://www.flickr.com/photos/psd/2086641/



Acknowledgments (I) - my colleagues

• Prof. Peter Murray-Rust, Nick Day, Jim Downing (for CrystalEye)

• Dr Peter Corbett (for OSCAR 3)

• Dr Joe Townsend, Alan Tonge (SPECTRa-T)

• Dr Nico Adams, Nick England, Dr Lezan Hawizy (Polymer Informatics)

• Prof. Martin Dove, Dr Richard Bruin, Dr Kevin Yang, Dr Dan Wilson (MaterialsGrid)

• Dr Toby White (eMinerals project; FoX and http://cmlcomp.org/)

http://cmlcomp.org

http://cmlcomp.org

Acknowledgments (II) - academic collaborators

• Prof. Kurt Mikkelsen (DALTON)

• Dr Thomas Steinke (VAMP)

• the CASTEP Development Group (http://castep.org/)

• CrystalEye (http://wwmm.ch.cam.ac.uk/crystaleye/):

• the International Union of Crystallography (http://www.iucr.org/)the Royal Society of Chemistry (http://rsc.org/)

• OSCAR (http://oscar3-chem.sourceforge.net/):

• the Royal Society of Chemistry (http://rsc.org/)Nature Publishing Group (http://nature.com/)

http://castep.org

http://castep.org



http://www.iucr.org

http://www.iucr.org

http://rsc.org

http://rsc.org



http://rsc.org

http://rsc.org

http://nature.com

http://nature.com

Acknowledgments (III)

for access to the Talis N2 platform beta