The seven-deadly-sins-of-bioinformatics3960

Post on 19-Jan-2015

815 views 1 download

Tags:

description

 

Transcript of The seven-deadly-sins-of-bioinformatics3960

The Seven Deadly Sins of Bioinformatics

Professor Carole Goblecarole.goble@manchester.ac.uk

The University of Manchester, UKThe myGrid project

OMII-UK

4

The traditional sins….• Lust• Gluttony • Greed• Sloth• Wrath• Envy • Pride

http://en.wikipedia.org/wiki/Seven_deadly_sins[Stevens and Lord]

5

Methodology• Email a handful of bioinformaticans.• Stand well back.• Collect.• Edit.

• Therapy on the cheap. • We all felt better.

6

I am grateful to…• Phil Lord (University of Newcastle)• Anil Wipat (University of Newcastle)• Matthew Pocock (University of Newcastle)• Robert Stevens (University of Manchester)• Paul Fisher (University of Manchester)• Duncan Hull (Manchester Centre for Systems Biology)• Norman Paton (University of Manchester)• Marco Roos (University of Amsterdam)• Rodrigo Lopez (EBI)• Tom Oinn (EBI)• Andy Law (Roslin Institute)• Graham Cameron (EBI)

8

Sins1. Parochialism and Insularity

2. Exceptionalism

3. Autonomy or death!

4. Vanity: Pride and Narcissism

5. Monolith Meglomania

6. Scientific method Sloth

7. Instant Gratification

9

Parochialism• “being provincial, being narrow in scope, or

considering only small sections of an issue.” http://en.wikipedia.org/wiki/Parochialism

Insularity• “a person, group of people, or a community

that is only concerned with their limited way of life and not at all interested in new ideas or other cultures.” http://en.wikipedia.org/wiki/Insularity

Sin 1

10

Reinvention• Reinventing the Wheel. Rediscovering the same

problems. Rediscovery of techniques & methods.

Creating…• Yet another identity scheme. Yet another

representation mechanism for data. • Yet another ontology. Yet another data

warehouse.• Yet another integration framework. Yet another

query or ontology or workflow language.

Result? Misery. Or more work for the boys.

Comparative Genomics? Tisk!Its Comparative Bioinformatics

Bioinformatics is about mapping one schema to another, one format to another, one id

scheme to another.

What a waste of time. What a handy distraction from doing some

Real Science™.

12

Names and Identity Crisis

Q92983O00275O00276O00277O00278O00279O00280O14865O14866P78507

• WSL-1 protein• Apoptosis-mediating receptor DR3 • Apoptosis-mediating receptor

TRAMP • Death domain receptor 3 • WSL protein • Apoptosis-inducing receptor AIR • Apo-3 • Lymphocyte-associated receptor of

death • LARD• GENE: Name=TNFRSF25

Q93038 = Tumor necrosis factor receptor superfamily member 25 precursor

P78515Q93036 Q93037 Q99722 Q99830 Q99831 Q9BY86 Q9UME0 Q9UME1 Q9UME5

Annotation history:

http://www.expasy.org/uniprot/Q93038

13

Andy Law's Third Law• “The number of unique identifiers assigned to

an individual is never less than the number of Institutions involved in the study”... and is frequently many, many more.

http://bioinformatics.roslin.ac.uk/lawslaws.html

14

The Selfish Scientist “A biologist would rather share their

toothbrush than their (gene) names” Mike Ashburner

Professor Genetics

University of Cambridge

UK

Amongst the many

15

Some causes of the Identity Crisis• Conflation of the ID for a thing, something to call the

thing, a description of the thing, with the thing itself (reference/referent)

• Internal vs external IDs • Opaque vs human-interpretable IDs • Situation-dependent 'parts' of a resource get different

IDs – e.g. the gene in a disease process vs the disease in a

metabolic process

• Annotation attribution and log differentiation– Two organisations attach annotations to two IDs, state they

are referring to the same thing, they now have provenance about which of them asserted which facts

[Pocock]

16

Id Reinvention• Global Identity naming mechanism for data objects in

the Life Sciences

• LSIDs and URIs and PURLs. WS-Naming and all its friends

• Half the debaters haven’t actually read the LSID or URL or PURL specs. Or provided use cases.

• Web Pages are not Data Assets.• “you could do this with HTTP based identifiers

given <insert hack>”. • The debate rages! 124 messages in the last week.• W3C Semantic Web Health Care and Life Sciences

Interest Group public-semweb-lifesci@w3.org

urn:lsid:uniprot.org:{db}:{id} http://purl.uniprot.org/{db}/{id}

17

Andy Law’s First (Format) Law• “The first step in developing a new genetic

analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.”

• Different codes to signify the sex of animals. • crimap uses '0' female and '1' male. • Keightly algorithm. ‘1' female and ‘0' male. • Knott & Haley QTL analysis algorithm ‘1' female

and ‘2' male• When they'll use '3' and '4' and then we'll know

they're doing it deliberately. 

http://bioinformatics.roslin.ac.uk/lawslaws.html

19

Reinvention of Ontology tools• OBO and OWL ?• OBOEdit and

Protégé-OWL ?

The Montagues and The Capulets..

Let me get my bullet-proof vest …

20

Pragm

atists

Aesthetics

Philosophers

Life Scientists

Capulets

KnowledgeRepresentation

Montagues

A means to an endContent providers

Theo

retic

ians

The endMechanism providers

Endurants, Perdurants,Being, Substance, Event

Spi

ritua

l gui

des

The Montagues and The Capulets …SOFG 2004, KCap 2005, Comparative and Functional Genomics 2004

The “Oh No” OBO

21

Yet another database …• Organism databases

• Counter example – Generic Model Organism Database Toolkit.

FlyBase, WormBase, SGD, BeeBase and many other large and small community databases

22

BioBabel• bioperl • biojava • biopython • bioruby • biophp• biosql • biouml• biofoo• biobar

23

Integration• Workflows

Management Systems

• Counter example• Taverna

http://www.mygrid.org.uk

25

Any more ?• Another Web 2.0 Web Site? Another Web

interface to a database? Another portal?• Whole database systems. ACeDB is not a lone-case.

• Genome data compilers for E. coli, Drosophila, Plant species, etcetera reuse each other's code?

• Text miners require synonyms and reinvent the wheel to get them in many cases.

• Add your favourite here….

26

Reuse Rocks. Collaboration through workflow and web services

VL-e Project• “instant collaboration”

with Martijn Schuemie (Rotterdam) through a web service that discloses their protein synonym data.

• Exchanging services and (sub)workflows with food scientists.

• Web services make that easier.

Generic Grid middleware

Workflow bus: provide services for

1) Interoperability and integration, 2) composition, 3) provenance, 4) Enactment, 5) Human in the loop computing

Taverna Kepler Triana VLAMG

Sub workflow 1

Sub workflow 2

Sub workflow 3

Scientific experiment: a meta workflow

Sub workflow 4

Generic Grid middleware

Workflow bus: provide services for

1) Interoperability and integration, 2) composition, 3) provenance, 4) Enactment, 5) Human in the loop computing

Taverna Kepler Triana VLAMG

Sub workflow 1

Sub workflow 2

Sub workflow 3

Scientific experiment: a meta workflow

Sub workflow 4

27

Recycling, Reuse, Repurposing• A Trypanosomiasis in Cattle

workflow (by Paul) reused without change for Trichuris muris Infection (by Jo).

• Identified the biological pathways believed to be involved in the ability of mice to expel the parasite.

• Workflows are memes. Scientific commodities. To be exchanged and traded and vetted and mashed. Users add value.

28

Warning! Reuse is Hard• Writing reusable workflows is hard.

– Local services– Permissions. Licences– What does it DO?

• Writing reusable services is hard.– What does it DO?– Predicting the unknown required by the unknown.

• Finding workflows, services and tools is hard– Where do you go?? What does it DO??

• Creating web services is still a bottleneck. For quick solutions it is still seen as too much extra trouble.

29

Bullying and the Borg• If a group is working in a field, you get bullied

at for trying out something different.– Can YOU think of an example??

• You may actually be doing something different, but you use some common words.

• “Why do this? It's already been solved by Foo - the massively unwieldy, slow-moving, monolithic, meeting paralysed international effort for Things Mentioning Foo”.

30

Reinvention or Invention? Pre-dating• BioMOBY pre-dates (Semantic) Web service

revolution • OBO and OBO-Edit pre-dates OWL and

Protégé-OWL– 20 years of Knowledge Representation.

• Taverna pre-dates a reliable Open Source BPEL engine– 20 years of functional programming.

• There ARE features that Bioinformatics needs that other solutions don’t cater for.

A few months in the laboratory (or the computer) can save a few hours

in the library (or on Google).

Westheimer's Law (with additions).

32

No tool is an island…• Assume

– only we will use it, whatever it may be.– that it will be freestanding and unlinked to anything else.– that it will always work and will keep on working.– That everyone will understand it.

• “Well I know what I mean. And so does my mate. So I don’t need to specify it. Or document it properly. Or keep the metadata up to date.”

• Never mind the interface, just look at my implementation!

• Metadata matters. Models matter. • Interfaces matter. Services matter.

Not just bioinformatics

Computer Science is Guilty!

35

Why don’t biologists modularise OWL

ontologies properly?

Er, well, like how should we do it “properly” and where are the tools to help us?

We don’t know and we haven’t got any. But here are some

vague guidelines.

W3C Semantic Web for Life Sciences mailing list, 2005

37

Standards are boring (but important)• “Blue collar Science” (John Quackenbush)• Nobody is going to win a Nobel prize for

creating a standard schema, ontology or whatever. (Duncan Hull)

• “Standardise where you need standards, don’t where you don’t. Standardise messages not structures” (Graham Cameron)

• Drive on the left or the right?

38

Self promotion• Not making shareable

reusable software, because we can publish every single monolithic software solution.

• And get promoted.• Applies equally to

databases and ontologies.

• Production vs NoveltyNot all software and databases are equal.

Trust

I don’t trust your code

I don’t trust your data

I don’t trust you will still be around in 1 year

41

Sin 2Exceptionalism• Biologist exceptionalism• Biological exceptionalism• Biology exceptionalism

A cause of Reinvention Syndrome“Bioinformatics is special”“Domain specific outcomes requires-specific

approaches and technologies”

42

Biologist exceptionalism

• I know there is already a gene name for that gene, but, I don't like it and it doesn't fit in with my schema.

• It would be better if I wrote the script I need so I know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place.

I’m different. We are all individuals.

43

Biological exceptionalism• “Biology is all exception.”

• “Don’t complicate everyone’s life for the sake of a few esoteric cases”. Cameron’s 5th Commandment of Curation

• Exceptionalism paralysis.• Gather requirements expansively, prune

ruthlessly• The EMBL/GenBank/DDBJ/Feature Table

44

We are so much more complex…• “There are proteins, and there are records about

proteins. Records come in different formats. If I make a statement using this url, is it about the record? or the protein?” Alan Ruttenberg

• “[Usually] we have one entry per gene. We have several entries for a single gene when description of variations are too complicated to describe in FT lines (of course, this criteria depends on the annotator). For viruses, it is much more messy, due to ribosomal frame-shifts. Formalise that!” Eric Jain UniProtDB

er…decomposition and untangling?

45

Other Sciences….• CERN: UML meta-modelling mechanisms in

order to migrate models over time without losing data.

• Ensembl: “Our data models are complicated - I don't think specifying them will help. We need to understand them instead.”

• And? • Confusing meta-mechanisms with models

46

Biology Exceptionalism• Biology is harder than anything

else in the whole wide world because there is lots of it and its complicated.

• Drawing graphs of data sets over time.

• Physics wipes you off the map.• The real problem is complexity

not scale.• The number of data sets, their

diversity and how they overlap. • How they change. • Their Reliability.

47

Sin 3• Autonomy or death!

• Combined with churn and indifference to users.• Compounded by the Early Adopter tendency of

the community and a monopoly mentality.

• “Hell is other people’s systems” as John Paul Sartre would have said if he had been a bioinformatician.

48

Autonomy is death!• Change my interface / format whenever I feel like

it, despite the fact I wanted lots of users and I have lots of users who depend on this. And I won’t bother to debug either or provide backwards compatibility.– BioMART changed 4 times in the past year.– NCBI changes as it fancies. – Ensembl relational schema.– Early BioJava.

• This is just unprofessional.• Stable Metadata matters. Stable Models matter.

Stable Interfaces matter. Stable Services matter.

49

Lincoln Stein said a while ago…An interface is a contract between data provider

and data consumer• Document interface; warn if it is unstable• Do not make changes lightly

– Even little fiddly changes can break things– Provide plenty of advance warning

• When possible, maintain legacy interfaces until clients can port their scripts

Support as many interfaces as you can• HTML (least desired)• Text only (better)• HTTP-XML (even better)• SOAP-XML (sweet!)• Easy Interfaces + Power User Interfaces

…and he could say it again today.

51

Workflow commodities• Workflow published

with its paper and its data set.

• So what happens when I want to run this workflow again?

• Is the service dead?• Is the dataset still

there? • Was it designed to be

reproduced or reused in the first place?

53

The myGrid Semantic Sweatshop notice how tired they look

Franck Tanoh Katy Wolstencroft

54

Churn, Churn, Churn• “Stability is more important than

Standards or Smartness. Discuss”• Constant churn and change for change sake.

– Impact on everyone else who uses the previous mechanism.

• A few voices, very loud, vested interest, for their application, win.

• You know what? Why don’t we stick with something for a while and rally behind it? Or at least figure out the cost of change.

• Maybe this is a sin inherited from Computer Science.

56

Sin 4

Vanity

Pride

Narcissismconceit, egotism or simple selfishness.

Applied to a social group, denotes elitism or an indifference to the plight of others

57

I know it all.• Claiming to know everything

about biology and everything about computers.

• This is really irritating to both biologists and computer scientists.

• Even they don’t claim to know everything about biology or computer science.

• Computer scientists do know a lot of stuff. And they publish too.

• “Biologists are the experts on everything because we produce the data”

And what would you suggest, Mr. Smartie Pants?

58

Think like me! • Building interfaces that only you can use.• Not actually using your tools in the field.

• I understand workflows• Workflows are for biologists. • My granny can do workflows...• Designing good experiments is hard.• Workflows are computational

experimental protocols. Ergo….• Writing workflows should be expected to

be hard.• Writing good workflows is really hard.• Writing good reusable workflows is really

really hard.Misunderstanding and disrespecting users

A good User Experience outweighs smart features.

Can I use it?

Is the user interface familiar?

Does it fit with my needs?

61

Sin 5• Monolith Meglomania

• delusions of grandeur. • obsession with grandiosity and extravagance.

• Data mining - “my data is mine, and your data is mine”

62

More, more, more!• Integration – the more the merrier. No.

– Every link is a potential dead link.– Every dependency can find its way on to your critical

path.– Monolithic solutions always fail.

• Put it all in a warehouse. – ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator,

MIPS, BioMART blah blah blah…– Toolkits: Information Integrator, GMOD, BioMART,

BioWarehouse, blah blah…– 50% warehouses fail.

• Uber-tools” and “Uber-databases”– Biomart, Ensembl, etc etc….

[Cameron]

63

The trouble with warehouses• 30% of data migration projects fail (Source: Standish Group)• 50% of data warehousing / Business Intelligence projects fail

(Source: NCR)

• “Warehouses work? Piffle. They never manage to maintain synchrony with the source data. Mostly they fall down of their own weight!” Graham Cameron, EMBL-EBI

• "Our ability to capture and store data far outpaces our ability to process and exploit it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again. Data tombs also represent missed opportunities." Usamma Fayyad Yahoo! Research! Laboratories!

• We believe that attempts to solve the issues of scientific data management by building large, centralised, archival repositories are both dangerous and unworkable” Microsoft 2020 Science report.

64

More More More • “Emacs of Biology”

• End-user apps/libraries in bioinformatics workbenches with loads of crap bundled in, none of it kept up to date, none of it properly integrated.

• Keep it simple and modular• Don’t reinvent Eclipse.

66

Distributed Annotation SystemMash-Up http://www.biodas.org

Reference Server

AC003027AC005122

M10154

Annotation Server Annotation Server

AC003027M10154

WI1029 AFM820 AFM1126 WI443

AC005122

Annotation Server

67

Sin 6• Scientific Method Sloth

• Its easier to think of a new name than use someone else’s.

• I want my own view over data and views are difficult, so I’ll create my own database.

• Leads to Reinvention, Exceptionalism • Often the result of Instant Gratification

68

Ennui• Garbage in, garbage out

– Running analysis over the wrong datasets– E.g. Identifying chicken proteins in mouse cells.

• Configuration traditionalism– Not changing the parameters of BLAST. Ever.

• Top list ennui– If there is a list only looking at the first one.– Look no further than the first Blast hit / first Google hit.

• Arbitrary cut-offs on rank-ordered result list– Absolute truth above, absolute falsehood below– E.g. differentially expressed genes in microarray

analyses.

70

Quality Delusions• The bioinformatics does not

have to be sound, because we only trust wet-lab results anyway.

• Worrying about errors in experimental data but believing that derived data is always true.

• Believing Trembl is always right.

• Believing computational gene predictions are always correct.

72

Black Box Science• Producing irreproducible bioinformatics analyses

– Not collecting the provenance of the analysis.– Not testing during software development.

• Try re-running experiments described in the journal Bioinformatics from before 5 years ago

• UniGene – What is happening during UniGene clustering? – ‘Human’ descriptions (via NCBI), are not exact. – The Human Transcriptome Map project and other

microarray analysts ended up reclustering UniGene [Marco Roos].

“No experiment is reproducible.” Wyszowski's Law

“An experiment is reproducible until another laboratory tries to

repeat it.” Alexander Kohn

74

Sin 7• Instant Gratification• Greed? Gluttony?• Always the immediate return.• Never investing for the future.• The quick and dirty fix.• Refusing to model or abstract.• Refusing to plan for recording

and exchanging.• Just getting the next quick fix.• The pressure to deliver now

and pay laterwww.CartoonStock.com .

75

Hackery• Deliver now, pay later

– Producing crap, non-reusable, software because only the biological results matter for publication X.

– Collect! Analyse! Er…now what?

• Spaghetti-ism– Over-indulgence in PERL– Over-indulgence in Ascii Art flat files.– Modelling a system by hacking up XSD fragments on a

whiteboard.– Writing perl scripts that resemble my high-school

BASIC of the 80s.

78

Blind faith in XML • It’s in XML, thus all data

integration problems are solved.– Er…no. – All those vocabularies

e.g. SBML, GenBank XML etc

• The good thing about XML is that it is human readable. – Arrrrgh!

• Insisting that XML is not text.

• Insisting that XML is text

XML

79

Blind Faith in Foo.• There's a new thing to

use.• we don't understand it

yet. • so it sucks up all the stuff

we already know we don't understand.

• Lack of appreciation about exactly what the new technology addresses in itself before trying to make it work for us.

80

Pioneering development methods• Development by anecdote

– I heard in the pub that the way to go was Foo.– Though I have no idea what Foo is or why it is the

way to go.

• Design by hacking– It would be better if I wrote the script I need so I

know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place.

– Hmmm…..We call that Extreme Programming or Emergent Semantics or Web 2.0 in CS .

82

Sin Summary

Maybe only one “original sin” in bioinformatics.

Parochialism and Insularity

ExceptionalismAutonomy or death!

Vanity: Pride and Narcissism

Monolith Meglomania

Scientific method Sloth

Instant Gratification

Reinvention

Churn

Can we become less sinful? Why do these sins exist?

Are bioinformaticians particularly naughty?

No naughtier than Computer Scientists.And its all very hard.

Though they are naughty…

84

Why?• Selfish Scientist – Self-interested Scientist

– Reputation, need to get results right now, win.– Fear of dependency, fear of being left behind.– Understand the incentives and barriers to adoption.

• Bioinformatics as it is practiced– Social and funding structure perpetuates this. – Production vs Research.

• Real, inherent issues. It is hard. • Hybrid exhaustion and pressure.

– Biology + Computing + Bioinformatics

85

Luddism? Surely not!• Refusing to have

biology go beyond a cottage industry.

• Being scared to do it properly.

• Railing against big science

• The cult of amateurism.

[Stevens]

87

Practical Steps?• Create means to share know-how

– Understanding outside my expertise. e.g. sources of error.

– A comprehensive catalogue of web services– A Facebook for workflow builders.– Learn from others. Even Computer Science. And

other Sciences.– Try and create a culture of raising quality.

Somehow.

88

FaceBook & Bazaar for Workflow e-Scientists

myexperiment.org

Trials start August 2007!

89

Delivery Bulge

90

Practical Steps for IT Platforms?• Stop building monolithic solutions

– Strong force in business enterprises

• Component-ise Bioinformatics– Loosely coupled systems– Stable APIs, standardised metadata. – Design to combine.– Sort out the b***dy naming/id problem– If you can’t agree, agree on the bridge.

• Raise the level of abstraction– Less Perl, more workflows – Enable users to extract the data they need without

hassling you.

91

Practical Steps?• Presume and design for incremental change

– Minimise disruption.

• Presume others use our stuff– And respect that– Describe to build Trust

• Presume others add value to our stuff– Be easily part of loosely coupled systems. Lightweight

programming models. – Presume, and enable, content and function mashing.

92

Web 2.0 Design Patterns

• http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

26/2/2007 | myExperiment | Slide 92

1. The Long Tail

2. Data is the Next Intel Inside

3. Users Add Value

4. Network Effects by Default

5. Some Rights Reserved

6. The Perpetual Beta

7. Cooperate, Don't Control

8. Software Above the Level of a Single Device

93

Practical Steps?Presume scientific practice naughtiness

– Try to deal with it, or expose it? – Transparency and accurate collection and reporting. – Provenance.– A prerequisite to publication.– The end of Black Box Science.– Peer pressure.

E.g. Workflows, but will a scientist give away their secrets or expose their mistakes?

The Final Word

Sin writes histories, goodness is silent.

 

Thomas Fuller