Post on 30-Jul-2015
@yannick__ http://yannick.poulet.org
Social insect evolution: genomics opportunities
& approaches
2014-10-15-NextBUG
© Alex Wild & others
© National Geographic
Atta leaf-cutter ants
© National Geographic
Atta leaf-cutter ants
© National Geographic
Atta leaf-cutter ants
Oecophylla Weaver ants
© ameisenforum.de
© ameisenforum.de
Fourmis tisserandes
© ameisenforum.de
Oecophylla Weaver ants
© forestryimages.org© wynnie@flickr
Tofilski et al 2008
Forelius pusillus
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Avant
Workers staying outside die« preventive self-sacrifice »
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Dorylus driver ants: ants with no home
© BBC
© Dirk Mezger
Ritualized fighting
© Carsten BrühlCamponotus gigas Pfeiffer & Linsenmair 2001
Army ant milling - “spiral of death”
Animal biomass (Brazilian rainforest)
from Fittkau & Klinge 1973
Other insects 49.6
Amphibians 2.8
Reptiles 3.7
Birds 5.3
Mammals 14.5
!Earthworms
17.3
!!
Spiders 4.7
Soil fauna excluding earthworms,
ants & termites 148
Ants & termites 114
Well-studied:
• behavior
• morphology
• evolutionary context
• ecology
This changes everything.454
Illumina Solid...
Any lab can sequence anything!
Major research areasGenes/mechanisms for evolution of
social behavior?
www.sciencemag.org SCIENCE VOL 331 25 FEBRUARY 2011 1067
REPORTS
on
Mar
ch 1
2, 2
013
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
Solenopsis invicta fire ants are a big problem!very well studied!
Ascunce et al 2011
Solenopsis invicta fire ant: two social forms
!
•1 large queen •Independent founding •Highly territorial •Many sizes of workers
!
•2-100 smaller queens •Dependent founding •No inter-colony aggression •All workers similar size
Single-queen form: Multiple-queen form:
Fire ants+
Population genetics: Allozyme screen
Ken Ross L. Keller
“starch gel”+
1 2 3=> “Gp-9” locus associated to social form
Single queen form Multiple queen form
Ken Ross and colleagues Laurent Keller and colleagues
Social form completely associated to Gp-9 locus
bbbbBB BB Bb bb
Ken Ross and colleagues Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% ) (< 5% )
bbBB BB Bb
x
Gp-9 bb females rareKen Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% ) (< 5% )
BB BB Bb
Ken Ross and colleagues Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% ) (< 5% )
BB BB Bb
xKen Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% ) (< 5% )
BB BB Bb
x xKen Ross and colleagues
Laurent Keller and colleagues
Social form completely associated to Gp-9 locus
Single queen form Multiple queen form(>15% ) (< 5% )
BB BB Bb
x x xKen Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form(>15% ) (< 5% )
Social form completely associated to Gp-9 locus
Sex chromosomes
X Y
Gp-9 B
Gp-9 b
SB Sb
“Social chromosomes”
?
Wang et al Nature 2013
Major research areas
Genes/mechanisms for differences (e.g., lifespan?)?
Genes/mechanisms for evolution of social behavior?
genome evolution social evolution
This changes everything.454
Illumina Solid...
Any lab can sequence anything!
Genomics is hard.
• Biology/life is complex • Field is young. • Biologists lack computational training. • Generally, analysis tools suck.
• badly written • badly tested • hard to install • output quality… often questionable.
• Understanding/visualizing/massaging data is hard. • Datasets continue to grow!
Genomics is hard.
Inspiration?
arX
iv:1
210.
0530
v3 [
cs.M
S] 2
9 N
ov 2
012
Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††
∗Software Carpentry (gvwilson@software-carpentry.org),†University of Ontario Institute of Technology (Dhavide.Aruliah@uoit.ca),‡MichiganState University (ctb@msu.edu),§Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶Space Telescope Science Institute(mrdavis@stsci.edu),∥University of Toronto (guy@cs.utoronto.ca),∗∗Monterey Bay Aquarium Research Institute(steve@practicalcomputing.org),††University of Wisconsin (khuff@cae.wisc.edu),‡‡University of British Columbia (mitchell@cs.ubc.ca),§§QueenMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶University College London (b.waugh@ucl.ac.uk),∗∗∗Utah StateUniversity (ethan@weecology.org), and †††University of Wisconsin (wilsonp@engr.wisc.edu)
Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.
Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.
We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].
In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.
This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial
and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.
1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].
First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):
...calculation...
or to take two points:def rect_area(point1, point2):
...calculation...
The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not
Reserved for Publication Footnotes
1–7
arX
iv:1
210.
0530
v3 [
cs.M
S] 2
9 N
ov 2
012
Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††
∗Software Carpentry (gvwilson@software-carpentry.org),†University of Ontario Institute of Technology (Dhavide.Aruliah@uoit.ca),‡MichiganState University (ctb@msu.edu),§Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶Space Telescope Science Institute(mrdavis@stsci.edu),∥University of Toronto (guy@cs.utoronto.ca),∗∗Monterey Bay Aquarium Research Institute(steve@practicalcomputing.org),††University of Wisconsin (khuff@cae.wisc.edu),‡‡University of British Columbia (mitchell@cs.ubc.ca),§§QueenMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶University College London (b.waugh@ucl.ac.uk),∗∗∗Utah StateUniversity (ethan@weecology.org), and †††University of Wisconsin (wilsonp@engr.wisc.edu)
Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.
Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.
We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].
In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.
This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial
and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.
1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].
First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):
...calculation...
or to take two points:def rect_area(point1, point2):
...calculation...
The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not
Reserved for Publication Footnotes
1–7
arX
iv:1
210.
0530
v3 [
cs.M
S] 2
9 N
ov 2
012
Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††
∗Software Carpentry (gvwilson@software-carpentry.org),†University of Ontario Institute of Technology (Dhavide.Aruliah@uoit.ca),‡MichiganState University (ctb@msu.edu),§Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶Space Telescope Science Institute(mrdavis@stsci.edu),∥University of Toronto (guy@cs.utoronto.ca),∗∗Monterey Bay Aquarium Research Institute(steve@practicalcomputing.org),††University of Wisconsin (khuff@cae.wisc.edu),‡‡University of British Columbia (mitchell@cs.ubc.ca),§§QueenMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶University College London (b.waugh@ucl.ac.uk),∗∗∗Utah StateUniversity (ethan@weecology.org), and †††University of Wisconsin (wilsonp@engr.wisc.edu)
Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.
Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.
We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].
In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.
This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial
and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.
1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].
First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):
...calculation...
or to take two points:def rect_area(point1, point2):
...calculation...
The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not
Reserved for Publication Footnotes
1–7
1. Write programs for people, not computers. 2. Automate repetitive tasks. 3. Use the computer to record history. 4. Make incremental changes. 5. Use version control. 6. Don’t repeat yourself (or others). 7. Plan for mistakes. 8. Optimize software only after it works correctly. 9. Document the design and purpose of code rather than its mechanics.!10. Conduct code reviews.
Inspiration?
• Technologies
• Planning for mistakes
• Automated testing
• Continuous
• Writing for people: use style guide
Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html
R style guide extract
Coding for people: Indent your code!
Programming better
• variable naming
• coding width: 100 characters
• indenting
• Follow conventions -eg “Google R Style”
• Versioning: DropBox & http://github.com/
• Automated testing
• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway
preprocess_snps <- function(snp_table, testing=FALSE) { if (testing) { # run a bunch of tests of extreme situations. # quit if a test gives a weird result. } # real part of function. }
Friday, 22 June 12
Line length Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.
R style guide extract
!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))
!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header = TRUE, sep = '\t', col.names = c('colony', 'individual', 'headwidth', 'mass') )
!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))
Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html • For Ruby: https://github.com/bbatsov/ruby-style-guide
Automatically check your code:install.packages(“lint”) # once
library(lint) # everytime lint(“file_to_check.R”)
Four tools
suck less. Four tools that
Four tools
suck less. (hopefully)
Four tools that
1. SequenceServer
“Can you BLAST this for me?”
• Once I wanted to set up a BLAST server.
Anurag Priyam, Mechanical engineering student, Kharagpur
Aim: An open source idiot-proof web-interface
for custom BLASTFriday, 22 June 12
Anurag Priyam, Mechanical engineering student, IIT Kharagpur
Sure, I can help you…
“Can you BLAST this for me?”
Antgenomes.org SequenceServer BLAST made easy
(well, we’re trying...)
http://www.sequenceserver.com/
(requires a BLAST+ install)
Do you have BLAST-formatted databases? If not: sequenceserver format-databases /path/to/fastas
1. Installinggem install sequenceserver
# ~/.sequenceserver.conf bin: ~/ncbi-blast-2.2.25+/bin/ database: /Users/me/blast_databases/
2. Configure.
sequenceserver ### Launched SequenceServer at: http://0.0.0.0:4567
3. Launch.
“Can you BLAST this for me?”
Antgenomes.org SequenceServer BLAST made easy
(well, we’re trying...)
Web server :Anurag Priyam & Git community - http://sequenceserver.com
blast on 48-core 512gig fat machine
via ssh
2. Bionode
Module countsNode = “NPM”
Reusable, small and testedmodules
ExamplesBASH
JavaScript
bionode.io (online shell)
bionode-ncbi urls assembly Solenopsis invicta | grep genomic.fna
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG/ GCA_000188075.1_Si_gnG_genomic.fna.gz
bionode-ncbi download sra arthropoda | bionode-sra
bionode-ncbi download gff bacteria
var ncbi = require('bionode-ncbi') ncbi.urls('assembly', 'Solenopsis invicta'), gotData) function gotData(urls) { var genome = urls[0].genomic.fna download(genome) })
# Get descriptions for papers related to SRA search !bionode ncbi search sra Solenopsis invicta | tool-‐stream extractProperty uid | bionode ncbi link sra pubmed | tool-‐stream extractProperty destUID | bionode ncbi search pubmed !
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js everywhereStreams var ncbi = require('bionode-ncbi') var tool = require('tool-stream') var through = require('through2') var fork1 = through.obj() var fork2 = through.obj()
ncbi .search('sra', 'Solenopsis invicta') .pipe(fork1) .pipe(dat.reads)
fork1 .pipe(tool.extractProperty('expxml.Biosample.id')) .pipe(ncbi.search('biosample')) .pipe(dat.samples)
fork1 .pipe(tool.extractProperty('uid')) .pipe(ncbi.link('sra', 'pubmed'))
Working with Gene predictions
Gene predictionDozens of software algorithms: dozens of predictions
20% failure rate: •missing pieces •extra pieces •incorrect merging •incorrect splitting
Visual inspection... and manual fixing required.
1 gene = 5 minutes to 3 days
Yand
ell &
Enc
e 20
13 N
RG
GTCTACAATGCGATTGTAAAATAGCACGAgAGGTGCATATGATGAACGACTATGTTCCACAACCACAGCTCATATATAACATGATTTtGTTTGCCGAATTCATACACGCATTACAACACACATTGAATTCAATAATAATATCAAATTCACATTCAAAGCTTTCAAGTTAGACAAAAGTTTTAATGCCGTTTTtACCTGTTTTtGAAAAGGTAATTTTCTTTAGATATATTATGTTGAATaTTAGGGTTTTTATAAAGAATGTGTATATTGUTTACAATATAAAAGACACAATTGCAAACTAGCATGATTGTAAACAATTGCTAAACGGATCAATATAAATTAAAATTGTAATATTAAGTATCAAACCGATAATTTTTATTTATTGTTCATTGTTTGTTCTTTATTTTGTTATTTGTAAATAATGAAA
Evidence
Evidence
Consensus:
3. GeneValidator
Monica Dragan
Ismail Moghul
https://github.com/monicadragan/GeneValidatorhttps://github.com/IsmailM/GeneValidatorApp
Monica Draganhttps://github.com/monicadragan/GeneValidatorhttps://github.com/IsmailM/GeneValidatorApp
Ismail Moghul
GeneValidator
Run on:
★whole geneset: identify most problematic predictions
★alternative models for a gene (choose best)
★individual genes (while manually curating)
Warning: Work in Progress
gem install GeneValidator gem install GeneValidatorApp
http://afra.sbcs.qmul.ac.uk/genevalidator
3. Afra: Crowdsourcing gene model curation
Gene predictionDozens of software algorithms: dozens of predictions
20% failure rate: •missing pieces •extra pieces •incorrect merging •incorrect splitting
Visual inspection... and manual fixing required. 1 gene = 20 minutes to 3 days 15,000 genes * 20 species = impossible. Ya
ndell
& E
nce
2013
NRG
GTCTACAATGCGATTGTAAAATAGCACGAgAGGTGCATATGATGAACGACTATGTTCCACAACCACAGCTCATATATAACATGATTTtGTTTGCCGAATTCATACACGCATTACAACACACATTGAATTCAATAATAATATCAAATTCACATTCAAAGCTTTCAAGTTAGACAAAAGTTTTAATGCCGTTTTtACCTGTTTTtGAAAAGGTAATTTTCTTTAGATATATACAGTTTGTAATaTTAGGTATTTTATAAACAGTGTGTATATTTCTTACAATATAAAAGACACAATTGCAAACTAGCATGATTGTAAACAATTGCTAAACGGATCAATATAAATTAAAATTGTAATATTAAGTATCAAACCGATAATTTTTATTTATTGTTCATTGTTTGTTCTTTATTTTGTTATTTGTAAATAATGAAA
Evidence
Evidence
Consensus:
Algorithm discovery by protein folding game playersFiras Khatiba, Seth Cooperb, Michael D. Tykaa, Kefan Xub, Ilya Makedonb, Zoran Popovićb,David Bakera,c,1, and Foldit PlayersaDepartment of Biochemistry; bDepartment of Computer Science and Engineering; and cHoward Hughes Medical Institute, University of Washington,Box 357370, Seattle, WA 98195
Contributed by David Baker, October 5, 2011 (sent for review June 29, 2011)
Foldit is a multiplayer online game in which players collaborateand compete to create accurate protein structure models. For spe-cific hard problems, Foldit player solutions can in some cases out-perform state-of-the-art computational methods. However, verylittle is known about how collaborative gameplay produces theseresults and whether Foldit player strategies can be formalized andstructured so that they can be used by computers. To determinewhether high performing player strategies could be collectivelycodified, we augmented the Foldit gameplay mechanics with toolsfor players to encode their folding strategies as “recipes” and toshare their recipes with other players, who are able to further mod-ify and redistribute them. Here we describe the rapid social evolu-tion of player-developed folding algorithms that took place in theyear following the introduction of these tools. Players developedover 5,400 different recipes, both by creating new algorithms andby modifying and recombining successful recipes developed byother players. The most successful recipes rapidly spread throughthe Foldit player population, and two of the recipes became parti-cularly dominant. Examination of the algorithms encoded in thesetwo recipes revealed a striking similarity to an unpublished algo-rithm developed by scientists over the same period. Benchmarkcalculations show that the new algorithm independently discov-ered by scientists and by Foldit players outperforms previouslypublished methods. Thus, online scientific game frameworks havethe potential not only to solve hard scientific problems, but also todiscover and formalize effective new strategies and algorithms.
citizen science ∣ crowd-sourcing ∣ optimization ∣ structure prediction ∣strategy
Citizen science is an approach to leveraging natural humanabilities for scientific purposes. Most such efforts involve
visual tasks such as tagging images or locating image features(1–3). In contrast, Foldit is a multiplayer online scientific discoverygame, in which players become highly skilled at creating accurateprotein structure models through extended game play (4, 5). Folditrecruits online gamers to optimize the computed Rosetta energyusing human spatial problem-solving skills. Players manipulateprotein structures with a palette of interactive tools and manipula-tions. Through their interactive exploration Foldit players also uti-lize user-friendly versions of algorithms from the Rosetta structureprediction methodology (6) such as wiggle (gradient-based energyminimization) and shake (combinatorial side chain rotamer pack-ing). The potential of gamers to solve more complex scientific pro-blems was recently highlighted by the solution of a long-standingprotein structure determination problem by Foldit players (7).
One of the key strengths of game-based human problem ex-ploration is the human ability to search over the space of possiblestrategies and adapt those strategies to the type of problem andstage of problem solving (5). The variability of tactics andstrategies stems from the individuality of each player as well asmultiple methods of sharing and evolution within the game(group play, game chat), and outside of the game [wiki pages (8)].One way to arrive at algorithmic methods underlying successfulhuman Foldit play would be to apply machine learning techniquesto the detailed logs of expert Foldit players (9). We chose insteadto rely on a superior learning machine: Foldit players themselves.
As the players themselves understand their strategies better thananyone, we decided to allow them to codify their algorithmsdirectly, rather than attempting to automatically learn approxi-mations. We augmented standard Foldit play with the ability tocreate, edit, share, and rate gameplay macros, referred to as“recipes” within the Foldit game (10). In the game each playerhas their own “cookbook” of such recipes, from which they caninvoke a variety of interactive automated strategies. Players canshare recipes they write with the rest of the Foldit community orthey can choose to keep their creations to themselves.
In this paper we describe the quite unexpected evolution ofrecipes in the year after they were released, and the striking con-vergence of this very short evolution on an algorithm very similarto an unpublished algorithm recently developed independentlyby scientific experts that improves over previous methods.
ResultsIn the social development environment provided by Foldit,players evolved a wide variety of recipes to codify their diversestrategies to problem solving. During the three and a half monthstudy period (see Materials and Methods), 721 Foldit players ran5,488 unique recipes 158,682 times and 568 players wrote 5,202recipes. We studied these algorithms and found that they fellinto four main categories: (i) perturb and minimize, (ii) aggressiverebuilding, (iii) local optimize, and (iv) set constraints. The firstcategory goes beyond the deterministic minimize functionprovided to Foldit players, which has the disadvantage of readilybeing trapped in local minima, by adding in perturbations to leadthe minimizer in different directions (11). The second categoryuses the rebuild tool, which performs fragment insertion withloop closure, to search different areas of conformation space;these recipes are often run for long periods of time as they aredesigned to rebuild entire regions of a protein rather than justrefining them (Fig. S1). The third category of recipes performslocal minimizations along the protein backbone in order to im-prove the Rosetta energy for every segment of a protein. The finalcategory of recipes assigns constraints between beta strands orpairs of residues (rubber bands), or changes the secondary struc-ture assignment to guide subsequent optimization.
Different algorithms were used with very different frequenciesduring the experiment. Some are designated by the authors aspublic and are available for use by all Foldit players, whereasothers are private and available only to their creator or theirFoldit team. The distribution of recipe usage among differentplayers is shown in Fig. 1 for the 26 recipes that were run over1,000 times. Some recipes, such as the one represented by theleftmost bar, were used many times by many different players,while others, such as the one represented by the pink bar in the
Author contributions: F.K., S.C., Z.P., and D.B. designed research; F.K., S.C., M.D.T., andF.P. performed research; F.K., S.C., M.D.T., K.X., and I.M. analyzed data; and F.K., S.C., Z.P.,and D.B. wrote the paper.
The authors declare no conflict of interest.
Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: dabaker@u.washington.edu.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1115898108/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1115898108 PNAS ∣ November 22, 2011 ∣ vol. 108 ∣ no. 47 ∣ 18949–18953
BIOPH
YSICSAND
COMPU
TATIONALBIOLO
GY
PSYC
HOLO
GICALAND
COGNITIVESC
IENCE
S
http://Fold.it
• Recruiting & retaining contributors
Crowd-sourcing the visual inspection + correction of gene models.
Challenges
Recruiting & retaining contributorsPlan A: get students. • Increase accessibility:
• Make tasks small & simple • Need excellent tutorials & training • Need an intelligent “mothering” user interface.
• Provide rewards: • Better grades • Learning experience • Good karma (helping science) • Prestige & pride (on facebook; points & badges “leaderboard”, with
certificates, in publications) • Opportunities to develop expertise & responsibilities
Crowd-sourcing the visual inspection + correction of gene models.
Challenges
• Recruiting & retaining contributors
• Ensuring quality
Ensuring quality
• Excellent tutorials/training
• Make tasks small & simple
• Redundancy
• Review of conflicts by senior users.
Begin
EĞĞĚƐ�ĐƵƌĂƟŽŶ
�ƌĞĂƚĞ�ŝŶŝƟĂů�ƚĂƐŬƐ
Being curated
Curate
Being curated
Curate
Being curated
Curate
Submit Submit Submit
�ƵƚŽͲĐŚĞĐŬ
�ŽŶĞ
/ŶĐŽŶƐŝƐƚ
ĞŶƚ͗�ĐƌĞĂ
ƚĞ�
“ƌĞǀŝĞǁ͟
�ƚĂƐŬ�
�ŽŶƐŝƐƚĞŶƚ͗�create nexƚ�ƌĞƋƵŝƌĞĚ�ƚĂƐŬ
Crowd-sourcing the visual inspection + correction.
Challenges
http://afra.sbcs.qmul.ac.ukAnurag Priyam http://github.com/yeban/afra
• Recruiting & retaining contributors
• Ensuring quality
Warning: Work in Progress
Timelines• Rolled out to:
• 8 MSc students
• 20 3rd year students
• Need to improve tutorials/guidance/documentation
• Roll out to 200 first years (few months)
• Expand
Summary• Ants are cool
• Exciting times & big challenges
• Inspiration from people working with computers more/longer
• SequenceServer - set up custom BLAST servers
• Bionode -modular streams for bioinformatics
• GeneValidator - identifying problems with gene predictions
• Afra - infrastructure to crowdsource gene curation to the masses
Recruiting Genomehacker/Bioinformatics support
GitHub
Thanks!
y.wurm@qmul.ac.uk@yannick__
http://yannick.poulet.org
Colleagues & Collaborators @ QMUL & UNIL Anurag Priyam @yeban Monica Dragan Ismail Moghul Vivek Rai Bruno Vieira @bmpvieira
Maybe
genome evolution social evolutionGenerally
Single- vs. Multiple queennessin fire antsin similar independent species
•one or many loci? •one or many genes? •convergence?
Social parasitism
Strengths of selection in social evolution
concepts & mechanisms
Medically relevant questionsCandidate gene studies
VitellogeninSex determination genes
functional testing....
Tools for genomics work on emerging model organisms
Molecular response to social upheaval