Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

download Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

of 70

  • date post

    23-Aug-2014
  • Category

    Science

  • view

    1.311
  • download

    0

Embed Size (px)

description

Screencast video now at: https://www.youtube.com/watch?v=oe7pjHJU-z4 Talk info at http://1.usa.gov/1kPcRxC

Transcript of Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

  • Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org May 14, 2014 CBIIT Slides: slideshare.net/andrewsu Citizen Science!
  • Few genes are well annotated 2 Data: NCBI, February 2013 41% 65% CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 20,473 protein- coding genes Genes, sorted by decreasing counts GOAnnotation Counts
  • because the literature is sparsely curated? 3 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
  • because the literature is sparsely curated? 4 0 10 20 30 40 1983 1988 1993 1998 2003 2008 2013 Average capacity of human scientist
  • 5 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  • 6 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
  • The Long Tail is a prolific source of content 7 Short Head Long Tail Content produced Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
  • Wikipedia is reasonably accurate 8
  • Wikipedia has breadth and depth 9 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 Articles Words (millions) Wikipedia Britannica Online
  • 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
  • From crowdsourcing to structured data 11 The Gene Wiki Citizen Science
  • Filtering, extracting, and summarizing PubMed Documents Concepts Review article
  • Filtering, extracting, and summarizing PubMed Documents Concepts
  • Wiki success depends on a positive feedback 14 Gene wiki page utility Number of users Number of contributors 1001 2002
  • 10,000 gene stubs within Wikipedia 15 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 Utility Users Contributors
  • Gene Wiki has a critical mass of readers 16 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011 Utility Users Contributors
  • Gene Wiki has a critical mass of editors 17 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011 Utility Users Contributors Editorcount Editors Edits Editcount
  • A review article for every gene is powerful 18 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002
  • Making the Gene Wiki more computable 19 Structured annotationsFree text
  • Filling the gaps in gene annotation 20 Wikilink GO exact match Gene Wiki mapping NCBI Entrez Gene: 334 Candidate assertion GO:0006897 6319 novel GO annotations 2147 novel DO annotations
  • Gene Wiki content improves enrichment analysis 23 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only Good BM et al., BMC Genomics, 2011
  • Making the Gene Wiki more computable 24 Structured annotationsFree text Analyses
  • Expansion through outreach and incentives 26 SP-A1 SP-A2 KIF11 LIG3 MIR155 EPHX2
  • Cardiovascular Gene Wiki Portal 27 CAMK2D -- CaM kinase II subunit delta CSRP3 -- Cysteine and glycine-rich protein 3 GJA1 -- Gap junction alpha-1 protein / Connexin-43 MAPK14 -- Mitogen-activated protein kinase 14 / p38- MYL7 -- Myosin regulatory light chain 2, atrial isoform MYL2 -- Myosin regulatory light chain 2, ventricular/cardiac isoform PECAM1 -- Platelet endothelial cell adhesion molecule/CD31 RYR2 -- Ryanodine receptor 2 ATP2A2 -- Sarcoplasmic/endoplasmic reticulum calcium ATPase 2 / SERCA2 TNNI3 -- Troponin I, cardiac muscle TNNT2 -- Troponin T, cardiac muscle Peipei Ping UCLA
  • The Long Tail of scientists is a valuable source of information on gene function 28
  • From crowdsourcing to structured data 29 The Gene Wiki Citizen Science
  • Gene databases are numerous and overlapping 30 and hundreds more
  • Why is there so much redundancy? 31 Users Requests Resources Time Community development BioGPS emphasizes community extensibility
  • Why do developers define the gene report view? 32 BioGPS emphasizes user customizability
  • http://biogps.org Community extensibility and user customizability 33
  • Utility UsersContributors Utility: A simple and universal plugin interface 34
  • Utility UsersContributors Utility: A simple and universal plugin interface 35
  • Utility UsersContributors Utility: A simple and universal plugin interface 36
  • Utility UsersContributors Utility: A simple and universal plugin interface 37
  • Utility UsersContributors Utility: A simple and universal plugin interface 38
  • Utility: A simple and universal plugin interface 39 Utility UsersContributors Total of > 540 gene-centric online databases registered as BioGPS plugins
  • Users: BioGPS has critical mass 40 > 6400 registered users 14,000 unique visitors per month 155,000 page views per month 1. Harvard 2. NIH 3. UCSD 4. Scripps 5. MIT 6. Cambridge 7. U Penn 8. Stanford 9. Wash U 10. UNC Top 10 organizations Daily pageviewsUtility UsersContributors
  • Contributors: Explicit and implicit knowledge 41 540 plugins registered (>300 publicly shared) by over 120 users spanning 280+ domains Utility UsersContributors
  • Gene Annotation Query as a Service 42 http://mygene.info High performance 3M hits/month Highly scalable 13k species 16M genes Weekly data updates JSON output REST interface Python/R/JS libraries
  • The Long Tail of bioinformaticians can collaboratively build a gene portal. 43
  • From crowdsourcing to structured data 44 The Gene Wiki Citizen Science
  • The biomedical literature is growing fast 45 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles
  • Information Extraction 46 1. Find mentions of high level concepts in text 2. Map mentions to specific terms in ontologies 3. Identify relationships between concepts
  • Disease mentions in PubMed abstracts 47 NCBI Disease corpus 793 PubMed abstracts (100 development, 593 training, 100 test) 12 expert annotators (2 annotate each abstract) 6,900 disease mentions Doan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
  • Four types of disease mentions 48 Specific Disease: Diastrophic dysplasia Disease Class: Cancers Composite Mention: prostatic , skin , and lung cancer Modifier: ..the familial breast cancer gene , BRCA2.. Doan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of t