International Cancer Genomics Consortium (ICGC) Data Coordinating Center
-
Upload
ontario-institute-for-cancer-research -
Category
Education
-
view
1.332 -
download
1
description
Transcript of International Cancer Genomics Consortium (ICGC) Data Coordinating Center
The International Cancer Genome
Consortium (ICGC) Data Coordinating
Center (DCC)
November 14th 2013
B.F. Francis Ouellette [email protected]
• Senior Scientists & Associate Director,
Informatics and Biocomputing, Ontario Institute for
Cancer Research, Toronto, ON
• Associate Professor, Department of Cell and Systems Biology,
University of Toronto, Toronto, ON.
2Module #: Title of Module
3
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
4
Slides are on slideshare.net
• http://www.slideshare.net/bffo/ebi-oncogenomics-nov2013ouellettever03
http://goo.gl/HP613K
6
Disclaimer
I do not (and will not) profit in any way, shape or form,
from any of the brands, products or companies I may
mention.
7
8
Cancer therapy is like
beating the dog with
a stick to get rid of
his fleas.
- Anna Deavere Smith,
Let me down easy
9
http://goo.gl/Yhbsj
10
The revolution in cancer
research can summed up
in a single sentence:
cancer is in essence,
a genetic disease.
- Bert Vogelstein
11
CancerA Disease of the Genome
Challenge in Treating Cancer:
Every tumor is different
Every cancer patient is different
12
Johns Hopkins
> 18,000 genes analyzed for mutations
11 breast and 11 colon tumors
L.D. Wood et al, Science, Oct. 2007
Wellcome Trust Sanger Institute
518 genes analyzed for mutations
210 tumors of various types
C. Greenman et al, Nature, Mar. 2007
TCGA (NIH)
Multiple technologies
brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma).
F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007
Large-Scale Studies of Cancer Genomes
13
Heterogeneity within and across tumor types
High rate of abnormalities (driver vs
passenger)
Sample quality matters
Lessons learned
2007
15
International Cancer Genome Consortium
• Collect ~500 tumour/normal pairs from each of 50 different major
cancer types;
• Comprehensive genome analysis of each T/N pair:
– Genome
– Transcriptome
– Methylome
– Clinical data
• Make the data available to the research community & public.
Identify
genome
changes
…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…
16
Rationale for the ICGC
• The scope is huge, such that no country can do it all.
• Coordinated cancer genome initiatives will reduce
duplication of effort for common and easy to acquire
tumor samples and and ensure complete studies for many
less frequent forms of cancer.
• Standardization and uniform quality measures across
studies will enable the merging of datasets, increasing
power to detect additional targets.
• The spectrum of many cancers varies across the
world for many tumor types, because of environmental,
genetic and other causes.
• The ICGC will accelerate the dissemination of genomic
and analytical methods across participating sites, and
the user community
17
International Cancer Genome Consortium
(ICGC)Goals
• Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe
• Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors
• Make the data available to research community rapidly with minimal restrictions to accelerate research into the causes and control of cancer
50 tumor types and/or subtypes
500 tumors + 500 controls per subtype
50,000 Human Genome Projects!
Nature (2010) 464:993
18
Analysis Data Types
• Simple Somatic Mutations
• Copy Number Alterations
• Structural Somatic Mutations
• Gene Expression (micro-arrays and RNASeq)
• miRNA Expression (RNASeq)
• Epigenomics (Arrays and Methylation)
• Splicing Variation
• Protein Expression
19
20
OICR’s mission
To build innovative research
programs that will have an impact
on the prevention, early detection,
diagnosis and treatment of
cancer.
21
OICR Informatics & Biocomputing Senior Staff
Lincoln Stein
Director, I&B
Sr. PI
Vincent Ferretti
Assoc. Director,
Bioinf. Software Dev
Sr. PI
Francis Ouellette
Assoc. Director, I&B
Paul Boutros
Jr. PI
David Sutton
Director, IT
Tatiana Lomasko
Program Manager
Brian Shoichet
Sr. PI
May 2013
Jared Simpson
OICR Fellow
May 2013
Lakshmi
Muthuswamy
Jr. PI
Paul Boutros
Jr. PI
22
http://icgc.org
22
23
ICGC Map – November 2013
67 projects launched
24
ICGC Committees & Working Groups
http://icgc.org/icgc/committees-and-working-groups
25
ICGC Project Teams @ OICR
• ICGC Secretariat
– Executive Chair: Thomas Hudson
– Senior Project Manager: Jennifer Jennings
– Administrative Coordinator: Jaypee Banlawi
• (with the support of the Web Development team)
• ICGC Data Coordination Center (DCC)
– DCC Leader: Lincoln Stein
– DCC Co-Leader: Francis Ouellette
– DCC Software Development Team Leader: Vincent
Ferretti (+6 FTE)
– DCC Data Curation: Hardeep Nahal (+1 FTE)
26
DCC ActivitiesDCC activities are split between two groups:
• Software Development– DCC portal
– Submission tool
• Curation (and Content Management)– Data level management
– Submitter “handling”
– Coordination with secratariat
– User support
http://dcc.icgc.org/team
26
27
ICGC Data Coordination Centre
A “comprehensive management system” providing:
• Secure mechanism for uploading data
• Track uploads and perform integrity checks
• Regular progress reporting (data audit)
• Quality checks (coverage, correctness, etc.)
• Enable distribution of raw data to public repositories
• Provide essential metadata to public repositories
• Integrate with other public repositories via standard data
formats, ontologies, etc.
27
28
ICGC Data Coordination Centre (2)
Provides the following support to experimental
biologists, computational biologists, and other
researchers:
• Download of complete dataset, or subsets
• Restrict protected data to authorized users (controlled access)
• Search data by gene or specimen, or lists thereof
• Interactive system for identifying specimens of interest, finding what
data sets are available for those specimens, selecting data slices
across those specimens (e.g., counts of the number of somatic
mutations observed a region within the UTR of a gene of interest), and
running basic analytic tests on those data slices
28
29
ICGC Data Types
• Clinical Data
– Hosted by DCC via data portal– Was 100% open access, but currently 9 data elements have been flagged by DACO
as controlled access and are under review by IDAC
• Experimental Analysis Data
– Hosted by DCC via data portal
– Somatic is open access, germline is controlled
• “Raw” Sequencing Data (+ array data, etc.)
– Hosted at other public repositories
– Primary repository for ICGC sequence data is EBI EGA
– TCGA raw data hosted at CGhub
30
ICGC datasets to date
Dec-11 Jan-2012 Feb March April June July Aug Sept OctMay Nov Dec Jan-2013 Feb March April May June July Aug Sept-2013
1000
2000
3000
4000
5000
6000
7000
8000
9000
10,000
Release 7Release 8
Release 9
Release 10
Release 11
Release 12Release 13
Release 14
Number of
Donors
ICGC Data Portal Cumulative Donor Count for Member Projects
Hardeep Nahal
• Cancer types: 41
• Donors: 8,532 (18,056 specimens)
• Simple somatic mutations: 1,995,134
• Copy number mutations: 18,526,593
• Structural rearrangements: 18,614
• Genes affected* by simple somatic mutations: 22,074
• Genes affected* by non-synonymous coding mutations: 19,150 Genes
affected* by copy number mutations: 20,341
• Genes affected* by structural rearrangements: 1,884
• *out 22,259 protein coding genes annotated in Ensembl Human release 69
• Open tier and controlled data currently available
ICGC dataset version 14
September 2013
Hardeep Nahal
32
Key DCC Activities for 2013
• Improved data & metadata curation at EGA; better
linking of data held at DCC to ICGC data in other
repositories (currently not perfect)
• Improved data quality/integrity checking through
new submission/validation system; review of
submission file specifications
• Integration of new data submission system and
portal infrastructure with project and user
information managed at ICGC.org
33
Moratorium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e3-publication-policy
34
Where do you find that information?
• We actually make it hard to find, but we are
working on that! (this is an example of where ICGC
would like to do what TCGA does!)
• http://cancergenome.nih.gov/publications/publicatio
nguidelines
35
Where do you find that information?
For ICGC data:
• Need to find the policy!• http://icgc.org/icgc/goals-structure-policies-
guidelines/e3-publication-policy
• Find text:
• Published > no embargo
• < 100 tumors > 2 years
• > 100 tumors > 1 year
• Find date: in README on FTP file
• (exception in README)
• This is bad, we know it, and we are fixing it!
• In doubt? Contact us! [email protected]
36
Time limits for publication moratoriums:
All data shall become free of a publication moratorium when either:
1) the data is published by the ICGC member project
2) one year after a specified quantity of data (e.g. genome dataset from 100 tumours per project) has been released via the ICGC database or other public databases.
3) In all cases data shall be free of a publication moratorium two years after its initial release.
37
DACO
ICGC
dbGaP
EGA
TCGA
BAM
Open
Open
ERA
BA
M
Germ
Line
+ EGA id
BA
MBA
M
ICGCBAM/FASTQ
TCGABAM/FASTQ
ICGCOpen
Data
(includes
TCGA
Open Data)
COSMICOpen
Data
39
Raw Data Availability at EGA by Project and Data Type
• https://www.ebi.ac.uk/ega/organisations/EGAO00000000024
40
Cooperation with EBI EGA Repository for
Controlled Access Raw Data
• Concerted efforts with EGA staff to support
coordinated data submissions to both ICGC DCC
& EGA
• Infrastructure to grant controlled data access
automatically on approval of ICGC DACO web
application forms
40
41
What the users see?
• Important to have a data portal that represents the
richness of the data that we generate, but to also
make sure biologists and clinicians can actually
use the data & make discoveries!
• Important to have a scalable technology that will
support 50,000 human genomes, and thousands of
concurrent users (we don’t have that many yet)
42
Uniform Annotations
• Annotating Simple Somatic Mutations (SSM) and Simple
Germline Variations (SGV)
• DCC is currently implementing the snpEff software
◦ Recommended by the ICGC Bioinformatics Analysis
Working Group
◦ Returns Sequence Ontology's controlled vocabulary
regarding mutation-induced changes
(www.sequenceontology.org)
• ICGC members will not be required to annotate
SSM and SGV for the ICGC data releases
43
http://icgc.org
44
45
Select “Pancreatic cancer – Canada”
46
… But where is the data?
47
48
http://dcc.icgc.org/
49
50
Highlights of the new portal: dcc.icgc.org
• Faceted searches capabilities for variants, genes and
donors
– Interactive data exploration fast and easy
• Mutation aggregation & counts across donors and cancers
– # of pancreatic cancers donors with mutation KRAS G12D
• Standardized gene consequence across all projects
• Genome browser
• Data doewnload
• Protein domains
• Links to repositories
51
Technologies
Chaplin Web GUI
Indexing
Processing
&
Data Model
Core
Brian O’Connor/
Vincent Ferretti
52
53
KRAS search
54
• Summary
• Cancer type distribution
• Other links (Cosmic, Entrez, etc)
• Mutation profile in protein
• Domains
• Genomic Context
• Mutation profile
• Most common mutations
55
http://dcc.icgc.org/genes/ENSG00000133703
56
57
58
59
http://goo.gl/qUzuAi
60
61
Donor• Donor ID
• Primary site
• Cancer Project
• Gender
• Tumor Stage
• Vital Status
• Disease Status
• Release type
• Age at diagnosis
• Available data types
• Analysis types
62
Genes
63
Mutations• Consequences
• Type
• Platform
• Verification status
64
Exporting data
65
Exporting data
66
67
68
Exporting data
69
Can do bulk download of the data …
ICGCBAM/FASTQ
TCGABAM/FASTQ
ICGCOpen
Data
(includes
TCGA
Open Data)
COSMICOpen
Data
71
DACO
ICGC
dbGaP
EGA
TCGA
BAM
Open
Open
ERA
BA
M
Germ
Line
+ EGA id
BA
MBA
M
72
ICGC Data Categories
ICGC Open Access Datasets ICGC Controlled Access Datasets
Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
Donor
Gender
Age range
RNA expression (normalized)
DNA methylation
Genotype frequencies
Somatic mutations (SNV,
CNV and Structural
Rearrangement)
Detailed Phenotype and Outcome Data
Patient demography
Risk factors
Examination
Surgery/Drugs/Radiation
Sample/Slide
Specific histological features
Protocol
Analyte/Aliquot
Gene Expression (probe-level data)
Raw genotype calls (germline)
Gene-sample identifier links
Genome sequence files
Most of the data in the portal is publically available without restriction. However,
access to some data, like the germline mutations, requires authorization by the Data
Access Compliance Office (DACO)
Module 1: Cancer Genomic Databases bioinformatics.ca
Module 1: Cancer Genomic Databases bioinformatics.ca
http://icgc.org/daco
75
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC Controlled
Access Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
ICGC OA
Datasets
http://goo.gl/w4mrV
Module 1: Cancer Genomic Databases bioinformatics.ca
Identify
yourselfFill out detail form which
includes:
• Contact and Project
Information
•Information Technology
details and procedures
for keeping data secure
•Data Access Agreement
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf
Module 1: Cancer Genomic Databases bioinformatics.ca
Module 1: Cancer Genomic Databases bioinformatics.ca
Module 1: Cancer Genomic Databases bioinformatics.ca
Module 1: Cancer Genomic Databases bioinformatics.ca
Module 1: Cancer Genomic Databases bioinformatics.ca
Module 1: Cancer Genomic Databases bioinformatics.ca
Module 1: Cancer Genomic Databases bioinformatics.ca
DACO approved projects:
59 groups - 75% academic
(~400 people)
84
DACO/DCC User Data Access Process
• Users approved through DACO are now automatically granted access to
ICGC controlled access datasets available through the ICGC Data Portal
and the EBI’s EGA repository
DACO Web
Application
DCC User
Registry
DCC Data
Portal
EBI EGA
application
approved
by DACO
user
accounts
activated
85
Future Work for the DCC
• Work with projects to improve in a number of areas:
– clinical data content,
– Increasing frequency of data release
• Better metadata collection from the EGA
– Working with EGA to better match metadata requirements for ICGC member
submissions; will enable reliable linking by Sample ID, Donor ID, etc. between data
portal and EGA. Will allow direct link to DACO approved users
– Projects will be required to provide this required metadata at submission time,
existing EGA datasets will be updated.
• Improve access to projects’ analysis methods
– Suggested publishing analysis SOPs in Standards in Genomic Sciences at most
recent ICGC workshop; haven’t seen any interest in doing this from member projects.
– DCC to host centralized web page(s) for each project’s analysis methods; use
permalink in submission files.
• Better documentation … always need more!
• Better transparency of processes
• Better links to publications
85
86
Future Work for the DCC
• New releases:– Release 15: finished before Christmas
• All data submission sent in again, plus new data
• (no methylation data)
– Release 16: incremental submission + Methylation data,
released before May
– Release 17: adopt incremental for all data types, and
increase frequency of releases.
86
87
New Project: ICGC PANCANCER analysis
• 2,000 Whole genome sequencing
– 6 cloud infrastructures across the world
– Appropriate policy and tool availability
– Agreed upon shared pipelines, and others
– Shared datasets
– Petabytes of files, 10,000’s cores
– Mutation analysis, as well as CNV, Structural, others
when feasible (RNA and methylome).
88
Challenges and Opertunity
• Targetted sequencing for Patient
Selection
• Consent
• Combinations
• Corrected features and #features >>
#samples
• Noisy and incomplete data
• Speed and cost
Adapted from Paul Rejto, Pfizer
We are also hiring!
89
FGED’s mission:
To be a positive agent of
change in the effective
sharing and reproducibility
of functional genomic data
fged.org
90
DCC Software
Developer
Vincent Ferretti
Brian O’Connor
Junjun Zhang
Anthony Cros
Jonathan Guberman
Bob Tiernay
Shane Wilson
Long Yao
Daniel Chang
Jerry Lam
Stuart Watt
Acknowledgments
ICGC Project leaders at the OICR:
• Tom Hudson
• John McPherson
• Lincoln Stein
• Paul Boutros
• Lakshmi Mutsawarma
• Vincent Ferretti
• Francis Ouellette
• Jennifer Jennings
Ouellette Lab
Michelle Brazas
Emilie Chautard
Nina Palikuca
Matthew Ziembicki
Web Dev
Miyuki Fukuma
Kamen Wu
Joseph Yamada
Salman Badr
Pipeline Development
& Evaluation
Morgan Taschuk
Rob Denroche
Peter Ruzanov
Zhibin Lu
DCC Data Coordinator
Hardeep Nahal
http://oicr.on.ca http://icgc.org
… and all the patients and their
families that that are putting their
hopes into our work!
• FGED
Alvis Brazma
Roger Bumgarner
Cesare Furlanello
Michael Miller
Francis Ouellette
John Quackenbush –
Dana-Farber
Michael Reich
Gabriella Rustici
Chris Stoeckert
Ronald Taylor
Steve Trutane
Jennifer Weller
Brian Wilhelm
Neil Winegarden
91Informatics and Biocomputing at the OICR
92Maya et Pascale, 2012
93
http://icgc.org
This presentation: http://goo.gl/HP613K
Video tutorial: https://vimeo.com/75522669