Multiple mouse reference genomes and strain specific gene annotations
-
Upload
thomas-keane -
Category
Education
-
view
690 -
download
0
Transcript of Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific
gene annotations
Thomas Keane,Wellcome Trust Sanger Institute @drtkeane @[email protected]
Sequence variation
**
*
*
*
**
*
*
**
*
*
*
*
*
➢ 36 inbred strains with whole-genome illumina sequencing
➢ SNPs, indels, and structural variants
➢ Are there more inbred strains with deep whole genome illumina sequencing?
➢ LG/J, SM/J, and JF1/MsJ pending
Anthony Doran, WTSI
Genome assemblies
➢ REL-1412: Illumina mate pair based de novo scaffolds
➢ REL-1504: Pseudo-chromosomes○ Alignment synteny with GRCm38
○ Evaluation with PacBio WGS/cDNA showed
excessive reference bias
➢ REL-1509: Pseudo-chromosomes based on breakpoint graphs
○ Dovetail genomics scaffolds for CAST/EiJ,
PWK/PhJ, and SPRET/EiJ.
nnnn
nnnn
1. Contigs
2. Scaffolds
Chr1
3. Pseudo-chromosomes
Paired-endIllumina
Large fragment ends (3,6,10kb, Dovetail, BAC ends)
Whole-genome alignments
PacBio alignments
➢ Use PacBio long reads alignment contiguity to validate the chromosome sequence
➢ Compare the number of inconsistently mapped reads
X
PacBio WGS and cDNA alignments
PWK/PhJ
Dovetail genomics: CAST/EiJ, PWK/PhJ, SPRET/EiJ
A) High molecular weight (50+ kbp) input DNA
B) Reconstitute chromatin from the input DNA
C) Addition of a fixative agent (e.g., formaldehyde) produces crosslinks
D) Crosslinked chromatin digested with a restriction endonuclease to generate sticky-ended fragments
E+F) DNA ligase added to perform blunt-end ligation of the many ends within a given chromatin aggregate
G) Chromatin is removed and DNA is purified and processed to remove biotin
Enriched for biotin-containing fragments and prepare sequencing library
http://dovetailgenomics.com/
Dovetail Scaffolds
Length (Gbp)
Scaffolds N50 (Mbp) Largest (Mbp)
% Ns
CAST/EiJ 2.69 382,843 0.644 4.75 11.4
PWK/PhJ 2.53 271,282 0.390 4.0 6.3
SPRET/EiJ 2.66 297,604 0.361 2.82 9.4
Length (Gbp)
Scaffolds N50 (Mbp) Largest (Mbp)
% Ns
CAST/EiJ 2.69 367,627 22.216 90.4 11.5
PWK/PhJ 2.58 251,844 24.066 100.6 7.44
SPRET/EiJ 2.66 272,127 23.475 88.6 9.5
REL-1412
REL-1412+Dovetail
Dovetail Scaffolds
Length (Gbp)
Scaffolds N50 (Mbp) Largest (Mbp)
% Ns
CAST/EiJ 2.69 382,843 0.644 4.75 11.4
PWK/PhJ 2.53 271,282 0.390 4.0 6.3
SPRET/EiJ 2.66 297,604 0.361 2.82 9.4
Length (Gbp)
Scaffolds N50 (Mbp) Largest (Mbp)
% Ns
CAST/EiJ 2.69 367,627 22.216 90.4 11.5
PWK/PhJ 2.58 251,844 24.066 100.6 7.44
SPRET/EiJ 2.66 272,127 23.475 88.6 9.5
REL-1412
REL-1412+Dovetail
PacBio WGS alignments
➢ Proportion of WGS reads where all hits are one orientation vs. mixed orientations (lower is better)
Complex regions - Nlrp1 paralogs
Post-dovetailPseudo-chromosomes (pre-dovetail)
➢ A dozen highly polymorphic complex loci○ Major urinary proteins (MUPs), H2/MHC, IRG, Nlrp etc.
Jingtao Lilue, WTSI
Pseudo-chromosomes (REL-1509)Strain Length (Gbp) Sequences (>2kb) N50 (Mbp) Largest (Mbp) %N129S1_SvImJ 2.73 7,153 134.54 202.56 0.15A_J 2.63 4,687 129.07 194.20 0.11AKR_J 2.71 5,954 132.98 199.99 0.13BALB_cJ 2.63 3,824 129.64 194.91 0.11C3H_HeJ 2.70 4,069 133.07 200.88 0.14C57BL_6NJ 2.81 3,893 139.12 208.92 0.18CAST_EiJ 2.65 2,976 133.75 200.42 0.14CBA_J 2.92 5,465 144.78 216.63 0.21DBA_2J 2.61 4,104 128.21 192.93 0.11FVB_NJ 2.59 5,013 127.06 191.00 0.11LP_J 2.73 3,498 135.16 203.66 0.16NOD_ShiLtJ 2.98 5,551 147.35 223.33 0.23NZO_HlLtJ 2.70 7,022 132.96 199.80 0.14PWK_PhJ 2.60 5,085 127.27 191.61 0.11SPRET_EiJ 2.63 5,405 131.95 198.85 0.11WSB_EiJ 2.69 2,238 133.18 200.11 0.16
Pseudo-chromosomes (REL-1509)Strain Length (Gbp) Sequences (>2kb) N50 (Mbp) Largest (Mbp) %N129S1_SvImJ 2.73 7,153 134.54 202.56 0.15A_J 2.63 4,687 129.07 194.20 0.11AKR_J 2.71 5,954 132.98 199.99 0.13BALB_cJ 2.63 3,824 129.64 194.91 0.11C3H_HeJ 2.70 4,069 133.07 200.88 0.14C57BL_6NJ 2.81 3,893 139.12 208.92 0.18CAST_EiJ 2.65 2,976 133.75 200.42 0.14CBA_J 2.92 5,465 144.78 216.63 0.21DBA_2J 2.61 4,104 128.21 192.93 0.11FVB_NJ 2.59 5,013 127.06 191.00 0.11LP_J 2.73 3,498 135.16 203.66 0.16NOD_ShiLtJ 2.98 5,551 147.35 223.33 0.23NZO_HlLtJ 2.70 7,022 132.96 199.80 0.14PWK_PhJ 2.60 5,085 127.27 191.61 0.11SPRET_EiJ 2.63 5,405 131.95 198.85 0.11WSB_EiJ 2.69 2,238 133.18 200.11 0.16
➢ Propose to make REL-1509 the first annotated reference genomes for the laboratory strains
Gene prediction approach
RNA-SeqGencode M7
C57BL/6J Strain specific
Ian Fiddes, UCSC
Stefanie König,U. Greifswald
Mario Stanke,U. Greifswald
Evidence
Gene prediction approach
➢ TransMap - utilise as much of the Gencode C57BL/6J genome annotation as possible
○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq
TransMap
RNA-SeqGencode M7
C57BL/6J
Ian Fiddes, UCSC
Stefanie König,U. Greifswald
Mario Stanke,U. Greifswald
TransMap+local Augustus
Strain specific
Evidence
How many genes have at least one fully correct transcript?
Ian Fiddes, UCSC
Gene prediction approach
➢ TransMap - liftover as much of the Gencode C57BL/6J genome annotation as possible
○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq
➢ Comparative gene prediction: Augustus CGP○ Generate gene predictions based primarily on RNA-Seq evidence
○ Allows for predictions of new transcripts+exons absent in C57BL/6J
TransMap TransMap+local Augustus
Augustus CGP
RNA-SeqGencode M7
Ian Fiddes, UCSC
Stefanie König,U. Greifswald
Mario Stanke,U. Greifswald
Strain specificC57BL/6J
Evidence
Gene prediction approach
➢ TransMap - utilise as much of the Gencode C57BL/6J genome annotation as possible
○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq
➢ Comparative gene prediction: Augustus CGP○ Generate gene predictions based primarily on RNA-Seq evidence
○ Allows for predictions of new transcripts+exons absent in C57BL/6J
TransMap TransMap+local Augustus
Augustus CGP
RNA-SeqGencode M7
Consensus gene set
Ian Fiddes, UCSC
Stefanie König,U. Greifswald
Mario Stanke,U. Greifswald
Strain specificC57BL/6J
Evidence
Efcab13-Efcab3 hybrid
Stefanie König,U. Greifswald
Charlie Steward,WTSI
What about human?
Efcab13-Efcab3 hybrid
NOT VALIDATED (YET)!
Stefanie König,U. Greifswald
Charlie Steward,WTSI
Dnah14: dynein, axonemal, heavy chain 14
Stefanie König,U. Greifswald
Charlie Steward,WTSI
Charlie Steward,WTSI
Gene extensions - Dnah14
NOT VALIDATED (YET)!
Stefanie König,U. Greifswald
Complex regions - Nlrp1 paralogs
Jingtao Lilue, WTSI
C57BL/6J
PWK/PhJ
C57BL/6J
PWK/PhJ
PWK/PhJassembly
How can I look at the genomes?
http://hgwdev-mus-strain.sdsc.eduMark Diekhans, UCSC
Ian Fiddes, UCSC
How can I look at the genomes?
http://hgwdev-mus-strain.sdsc.eduMark Diekhans, UCSC
Ian Fiddes, UCSC
Change co-ordinate system to strain of interest
http://hgwdev-mus-strain.sdsc.edu
Mark Diekhans, UCSC
Ian Fiddes, UCSC
How can I look at the genomes?
Developed and maintained by the Genome Reference Informatics Teamhttp://mice-geval.sanger.ac.uk
Kerstin Howe,WTSI
Acknowledgements➢ Wellcome Trust Sanger Institute
○ Anthony Doran, Kim Wong, Dirk-Dominik Dolle, Jingtao Lilue, Monica Abrudan○ David Adams, Richard Durbin, Kerstin Howe, Jennifer Harrow, Charles Steward, Mark Thomas, Ruth Bennet,, Jo Wood,
James Torrance, Will Chow, Mike Quail, Matt Dunn, Marcela Sjoberg, James Gilbert, Ed Griffiths, Anne Ferguson-Smith
➢ UCSC○ Benedict Paten, Joel Armstrong, Mark Diekhans, Dent Earl, Ian Fiddes
➢ EBI○ David Thybert, Duncan Odom, Paul Flicek
➢ University of Greifswald○ Mario Stanke, Stefanie König
➢ Salk Institute○ Son Pham, Mikhail Kolmogorov
➢ Yale○ Fabio Navarro, Cristina Sisu, Mark Gerstein
➢ Wellcome Trust Centre for Human Genetics○ Jonathan Flint, Richard Mott, Leo Goodstadt
➢ Jackson Laboratory○ Laura Reinholdt, Anne Czechanski
➢ URLs○ http://www.sanger.ac.uk/science/data/mouse-genomes-project○ http://hgwdev-mus-strain.sdsc.edu○ http://mice-geval.sanger.ac.uk/index.html
2014-2017 2015-2018
Sequence Variation Infrastructure Group, WTSI
BioNano genomics optical mapping
10kb mate-pair consistency