Clinical significance of transcript alignment discrepancies gne - 20141016
-
Upload
reece-hart -
Category
Health & Medicine
-
view
337 -
download
0
description
Transcript of Clinical significance of transcript alignment discrepancies gne - 20141016
![Page 1: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/1.jpg)
Reece Hart, Ph.D.Reece Hart, [email protected]@23andme.com
GenentechGenentech2014-10-162014-10-16
The Clinical Significance of Transcript The Clinical Significance of Transcript Alignment DiscrepanciesAlignment Discrepancies… … and tools to help you deal with them.and tools to help you deal with them.
Available on SlideShare (http://www.slideshare.net/reecehart)
![Page 2: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/2.jpg)
2 / 28
The fidelity of transcript-genome mapping matters.The fidelity of transcript-genome mapping matters.
Variants are identified and computed on in genome coordinates
Variants are analyzed and communicated using
transcript coordinatesgenome totranscript(g. to c.)
transcriptto genome
(c. to g.)
![Page 3: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/3.jpg)
3 / 28
Motivation 1: Discordant exon coordinatesMotivation 1: Discordant exon coordinatesNCBI and UCSC report different coordinates for CARD9, NM_052813.3, exon 12NCBI and UCSC report different coordinates for CARD9, NM_052813.3, exon 12
UCSC(BLAT)
NCBI(Splign)
Consequences:1. An assay that targets the wrong genomic region will generate uninformative sequence data.2. A genomic variant will be interpreted as exonic when it is intronic, or vice versa.
exon 12displaced 322 nt
![Page 4: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/4.jpg)
4 / 28
Motivation 2: indels confound mappingMotivation 2: indels confound mappingNM_006158.3 (NEFL) contains indel in CDSNM_006158.3 (NEFL) contains indel in CDS
Deletion justified differently!
![Page 5: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/5.jpg)
5 / 28
Motivation 3: Data management challengesMotivation 3: Data management challenges
➢ Mutable data (!)➢ Sporadic failures➢ Inconsistent data from a single source➢ Inconsistent data across sources➢ Opaque and implicit data definitions➢ Historical alignment data not available
Source AC Reference exons
EUtils NM_005168.3 GRCh37.p10 1146 / 125 / 320 / 1998
NM_005168.4 NG_008492.1 1398 / 125 / 320 / 1998
seqgene NM_005168.3 GRCh37.p10 102 / 1046 / 125 / 321 / 143 / 1855
UCSC NM_005168.4 hg19 1398 / 135 / 244 / 76 / 1997
![Page 6: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/6.jpg)
6 / 28
Motivation 4: Use Ensembl for Variant Effect PredictionMotivation 4: Use Ensembl for Variant Effect Prediction
RefAgreeDo transcript and genome sequences agree?
Transcript EquivalenceWhich RefSeq and Ensembl transcripts are equivalent?
RefSeq(NM)
Ensembl(ENST)
Genome(GRCh37)
➊ SNV
➌
➋ Indel
➍ Historical Transcripts UCSC (NM)LRG, BIC, …
![Page 7: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/7.jpg)
7 / 28
Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658
![Page 8: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/8.jpg)
8 / 28
Challenges and Solutions in Transcript ManagementChallenges and Solutions in Transcript Management
➢ Biological● Alternative splicing● Paralogs● Natural polymorphisms● Alternative references
➢ Technical / Logistical● Multiple transcript sources● Multiple alignment methods● Multiple references● Genome-transcript sequence
differences● Historical transcript alignments
➢ Existing resources● RefSeq, UCSC, Ensembl● Locus Reference Genomic● Mutalyzer
➢ See also● McCarthy DJ¸ et al. Genome
Medicine 6:26 (2014).● Garla V, et al. Bioinformatics
27(3): 416–8 (2010).
![Page 9: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/9.jpg)
![Page 10: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/10.jpg)
10 / 28
Part 1
The Universal Transcript Archive
![Page 11: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/11.jpg)
11 / 28
T
RefSeqNM_01234.4
UTA solves four issues with transcript management.UTA solves four issues with transcript management.
RefSeqNM_01234.5
InDel
UCSCNM_01234.5
➌
Exon coordinate differences between sources for same accession➍
Historical transcripts alignments no longer available
➊ SNV
A
➋Transcript ≠≠ Genome Reference
![Page 12: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/12.jpg)
12 / 28
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database
transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345
referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3
methodselfsplignselfsplignsplignblatselfgenebuild
exonsexon set
![Page 13: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/13.jpg)
13 / 28
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database
transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345
referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3
methodselfsplignselfsplignsplignblatselfgenebuild
exonsexon set
exon alignmentsNM_01234.4 NC_000012.3 0 50≠NM_01234.4 NC_000012.3 1 100≠1X49≠NM_01234.4 NC_000012.3 2 5≠1I44≠
➊➋
Alignments use coordinates from source databases.
![Page 14: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/14.jpg)
14 / 28
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database
transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345
referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3
methodselfsplignselfsplignsplignblatselfgenebuild
exonsexon set
➌
![Page 15: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/15.jpg)
15 / 28
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database
transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345
referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3
methodselfsplignselfsplignsplignblatselfgenebuild
exonsexon set
➍
![Page 16: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/16.jpg)
16 / 28
““RefAgree” Statistics by Protein Coding TranscriptRefAgree” Statistics by Protein Coding TranscriptSequence concordance between RefSeq and GRCh37 primary assemblySequence concordance between RefSeq and GRCh37 primary assembly
c.f. Garla V, et al. Bioinformatics 27(3): 416–8 (2010).
34531 NM transcripts (Jan 2014)760 0.2% with length discrepancies
3481 10% with substitutions321 0.9% with deletions255 0.7% with insertions
➊➋
![Page 17: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/17.jpg)
17 / 28
Exon structures have unique fingerprintsExon structures have unique fingerprintsIdentifying ENST-NM equivalences with fingerprintsIdentifying ENST-NM equivalences with fingerprints
=> select N.hgnc,N.es_fingerprint,N.tx_ac,E.tx_acfrom uta_20140210.tx_exon_set_summary_mv Njoin uta_20140210.tx_exon_set_summary_mv E on N.es_fingerprint=E.es_fingerprint and N.tx_ac ~ '^NM_' and E.tx_ac ~ '^ENST' and N.alt_aln_method='transcript' and E.alt_aln_method='transcript';
┌─────────┬──────────────────────────────────┬────────────────┬─────────────────┐ │ hgnc es_fingerprint tx_ac tx_ac │ │ │ │
├─────────┼──────────────────────────────────┼────────────────┼─────────────────┤ │ AFF2 db0e20be1a2bb687c33227d2e6bf9d53 NM_002025.3 ENST00000370460 │ │ │ │ │ UBE3A d1eace7da295c45378fa5f898f2f03f6 NM_130838.1 ENST00000438097 │ │ │ │ │ ANXA8L1 1f6fd4f3fe9854aa468489ec7f507512 NM_001098845.1 ENST00000359178 │ │ │ │ │ APOL5 939a9e9e4a46ef9aef862cf9b369afe6 NM_030642.1 ENST00000249044 │ │ │ │ │ ARID4B 524fc954d10b08a4014e86aee81d0358 NM_016374.5 ENST00000264183 │ │ │ │
![Page 18: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/18.jpg)
18 / 28
NCBI (Splign) v. UCSC (BLAT) Alignment StatisticsNCBI (Splign) v. UCSC (BLAT) Alignment StatisticsSplign and BLAT provide significantly different exon structures for 886 transcriptsSplign and BLAT provide significantly different exon structures for 886 transcripts
Are Splignand BLATsimilar ?
31472 (97.3%)transcripts
Y
N
32358transcripts
w/exon structures
➌
886 (2.7%)transcripts
“similar” means either1) identical exon coordinates, or2) coordinates that differ only by short 3' terminal artifacts
![Page 19: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/19.jpg)
19 / 28
Characterization of transcripts discrepanciesCharacterization of transcripts discrepanciesWhether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence.Whether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence.
Splign
BLA
TT F
T 14 18
F 545 311
886 transcripts withsignificant discrepancies
![Page 20: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/20.jpg)
20 / 28
Characterization of transcripts discrepanciesCharacterization of transcripts discrepanciesReference agreement (blue) and alignment “simplicity” (green)Reference agreement (blue) and alignment “simplicity” (green)
Splign
BLA
TT F
T 14 18
F 545 311Splign
BLA
T
T F
T 200(0)
4(97)
F 90(82)
16(84)
Splign
BLA
T
T F
T 6(41)
12(180)
F
Splign
BLA
T
T F
T 434(7)
F 110(652)
Splign
BLA
T
T F
T 14(11)
F
886 transcripts withsignificant discrepancies
![Page 21: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/21.jpg)
21 / 28
ACMG “Must Report” GenesACMG “Must Report” Genes
Green, R. C., Berg, J. S., Grody, W. W., Kalia, S. S., Korf, B. R., Martin, C. L., … Biesecker, L. G. (2013). ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genetics in Medicine : Official Journal of the American College of Medical Genetics, 15(7), 565–74. doi:10.1038/gim.2013.73
![Page 22: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/22.jpg)
22 / 28
Summary of Splign-BLAT gene-wise coordinate deltas.Summary of Splign-BLAT gene-wise coordinate deltas.
delta # genes # ACMG must report
=0 15206 45
>=1 183 8
>=10 116 0
>=25 6 0
>=50 5 0
>=250 13 0
>=1000 94 3
delta ≝ minimum per gene of maximum per transcript of difference of exon coordinates between NCBI and UCSC.
MYBPC3, MYH7,TNNI3
(all trivial diffs)LDLR, MYL2,
PRKAG2, SDHB, SDHC, TGFBR1, TGFBR2, WT1
Identical ExonStructures
![Page 23: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/23.jpg)
23 / 28
Part 2
Using HGVS “Nomenclature”
(http://www.hgvs.org/mutnomen/)
![Page 24: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/24.jpg)
24 / 28
HGVS Python PackageHGVS Python Packagehttp://bitbucket.org/hgvs/hgvs/http://bitbucket.org/hgvs/hgvs/
➢ Parser● HGVS Python object→● Based on a Parsing Expression
Grammar➢ Formatter
● Python object HGVS→➢ Validator
● intrinsic & extrinsic validation➢ Mapping tools indel-aware!
● g. c. p. (m,n,r also supported)↔ →● transcript-to-transcript liftover● uses on UTA data
![Page 25: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/25.jpg)
25 / 28
Example: Variant liftover between transcriptsExample: Variant liftover between transcriptsMapfrom NM_182763.2:c.688+403C>T➀to NC_000001.10:g.150550916G>A➁to ➂ NM_001197320.1:281C>Twith Splign alignments
NM_001197320.1NP_001184249.1
NM_182763.2NP_877495.1
➀
➂
➁
NC_000001.10
![Page 26: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/26.jpg)
26 / 28
Developer InfoDeveloper Info
Testing➢ 91% code coverage➢ 25665 tests variants
● ~200 hand curated, rest from dbSNP
● 23436 sub, 1254 del, 908 ins, 45 delins, 22 dup
● 44 distinct transcripts, many selected for difficulty
➢ >99% concordance with Mutalyzer
● using >100K variants from ClinVar
Upcoming directions(all issues are publicly readable)➢ multi-variant alleles➢ release LRG➢ GRCh38➢ API changes
![Page 27: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/27.jpg)
27 / 28
ConclusionsConclusions
➢ The fidelity of reference-transcript mapping matters● For ~800 transcripts, splign and BLAT generate significantly different
alignments● These differences might affect the interpretation of clinically-relevant
genes (including 3 ACMG must report genes)
➢ Current resources have important limitations
➢ Two tools may help you deal with these limitations● UTA – Freely available archive of transcripts from multiple sources● HGVS – Comprehensive parsing, formatting, manipulation, and validation
of variants
![Page 28: Clinical significance of transcript alignment discrepancies gne - 20141016](https://reader034.fdocuments.in/reader034/viewer/2022052316/557bf5d0d8b42ab9388b4636/html5/thumbnails/28.jpg)
28 / 28
AcknowledgementsAcknowledgements
➢ Invitae● Vince Fusaro● John Garcia● Emily Hare● Kevin Jacobs● Geoff Nilsen● Rudy Rico● Jody Westbrook●
●
● http://goo.gl/dq2uoW
http://bitbucket.com/hgvs/hgvshttp://bitbucket.com/uta/uta➢ Code (Python)➢ Documentation & Examples➢ Issues➢ BED files➢ Code testing is public
Or just:pip install hgvs