IGB genome genometry data models by Gregg Helt and Cyrus Harmon
-
Upload
ann-loraine -
Category
Science
-
view
167 -
download
1
description
Transcript of IGB genome genometry data models by Gregg Helt and Cyrus Harmon
![Page 1: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/1.jpg)
Genometry
Gregg Helt Cyrus Harmon
![Page 2: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/2.jpg)
Genometry
• Motivation and Purpose • Points of Reference • Genometry interfaces • Genometry manipulations • Genometry implementation • Representation examples • Prototype apps • Current status, future work
![Page 3: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/3.jpg)
Motivation and Goals • Desire for a more unified data model to represent
relationships between biological sequences, such as: – Annotations – Alignments – Sequence composition
• More networked, less hierarchical (genome-centric, transcript-centric)
• Simplicity • Expressivity / Flexibility • Memory and Computational Efficiency • Use by others to provide core functionality for various
Affy projects
![Page 4: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/4.jpg)
Points of Reference
• com.neomorphic.bio models • Genisys DB and Genisys IDL • EBI mapping models • Apollo data models • BioPerl • BioJava • Closest similarity to bio alignment models and
Genisys alignment models
![Page 5: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/5.jpg)
Basic Annotations
Transcript T Genome G
Transcript T
G: 1000..5000
Exon E1 G:1000..1200
Exon E2 G:3000..3500
Exon E3 G:4500..5000
![Page 6: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/6.jpg)
Genometry Annotations – Specify All Coordinates
Transcript T Genome G
Transcript T
G: 1000..5000 T:0..1200
Exon E1 G:1000..1200
T:0..200
Exon E2 G:3000..3500
T:200..700
Exon E3 G:4500..5000 T:700..1200
![Page 7: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/7.jpg)
Genometry Annotations – All coordinates are relative to BioSeqs
Transcript T Genome G
TranscriptAnnot T1 G: 1000..5000
T:0..1200
ExonAnnot E1 G:1000..1200
T:0..200
ExonAnnot E2 G:3000..3500
T:200..700
ExonAnnot E3 G:4500..5000 T:700..1200
Transcript T
Genome G
![Page 8: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/8.jpg)
Genometry Annotations – SeqSpans encapsulate a range along a BioSeq
Transcript T Genome G
TranscriptAnnot T1
ExonAnnot E1
ExonAnnot E2
ExonAnnot E3
Transcript T
Genome G G: 1000..5000
T: 0..200
G:1000..1200 T:0..200
G:3000..3500 T:200..700
G:4500..5000 T:700..1200
![Page 9: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/9.jpg)
Genometry Core Core • BioSeq
– length, residues (optional)
• SeqSpan – start, end, BioSeq
• SeqSymmetry – SeqSpans (breadth) – SeqSymmetry parent / child hierarchy (depth)
![Page 10: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/10.jpg)
Expressiveness of Core Core
• “Standard” annotations • Singleton annotations • Alternative Splicing • Pairwise alignments • Annotations with depth > 2 • Annotations with breadth > 2 • Indels • Structure of analyzed sequence • Fuzzy locations • All without explicit pointers from BioSeq to annotation
![Page 11: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/11.jpg)
Genometry Modelling of Insertions and Deletions #1a
G:1000..1006 T:7..18
G:1000..1017
T:0..6 G:1006..1017
T:0..18
…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…
GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG
G:2000..2017 T:18..34
G:2000..2010 T:28..34 T:18..28
G:2011..2017
G:1000..2017 T:0..34
insertion in transcript relative to genome (deletion in genome relative to transcript)
deletion in transcript relative to genome (insertion in genome relative to transcript)
Genome G
Transcript T
![Page 12: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/12.jpg)
Genometry Modelling of Insertions and Deletions #1b
G: g0..g2 T:t0..t2
…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…
GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG
G:g3..g5 T:t3..t5
G:g3..g4 T:t4..t5 T:t3..t4
G:g4+1..g5 G:g0..g1 T:t0..t1 T:t1+1..t2
G:g1..g2
G:g0..g5
T:t0..t5
insertion in transcript relative to genome (deletion in genome relative to transcript)
deletion in transcript relative to genome (insertion in genome relative to transcript)
Genome G
Transcript T
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5
![Page 13: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/13.jpg)
Genometry Modelling of Insertions and Deletions #2
G:g0..g1 T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2 T:t0..t2
…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…
GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG
G:g3..g5 T:t3..t5
G:g3..g4 T:t3..t4 T:t4..t5
G:g4+1..g5
G:g0..g5
T:t0..t5
insertion in transcript relative to genome (deletion in genome relative to transcript)
deletion in transcript relative to genome (insertion in genome relative to transcript)
Genome G
Transcript T
T:t1..t1+1 “C” :0..1
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5
G:g4..g4+1 “G” :0..1
![Page 14: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/14.jpg)
Genometry Modelling of Insertions and Deletions #3
G:g0..g1 T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2 T:t0..t2
…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…
GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG
G:g3..g5 T:t3..t5
G:g3..g4 T:t3..t4 T:t4..t5
G:g4+1..g5
G:g0..g5
T:t0..t5
insertion in transcript relative to genome (deletion in genome relative to transcript)
deletion in transcript relative to genome (insertion in genome relative to transcript)
Genome G
Transcript T
T:t1..t1+1 G:g1..g1
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5
G:g4..g4+1 T:t4..t4
![Page 15: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/15.jpg)
Genometry Modelling of Insertions and Deletions #4
G:g0..g1 T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2 T:t0..t2
…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…
GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG
G:g3..g5 T:t3..t5
G:g3..g4 T:t3..t4 T:t4..t5
G:g4+1..g5
G:g0..g5
T:t0..t5
insertion in transcript relative to genome (deletion in genome relative to transcript)
deletion in transcript relative to genome (insertion in genome relative to transcript)
Genome G
Transcript T
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5
T:t1..t1+1 G:g1..g1
“C”:0..1 T:t4..t4
G:g4..g4+1
“G”:0..1
![Page 16: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/16.jpg)
Modelling SNPs with Genometry: Two Approaches
SeqB : 0..n
SeqA : 0..x SeqB : 0..x
“T” : 0..1 SeqB : x..x+1
SeqA : 0..m
SeqA : x+1..m SeqB : x+1..n
SeqA : x..x+1 …GGCAAGGAATGATC… SeqA x x+1
…GGCAAGGAATGATC… SeqA
SeqB …GGCAAGTAATGATC…
x x+1
SeqA = reference chromosome SeqB = exactly same as reference chromosome, except for one SNP
I. SNPs as annotations of differences between sequences
II. SNPs as gaps in similarity between two sequences
T
SeqB : x..x+1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA
SeqB …GGCAAGTAATGATC…
x x+1
“T” : 0..1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA
T
x x+1
I.a. annotation of just reference seq
I.b. annotation of reference seq w/ variant base
I.c. annotation of reference and variant seq
![Page 17: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/17.jpg)
Modelling SNPs with Genometry: Two Approaches
SeqB : 0..n
SeqA : 0..x SeqB : 0..x
“T” : 0..1 SeqB : x..x+1
SeqA : 0..m
SeqA : x+1..m SeqB : x+1..n
SeqA : x..x+1 …GGCAAGGAATGATC… SeqA x x+1
…GGCAAGGAATGATC… SeqA
SeqB …GGCAAGTAATGATC…
x x+1
SeqA = reference chromosome SeqB = exactly same as reference chromosome, except for one SNP
I. SNPs as annotations of differences between sequences
II. SNPs as gaps in similarity between two sequences
T
SeqB : x..x+1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA
SeqB …GGCAAGTAATGATC…
x x+1
“T” : 0..1 SeqA : x..x+1 …GGCAAGGAATGATC… SeqA
T
x x+1
I.a. annotation of just reference seq
I.b. annotation of reference seq w/ variant base
I.c. annotation of reference and variant seq
![Page 18: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/18.jpg)
Sequence-oriented annotations • AnnotatedBioSeq
– Contains a collection of SeqSymmetries that annotate the sequence
– Interfaces to retrieve annotations covered by a span within the sequence
![Page 19: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/19.jpg)
Annotation Networks • Can traverse networks of annotations, alternating between
AnnotatedBioSeqs and SeqSymmetries
protein2mRNA proteinSpanB
mrnaSpanB
mRNA2genomic genomicSpanC mrnaSpanC
Annotated GenomicSeq G
Annotated mRNASeq M
Annotated ProteinSeq P
m2gSub0 gSpanC0 mSpanC0
m2gSub1 gSpanC1 mSpanC1
m2gSub2 gSpanC2 mSpanC2
domainOnProtein proteinSpanA
= AnnotatedBioSeq = SeqSymmetry
![Page 20: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/20.jpg)
Sequence Composition
• CompositeBioSeq – Contains a SeqSymmetry describing the mapping
of BioSeqs used in composition to the CompositeBioSeq itself
![Page 21: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/21.jpg)
Sequence Composition Representations
• Sequence Assembly / Golden Path / etc. • Piecewise data loading / lazy data loading • Genotypes • Chromosomal Rearrangements • Primer construction • Reverse Complement • Coordinate Shifting
![Page 22: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/22.jpg)
Genometry Modelling of Reverse Complement
Sequence B = reverse complement of Sequence A
BioSeq A length: x
Composite BioSeq B
length: x
A:0..x B:x..0
Sym AB composition
AGGCAATTAATTGATCCAGGTGGAGTCCGAATAGGGTTAGCGA
TCGCTAACCCTATTCGGACTCCACCTGGATCAATTAATTGCCT
SeqA
SeqB
![Page 23: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/23.jpg)
MultiSequence Alignments • MultiSeqAlignment
– Alignments sliced “horizontally” -- each “row” in an alignment is a CompositeBioSeq whose composition maps another BioSeq to the same coord space as the alignment
• Can also slice vertically (synteny)
![Page 24: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/24.jpg)
Alignment Representations • Can represent same alignment as either MultiSeqAlignment or Synteny • Transformation from horizontal slicing (MultiSeqAlignment) to vertical
slicing (Synteny)
![Page 25: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/25.jpg)
Complete Genometry Core Models
• Mutability • Curations
![Page 26: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/26.jpg)
Genometry Manipulations
• Symmetry Intersection (AND) • Symmetry Union (OR) • Symmetry Inverse (NOT) • Symmetry Mutual Exclusion (XOR) • Symmetry Transformation / Mapping
![Page 27: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/27.jpg)
Symmetry Combination Operations
SymA SymB
XOR(A, B)
AND(A, B)
OR(A, B)
NOT(A)
NOT(B)
![Page 28: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/28.jpg)
Genometry Transformations
• Every symmetry of breadth > 1 describes a mapping between different sequences
• Therefore every symmetry can be used to transform coordinates of other symmetries from one sequence to another
• Because sequence annotations, alignments, and composition are all based on symmetries, can use any of them as mappings
• Discontiguous linear mapping algorithm • Results of transformation are also symmetries
![Page 29: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/29.jpg)
Coordinate Mapping
(note that domain mapped to spliced transcript only overlaps two of the three exons, hence only end up with two children for resulting domain2genomic symmetry)
Example – mapping domain from protein coords to genomic coords
protein2mRNA proteinSpanB
mrnaSpanB
mRNA2genomic genomicSpanC mrnaSpanC
Annotated GenomicSeq G
Annotated mRNASeq M
Annotated ProteinSeq P
m2gSub0 gSpanC0 mSpanC0
domain2genomic proteinSpanA
d2gSub0 pSpanA0 mSpanA0 gSpanA0
domain2genomic proteinSpanA mrnaSpanA
domain2genomic proteinSpanA mrnaSpanA
genomicSpanA
d2gSub1 pSpanA1 mSpanA1 gSpanA1
transform via protein2mRNA
transform via mRNA2genomic
m2gSub1 gSpanC1 mSpanC1
m2gSub2 gSpanC2 mSpanC2
domainOnProtein proteinSpanA
= AnnotatedBioSeq (BioSeq)
= SeqSymmetry (SeqAnnot)
“Growing” domain2genomic result
= MutableSeqSymmetry
![Page 30: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/30.jpg)
![Page 31: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/31.jpg)
![Page 32: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/32.jpg)
mRNA2genomic genomicSpanC mrnaSpanC
m2gSub0 gSpanC0 mSpanC0
m2gSub1 gSpanC1 mSpanC1
m2gSub2 gSpanC2 mSpanC2
domain2genomic proteinSpanA mrnaSpanA
domain2genomic proteinSpanA mrnaSpanA
d2gSub0 mSpanA0
domain2genomic proteinSpanA mrnaSpanA
d2gSub0 mSpanA0 pSpanA0
domain2genomic proteinSpanA mrnaSpanA
d2gSub0 mSpanA0 pSpanA0 gSpanA0
d2gSub0 pSpanA0 mSpanA0 gSpanA0
domain2genomic proteinSpanA mrnaSpanA
genomicSpanA
d2gSub1 pSpanA1 mSpanA1 gSpanA1
domain2genomic proteinSpanA mrnaSpanA
d2gSub0 mSpanA0 pSpanA0 gSpanA0
d2gSub1 mSpanA1 pSpanA1 gSpanA1
step1b step1c step1a
step 2
step1 (loop2) [a,b,c]
Step 2 “roll up”
Step 1a “sit still”
Step1b “roll back”
Step1c “roll forward”
Step 1 Details of “split” mapping
![Page 33: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/33.jpg)
![Page 34: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/34.jpg)
Transformations Applications
• Mapping Affy probes to genome • Mapping contig annotations to larger genomic assemblies • Mapping protein annotations to genome • Mapping genomic annotations to proteins and transcripts
(SNPs, for example) • Sequence slice-and-dice with annotation propagation • Propagation of annotations across versioned sequences (such
as Golden Path) • Deep mappings (for example, SNP to genomeA to transcriptB to
proteinC to homolog proteinD to transcriptE to genomeF to putative SNP location in genomeF – symmetry path of depth 5)
• Etc., etc.
![Page 35: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/35.jpg)
Prototypes & Applications
• GenometryTest • Generic Genometry Viewer • ProtAnnot (Ann) • GPView (Cyrus) • AlignView (Eric) • ContigViewer (Peter, Barry) • Unibrow (Transcriptome Group)
![Page 36: IGB genome genometry data models by Gregg Helt and Cyrus Harmon](https://reader033.fdocuments.in/reader033/viewer/2022051207/53fb759f8d7f729c2e8b5837/html5/thumbnails/36.jpg)
Genometry Summary
• Genometry presents a unified model for location-based sequence relationships
• Sequence annotation, composition, and alignment are all based on SeqSymmetry
• Provides powerful genometry manipulations -- any SeqSymmetry can be used to map other SeqSymmetries across sequences / coordinate spaces
• Work in progress