Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes...

49
Genome Annotation

Transcript of Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes...

Page 1: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Genome Annotation

Page 2: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 3: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 4: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 5: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 6: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Genome annotation

● What is the function of each part of the genome?● Where are the genes?● What is the mRNA sequence (transcription,

splicing)● What is the protein sequence?● What does that protein do?● rRNA, tRNA, miRNA, etc

Page 7: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

How do we know?

● Homology – similar sequence● Expression data● Prediction based on sequence itself:

– Open Reading Frames

– Prectiction of tRNA, other well-understood structures

– Non-random evolution

Page 8: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Prediction based on homology

Page 9: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 10: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 11: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 12: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 13: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Expression data

● Align mRNA sequence to the genome– Tells you where expressed sequences are

– Predict function based on homology or known domains

Page 14: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Structure predictions

Page 15: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 16: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 17: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Before using DOGMA

● Before you spend time annotating, make sure your genome is in good shape!– No Ns or other ambiguous bases – RYMKWSBDHV

– Align your raw reads to the assembled genome with bwa, and identify SNPs and indels with samtools

– Fix all of these SNPs and indels

– Align the raw reads again, make sure there are no more SNPs or indels

– Use tview to look at every base in the genome● Is there even coverage across the genome? Gaps in coverage = errors● Are there any SNPs and/or indels left to fix?

● Once these errors and omissions are fixed, you are ready!

Page 18: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Using DOGMA

● DOGMA is a program specifically designed for plastid genome annotation. It uses:– Homology to existing, well-annotated genomes

– Predictions of tRNA structure

– ORF prediction based on start, stop codons

● This is a powerful but buggy program.– DO NOT ever click refresh or back, as that often leads to

unfixable errors. If you do accidentally hit the back button, just close the window rather than hitting forward again.

– If you start seeing weird bugs, it is probably because you hit the back or refresh button at some point. You may have to re-start the whole annotation, or just deal with the weird bugs.

Page 19: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

DOGMA step-by-step

● Go to http://dogma.ccbb.utexas.edu/

Page 20: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Starting DOGMA

● Choose a user ID● Enter your chloroplast name● Choose chloroplast● Choose your file

– The fasta file with your final, corrected genome sequence as one single sequence

– Fasta has a header line then lines of sequence

>Sequence_name

ATGAATTAATATATAATAATATACGTTTGTGTTAT

● Click submit – wait a minute for your annotation to run

Page 21: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

You now have a draft annotation

Page 22: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Finalizing your annotation

● Each gene has to be confirmed to be correct. You just have to go through one by one and do it – it takes time!

● Just click on a gene – e.g. trnH, and more information will be given about that gene

Page 23: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

trnH

Page 24: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

trnH

● trnH looks pretty good – a tRNA:– Same sequence

as the wheat, rice, tobacco, and corn trnH

– Complete

– Structure looks good (click on tRNA figure button)

Page 25: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

trnH

● The structure looks reasonable and there are no errors listed

Page 26: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

trnH

● Since everything looks good, click “Commit”

Page 27: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

trnH

● Since everything looks good, click “Commit”

● Now that gene has black bars on the top and bottom, marking it as done

Page 28: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

trnH

● Since everything looks good, click “Commit”

● Now that gene has black bars on the top and bottom, marking it as done

● Let's move on to the next gene

Page 29: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 30: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

rps19

● This is a protein-coding gene.

● The amino acid sequence and length look similar to other sequences listed

Page 31: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

RNA editing

● Protein-coding genes should start with a 'start codon', and an “M” amino acid (methionine)

● This one starts with a V, and so does Oryza!● The start codon is ATG, not GTG● G--> A edits (actually, C to U in the mRNA

endoded by this sequence) are very common.● GTG sequences sometimes are edited, so can

be starts!

Page 32: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

The Editosome

Page 33: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Confirming this

● Sequencing the mRNA vs pre-mRNA (ideal)● We will just compare to other well-annotated

genomes – RNA editing is often conserved● Let's look at Oryza sativa

Page 34: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Oryza sativa rps19

● It is a ribosomal protein – used for making proteins from mRNA , as part of the ribosome

● It doesn't start with a 'V' like dogma says!● Let's look at the coding DNA sequence – click

on 'CDS'

Page 35: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Confirming this

● GTG does not encode M, it encodes V!● The O. sativa plastid genome on genbank is

annotated in a way that can only be explained by RNA editing

● Sorghum is closely related, so may be the same

Page 36: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Confirming this

● GTG does not encode M, it encodes V!● The O. sativa plastid genome on genbank is

annotated in a way that can only be explained by RNA editing

● Sorghum is closely related, so may be the same

Page 37: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Rps19 is known to have altered start

● Well-characterized “new start codon” by RNA editing

Wolf et al. 2004

Page 38: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Finding the stop codon

Page 39: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Finding the stop codon

Page 40: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Continue on with more genes

Page 41: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

rps16

Page 42: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

BLASTn against that gene as annotated in a closely related

genome - important tool!

Page 43: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

BLASTn against that gene as annotated in a closely related

genome - important tool!

Page 44: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

BLASTn against that gene as annotated in a closely related

genome - important tool!

Page 45: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Add in a second exon, and join the two exons

And, notice a very clear bug – the way the exons are drawn is not correct, at least on my browser! However, the product should be correct

Page 46: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Check that the products look correct for our first few proteins with

“Extract Sequences”● They should start with M and end with *, except

in cases of RNA editing, which we deal with later

Page 47: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons
Page 48: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Functional annotation

● How do we know what these genes are doing?– Homology

● Okay, how does that help us? How do we actually categorize genes into meaningful functions?– GO = Gene Ontology

Page 49: Genome Annotation · genome annotation. It uses: – Homology to existing, well-annotated genomes – Predictions of tRNA structure – ORF prediction based on start, stop codons

Cool... how does it work?

● Yet another great idea, not yet easy to use– Multiple, different organizations curating ontologies

● e.g. Plant Ontology, Crop Ontology, GRAMENE, and many species-specific databases (e.g. TAIR)

– Multiple levels of ontology – how to compare?

● Fortunately, it is easier chloroplasts!– Small genomes (few genes to deal with)

– Less gene redundancy / duplication

– Easier to identify orthology / homology

– Function is better understood