23-4-2015 Dr. Amira A. AL-Hosary 1 - Assiut University Bioinformatics22.23-… · An overview on...
Transcript of 23-4-2015 Dr. Amira A. AL-Hosary 1 - Assiut University Bioinformatics22.23-… · An overview on...
23-4-2015 Dr. Amira A. AL-Hosary 1
Interpretation of sequence results
Amira A. AL-Hosary
PhD of infectious diseases
Department of Animal Medicine
(Infectious Diseases)
Faculty of Veterinary Medicine
Assiut University
Egypt
23-4-2015 Dr. Amira A. AL-Hosary 2
An overview on DNA sequencing: • DNA sequencing involves determining the
linear nucleotide order of a segment of DNA.
• There are several methods of sequencing, but most are based on the Sanger Method.
• This is an enzymatic method that synthesizes DNA in vitro.
• It use a modified PCR reaction where both normal and labeled dideoxy-nucleotides are included in the reaction mix. Each dideoxy-nucleotides were labeled with fluorescent dyes (Each nucleotide has a different color).
23-4-2015 Dr. Amira A. AL-Hosary 3
• Template is single-stranded DNA that you want to sequence.
• Primer is a short fragment of DNA that binds to one end of the template DNA.
• Deoxynucleotides (dNTPs) extend the primer, forming a DNA chain. All four nucleotides (A,T,G,C in deoxynucleotide form) are added to the sequencing reaction.
• Dideoxynucleotides (ddNTPs) are another form of nucleotide that inhibit extension of the primer. Once a ddNTP has been incorporated into then DNA chain, no further nucleotides can be added.
• DNA polymerase incorporates the nucleotides and dideoxynucleotides into the growing DNA chain.
• Buffer is a solution that stabilizes the reagents and products in the sequencing reaction.
An overview on DNA sequencing:
23-4-2015 Dr. Amira A. AL-Hosary 4
At the end of the sequencing reaction,
Using a polyacrylamide gel (either a big thin slab
gel or a narrow capillary tube filled with gel
solution) that is scanned with a laser detection
device.
As each band moves past a viewer, the laser
excites the dye, and the color of fluorescence is
read by a photocell and recorded on a computer.
An overview on DNA sequencing:
23-4-2015 Dr. Amira A. AL-Hosary 5
Manual reading Vs. Automated reading of the Sequencing results:
The products of the sequence are loaded in four parallel
lanes on a gel.
A computer collects and analyzes this data, reading the sequence of the DNA. Thus automated sequencing is much faster and more efficient then manual sequencing. 23-4-2015 Dr. Amira A. AL-Hosary 6
An overview on DNA sequencing:
23-4-2015 Dr. Amira A. AL-Hosary 7
Sequencer Ion Torrent
PGM 454 GS FLX HiSeq 2000 SOLiDv4 PacBio Sanger 3730xl
Manufacturer Ion Torrent (Life
Technologies)
454 Life Sciences
(Roche) Illumina
Applied
Biosystems (Life
Technologies)
Pacific Biosciences
Applied
Biosystems (Life
Technologies)
Amplification
approach Emulsion PCR Emulsion PCR
Bridge
amplification Emulsion PCR
Single-molecule;
no amplification PCR
Data output per run 100-200 Mb 0.7 Gb 600 Gb 120 Gb 100-700 Mb 1.9∼84 Kb
Accuracy 99% 99.9% 99.9% 99.94% 88.0% (>99.9%
CCS) 99.999%
Time per run 2 hours 24 hours 3–10 days 7–14 days 2-3 hours 20 minutes - 3
hours
Read length 200-400 bp 700 bp 100x100 bp paired
end
50x50 bp paired
end 5,500-10,000 bp 400-900 bp
Cost per run $350 USD $7,000 USD $6,000 USD (30x
human genome) $4,000 USD $125-300 USD
$4 USD (single
read/reaction)
Cost per Mb $1.00 USD $10 USD $0.07 USD $0.13 USD $0.20 - $3.00 USD $2400 USD
Cost per instrument $80,000 USD $500,000 USD $690,000 USD $495,000 USD $695,000 USD $95,000 USD
23-4-2015 Dr. Amira A. AL-Hosary 8
1- The Band:
23-4-2015 Dr. Amira A. AL-Hosary 9
Interpreting Sequencing Results
Automated DNA Sequencers generate
1- A four-color chromatogram showing the results of the sequencing run.
2- In addition to a text file of sequence data.
23-4-2015 Dr. Amira A. AL-Hosary 10
• When you obtain a sequence you should proofread it to ensure that all ambiguous sites are correctly called and determine the overall quality of your data.
• Base Designations
• “A” designation—green peaks
• “G” designation—black peaks
• “T” designation—red peaks
• “C” designation—blue peaks
• “N” designation—peaks that,
for whatever reason, are not clear enough to designate as A, G, T, or C.
Interpreting Sequencing Results
23-4-2015 Dr. Amira A. AL-Hosary 11
Interpreting Sequencing Chromatograms
Good sequence generally begins roughly around base 20.
Beginning of Sequence
End of sequence
23-4-2015 Dr. Amira A. AL-Hosary 12
With a little practice, you can scan a chromatogram in less than a minute and spot problems.
It is not necessary to read each and every base.
Interpreting Sequencing Chromatograms
An example of excellent sequence. Note the evenly-spaced peaks and the lack of baseline 'noise'
23-4-2015 Dr. Amira A. AL-Hosary 13
Background noise
This example has a little baseline noise, but the 'real' peaks are still easy to call, so there's no problem with this sample.
Interpreting Sequencing Chromatograms
23-4-2015 Dr. Amira A. AL-Hosary 14
Interpreting Sequencing Chromatograms
Noise like the above most commonly arises when the sample itself is too dim, Contamination with salts or inefficient primer binding .
23-4-2015 Dr. Amira A. AL-Hosary 15
Types of Polymorphisms 1- Transitions: A G or C T
(purines to purines OR pyrimidines to pyrimidines)
2-Insertions: an extra base is present when compared to the Anderson reference sequence.
3- Deletions: a base is missing when compared to the Anderson reference sequence.
4- Mis-Called (a) Irregular spacing:
Common one for us is a G-A
dinucleotide, which leaves a little
extra space between them.
23-4-2015 Dr. Amira A. AL-Hosary 16
4- Mis-Called (b) Mis-call a nucleotide:
Sometimes the computer will mis-call a nucleotide when a human could do better.
Most often, this occurs when the base caller calls a specific nucleotide, when the peak really was
ambiguous and should have been called as 'N'.
23-4-2015 Dr. Amira A. AL-Hosary 17
4- Mis-Called (b) Mis-call a nucleotide:
23-4-2015 Dr. Amira A. AL-Hosary 18
4- Mis-Called (b) The real problem comes when the base caller attempts to interpret a gap as a real nucleotide. Note the real T peak (nt 58) and the real C peak (nt 60), with the G barely visible between them. Despite it size, the baseline-noise G peak was picked as if it were real. The clues to spot are (i) the oddly-spaced letters, with the G squeezed in, and (ii) the gap in the 'real' peaks, containing a low noise peak. This is a great example of why a weak sample, with its consequent noisy chromatogram, is untrustworthy.
23-4-2015 Dr. Amira A. AL-Hosary 19
5- Heterozygous (double) peaks:
A single peak position within a trace may have but two peaks of different colors instead of just one. This is common when sequencing a PCR product derived from diploid genomic DNA, where polymorphic positions will show both nucleotides simultaneously. Note that the base caller may list that base position as an 'N', or it may simply call the larger of the two peaks.
Here's a great example of a PCR
amplicon from genomic DNA,
with a clear heterozygous
single-nucleotide polymorphism
(SNP).
23-4-2015 Dr. Amira A. AL-Hosary 20
6- Loss of resolution later in the gel:
As the gel progresses, it loses resolution. This is normal; peaks broaden and shift, making it harder to make them out and call the bases accurately. The sequencer will continue attempting to "read" this data, but errors become more and more frequent.
This is a typical example of data from a very good sample
the spacing between the basecall letters at top is regular, which is often a good indication of the reliability of the data.
23-4-2015 Dr. Amira A. AL-Hosary 21
There are only a few base calls that can be considered reliable. The G at 981 may in fact be two G's, the N could be a G or an A, and who knows how many A's there are afterwards.
23-4-2015 Dr. Amira A. AL-Hosary 22
7- Non-discrete peaks:
These may occur when several of the same nucleotide appears in a row.
For example, if the sequence includes the region TAAAAAT, it may be represented by one wavy peak as opposed to 5 distinct peaks.
23-4-2015 Dr. Amira A. AL-Hosary 23
8- Good sequence with bad base calling:
Failed analysis, Ask the Sequencing Service to reanalyze the sequence.
23-4-2015 Dr. Amira A. AL-Hosary 24
9- Abrupt Truncation: DNA template has a secondary structure:
Secondary structures create a distortion that makes it impossible for elongation to continue and so the sequence ends abruptly.
The sequence ends after approximately 200 bp 23-4-2015 Dr. Amira A. AL-Hosary 25
10- Gradual truncation: Due to too much DNA
• So please quantitate your template DNA carefully, and use the recommended concentrations according to your work.
23-4-2015 Dr. Amira A. AL-Hosary 26
11- Repetitive regions:
• The nucleotide composition, as well as the size, of a repetitive region can play a large role in the success of sequencing through such an area.
• In general, G-C and G-T (often seen in bisulfite-treated DNA) repeats tend to be the most troublesome, though the newest version of Applied Biosystems BigDye Terminator v3.1 contains some modifications that have allowed for some striking improvements in certain previously difficult templates.
23-4-2015 Dr. Amira A. AL-Hosary 27
Methylation-specific PCR (MSP)
MSP used in quantitative PCR provides quantitative information about the methylation state of a given C p G island.
23-4-2015 Dr. Amira A. AL-Hosary 28
12- Negative samples / No DNA—chromatograms displaying peaks from which no useable sequence can be obtained may be due to an absence of DNA. These chromatograms generally have one or two predominant colors.
23-4-2015 Dr. Amira A. AL-Hosary 29
13- DNA contamination:
23-4-2015 Dr. Amira A. AL-Hosary 30
Cause related to sequencing: Poor removal of unincorporated dye terminators during the post-sequencing clean up
14- Excess dye peaks at the beginning of the sequence
23-4-2015 Dr. Amira A. AL-Hosary 31
15- Sharp peaks / spikes in the sequence
They are caused by tiny air bubbles within the liquid polymer or by small pieces of dried polymer that have flaked off and entered a capillary. 23-4-2015 Dr. Amira A. AL-Hosary 32
16- Dye blobs:
Dye blobs are unincorporated dye terminator molecules that have passed through the cleanup columns and remain in solution with the purified DNA loaded into the sequencers. They are most often seen with samples that have low signal strength. 23-4-2015 Dr. Amira A. AL-Hosary 33
17- Reaction failed, No sequencing data
23-4-2015 Dr. Amira A. AL-Hosary 34
Realize, too, that it's easy for a human to miss these. If you want to be sure you've
detected all of the polymorphic positions, you should be using a computer program
to scan your chromatograms
23-4-2015 Dr. Amira A. AL-Hosary 35
Interpreting of Sequencing Results
23-4-2015 Dr. Amira A. AL-Hosary 36
Determining homology:
In other words, is your sequence similar to any other published sequences and if so, to what degree?
This can be accomplished using BLAST, (Basic Local Alignment Search Tool): This program supported by the National Center for Biotechnology Information (NCBI).
The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. This program is accessible at: http://www.ncbi.nlm.nih.gov/BLAST/ (GenBank database; National Center for Biotechnology Information, National Institutes of health).
Interpreting of Sequencing Results
23-4-2015 Dr. Amira A. AL-Hosary 37
23-4-2015 Dr. Amira A. AL-Hosary 38
BLAST: Basic Local Alignment Search Tool blast.ncbi.nlm.nih.gov/
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares
nucleotide or protein sequences to ... Nucleotide BLAST - Protein BLAST: ***search ... - Align two or more
sequences - Rat Nucleotide BLAST: Search nucleotide databases using a nucleotide ...
blast.ncbi.nlm.nih.gov/Blast.cgi?...blastn...BlastSearch... No BLAST database contains all the sequences at NCBI. BLAST
databases ... BLAST - Wikipedia, the free encyclopedia
en.wikipedia.org/wiki/BLAST In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence
information, such as the amino-acid ... Background - Input - Output - Process
23-4-2015 Dr. Amira A. AL-Hosary 39
23-4-2015 Dr. Amira A. AL-Hosary 40
Click the “Blast!” button at the bottom to submit your sequence data. 23-4-2015 Dr. Amira A. AL-Hosary 41
This screen will come up next. Finally (sometimes after a lengthy wait), a new window will appear showing any “hits” your sequence made. The results will be color coded and annotated
23-4-2015 Dr. Amira A. AL-Hosary 42
The bars show what places along your sequence are similar to other published sequences; the colors indicate how many bases were involved in homology determination.
23-4-2015 Dr. Amira A. AL-Hosary 43
Clicking on a “gi” link at the beginning of any line will take you to the GenBank accession page for a sequence showing similarity to yours. There you can find a wealth of information about the published sequence to which yours showed some homology. 23-4-2015 Dr. Amira A. AL-Hosary 44
23-4-2015 Dr. Amira A. AL-Hosary 45
INTERPRETATION OF SEQUENCES WHICH CODING FOR PROTEIN
23-4-2015 Dr. Amira A. AL-Hosary 46
Translation and Open Reading Frame Search Regions of DNA that encode proteins are first transcribed into messenger RNA and then translated into protein.
By examining the DNA sequence alone we can determine the sequence of amino acids that will appear in the final protein.
In translation codons of three nucleotides determine which amino acid will be added next in the growing protein chain.
It is important then to decide which nucleotide to start translation, and when to stop, this is called an open reading frame.
23-4-2015 Dr. Amira A. AL-Hosary 47
Once a gene has been sequenced it is important to determine the correct open reading frame (ORF).
Every region of DNA has six possible reading frames, three in each direction.
The reading frame that is used determines which amino acids will be encoded by a gene.
Typically only one reading frame is used in translating a gene and this is often the longest open reading frame.
Once the open reading frame is known the DNA sequence can be translated into its corresponding amino acid sequence. An open reading frame starts with an ATG (Met) in most species and ends with a stop codon (TAA, TAG or TGA). 23-4-2015 Dr. Amira A. AL-Hosary 48
For example,
the following sequence of DNA can be read in six reading frames.
Three in the forward and three in the reverse direction.
The three reading frames in the forward direction are shown with the translated amino acids below each DNA seqeunce.
Frame 1 starts with the "a", Frame 2 with the "t" and Frame 3 with the "g". Stop codons are indicated by an "*" in the protein sequence.
23-4-2015 Dr. Amira A. AL-Hosary 49
5' 3' atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgatgtataa
1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta taa
M P K L N S V E G F S S F E D D V *
2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agg acg atg tat
C P S * I A * R G F H H L R T M Y
3 gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gga cga tgt ata
A Q A E * R R G V F I I * G R C I
23-4-2015 Dr. Amira A. AL-Hosary 50
23-4-2015 Dr. Amira A. AL-Hosary 51
Translation: Each sequence must be translate to its amino acids (aa) by using
Expasy.translatesoftware
23-4-2015 Dr. Amira A. AL-Hosary 52
23-4-2015 Dr. Amira A. AL-Hosary 53
23-4-2015 Dr. Amira A. AL-Hosary 54
23-4-2015 Dr. Amira A. AL-Hosary 55
23-4-2015 Dr. Amira A. AL-Hosary 56
BLAST: Basic Local Alignment Search Tool blast.ncbi.nlm.nih.gov/
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares
nucleotide or protein sequences to ... Nucleotide BLAST - Protein BLAST: ***search ... - Align two or more
sequences - Rat Nucleotide BLAST: Search nucleotide databases using a nucleotide ...
blast.ncbi.nlm.nih.gov/Blast.cgi?...blastn...BlastSearch... No BLAST database contains all the sequences at NCBI. BLAST
databases ... BLAST - Wikipedia, the free encyclopedia
en.wikipedia.org/wiki/BLAST In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence
information, such as the amino-acid ... Background - Input - Output - Process
23-4-2015 Dr. Amira A. AL-Hosary 57
23-4-2015 Dr. Amira A. AL-Hosary 58
23-4-2015 Dr. Amira A. AL-Hosary 59
The bars show what places along your aa are similar to other published; the colors indicate how many bases were involved in homology determination.
23-4-2015 Dr. Amira A. AL-Hosary 60
23-4-2015 Dr. Amira A. AL-Hosary 61
Thanks a lot with my Best Regards and My Best wishes
Amira A. AL-Hosary E-mail: Amiraelhosary @yahoo.com
Mob. (002) 01004477501
23-4-2015 Dr. Amira A. AL-Hosary 62