Introduc)on*to*CLC*Main*Workbench* ILRI*Training ...
Transcript of Introduc)on*to*CLC*Main*Workbench* ILRI*Training ...
Joyce Njoki Nzioki BecA-‐ILRI Hub, Nairobi, Kenya h;p://hub.africabiosciences.org/ h;p://www.Ilri.org/ [email protected]
Introduc)on to CLC Main Workbench ILRI Training / EthopiaTraining
27, August 2015
Ge#ng started with CLC
CLC Main Workbench is a so7ware package that supports analysis of sequence data Func)ons include:
ü Sequence assembly ü Primer design ü Alignment and Phylogeny ü Blast / Database searches ü Addi)onal plugins
Ge#ng around in CLC
ü CLC has a has a main menu with features available as shown above
ü File menu has opAons to manipulate data ü The most useful menu is the TOOLBOX that has various analysis opAons to manipulate data
Sequenced Data
ü You can view your sequences data by opening the sequence files (trace files) extension .ab1 /.abi
ü NOTE: In order to obtain good sequencing results, you MUST download and examine your sequencing chromatogram. If you are using just the text data, you could be publishing data that is completely invalid!
ü So7ware used for viewing include: CLC bio, BioEdit, TracerView
ManipulaAng Data in CLC
Crea)ng folders
ü It is best to organize data in the navigaAon area in folders.
ü To create a folder go to File | New | Folder ü Or click on the new folder icon on the tool bar ü Name the folder
ManipulaAng Data in CLC
Impor)ng Data ü Allows you to bring sequenced data into CLC from where it is stored on your computer.
ü Go to File | import or click the import icon on the tool bar.
ü Navigate to where your sequences are stored on your computer
ü Select the file format to import in the case of sequenced data select Trace files (.abi/.ab1/.scf/.phd)
ü Select the folder to save the sequences to
Trouble shoot sequenced data “the good”
• Good quality peaks are smooth, disAnct or well formed, evenly spaced and with li]le baseline noise
Trouble shoot sequenced data “the bad”
ü A failed sequencing reaction: the chromatographs look messy, many ‘N’s in the sequence.
ü Non-usable sequenced data: can be due to low concentration of DNA template, none or wrong primer added.
Trouble shoot sequenced data “double peaks”
ü Double peaks: mulAple peaks of same or different length at the same posiAon; this is due to clone contaminaAon, heterozygous posiAon (SNP), contaminated PCR reacAon
ü Can be corrected using degenerate codes; N (a c t g ) , Y (c t ), R (a g)
Trouble shoot sequenced data “stu]ering”
ü Sequence data quality is poor a7er stretches of 7 or more nucleoAdes of the same base. This is due to polymerase slippage during DNA synthesis, it’s a limitaAon of sanger
Trouble shoot sequenced data “drop off”
ü The DNA sequence suddenly stops or peak intensely drops off substanAally. This is caused by secondary structures like hairpin loops or GC/GT rich regions.
Trouble shoot sequenced data “mis-‐called bases”
ü NucleoAdes that have been erroneously inserted into a sequence will appear oddly spaced relaAve to their neighboring bases
Trouble shoot sequenced data “mis-‐called bases”
ü NucleoAdes that have been erroneously inserted into a sequence will appear oddly spaced relaAve to their neighboring bases
Trim 3’ and 5’ ends At 5’ end sequences don’t start of very clearly till about bases 20-30 bases. Due to non-fully activated taq polymerase / poor termination near the primer
Trim 3’ and 5’ ends At 5’ end towards the end base 500-800 the quality will degrade as well. due to diminishing bases.
Trimming sequences ü After carefully scrutinizing your sequence you
can determine where your reliable sequence starts and ends.
ü You can delete / or trim the unreliable sequences from each end of your sequence file.
ü As a gel processes it looses resolution and the reads become more erroneous. Trim sequences when the errors become too frequent for your purpose
Quality Control using CLC ü The first step in sequence analysis is to check the quality
of reads and trim sequences where need be to eliminate poor quality or vector contamination.
ü When the trimming is done the parts of the sequences that are trimmed are not actually removed but trim annotations are saved to the sequences. These annotated sections are ignored in further analysis.
Assemble sequence
Sequence assembly refers to merging and aligning fragment of a much longer DNA sequence in order to reconstruct the much longer DNA sequence
I. Reference assembly – reference guided assembly.
II. De novo assembly – assembling without the aid of a reference genome.
De novo assembly
ü In most cases forward and reverse primers are used, hence you sequence both forward and reverse sequences.
ü Assembling the two sequences aligns the two sequences at they point the overlap to get a conAguous sequence called a conAg.
Conflicts The example shows a conflict in which the forward strand show base call “A” and reverse strand shows a “gap”
F
R
Resolving conflicts ü We assess the quality of reads at this position. The
reverse sequence has low quality of chromatographs (this is often the case towards the ends of the sequence). However the forward strand clearly has good quality peaks and can be trusted.
F
R
Resolving conflicts ü Other conflicts may
occur between two nucleotides, judgment on how to resolve such conflicts should be made based on:
Resolving conflicts Other conflicts may occur between two nucleotides, judgment on how to resolve such conflicts should be made based on: ü Quality of reads on both
strands (take data from the most consis tent sequence)
Resolving conflicts Other conflicts may occur between two nucleotides, judgment on how to resolve such conflicts should be made based on: ü Quality of reads on both
strands (take data from the most consistent sequence)
ü Two differing bases may be picked on either sequences because it is genuinely a SNP position so judgment should be based on quality of reads but also background knowledge on the sequences been analyzed.
Consensus sequences Once you have assembled and resolved conflicts you can extract a consensus sequence that is used in further analysis
The End