Sequence analysis -...
Transcript of Sequence analysis -...
Sequence analysis
Bioinformatics course 2018Dr Dominique Anderson
SANBI
Define ‘BIOINFORMATICS’
• “ the collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics” ~ Merriam-Webster dictionary
• “Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying "informatics" techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.” ~ Luscombe et al 2001
• “The computer-based discipline that includes methods for storage, retrieval and analysis of biological data such as RNA,DNA and protein sequences, structures and genetic interactions, by constructing electronic databases.” ~ Collins Dictionary of Biology
Sequence analysis
• Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution• The comparison of sequences in order to find similarity, often to infer if they are
related (homologous)• Identification of intrinsic features of the sequence such as active sites, post
translational modification sites, gene-structures, reading frames, distributions of introns and exons and regulatory elements
• Identification of sequence differences and variations such as point mutations and single nucleotide polymorphism (SNP) in order to get the genetic marker.
• Revealing the evolution and genetic diversity of sequences and organisms• Identification of molecular structure from sequence alone
• So in this section, we take a break from the command line and explore other tools for sequence analysis
• The amount of data being generated is massive• Remember – experimental validation
• Sequence analysis depends on your objectives
• We will explore 2 cases• Single sequence to protein to function
• Multiple sequences
About NGS sequencing
• First human genome sequenced • 15 year multi-team project
• Cost approximately $2.7 Billion
• And now?• You can have your genome sequenced on the NovaSeq (Illumina) in an hour
• Cost of approximately $1000
• The goal is the $100 human genome
Application and industry
2016 = estimated at $ 46 BILLION
https://www.gminsights.com/industry-analysis/precision-medicine-market
Databases for sequence analysis
• Primary databases• are populated with experimentally-derived data such as nucleotide sequence,
protein sequence or macromolecular structure
• Genebank, ArrayExpress, PDB
• Secondary databases• comprise data derived from the results of analysing primary data.
• Uniprot, Ensembl, Interpro
• Curated vs non-curated
BLAST glossary
• Basic Local Alignment Search Tool
• Identity• The extent to which two (nucleotide or amino acid) sequences have the same
residues at the same positions in an alignment, often expressed as a percentage.
• Score• Calculated as the sum of substitution and gap scores
• E value • Expect value - presents the number of different alignments with scores equivalent to
or better than S that is expected to occur in a database search by chance. The lower the E value, the more significant the score and the alignment
• Global vs local alignment• Global alignment of two nucleic acid or
protein sequences over their entire length
• Local alignment of a high-scoring region of two nucleic acid or protein sequences
• Homologs, paralogs and ortho• Homolog - biological components in different
species that arose from a single component present in the common ancestor of the species;
• Ortholog - may or may not have a similar function despite being in different species
• Paralogs - components within a single species that arose by gene duplication
Scenario 1
• The deep-sea vent metagenome
• You have assembled contigs from the sequence data
• Now you want to annotate the sequence
• You will predict ORFs
• You will predict proteins and parameters
• You will look at 3D structural predictions for proteins of interest
• You will link proteins to possible functions and pathways
• https://www.ncbi.nlm.nih.gov/assembly/GCA_900216165.1/
• Contig file
• Save NODE 1 as separate
• BLASTn for a first line of investigation
• ORF finding• NCBI ORF finder
• GeneMark
Nucleotide to protein
• ORF finders will translate the nucleotide sequence for you
• If you need to do this manually, there are MANY tools you can use
• Look at ExPASy tools• https://www.expasy.org/resources/
Now you have a protein sequence, so what??
• It is better to use a protein sequence than a nucleotide sequence• Degeneracy in the genetic code means more accurate results with protein
• NT changes do not always result in AA changes
• Sequence analysis relies on alignment – No matter what you use to do them
• The 4-letter NT code has less information than the 20 letter AA code
What next?
• BlastP• What, who, how much (relatedness)• Infers homology (guilt by association) but calculates conservation/ similarity
• ProtParam• All important information when you want to design down stream experimental
validation• Special note on GRAVY number hydropathy of a protein = and determines its folding
• 2D predictions (PSIPRED)
• 3D model predictions
• What about functions and other experiments?• Re-invent the wheel
Function predictions
• My favorite data bases• KEGG
• InterPro
• Uniprot
• STRING – functional protein association networks
• The Gene Ontology (GO) project provides a set of hierarchical controlled vocabulary split into 3 categories• Biological process
• Molecular function
• Cellular component
Multiple sequence alignment
• Multiple sequences alignments can tell you where in a sequence the conserved and variable regions are, which is important for understanding the biology of the sequences under investigation
• Can also demonstrate evolutionary trends and conservation
• Practical applications, such as being able to design PCR primers that will amplify sequences from a number of different species
Example
• Download the sequences for glyceraldehyde 3- phosphate dehydrogenase from NCBI
• Add these PROTEIN sequences to Uniprot
• Run alignment • Find conserved regions
• Can you identify active sites for G3PD (key- look at 97% conservation)
• You can look at phylogenetics trees (to be covered in the phylogenetics section)
SNP analysis• Used to detect polymorphisms within a
population.
• A single nucleotide polymorphism (SNP), a variation at a single site in DNA
• DNA strand 1 differs from DNA strand 2 at a single base-pair location
• Multiple tools available, Mummer is frequently used and paid software are CLC-Bio and DNA baser
• Look at SNiPlay
• https://omictools.com/
RNA- seq analysis
• RNA-Seq technology is very useful for differential expression analysis involving some specific conditions
• Following sequencing small generated sequences are mapped to a genome or transcriptome
• Mapped data are normalized and differentially expressed genes are identified using statistical models
• Relevance of the produced data is finally evaluated from a biological context• There is not a consensus about which methodology is most appropriate or
which approach ensures the validity of the results in terms of robustness, accuracy and reproducibility
• RNAontheBENCH: computational and empirical resources for benchmarkingRNAseq quantification and differential expression methods - reference
Conclusions
• Main take away points are that if you have DNA sequence (derived from any method), you can analyze the sequence and infer biological functions
• Remember that experimental validation will be key
• References to look at:• http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190152
• http://www.bioinf.jku.at/teaching/current/ws_sapvl/BioInf_I_Notes.pdf
• https://omictools.com/