CSCE555 Bioinformatics
description
Transcript of CSCE555 Bioinformatics
![Page 1: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/1.jpg)
CSCE555 BioinformaticsCSCE555 Bioinformatics
Lecture 2Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555
University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.
![Page 2: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/2.jpg)
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/21/23 2
![Page 3: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/3.jpg)
Tools to Learn Concepts Tools to Learn Concepts QuicklyQuicklyWikipedia.org
◦Search “Genome” bringing up many related information
◦In google, type “keywards wiki”Google search tips
◦Find info from university websites Genome, site:edu
◦Find info as powerpoint files Genome, tutorial, filetype:ppt
![Page 4: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/4.jpg)
DNADNADeoxyribonucl
eic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. Backbone:
sugars and phosphate groupsDNA is a long polymer of simple units called nucleotides
BasesA: adenosine C: cytidine G: guanosine T: thymidine
![Page 5: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/5.jpg)
Microbial Genome: Microbial Genome: Clostridium sp. OhILAsClostridium sp. OhILAsCTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT
Complementary Base Pairing:A TC G Write a program to export
complementary sequence?
![Page 6: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/6.jpg)
Genome of organismsGenome of organismsgenome of an
organism is a complete DNA sequence of one set of chromosomes
![Page 7: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/7.jpg)
Sequencing: Basic IdeasSequencing: Basic Ideas Current lab techniques can sequence small (say 700 base
pairs) DNA pieces.◦ Use restriction enzymes to cut DNA pieces◦ Sort pieces of different sizes using gel electrophoresis and use
the sorting to read them Mapping and Walking
◦ Sequence one piece, get 700 letters, make a primer that allowed you to read the next 700, and work sequentially down the clone
◦ Estimate for human genome sequencing using this method: 100 years
Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing genomes◦ Obtain random sequence reads from a genome◦ Assemble them into contigs on the basis of sequence overlaps
Straightforward for simple genomes (with no or few repeat sequences) Merge reads containing overlapping sequence
Shotgun sequencing is more challenging for complex (repeat-rich) genomes: two approaches
![Page 8: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/8.jpg)
How Sequencing WorksHow Sequencing Works
Beckman CEQ 8000
![Page 9: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/9.jpg)
Sequencing small DNA piecesSequencing small DNA pieces
Use DNA cloning or PCR to make multiple copies.
Put in 4 testtubes marked G, A, T and C
In testtube G use restriction enzymes that cuts at G.
Do the above step for the other testubes.
Use gel electrophoresis separately for the content in each testtube.
The data results in the table on the left.
Reading the table we get G has lengths 1, 7, 12, 13, 19; A has lengths 2, 6, 8, 11, 14,15,16; T has length 4, 5, 9, 18 and C has length 3, 10, 17.
This gives us the sequence.
G A T C
G --------------
A --------------
C --------------
T --------------
T --------------
A --------------
G --------------
A --------------
T --------------
C --------------
A --------------
G --------------
G --------------
A --------------
A --------------
A --------------
C --------------
T --------------
G --------------
![Page 10: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/10.jpg)
Methods for very large scale Methods for very large scale sequencingsequencing
A hierarchical approach◦ Map on a large scale (physical mapping),
sequence specific clones whose position in the genome is known
Shot gun sequencing◦ “Tear up” the genome and sequence
random fragments until it is doneSequence tagged connectors (STC)
◦ Sequence the ends of many clones and use this info to pick overlapping clones
![Page 11: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/11.jpg)
““Shotgun” sequencingShotgun” sequencing
Clone to sequence
CopySub-clone
Sequence and “assemble”
….GTCTACCTGTACTGATCTAGC...…. CCTGTACTGATCTAGCATTA...
…. GTACTGATCTAGCATTACG...
![Page 12: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/12.jpg)
Emerging Sequence Emerging Sequence MethodsMethodsSequencing by
Hybridization (SBH).Mass
Spectrophotometric Sequences.
Direct Visualization of Single DNA Molecules by Atomic force Microscopy (AFM )
Single Molecule Sequencing Techniques
Single nucleotide Cutting
Nanopore sequencingReadout of Cellular
Gene Expression
![Page 13: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/13.jpg)
Whole Genomes of SpeciesWhole Genomes of SpeciesBacterial GenomesEukaryotic GenomesHuman Genome ProjectOther Animal and Plant GenomesModel Genomes
The genomes of more than 180 organisms have been sequenced since 1995
http://www.genomenewsnetwork.org/resources/sequenced_genomes/genome_guide_p1.shtml
![Page 14: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/14.jpg)
Sizes of GenomesSizes of GenomesYou will learn to download all these genomes into your computer’s harddrive
Refer to Table 1.1 Page 2 of Intro to Comp Genomics book.
![Page 15: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/15.jpg)
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/21/23 15
![Page 16: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/16.jpg)
DNA Sequence DNA Sequence RepresentationRepresentationDNA Sequence: a string of
letters with alphabet {A, C, G, T}Protein sequence: a string of
amino acids with alphabet {ARNDCEQGHILKMFPSTWYV}◦20 standard amino acids
Genetic code:
![Page 17: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/17.jpg)
Genetic Code: CondonGenetic Code: CondonDNA (ATCG)
RNA (AUCG)Three bases of
DNA encode an amino acid
![Page 18: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/18.jpg)
Genetic Code with Genetic Code with DegeneracyDegeneracy
![Page 19: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/19.jpg)
Representation of Representation of SequencesSequencesSingle DNA sequence
◦ATCCTTAAGGAAAMultiple sequences with similarity
◦Regular Expression◦ATAAA◦ACAAAA◦ATAAAAAA◦A[TC]A+
![Page 20: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/20.jpg)
Representation of Representation of SequencesSequencesProbablistic Model: Position-
specific scoring matrices (PSSM)
![Page 21: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/21.jpg)
Representation of Sequence: Representation of Sequence:
FASTA formatFASTA formattext-based format for representing either nucleic acid sequences or peptide sequences,
allows for sequence names and comments to precede the sequences.
![Page 22: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/22.jpg)
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/21/23 22
![Page 23: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/23.jpg)
Sequence Retrieval, Sequence Retrieval,
ManipulationManipulationWhere to download genome/sequence data◦Online databases: EMBL, GenBank◦Entrez cross-database search (life
science search engine)◦Goolge -
![Page 24: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/24.jpg)
![Page 25: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/25.jpg)
Example: Download H. Example: Download H. influenzae Genomeinfluenzae GenomeFirst bacterial genome: H.
influenzae, 1830Kbhttp://www.ncbi.nlm.nih.gov/
sites/entrez NC_007146
LinksHaemophilus influenzae 86-028NP, complete genomeDNA; circular; Length: 1,914,490 ntReplicon Type: chromosomeCreated: 2005/06/27
![Page 26: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/26.jpg)
Genome Information of H. Genome Information of H. influenzae influenzae
![Page 27: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/27.jpg)
Download the Complete Download the Complete Genome Sequence in Fasta Genome Sequence in Fasta FormatFormat
![Page 28: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/28.jpg)
RoadmapRoadmap
DNA, Chromosomes, Genomes
Genome Sequencing and whole genomes
DNA Sequence Representation, Models
Sequence Retrieval, Manipulation
Basic Analysis and Questions of Genomes
Summary
04/21/23 28
![Page 29: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/29.jpg)
Simple Questions and Simple Questions and Analysis of Genome Analysis of Genome SequenceSequenceFrequencies of Bases A/C/G/T by
simple countingSliding windows to check local
densityAT AG AC TA TG TC
K-mers frequent/unusual words ◦2-mers AT AG AC TA TG TC etc.◦3-mers
![Page 30: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/30.jpg)
Page 627
Genomic landscape: GC Genomic landscape: GC content analysiscontent analysisThe overall GC content of the
human genome is 41%.A plot of GC content versus
number of 20 kb windows shows a broad profile with skewing to the right.
![Page 31: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/31.jpg)
Fig. 17.15Page 628Source: IHGSC (2001)
GC content of the human genome: mean 41%
![Page 32: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/32.jpg)
Genomic landscape: CpG Genomic landscape: CpG islandsislands Dinucleotides of CpG are under-represented in
genomic DNA, occuring at one fifth the expected frequency.
CpG dinucleotides are often methylated on cytosine (and subsequently may be deamination to thymine).
Methylated CpG residues are often associated with house-keeping genes in the promoter and exonic regions.
Methyl-CpG binding proteins recruit histone deacetylases and are thus responsible for transcriptional repression.
They have roles in gene silencing, genomic imprinting, and X-chromosome inactivation.
![Page 33: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/33.jpg)
Broad genomic landscape: Broad genomic landscape: CpG islandsCpG islandsFindings:
◦50,267 CpG islands in human genome
◦28,890 after masking repeats with RepeatMasker
◦5-15 CpG islands per megabase◦(about <40 genes per megabase)
![Page 34: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/34.jpg)
SummarySummaryDNA, Chromosome, GenomeSequence modelsSequence database, retrievalWhole genome sequence
analysis
![Page 35: CSCE555 Bioinformatics](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814f2a550346895dbcb5f1/html5/thumbnails/35.jpg)
Slides CreditsSlides CreditsSlides in this presentation are
partially based on the work of slides from Internet.