CSE/Beng/BIMM 182: Biological Data Analysis
-
Upload
otto-sexton -
Category
Documents
-
view
24 -
download
1
description
Transcript of CSE/Beng/BIMM 182: Biological Data Analysis
![Page 1: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/1.jpg)
CSE/Beng/BIMM 182: Biological Data Analysis
Instructor: Vineet Bafna
TA: Yuan Zhao
Course Link
![Page 2: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/2.jpg)
Today
• We will explore the syllabus through a series of questions?
• Please ASK• All logistical information will be given at the
end
![Page 3: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/3.jpg)
Introduction to the class:Databases
• Biological databases are diverse– Often, little more than large text files
• Database technology is about formally representing data and the inter-relationships among the data objects.
• This course is not about databases, but about the data itself.• We will ‘look’ at many biological databases (keep a count!) but not at
their formal structure. Instead, we will ask:– How can we represent the data?– How can we query this data?
• In order to understand the data, we need to know a little Biology.
![Page 4: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/4.jpg)
Life begins with Cell
• A cell is a smallest structural unit of an organism that is capable of independent functioning
• All cells have some common features
![Page 5: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/5.jpg)
All life depends on 3 critical molecules
• Protein– Form enzymes, send signals to other cells, regulate gene activity.– Form body’s major components (e.g. hair, skin, etc.).
• DNA– Hold information on how cell works
• RNA– Act to transfer short pieces of information to different parts of cell– Provide templates to synthesize into protein
![Page 6: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/6.jpg)
The molecules of Life and Bioinformatics
• DNA, RNA, and Proteins can all be represented as strings!
• DNA/RNA are string over a 4 letter alphabet(A,C,G,T/U).
• Protein Sequences are strings over a 20 letter alphabet.
• This allows us to store and query them as text.
![Page 7: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/7.jpg)
History of Genbank
• In 1982 Goad's efforts were rewarded when the National Institutes of Health funded Goad's proposal for the creation of GenBank, a national nucleic acid sequence data bank. By the end of 1983 more than 2,000 sequences (about two million base pairs) were annotated and stored in GenBank.
Walter Goad, 1942-2000
![Page 8: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/8.jpg)
Sequence data
![Page 9: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/9.jpg)
![Page 10: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/10.jpg)
How do we query a sequence database?
• By name• By sequence• ‘Relational’ queries
are barely applicable
![Page 11: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/11.jpg)
Quiz:DNA sequence databases
Suppose you have a 100nt sequence, and you want to know if it is human, what will you do?
How much time will it take? Or, how many steps? (Query=m, Database = n)
• What if you were interested in identifying the human homolog of a mouse sequence ( 85% identical)? How much time will it take? What if the query was 10Kbp? What if it was the entire genome?
ACGGATCGGCGAATCGAATCGTGGGCCTTAdatabase
AATCGTquery
![Page 12: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/12.jpg)
BLAST
• Allows querying sequence databases with sequence queries.
• It is the prototypical search tool.
• The paper describing it was the most cited paper in the 90s.
![Page 13: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/13.jpg)
Quiz:BLAST
What do you do if BLAST does not return a ‘hit’?
What does it mean if BLAST returns a sequence that is 60% identical? Is that significant (are the sequences evolutionarily related)?
Suppose Protein sequences A & B are 40% identical, and A &C are 40% identical. If we know that A&B are evolutionarily related, what does that say about A & C?
![Page 14: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/14.jpg)
Non sequence based queries
• Biological databases are not limited to sequences.
![Page 15: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/15.jpg)
Protein Sequences have structure
Quiz: Can you search using a structure query?
![Page 16: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/16.jpg)
Ex2: Sequences have motifs
How to represent and query such motifs?
![Page 17: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/17.jpg)
Quiz: Protein Sequence Analysis
• You are interested in all protein sequences that have the following pattern:– [AC]-x-V-x(4)-{ED}
• This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
• How can you search a protein sequence database for any such pattern?
• What if the database was a collection of patterns ?
![Page 18: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/18.jpg)
Database of Protein Motifs
![Page 19: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/19.jpg)
Quiz: Protein Sequence Analysis
Proteins fold into a complex 3D shape. Can you predict the fold by looking at the sequence?
What is a domain? How can you represent a domain? How can you query?
![Page 20: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/20.jpg)
Quiz: Biology
• DNA is the only inherited material. Proteins do most of the work, so DNA must somehow contain information about the proteins.
• How is the information about proteins encoded in DNA? What is the region encoding this information called?
![Page 21: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/21.jpg)
DNA, RNA and flow of information
• A gene is expressed in two steps1) Transcription: RNA synthesis2) Translation: Protein synthesis
![Page 22: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/22.jpg)
DNA, RNA, and the Flow of Information
TranslationTranscription
Replication
![Page 23: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/23.jpg)
Quiz:
How would you find genes in genomic sequence?
What is splicing? Alternative splicing? How can you (computationally) tell if a gene has alternative splice forms?
What is a gene?
![Page 24: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/24.jpg)
Quiz:Transcription?
• What causes transcription to switch on or off? How can we find transcription factor binding sites?
• The number of transcripts of a gene is indicative of the activity of the gene. Can we count the number of transcripts? Can we tell if the number of copies is abnormally high, or abnormally low?
![Page 25: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/25.jpg)
Quiz: Translation
• How is Protein Sequencing done?
Many proteins are post-translationally modified. How can you identify those proteins?
• What is a mass spectrometer?
![Page 26: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/26.jpg)
Quiz: Translation
• Are all genes translated?
• Can you predict non-coding genes in the genome? Can you predict structure for RNA?
• What is special about RNA?
![Page 27: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/27.jpg)
RNA sequences have Structure
![Page 28: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/28.jpg)
Quiz:RNA
• How can you predict secondary, and tertiary structure of RNA?
• Given an RNA query (sequence + structure), can you find structural homologs in a database? EX: tRNA
![Page 29: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/29.jpg)
Packaging
• All of the transcripts are encoded in DNA, which is packaged into the genome.
• Many databases (much of sequence) are devoted to storing entire genomic sequences.
![Page 30: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/30.jpg)
Genome Sequencing
• How is the genome sequence determined? Sequences can only be read 500-1000bp at a time. How long is the human genome?
• If human genome is of length X(=3Gb), and each shotgun fragment is of length y, how many fragments do we need to get X
• What is shotgun sequencing?
![Page 31: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/31.jpg)
Quiz: Sequencing
• Suppose you have fragments, and you want to assemble them into the genome, how would you do it? – How would you determine the overlaps– Layout, Consensus?
![Page 32: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/32.jpg)
1997
What was the main point of the debate?
![Page 33: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/33.jpg)
2001
![Page 34: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/34.jpg)
Sequencing Populations
• It took a long time (10-15 yrs) to produce the draft sequence of the human genome.
• Now, entire populations can have their DNA sequenced. Why do we care?
![Page 35: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/35.jpg)
Personalized genomics
April’08 Bafna
![Page 36: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/36.jpg)
UCSD Bix
23andMe
Sep’07
![Page 37: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/37.jpg)
UCSD BixSep’07
![Page 38: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/38.jpg)
Quiz:Population genetics
• We are all similar, yet we are different. How substantial are the differences?– Why are some people more likely to get a disease
then others?– If you had DNA from many sub-populations, Asian,
European, African, can you separate them?– How is disease gene mapping done?
![Page 39: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/39.jpg)
Variations in DNA
• What is a SNP?• What is DNA
fingerprinting?• What can you
study with these variations?
![Page 40: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/40.jpg)
How do these individual differences occur?
• Mutation• Recombination
![Page 41: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/41.jpg)
Mutations
000001010111000110100101000101010010000000110001111000000101100110
Infinite Sites Assumption:Each site mutates at most once
![Page 42: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/42.jpg)
Recombination
1101010100010111101010001010110100
11010101010110100
![Page 43: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/43.jpg)
Genotypes and Haplotypes• Each individual has two “copies” of each chromosome. • At each site, each chromosome has one of two alleles
• Current Genotyping technology doesn’t give phase
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
01
1 01 1 0 0 1 0
1 0 Genotype for the individual
![Page 44: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/44.jpg)
SNP databases
• Quiz: Given a database of ‘variations’ in a population (EX: dbSNP), how do you use it to map disease genes?
• Given database from different ethnicities, how do we check the ethnicity of a specific individual?
![Page 45: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/45.jpg)
Summary
• Biological data is complex.• Hard to standardize representation, and
harder to query such data• Important to understand this diversity and the
variety of tools available for querying.
![Page 46: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/46.jpg)
Course Outline
• Informal description of various data repositories
• Tools for querying this data– Underlying algorithms– Implementation issues
• Assignments– Using & building simple versions of these tools.
![Page 47: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/47.jpg)
Perl/Python
• Advanced programming skills are not required except in optional projects..
• Facility for handling and manipulating data is important and will be covered in this course.
• Perl/Python are appropriate scripting languages. You can do a lot by learning a little.
![Page 48: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/48.jpg)
Grading
• 40% assignments, 20% Mid-term, 20% Final, 20% Project• For all assignments, you are free to discuss among
yourselves, and use web resources unless otherwise stated.– You must write the assignment yourself.– Cite all sources and collaborators!
• The final exam will be take home and no collaboration is allowed.
• Academic honesty is more important than grades!
![Page 49: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/49.jpg)
Assignment 1
• Online now. (link)• Due in class the following week, but is fairly
simple to accomplish with a scripting language.
![Page 50: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/50.jpg)
Project
• You can team up (<= 3) to do the project.• Some project require more biology, others require
serious programming.• There are 3 checkpoints, after the first midterm.• For the final project, you must make a 15min
presentation at the end of the class.
![Page 51: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062408/568132ae550346895d995edb/html5/thumbnails/51.jpg)
QUESTIONS?