From motif search to gene expression analysis
description
Transcript of From motif search to gene expression analysis
From motif search to gene expression analysis
P[ED]XK[RW][RK]X[ED]
Protein Motifs
Protein motifs are usually 6-20 amino acids long andcan be represented as a consensus/profile:
or as PWM
Protein Domains• In additional to protein short motifs, proteins are
characterized by Domains. • Domains are long motifs (30-100 aa) and are
considered as the building blocks of proteins (evolutionary modules).
The zinc-finger domain
Some domains can be found in many proteins with different functions:
….while other domains are only found in proteins with a certain function…..
MBD= Methylated DNA Binding Domain
Varieties of protein domains
Page 228
Extending along the length of a protein
Occupying a subset of a protein sequence
Occurring one or more times
Pfam
> Database that contains a large collection of multiple sequence alignments of protein domains
Based on Profile hidden Markov Models (HMMs).
HMM in comparison to PWM is a modelwhich considers dependencies between thedifferent columns in the matrix (different residues) and is thus much more powerful!!!!
http://pfam.sanger.ac.uk/
Profile HMM (Hidden Markov Model)can accurately represent a MSA
D16 D17 D18 D19
M16 M17 M18 M19
I16 I19I18I17
100%
100% 100%
100%
D 0.8S 0.2
P 0.4R 0.6
T 1.0 R 0.4S 0.6
X XX X
50%
50%D R T RD R T SS - - SS P T RD R T RD P T SD - - SD - - SD - - SD - - R
16 17 18 19
Match
delete
insert
Gene Expression Analysis
Gene Expression
10
proteinRNADNA
Gene Expression
11
AAAAAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAA
AAAAAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAAmRNA gene1
mRNA gene2
mRNA gene3
Studying Gene Expression 1987-2011
12
Spotted microarray (first high throughput gene expression experiments)
DNA chips
RNA-seq (Next Generation Sequencing)
Classical versus modern technologies to study gene expression
13
Classical Methods (Spotted microarray, DNA chips)-Require prior knowledge on the RNA transcriptGood for studying the expression of known genes
New generation RNA sequencing-Do not require prior knowledge Good for discovering new transcripts
14
1. Spotted Microarray
Two channel cDNA microarrays.
2. DNA Chips
One channel microarrays
(Affymetrix, Agilent),
Classical Methods
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
15
16
Experimental Protocol Two channel cDNA arrays
1. Design an experiment
(probe design)
2. Extract RNA molecules from cell
3. Label molecules with fluorescent dye
4. Pour solution onto microarray
– Then wash off excess molecules
5. Shine laser light onto array
– Scan for presence of fluorescent dye
6. Analyze the microarray image
17Cy3 Cy5Cy5Cy3
Cy5log2 Cy3
The ratio of expression is indicated by the intensity of the colorRed= High mRNA abundance in the experiment sample Green= High mRNA abundance in the control sample
Transforming raw data to ratio of expression
18
One channel DNA chips
• Each sequence is represented by a probe set colored with one fluorescent dye
• Target hybridizes to complimentary probes only• The fluorescence intensity is indicative of the
expression of the target sequence
19
Affymetrix Chip
RNA-seq
20
21
Clustering genes according to their expression profiles.
Gen
es
Experiments
NEXT…
22
WHY?What can we learn from the
clusterers?
• Identify gene function– Similar expression can infer similar function
• Diagnostics and Therapy– Different genes expression can indicate a disease
state– Genes which change expression in a disease can be
good candidates for drug targets
23
HOW?Different clustering approaches
• Unsupervised -Hierarchical Clustering
-Partition MethodsK-means
• Supervised Methods-Analysis of variance-Discriminant analysis-Support Vector Machine (SVM)
Clustering
Clustering organizes things that are close into groups.
- What does it mean for two genes to be close?
- Once we know this, how do we define groups?
What does it mean for two genes to be close?
25
We need a mathematical definition of distance between the expression of two genes
Gene 1
Gene 2
Gene1= (E11, E12, …, E1N)’Gene2= (E21, E22, …, E2N)’
For example distance between gene 1 and 2Euclidean distance= Sqrt of Sum of (E1i -E2i)2, i=1,…,N
Once we know this, how do we define groups?
26
Michael Eisen, 1998 : Generate a tree based on similarity(similar to a phylogenetic tree)
Each gene is a leaf on the treeDistances reflect similarity of expression
Hierarchical Clustering
Gen
es
Experiments
Gene Cluster
Internal nodes represent different functional Groups (A, B, C, D, E)
One genes may belong to more than one cluster
gene
s
28
Clusters can be presented by graphs
29
What can we learn from clusters with similar gene expression ??
30
EXAMPLE- hnRNP A1 and SRp40
HnRNPA1 and SRp40 are not clear homologs based on blast e-value but have a very similar gene expression pattern in different tissues
31
Are hnRNP A1 and SRp40 functionally homologs ??
SF SFSF
SFSF
SF SF
SFSF
SFSFSF
SRP40
hnRNP A1
YES!!!!
32
What can we learn from clusters with similar gene expression ??
• Similar expression between genes
– The genes have similar function
– One gene controls the other
– All genes are controlled by a common regulatory genes
33
How can we use microarray for diagnostics?
Gene-Expression Profiles in Hereditary Breast Cancer
• Breast tumors studied: BRCA1 BRCA2sporadic tumors
• Log-ratios measurements of 3226 genes for each tumor after initial data filtering
cDNA MicroarraysParallel Gene Expression Analysis
RESEARCH QUESTIONCan we distinguish BRCA1 from BRCA2– cancers based solely on their gene expression profiles?
35
How can microarrays be used as a basis for diagnostic?
Patient 1
patient 2
patient 3
patient4
patient 5
Gen1 + - - + +Gen2 + + - + -Gen3 - + + + -Gen4 + + + - -Gen5 - - + - +
5 Breast Cancer Patient
36
How can microarrays be used as a basis for diagnostic?
patinet1
patient 2
patient4
patient 3
patient 5
Gen1 + - + - +Gen3 - + + + -Gen4 + + - + -Gen2 + + + - -Gen5 - - - + +
InformativeGenes
BRCA1 BRCA2
37
Specific Examples
Cancer Research
Ramaswamy et al, 2003Nat Genet 33:49-54
Hundreds of genesthat differentiate betweencancer tissues in differentstages of the tumor were found.The arrow shows an exampleof a tumor cells which were not detected correctly byhistological or other clinical parameters.
38
Supervised approachesfor predicting gene function based on microarray data
• SVM would begin with a set of genes that have a common function (red dots), In addition, a separate set of genes that are known not to be members of the functional class (blue dots) are specified.
Support Vector Machine
39
• Using this training set, an SVM would learn to differentiate between the members and non-members of a
given functional class based on expression data.
• Having learned the expression features of the class, the SVM could recognize new genes as members or as non-members of the class based on their expression data.
?
40
Using SVMs to diagnose tumors based on expression dataEach dot represents a vector of the expression pattern taken from a microarray experiment . For example the expression pattern of all genes from a cancer patients.
41
How do SVM’s work with expression data?In this example red dots can be primary tumors and blue arefrom metastasis stage.The SVM is trained on data which was classified based on histology.
?
After training the SVM we can use it to diagnose the unknown tumor.
Projects 2012-13
Key dates13.12 lists of suggested projects published **You are highly encouraged to choose a project yourself or find a relevant project which can help in your research
22.1 Submission project overview (one page)-Title-Main question-Major Tools you are planning to use to answer the questions
Final week – meetings on projects12.3 Poster submission20.3 Poster presentation
Instructions for the final projectIntroduction to Bioinformatics 2012-13
2. Planning your research After you have described the main question or questions of your project, you should carefully plan your next stepsA. Make sure you understand the problem and read the necessary background to proceed B. formulate your working plan, step by stepC. After you have a plan, start from extracting the necessary data and decide on the relevant tools to use at the first step. When running a tool make sure to summarize the results and extract the relevant information you need to answer your question, it is recommended to save the raw data for your records , don't present raw data in your final project. Your initial results should guide you towards your next steps.D. When you feel you explored all tools you can apply to answer your question you should summarize and get to conclusions. Remember NO is also an answer as long as you are sure it is NO. Also remember this is a course project not only a HW exercise. .
3. Summarizing final project in a poster (in pairs)Prepare in PPT poster size 90-120 cmTitle of the project Names and affiliation of the students presenting
The poster should include 5 sections :Background should include description of your question (can add
figure)Goal and Research Plan: Describe the main objective and the research planResults (main section) : Present your results in 3-4 figures, describe
each figure (figure legends) and give a title to each result Conclusions : summarized in points the conclusions of your projectReferences : List the references of paper/databases/tools used for
your project
Examples of posters will be presented in class