Tag-based expression/function analysis

65

description

Tag-based expression/function analysis. Data files at webpage (link at todays date), and also: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/. Where are we now? R to do statistics Genome browsers and galaxy to visualize genes and genomics data - PowerPoint PPT Presentation

Transcript of Tag-based expression/function analysis

Page 1: Tag-based expression/function analysis
Page 2: Tag-based expression/function analysis

Tag-based expression/function analysis

Data files at webpage (link at todays date), and also:http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/

Page 3: Tag-based expression/function analysis

Where are we now?• R to do statistics• Genome browsers and galaxy to visualize

genes and genomics data• Analyzing expression by microarrays +R and

Bioconductor• Tag analysis• Proteomics

Page 4: Tag-based expression/function analysis

What we want in transcriptomics• Know what transcripts that are transcribed,

and how much they are transcribed – Implicitly also what transcripts that exist in the

cell, and how they look!

• Intuitively, we could get all this information by sequencing all mRNAs in one cell

Page 5: Tag-based expression/function analysis

General problems with cDNA sequencing:

Reverse transcriptase falls offHard to sequence long transcripts

Many cDNAs are identical, but some occurs only once per cell (or less!). Need to sequence

MANY cDNAsVery expensive if you want to sequence all

molecules

Page 6: Tag-based expression/function analysis

Solutions:

1) Do not sequence: use probes and hybridization: microarrays and tiling arrays ( this is where we are now!)

2) Only sequence parts of transcripts: tag sequencing (this is where we are getting)

Page 7: Tag-based expression/function analysis

Thought exercise

• What are the pros/cons with hybridization (micro/tiling arrays) vs sequencing? 2 minutes with your sideman

Page 8: Tag-based expression/function analysis

Albin’s take• + Cheap(per “gene”)• + Mature methods• + Standardized• -complex normalization needed• - cross-hybridization• - highly dependant on

annotation of probes• -dependant on designed

probes for genes• -Cannot deal with repeats• +/-Integrative signal (more on

next slide)

• - expensive (now, but changing)• -”unbiased” - no designed probes• - non-standard computational

methods• - more demanding processing

(now) • - much easier statistics in the end• + less noisy• + much higher resolution - up to

nucleotide level• + location information• +/- Sampled signal (more on next

slides)

Hybridization Sequencing

Page 9: Tag-based expression/function analysis

Hybridization: integrative

We have many identical probes. Each time a probe gets a hybridization event, we add a little to the signal.

This includes non-optimal hybridization events - just something labeled that hybridizes will give some signal

Page 10: Tag-based expression/function analysis

Sequencing: sampling

The number of cDNAs in a library is VERY LARGE

We pick only some of them to do sequencing, randomly

Blind sampling (does not know anything about RNAs)

We map sequences back to the genome ( a kind of quality check)

Page 11: Tag-based expression/function analysis

Why is this interesting?• Sequencing approaches are generally

better than hybridization in quality and you can also do more diverse experiments

• New sequencers make it possible to do this almost as cheap as with hybridization – normal research groups can now buy the capacity of an old sequencing centre

• It is basically the technology of the future

Page 12: Tag-based expression/function analysis

5 types of sequencing data data for expression – and functional- studies

• Non-subtracted cDNA• ESTs• SAGE• CAGE• RNA-seq

Page 13: Tag-based expression/function analysis

Why so many techniques?

• Historical reasons – technology development over time

• Some of these technologies are only for expression – others also give other information (and different information)

• Difference in costs - efficiency

Page 14: Tag-based expression/function analysis

Non-subtracted cDNA

• Theoretically possible to sequence all cDNAs in a cell

• Very, very expensive!• Hard to get true expression, since

amplification is length-dependant• Not very necessary to have the whole cDNA

for expression?

Page 15: Tag-based expression/function analysis

Expressed sequence tags ESTsSequence from 5’ and 3’ ends – until the reverse transcriptase falls off

Cheaper than full-length cDNAs

Problems: many ESTs are simply trash – the result of over-enthusiastic sequencing

For longer genes, no coverage of the middle part

Page 16: Tag-based expression/function analysis

How can we use ESTs?

• View the EST as a ranom sample from a pool of transcripts:– The number of ESTs found from a transcript

should be proportional to the concentration of that transcript in the cell=the expression

• How do we know what transcripts an EST comes from?

Page 17: Tag-based expression/function analysis

Unigene:clustering ESTs to “genes”

Back in the 90s, the idea was to use a lot of ESTs to find, and puzzle together, genes

The UNIGENE database is one of the outcome of this. Slightly obsolete, but useful at times

Basically, it tries to cluster ESTs and cDNAs to functional units: “genes”

Bonus: we can use this to look at expression of these genes – because we can count ESTs from different libraries

Page 18: Tag-based expression/function analysis

Thought exercise: How?

• Say that we have two lung EST libraries(= two collections of tags) from two patients, one who has lung cancer

• How can we prove that a given gene, like RARA, is significantly altered in expression in lung cancer?

• Think R! What do we need, and what tests should we use?

• 2 minutes with your side man

Page 19: Tag-based expression/function analysis

“Electronic Northern blot”

• In a nutshell: Fill in the following contingency table for a given gene

ESTs from tissue A

ESTs from tissue B

RARA

Rest of ESTs

Fisher exact test situation!

We can do this within unigene for single genes

Page 20: Tag-based expression/function analysis

Side-story for non-life-scientists: Northern what?

• Northern blot is classical method for detecting RNA molecules

• Related to Southern and Western blot (DNA and protein detection methods)

Page 21: Tag-based expression/function analysis
Page 22: Tag-based expression/function analysis

However…

• An electronic Northern is just a clever name, although it has the same goals - finding RNAs

• It is nothing more than a statistical over-representation test of mRNAs, by use of ESTs

Page 23: Tag-based expression/function analysis

Unigene:

• http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene

• …or just google for unigene

Page 24: Tag-based expression/function analysis

EST hits from different tissues

Public microarray data (nice for comparison - but not important now)

Let’s look at the tissue constraints of human RARA…

Page 25: Tag-based expression/function analysis
Page 26: Tag-based expression/function analysis

Note that the sample sizes are very different!1tag of 282332 is not the same as 1 tag out of 131488

Page 27: Tag-based expression/function analysis

What is TPM?

TPM= Tags per millionA normalization to be able to compare libraries of different sizes. Used very often for tag-based expression.

“How many tags would my gene have we have if the sample size is 1 million?”

…so, 10^6 * (#tags in my gene)/(#total tags)

Page 28: Tag-based expression/function analysis

Challenge

• Is the RARA gene significantly different in expression in eye vs blood?

Page 29: Tag-based expression/function analysis

ESTs from blood

ESTs from eye

Gene X 12 12

Rest of ESTs

124139-12

210756-12

Page 30: Tag-based expression/function analysis

> a<-matrix(c( 12,12,124139-12, 210756-12), nrow=2,byrow=T) > fisher.test(a)

Fisher's Exact Test for Count Data

data: a p-value = 0.2078# so,despite twice the TPM value, not significant

Page 31: Tag-based expression/function analysis

So ESTs are fantastic?…not really!Sometime useful butThere are too few of them, and very diverse libraries…and way too expensive to make routinely in a

normal lab

Basically, ESTs are rarely used now, but it is data worth considering

Page 32: Tag-based expression/function analysis

Modern tag sequencing

• SAGE, CAGE and RNASeq

Page 33: Tag-based expression/function analysis

Underlying idea:

• Only sequence as much as you need: 5', 3' or whole cDNA (in pieces)

• Map tags to known cDNAs or the genome (Thought exercise: what is the difference?)

Page 34: Tag-based expression/function analysis

SAGE

Page 35: Tag-based expression/function analysis

SAGE

• After sequencing:– Mask out adapters and primers– Make a database of all possible hits in mRNAs following the

restriction site (white board demo)– Map tags to this database, or the genome

• Mapping is surprisingly tricky– We cannot use BLAST or BLAT alignments (too short sequences)– Sequencing errors exist, as well as RNA editing– Some species have very few known mRNAs

Page 36: Tag-based expression/function analysis

Common approach

First identify all unique tags, and how many times we have seen themAAAGATGCTGC 67CAGTCGATCGAT 192…Correlate these tags with our gene database. Sum up all the tags for each geneMake expression analysis!

Page 37: Tag-based expression/function analysis

How can we analyze count data?

• The difference to micro arrays is that we deal with integers

• The more counts for a gene, the more expressed it is - theoretically a linear relation. We are theoretically counting actual RNA molecules

• Very much like the EST case, we can make statistics based on contingency tables if we have two samples

Page 38: Tag-based expression/function analysis

Data flow for tags

…is a bit too complex for this course to do in real life - takes time and requires programming (and a big computer)

Mapping of tags to genes is complex, and no standard solutions are adopted (yet)

Statistical analysis often involves making multiple fisher exact tests - this involves some R programming

To get a feeling for the data, we will instead use a website to to these things for us

Page 39: Tag-based expression/function analysis

Typical data after mapping:

Tag FrequencyAAAAAAAAAA 173AAAAAAAAAG 1AAAAAAAAAT 1AAAAAAAATA 2AAAAAAACAA 1AAAAAAACTA 2AAAAAAATAA 1

We want to go from here to actual counts per gene: we will let a web system do this for us

Page 40: Tag-based expression/function analysis

• In the data directory, I have collected two such files:SAGE_Colon…, corresponding to normal and cancer colon

• These are linked in the web page, also here: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/

• Then, go to http://cgap.nci.nih.gov/SAGE/• This page has many SAGE-related analyses. We

will try Digital Gene Expression Displayer (DGED)

Page 41: Tag-based expression/function analysis

Challenge

• Using DGED• Use the “Two of your files” option to use

the two colon samples. Select “short tags”• Try to understand what the statistical test

does (accept defaults)• What types of genes are “over-expressed”

in colon i) cancer tissue vs normal tissue, ii) normal tissue vs cancer tissue

Page 42: Tag-based expression/function analysis

Thought exercise

• What are the limitations with SAGE?

Page 43: Tag-based expression/function analysis

Albin’s take

• We can only measure expression – the location of tags in genes have no functional meaning

• Dependent on gene annotation - we can map to the genome, but hard to interpret such data (what genes?)

• Compared to array data: very few standard analysis methods

• Limited sequencing depth

Page 44: Tag-based expression/function analysis

5’ tagging

• Three methods that really do the same thing. Difference lies in chemistry and throughput and length of tags– CAGE– 5’SAGE– 5’ Oligo-capping

• We will use CAGE as an example (“Cap Analysis of Gene Expression)

Page 45: Tag-based expression/function analysis

Sequencing and mapping to the genome

CAGE

Page 46: Tag-based expression/function analysis

CAGE vs …

• SAGE– Conceptually same thing, but you catch the 5’

end of the gene: the transcription start site and thereby the promoter– which is a functional entity

– Higher number of tags– 5’ ends give functional data apart from

expression

Page 47: Tag-based expression/function analysis

Issues

• Only capped transcripts– Some real transcripts are not capped– Some capped transcripts are not full-length

• Associating 5’ ends with gene products is sometimes problematic – We only know starts of genes, not the length

• Tag length is borderline for mapping - 20-21 bp• Not clear how to define cutoffs - how many tags are

“real biological promoter”• Under-sampling: we miss a lot of promoters because

there are so many of them

Page 48: Tag-based expression/function analysis

StrengthsWe are actually looking at promoters, not genesFind novel promoters - sometimes within known

genesWe can look at expression at promoter level - for

instance define “tissue-specific” promotersWe can get a first unbiased look at where promoters

are, and how much they are used in a given cell

Page 49: Tag-based expression/function analysis

CAGE concepts

• The atom unit in CAGE is the tag, mapped to the genome. The tag comes from a given experiment (and has a label)

• What positional information is the most relevant for analysis?

20-21 bp

The tag

? ?

Page 50: Tag-based expression/function analysis

Only 5’ ends are interesting!

• …since the 20 bp length is only for mapping purposes .

• What if we have many tags overlapping one another? How can we represent this?

Page 51: Tag-based expression/function analysis

Some soon-to-be-outdated terminology

Page 52: Tag-based expression/function analysis

So…

• Unlike SAGE, CAGE can be viewed as a “barplot” on the genome, on nucleotide level

• How to cluster nearby CAGE tags to a meaningful “promoter” is an open problem

Page 53: Tag-based expression/function analysis

Within a promoter…

• …we can do exactly the same Fisher exact tests as before (as in SAGE or ESTs do for whole genes)

• What is the advantage/disadvantage of doing this on promoters instead of genes? (2min)

Page 54: Tag-based expression/function analysis

The big answer: alternative promoters with different tissue usage

Page 55: Tag-based expression/function analysis

CAGE resources• Genomic element viewer ( very similar to

UCSC browser)– CAGE tags and cDNA landscapes– Easiest by the links on fantom.gsc.riken.jp/3

Page 56: Tag-based expression/function analysis
Page 57: Tag-based expression/function analysis
Page 58: Tag-based expression/function analysis

Clicking on cage clusters give two options:CAGE analysis viewerCAGE basic viewer

Page 59: Tag-based expression/function analysis

CAGE resources

• Basic CAGE viewer– Comprehensive browser of CAGE tags and CAGE

tag clusters, and library information

Page 60: Tag-based expression/function analysis
Page 61: Tag-based expression/function analysis

Challenge• Look at the RARA gene in the MM5

assembly in the genomic elements viewer(browser) (so, NOT UCSC).

• How many alternative promoters does it have?

• Are any of these biased towards certain tissues?

Page 62: Tag-based expression/function analysis

Some points

• Not that easy to say which of these promoters that are “significant”

• Easy to get overwhelmed by numbers when counting tags

Page 63: Tag-based expression/function analysis

Back to work…

• We can treat CAGE tag counts, or really TPMs in a promoter as expression

• We can do the same analyses as in microarrays - including the typical heatmap

• We will do a small exploratory study of some CAGE data

Page 64: Tag-based expression/function analysis

• http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/

Page 65: Tag-based expression/function analysis

Walk-thru of CAGE exercise

• Also at http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/

• …together with updated slides• And linked from web page