MASSIVELY PARELLEL SIGNATURE SEQUENCING

39
Gene expressions analysis by massively parallel signature sequencing (MPSS) By: Dr. Ashish C Patel Assistant Professor Vet College, AAU, Anand

Transcript of MASSIVELY PARELLEL SIGNATURE SEQUENCING

Page 1: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Gene expressions analysis by massively parallel signature sequencing (MPSS)

By: Dr. Ashish C PatelAssistant ProfessorVet College, AAU, Anand

Page 2: MASSIVELY PARELLEL SIGNATURE SEQUENCING

DNA microarray, serial analyses of gene expression (SAGE), cDNA sequencing and a variety of other technologies are available for analysing the expression of hundreds to thousands of genes simultaneously.

Each of these existing technologies has limitations when it comes to generating complete data sets for building relational databases.

Massively Parallel Signature Sequencing (MPSS) is a an open-ended platform that analyses the level of gene expression in a sample by counting the number of individual mRNA molecules produced by each gene.

Page 3: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Massively Parallel Signature Sequencing (MPSS) a sequencing technique is developed by Sydney Brenner which is bacteria-free bead-based library preparation, “Megaclone” technology.

In MPSS, mRNA transcripts did not need to be known and could be discovered de novo.

Genes with low level expression could be quantified by MPSS.

All clones in a microbead library can be sequenced simultaneously (so, called “massively parallel”).

Page 4: MASSIVELY PARELLEL SIGNATURE SEQUENCING

MPSS produces data in a digital format. MPSS Captures data by counting virtually all mRNA molecules in a tissue or cell sample.

All genes are analysed simultaneously, and bioinformatics tools are used to sort out the number of mRNAs from each gene relative to the total number of molecules in the sample.

Even genes that are expressed at low levels can be quantified with high accuracy.

Counting mRNAs with MPSS is based on the ability to identify uniquely every mRNA in a sample by generating a 17-base sequence for each mRNA at a specific site upstream from its poly (A) tail.

This 17 base sequence is used as mRNA identification signature.

To measure the level of expression of any given gene, the total number of signatures for that gene’s mRNA is counted.

Page 5: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Principle of MPSS

A sample’s mRNA are first converted to cDNA using reverse transcriptase, which are fused to a small oligonucleotide "tag" which allows the cDNA to be PCR amplified and then coupled to microbeads. After several rounds of sequence determination, using hybridization of fluorescent labeled probes, a sequence signature of ~16-20 bp is determined from each bead.

Fluorescent imaging captures the signal from all of the beads, so DNA sequences are determined from all the beads in parallel, approximately 1,000,000 sequence reads are obtained per experiment.

Page 6: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Procedure of Cloning and Sequencing cDNA Fragments on Beads

MPSS signatures for mRNAs in a sample are generated by sequencing ds cDNA fragments cloned onto microbeads using the Lynx Megaclone technology

Page 7: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Poly (A) mRNA molecules are converted into double-stranded cDNA molecules using biotynalated oligo dT primer. Streptavidin is use to purify biotynalated cDNA.

cDNA digested with DpnII cDNA fragments cloned into a

specially designed plasmid vector containing a unique barcode tag.

Total 16.8 x 106 million different 32-base sequences available in the reference tag library, and each cDNA clone contains a different sequence.

The library of cDNA inserts with oligonucleotide tags are PCR- amplified.

The resulting linear molecules are partially treated with an exonuclease to

make the 32-base tag single stranded.

Page 8: MASSIVELY PARELLEL SIGNATURE SEQUENCING

The 32-base tags at the end of each of the cDNA molecules are hybridised to 32-base complementary tags of microbeads.

The end-product is a microbead with approximately 100,000 identical cDNA molecules covalently attached to the surface.

Page 9: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Adaptor ligating Encoded adaptors are ligated

to the ends of the cDNA molecules attached to the microbeads.

Decoder hybridization

Sixteen different fluorescent-labelled decoder probes are then sequentially hybridised to the encoded adaptor ends in order to assume the first four nucleotides at the end of each molecule.

Page 10: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Sequences of encoded adaptors

• Bead complementary four-base overhangs sequences in bold

• Decoder binding sites in lowercase.

Page 11: MASSIVELY PARELLEL SIGNATURE SEQUENCING

17 bases Signature determination

The encoded adaptor from the first round is then removed by digestion with Bbv I, which exposes the next four nucleotides as a four-base single- stranded overhang.

The process is repeated several times in order to generate a total of 17 bases of sequence

Page 12: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Approximately one million microbeads are loaded into a specially designed flow-cell in a way that allows them to stack together along channels and form a tightly packed monolayer in the flow-cell.

The flow-cell is connected to a computer-controlled microfluidics network that delivers different reagents for the sequencing reactions.

Sequencing

Page 13: MASSIVELY PARELLEL SIGNATURE SEQUENCING
Page 14: MASSIVELY PARELLEL SIGNATURE SEQUENCING

A high-resolution CCD camera is positioned directly over the flow-cell in order to capture fluorescent images from the microbeads at specific stages of the sequencing reactions.

MPSS system

Page 15: MASSIVELY PARELLEL SIGNATURE SEQUENCING
Page 16: MASSIVELY PARELLEL SIGNATURE SEQUENCING
Page 17: MASSIVELY PARELLEL SIGNATURE SEQUENCING
Page 18: MASSIVELY PARELLEL SIGNATURE SEQUENCING
Page 19: MASSIVELY PARELLEL SIGNATURE SEQUENCING
Page 20: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Dot plot showing the reproducibility between MPSS runs

Each dot or x represents a total of 10,799 signatures that were generated from two independent MPSS runs. The X and Y coordinates represent the number of each signature. Each signature represented by a dot occurs within a 0.99 confidence interval, while those represented with an x occur outside this interval.

Page 21: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Accuracy of MPSS for gene expression measurements

Factors that contribute to the false signatures include errors in the yeast genomic sequence, and errors introduced through reverse transcription, PCR, and incorrect ligation of encoded adaptors to non-complementary overhangs or single-stranded tag complements on the microbeads.

The accuracy of gene expression measurements was also assessed by comparing expression levels of genes measured by MPSS and other conventional methods.

E.g. THP-1 genes measured by MPSS analysis. A database of 1,619,000 signatures from MPSS analysis was generated from cDNAs derived from induced THP-1 cells. Separately, 1,839 clones were selected from the same cDNA library and conventionally sequenced.

Page 22: MASSIVELY PARELLEL SIGNATURE SEQUENCING

The relative frequencies of the most highly expressed genes were similar and the error from MPSS analysis was extremely low, reflecting the advantage of large samples of templates.

A few of the expression measurements are not in agreement, such as apoferritin heavy-chain transcript (HSAFH1) and B94 protein mRNA (HUMB94).

Percentage total for MPSS data is the average abundance. Percentage total for the EST data is the number of sequences clustered out of the 1,839 selected clones, with the 99% confidence interval.(A 1% relative abundance corresponds to about 2,500 microbead signatures for MPSS data and to about 18 sequences for the EST data).

Page 23: MASSIVELY PARELLEL SIGNATURE SEQUENCING

DATA HANDLING AND CALCULATION OF RNA ABUNDANCE

A typical MPSS experiment with about one million microbeads will yield 250,000 to 400,000 high quality ~16-20 base signature sequences.

MPSS datasets are additive in nature, means that datasets from multiple analyses with the same starting mRNA sample can be combined.

It involve in excess of one million mRNAs counted per sample. Which increased sensitivity for all genes being analysed, particularly those that are expressed at very low levels within the sample.

Each signature sequence in an MPSS data set is analysed, compared with all other signatures and all identical signatures are counted.

Page 24: MASSIVELY PARELLEL SIGNATURE SEQUENCING

The level of expression of any single gene is calculated by The number of signatures from that gene

--------------------------------------------------------------------- The total number of signatures for all mRNAs in the dataset. The data for each gene are usually reported as the transcripts per

million (TPM). The numbers of genes that are expressed at varying levels

within the sample. For example, genes expressed at greater than 1,000, 100 -1,000, 10 to 100 and less than 10 TPM.

Page 25: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Troubleshoots in MPSS Data MPSS signature sequences can be connected to known genes

by comparison with data in the available genomic sequence and expressed sequence tag (EST) databases.

This method is not an efficient process sometime when a signature for a gene is unknown in a particular sample only.

When a gene does not contain a Dpn II site, or when there is a sequence polymorphism in the Dpn II site.

These problems can be easily overcome by digesting the cDNA with an alternative enzyme.

Incomplete sequence representation of a gene in the current EST and cDNA clone databases can also complicate the process of assigning a signature sequence to a gene.

Page 26: MASSIVELY PARELLEL SIGNATURE SEQUENCING

The sequence that corresponds to an MPSS signature for a specific gene may not be present in an EST sequence database.

For example, the signature sequence for the T-cell transcription factor NFATc did not appear at any significant level in the Human T-cell-related MPSS datasets using the RefSeq database.

Therefore, careful and thoughtful analysis of the available data may be necessary during the process of assigning an MPSS signature to a gene.

Page 27: MASSIVELY PARELLEL SIGNATURE SEQUENCING

STATISTICAL ANALYSIS OF MPSS DATA

MPSS data as categorical form in a statistical point of view. Which make possible that the large number of measurements of

a given signature in the dataset (typically ten to 1,000 or more) as well as the size of the entire dataset (typically over one million) to evaluate whether the particular gene signature is differentially expressed in multiple different samples or not.

To test whether a gene is differentially expressed between two samples, the Z-test employed for analysis of SAGE data sets.

If x1 and x2 represent the abundance of a specific signature in samples 1 and 2, respectively, and n1 and n2 represent the total number of signatures generated for all mRNAs in samples 1 and 2, the proportions p1= x1/n1 and p2= x2/n2 each follow a binomial distribution.

Page 28: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Since n1 and n2 are large in MPSS (around in 106), the difference (p1 – p2) follows an approximate normal distribution defined.

where the unknown parameters p and q can be estimated as ^p = (x1 + x2) / (n1 + n2) and ^q = 1- ^p.

The test statistic defined by equation Equation show an inverse relationship between the level of

expression and size of the difference that can be evaluated between samples.

For example, for p <0.001, it is possible to detect a two-fold change for a gene that is expressed at a level of 30-40 copies per million.

For genes that are expressed at a higher abundance, it is possible to detect a much smaller difference.

Page 29: MASSIVELY PARELLEL SIGNATURE SEQUENCING

A 40 percent difference can be determined for genes that are expressed at about 200 copies per million.

This compared with microarrays, where a significance test is possible only if the experiment is replicated several times and where differential expression can usually be detected only for genes with relatively high levels of expression and with a large difference between samples.

Page 30: MASSIVELY PARELLEL SIGNATURE SEQUENCING

COMPARISON OF MPSS WITH cDNA SEQUENCING, SAGE AND MICROARRAY TECHNOLOGIES cDNA sequencing, SAGE and other technologies are similar to

MPSS in that they are digital in nature and count mRNA molecules in the sample.

Direct sequencing of cDNAs was the first digital technology for measuring gene expression.

MPSS Vs. cDNA: Both MPSS and direct cDNA sequencing involve the generation of a cDNA library as the first step of analysis.

Once the cDNA library is made, sequencing of cDNA clones involves the purification and sequencing of DNA using standard procedures that are both costly and time consuming.

Page 31: MASSIVELY PARELLEL SIGNATURE SEQUENCING

With Megaclone, at least one million cDNA molecules are cloned onto beads and with MPSS, over one million clones are sequenced simultaneously.

MPSS Vs. SAGE: MPSS has two advantages over SAGE. First, SAGE is also a transcript counting technique that

generates a tag sequence for each mRNA. The length of the SAGE tag is 14 nucleotides for SAGE procedure, which compares with a 17-nucleotide signature with MPSS.

Signature lengths of 14 nucleotides are 80 per cent unique, while the 17-nucleotide signature lengths generated with MPSS are approximately 95 per cent unique on the human genome (Zhang, et.al, 1999).

Page 32: MASSIVELY PARELLEL SIGNATURE SEQUENCING

During Human genome project, much higher percentage of MPSS signature sequences map to unique locations on the genome compared with SAGE tags.

Secondly, MPSS makes it possible to produce efficiently a very large dataset of signature sequences.

While Many SAGE tag sets are comprised of only 20,000 - 60,000 sequenced mRNAs. Knowing that a large percentage of genes are expressed at a level of 0.01 per cent or less.

But the dataset of 20,000 - 60,000 sequenced mRNAs has enough depth ????? to allow the quantitation and analysis of all genes within a sample, particularly for expressed at very low levels in the cell.

An MPSS dataset of one million or more signature sequences is more likely to provide a depth of analysis that will allow low-level expressed genes to be accurately quantify.

Page 33: MASSIVELY PARELLEL SIGNATURE SEQUENCING

MPSS Vs. Microarray: MPSS is most notable in that it is a technology that has the potential to capture virtually all genes present within the sample, and not just those that have been placed on the microarray.

No prior knowledge of a gene's sequence is required for MPSS. Microarrays have the limitation that homologous genes can

cross-hybridise, which makes it impossible to detect individual members of highly homologous gene family members which are not annotated earlier.

But with MPSS, the signature sequence in the 3’ untranslated region, can be different for individual family members. Therefore, it is possible, in many cases, to differentiate highly homologous genes from each other.

Page 34: MASSIVELY PARELLEL SIGNATURE SEQUENCING

The advantage of microarray is the high throughput analysis of multiple samples.

The microarray and MPSS technologies as being complementary in nature different tools for different types of experiments.

e.g. To generate in-depth and quantitative gene expression data for building complex relational databases, MPSS may be the technology of choice.

After these databases are mined for interesting biological information, it may be necessary to test whether sets of genes are differentially expressed in a large number of samples (eg tumours of a specific type). Here, the microarray platform be the technology of choice.

Both MPSS and at least one of the microarray technologies would seem to be ideal for most investigators.

Page 35: MASSIVELY PARELLEL SIGNATURE SEQUENCING

MPSS has the advantage that it provides in-depth quantitation of virtually all genes that are expressed in a sample.

Since there is no requirement for prior knowledge of any gene or genome, it is possible to generate quantitative gene expression datasets from any organism.

MPSS dataset involves one million or more signature sequences, it has the sensitivity to quantitate accurately genes that are expressed at very low levels within a cell.

No other single technology has these performance characteristics.

Page 36: MASSIVELY PARELLEL SIGNATURE SEQUENCING

To discover the molecular basis of hepatocyte function, they employed Massively Parallel Signature Sequencing (MPSS) to determine the transcriptomic profile of adult human hepatocytes.

They found that about 10,279 UniGene clusters, representing 7,475 known genes, were detected in human hepatocytes.

1,819 unique MPSS signatures matching the antisense strand of 1,605 non redundant UniGene clusters (such as APOC1, APOC2, APOB and APOH) were highly expressed in hepatocytes.

Some of the antisense transcripts expressed in hepatocytes could play important roles in transcriptional interference via a cis-/trans regulation mechanism.

Constitute the essential structural proteins of certain lipoproteins involved in lipid transport.

Page 37: MASSIVELY PARELLEL SIGNATURE SEQUENCING

In this study, two lineage-related prostate cancer cell lines, LNCaP and C4-2, were used for transcriptome analysis with the aim of identifying genes associated with prostate cancer progression.

In LNCaP cell line, 3,180 genes were only detected by Affymetrix and only 1,169 genes were detected by MPSS.

Similarly, in C4-2 cell line, 4,121 genes were only detected by Affymetrix and only 1,014 genes were detected by MPSS.

A combination of transcription profiling technologies such as DNA array and MPSS provides a more robust means to assess the expression profile of an RNA sample.

Finally, genes that were differentially expressed in cell lines were also differentially expressed in primary prostate cancer and its metastases.

Page 38: MASSIVELY PARELLEL SIGNATURE SEQUENCING

They used MPSS to determine transcriptomes of 32 normal human tissues and they found the patterns of expression of almost 20,000 genes with high sensitivity and specificity.

The differences in gene expression between cell and tissue types are largely determined by transcripts derived from a limited number of tissue-specific genes.

Page 39: MASSIVELY PARELLEL SIGNATURE SEQUENCING

Using massively parallel signature sequencing (MPSS), they identified a total of 4535 genes that are differentially expressed between normal brain and GBM (Glio-blastoma multiforme) tissue.

The expression changes of three up-regulated genes, CHI3L1, CHI3L2, and FOXM1, and two down-regulated genes neurogranin and L1CAM, were confirmed by quantitative PCR.

The construction of an extended TGF- b signaling network with overlaid gene expression changes between GBM and normal brain.