CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background...

Post on 03-Jan-2016

215 views 0 download

Tags:

Transcript of CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background...

CS491JH: Data Mining in Bioinformatics

Introduction to Microarray Technology

•Technology Background

•Data Processing Procedure

•Characteristics of Data

•Data integration and Data mining

Substrates for High Throughput Arrays

Nylon Membrane Glass SlidesGeneChip

Single label P33 Single label biotinstreptavidin

Dual labelCy3, Cy5

GeneChip® Probe Arrays

24µm24µm

Millions of copies of a specificMillions of copies of a specificoligonucleotide probeoligonucleotide probe

Image of Hybridized Probe ArrayImage of Hybridized Probe Array

>200,000 different>200,000 differentcomplementary probes complementary probes

Single stranded, Single stranded, labeled RNA targetlabeled RNA target

Oligonucleotide probeOligonucleotide probe

**

**

*

1.28cm1.28cm

GeneChipGeneChip Probe ArrayProbe ArrayHybridized Probe CellHybridized Probe Cell

GeneChip® Expression Array Design

GeneGeneSequenceSequence

Probes designed to be Probes designed to be Perfect MatchPerfect Match

Probes designed to be Probes designed to be MismatchMismatch

Multiple Multiple oligo probesoligo probes

5´5´ 3´3´

Procedures for Target Preparation

cDNAcDNAFragmentFragment(heat, Mg(heat, Mg2+2+))

LL LL LL LL

Wash & StainWash & Stain

ScanScan

HybridizeHybridize

(16 hours)(16 hours)

Labeled transcriptLabeled transcript

Poly (A)Poly (A)++//TotalTotal RNARNA

AAAAAAAA

IVTIVT

(Biotin-UTP(Biotin-UTPBiotin-CTP)Biotin-CTP)

Labeled fragmentsLabeled fragments

LL LL

LL

LL

CellsCells

Microarray Technology

NSF Soybean Functional GenomicsSteve Clough / Vodkin Lab

Printing Arrays on 50 slides

Cells from condition ACells from condition ACells from condition ACells from condition A Cells from condition BCells from condition BCells from condition BCells from condition B

mRNA

Label Dye 2

NSF / U of IllinoisMicroarray Workshop-Steve Clough / Vodkin Lab

Ratio of expression of genes from two sources

Label Dye 1

cDNA

equal over under

Mix

Totalor

GSI Lumonics

NSF Soybean Functional GenomicsSteve Clough / Vodkin Lab

Beta Actin

PKG

HPRT

Beta 2 microglobulin

RubiscoAB binding protein

Major latex proteinhomologue (MSG)

Cattle and Soy Controls

Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green).1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng) were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem, Inc.). To the right of each set of spots are five negative controls (water).

IgM

IgM heavy chain

MYLK

COL1A2 COL1A2

MYLK

IgM

Fetal Spleen-Cy3 Adult Spleen-Cy5

IgM heavy chain

Placenta vs. Brain – 3800 Cattle Placenta Array cy3 cy5

GenePix Image Analysis Software

GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 Name:O2#1 8-20-99adjfinal N2#1finaladj

INTENSITIESRAW NORMALIZED

ORF NAME GENE NAME CHRM F G R GF1 GF2 GF1 GF2 DIFFERENCE RATIOYAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 209.79 194.04 1.92YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98 1.76YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87 1.19YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41 1.41YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 578.42 220.64 1.38YDL211C 4 1 A 1 7 17.31 35.34 581.00 "1,005.18" -424.18 -1.73YDR155C CPH1 4 1 A 1 8 349.78 401.84 "11,741.98" "11,428.10" 313.88 1.03YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21 1.16YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 273.36 187.67 1.69YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 220.01 65.37 1.30YBR162C 2 1 A 2 4 226.84 293.83 "7,614.82" "8,356.39" -741.57 -1.10YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" 989.41 396.38 1.40YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99 177.34 89.65 1.51YDL219W 4 1 A 2 7 16.08 11.33 539.93 322.20 217.74 1.68YDR163W 4 1 A 2 8 19.13 14.19 642.17 403.56 238.61 1.59YDR354W TRP4 4 1 A 2 9 62.24 40.74 "2,089.48" "1,158.64" 930.84 1.80YAL018C 1 1 A 3 2 10.72 8.81 359.75 250.60 109.15 1.44YBL096C 2 1 A 3 3 10.91 8.98 366.40 255.40 111.00 1.43YBR169C SSE2 2 1 A 3 4 17.33 27.81 581.80 790.84 -209.05 -1.36YCL060C 3 1 A 3 5 17.99 24.75 603.96 703.75 -99.79 -1.17YDL036C 4 1 A 3 6 14.22 8.86 477.39 251.94 225.44 1.89YDL227C HO 4 1 A 3 7 25.61 31.52 859.71 896.46 -36.75 -1.04YDR171W HSP42 4 1 A 3 8 102.08 98.37 "3,426.83" "2,797.58" 629.25 1.22YDR362C 4 1 A 3 9 16.32 12.95 547.96 368.39 179.57 1.49YAL026C DRS2 1 1 A 4 2 11.32 7.97 379.85 226.53 153.33 1.68YBL102W SFT2 2 1 A 4 3 55.88 63.74 "1,875.82" "1,812.81" 63.02 1.03YBR177C 2 1 A 4 4 63.31 29.03 "2,125.20" 825.60 "1,299.60" 2.57YCL068C 3 1 A 4 5 8.33 4.47 279.51 127.16 152.35 2.20YDL044C MTF2 4 1 A 4 6 11.73 6.96 393.88 198.07 195.81 1.99YDL235C YPD1 4 1 A 4 7 38.71 30.20 "1,299.33" 858.83 440.50 1.51YDR179C 4 1 A 4 8 12.77 11.05 428.60 314.12 114.48 1.36YDR370C 4 1 A 4 9 16.70 15.30 560.62 435.13 125.49 1.29YAL034C FUN19 1 1 A 5 2 20.89 24.21 701.32 688.59 12.73 1.02YBL111C 2 1 A 5 3 22.38 13.67 751.39 388.69 362.70 1.93YBR185C MBA1 2 1 A 5 4 38.42 19.96 "1,289.61" 567.78 721.83 2.27YCLX03C 3 1 A 5 5 8.69 3.66 291.77 104.11 187.66 2.80YDL052C SLC1 4 1 A 5 6 52.37 49.87 "1,758.05" "1,418.33" 339.73 1.24YDL243C 4 1 A 5 7 15.56 12.95 522.24 368.30 153.94 1.42YDR186C 4 1 A 5 8 16.48 15.01 553.30 426.75 126.55 1.30YDR378C 4 1 A 5 9 31.13 28.08 "1,045.01" 798.50 246.50 1.31YAL040C CLN3 1 1 A 6 2 126.65 107.34 "4,251.70" "3,052.61" "1,199.08" 1.39YBR006W 2 1 A 6 3 22.74 11.10 763.49 315.55 447.94 2.42YBR193C 2 1 A 6 4 14.81 15.55 497.07 442.20 54.87 1.12YCLX11W 3 1 A 6 5 161.96 175.34 "5,436.86" "4,986.41" 450.44 1.09YDL060W 4 1 A 6 6 29.84 37.13 "1,001.65" "1,055.98" -54.34 -1.05YDR003W 4 1 A 6 7 23.99 23.22 805.48 660.25 145.22 1.22YDR194C MSS116 4 1 A 6 8 66.58 47.16 "2,235.07" "1,341.29" 893.78 1.67YDR386W 4 1 A 6 9 11.27 5.75 378.27 163.46 214.81 2.31YAL047C 1 1 A 7 2 15.54 11.30 521.74 321.28 200.46 1.62YBR012W-B 2 1 A 7 3 54.70 79.97 "1,836.29" "2,274.15" -437.86 -1.24YBR201W DER1 2 1 A 7 4 21.67 19.57 727.49 556.64 170.85 1.31YCR007C 3 1 A 7 5 25.02 15.96 840.01 453.76 386.25 1.85YDL068W 4 1 A 7 6 18.32 13.11 614.83 372.78 242.05 1.65

1. Experimental Design

2. Image Analysis – raw data

3. Normalization – “clean” data

4. Data Filtering – informative data

5. Model building

6. Data Mining (clustering, pattern recognition, et al)

7. Validation

Microarray Data Process

Scatterplot of Normalized Data

Adult

Fet

al

>0.3<-0.3

Characteristics of Data

Data can be viewed as a NxM matrix (N >> M):

N is the number of genes

M is the number of data points for each gene

Or Nx(M+K)

K is the number of Features describing each gene(genome location, functional description, metabolic pathway et al)

Model for Data Analysis

•Gene Expression is a Dynamic Process

•Each Microarray Experiment is a snap shot of the process

•Need basic biological knowledge to build model

For Example:

Assumption – In most of experiments, only a small set of genes (100s/1000s) have been affected significantly.

Data Mining

•Data volumes are too large for traditional analysis methods

Large number of records and high dimensional data

•Only small portion of data is analyzed

•Decision support process becomes more complex

Functions of Data Mining

Need for Data Mining

Use the data to build predictors – prediction, classification, deviation detection, segmentation

Generates more sophisticated summaries and reports to aid understanding of the data – find clusters, partitions in data

Data Mining Methods

Classification, Regression (Predictive Modeling)

Clustering (Segmentation)

Association Discovery (Summarization)

Change and deviation detection

Dependency Modeling

Information Visualization

Cholesterol Biosynthesis

Cell Cycle

Immediate Early Response

Signaling and Angiogenesis

Wound Healing and Tissue Remodeling

Clustered display of data from time course of serum stimulation of primary human fibroblasts.

Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998) pg 14865

Self Organizing Maps

Molecular Classification of Cancer

Gene Expression Profile of Aging and Its Retardation by Caloric Restriction

Cheol-Koo Lee, Roger G. Klopp, Richard Weindruch, Tomas A. Prolla

Expression Landscape of cell-cycle regulated genes in yeast

Multi-dimension data visualization