CHAPTER 3 PREPROCESSING ON MICROARRAY...

65

CHAPTER 3

PREPROCESSING ON MICROARRAY DATA

3.1 INTRODUCTION

One of the most critical steps in a data mining process is the

preparation and transformation of the initial dataset into suitable from. This

task received little attention in the research literature, mostly because it is

considered too application specific. But, in most data mining applications,

some parts of data preparation process or, sometimes, even the entire process

can be described independent of an application and a data mining method.

Many transformations may be needed to produce features more suitable for

selected data mining methods such as prediction or classification. In most

cases, human assistance is required for finding the best transformation for a

given method or application (Yong Shi 2008).

In general, real world databases are highly susceptible to noise,

missing, and inconsistent data due to their typically huge size, often several

gigabytes or more. In order to improve the quality of the data, preprocessing

is mandatory. There are number of data preprocessing techniques: 1) Data

cleaning is applied to remove noise and correct inconsistencies in data, 2)

Data integration merges data from multiple sources into a coherent data store,

such as a data warehouse or a data cube and 3) Data transformations, such as

normalization is applied to make the data suitable for further analysis. This

may improve the accuracy and efficiency of data mining techniques. Data

reduction methods are used to reduce the data size by aggregating, eliminating

66

redundant features, or clustering, for instance. These data processing

techniques, when applied prior to mining, can substantially improve the

overall quality of the patterns mined. These methods are organized into the

following categories: data cleaning, data integration and transformation, and

data reduction (Yong Shi 2008, Hawkins 1980, Pang-Ning Tan et al 2009). In

this research work two preprocessing techniques namely outlier detection

(anomaly detection) and dimensionality reduction are used to improve the

quality of data and as a sequel improve the efficacy of clustering method.

3.2 ANOMALIES DETECTION AND REMOVAL

Real world data tend to be incomplete, noisy, and inconsistent as

discussed above. As outliers (anomalies) can significantly impact the quality

while analyzing microarray gene expression data, it is given more attention.

The goal of anomaly detection is to find objects that are different from most

other objects. Often, anomalous objects are known as outliers (Pang-Ning Tan

et al 2009).

Anomaly detection is an important branch in data mining, which isthe discovery of data that deviate a lot from other data patterns (Hawkins1980, Mansur 2005). Detecting and removing outliers is very important indata mining. For example errors in large databases are extremely common, soan important property of a data mining algorithm is robustness with respect tooutliers in the database. For instance, relatively small number of outliers canalter the set of clusters produced by a clustering algorithm (Bernard Chen et al2005). Machine learning researchers often use the concept of noise rather thanof outliers (Jorma Laurikkala et al 2000, Jan Dupac et al 2002). There are avariety of anomaly detection approaches from several areas, includingstatistics, machine learning, and data mining. All try to capture the idea thatanomalous data objects are unusual or in some way inconsistent with otherobjects. Some of common causes of anomalies are human errors often found

67

during data collection and data entry and errors while integrating data fromdifferent sources.

Data mining is a process of extracting valid, previously unknown,

and ultimately comprehensible information from large datasets and using it

for many applications (Yu et al (2002). However, lot of problems exist in

mining data in large datasets such as data redundancy, the value of attributes

not specific, data not complete and outlier (Breunig et al 2000). An outlier is

defined as data point which is very different from the rest of the data based on

some measure (Pang-Ning Tan et al 2009, Chao Yan et al 2001, Jorma

Laurikkala et al 2000, Barnett and Lewis 1987). Such a point often contains

useful information on abnormal behavior of the system described by data

(Aggarwal and Yu 2005, Jiawei Han and Micheline Kamber 2005). On the

other hand, many data mining algorithms in the literature find outliers as a by

product of clustering algorithms. From the viewpoint of a clustering

algorithm, outliers are objects not located in clusters of dataset (Breunig et al

2000).

Outlier detection problem is one of the very interesting problems in

the data mining research. Recently, a few studies have been conducted on

outlier detection for large datasets (Aggarwal and Yu 2005). Many data

mining algorithms try to minimize the influence of outliers or eliminate them

all together. However, this could result in the loss of important hidden

information since one person’s noise could be another person’s signal (Knorr

et al 2000, Koji-Kadota et al 2003)

Outliers can render the data abnormal. Since normality is one of the

assumptions for many of the statistical tests, finding and eliminating the

influence of outliers may render the data normal and appropriate for analysis

using those statistical tests. Just because a value is extreme compared to the

rest of the data does not necessarily mean it is somehow an anomaly, or

68

invalid, or should be removed. The subject chose to respond with that value

and so removing that value is arbitrarily throwing away data simply because it

does not fit the normality assumption. Conducting research is about

discovering empirical reality. If the subject chose to respond with that value,

then that data is a reflection of reality, so removing the outlier is the antithesis

of why one conduct research. One solution is to analyze the data with the

outlier and without the outlier because each analysis gives separate types of

information (Suresh and Dinakaran 2009, www.psychwiki.com/wiki 2009)

Outlier detection or outlier mining is the process of identifying

outliers in a set of data. The outlier detection technique finds applications in

financial, marketing, fraud detection, intrusion detection, ecosystems

distributions, public health, medicine and analysis of gene expression data

(Yu et al 2002). Thus, outlier detection and analysis is an interesting and

important data mining task. Outlier detection process in data mining is shown

in Figure 3.1.

Figure 3.1 Outlier detection in Data Mining

3.3 NEED FOR OUTLIER ANALYSIS

Outlier in gene expression data is crucial, a gene exception may

yield a) no harm and no benefit or b) a harmful genetic defection (such as

haemophilia that is uncontrolled bleeding) or c) improvements (such as

immunity to a certain disease). The last two are no doubt of most biological

importance. As discussed above, outlier analysis is similar to clustering

which finds clusters containing similar patterns. Although some clustering

69

algorithms (Ester et al 1996, Zhang et al 1996, Sheikholeslami 1998, Agrawal

et al 1998) can be applied to outlier detection, they are actually insensitive to

outliers as they are mainly meant for clustering. Their results are often

inaccurate (Chao Yan et al 2001), so outlier analysis needs its own

algorithms. There have been many algorithms for outlier analysis in recent

years. Yet these algorithms are all vulnerable to high dimensional data like

microarray gene expression data. Gene expression data are inherently linked

to high dimensionality.

In order to handle sparsity problems in high dimensionality,

algorithms need to be developed exclusively for such data. They should

provide interpretability in terms of the factors contributing to the abnormality.

Proper measures must be identified in order to account for the physical

significance of the definition of an outlier in k-dimensional subspace. They

should continue to be computationally efficient for very high dimensional

problems. They should provide importance to the local data behaviour while

determining whether a point is an outlier. Biological experiments often

pinpoint quite a large quantity of segments in a sequence, thus leading to the

inevitability of high dimensional data. Additionally, more advanced

technologies like microarray may yield data with higher dimensionality (Chao

Yan et al 2001) as longer sequences can be put into test. Testing gene

expression dataset in the present work has the dimensionality range of 2 to 90.

In some case, there may be hundreds of dimensions. An outlier detection

algorithm for gene expression data must have the ability to deal with high

dimensionality.

3.4 OUTLIER ANALYSIS ON MICROARRAY DATA

Among many outlier detection methods discussed in the section2.6, distance based technique was chosen to detect outliers present inmicroarray datasets such as human serum, yeast and cancer. This technique isconsidered more suitable for microarray gene expression data which are

70

expressed in different points of time. In the following sections, the results ofoutlier detection technique on four different gene expression datasets asmentioned above are discussed with suitable example.

Another outlier detection method used in this research is graphical

method. Among many graphical methods box plot was constructed for the

original data set which indicates the outliers. Box plots are an excellent tool

for conveying location and variation information in data sets. Box plots are

formed by vertical axis which represents response variables and horizontal

axis which represents the factor of interest. There is a useful variation of the

box plot that more specifically identifies outliers.

3.4.1 Human Serum Data

Outliers are detected and removed from human serum dataset

(http://genome-www.stanford.edu/serum/ referred on 04.03.2009) and this is

applied to Hybrid Clustering Technique in the next chapter. Sample input

dataset for human serum is given in Table 3.1.

Table 3.1 Sample input dataset for human serum

GeneIndex 15MIN 30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR 28HR

361771 -0.47 -3.32 -0.81 0.11 -0.6 -1.36 -1.03 -1.84 -1 -0.6 -0.94 -0.84

120386 -0.45 1.62 1.83 0.03 0.33 0.25 -0.07 0.23 -0.4 -0.1 -0.36 -0.32

26474 1.42 3.03 3.67 0.58 0.66 0.78 0.3 -0.38 0.19 -0.01 -0.17 0.11

162772 0.56 2.05 2.43 0 1.36 0.06 -0.58 -0.04 -0.76 0.16 0.21 0.07

254436 0.01 2.24 3.41 1.58 1.86 0.69 0.08 -0.22 0.74 0.61 -0.32 -0.23

510136 -0.07 -0.14 0.01 0.1 2.8 1.34 0.56 0.55 0.48 0.18 0.33 -0.3

23464 -0.54 -0.27 -1.06 0.43 1.66 1.7 1.52 0.64 0.21 0.2 -0.12 0.23

364959 0.07 0.5 -0.09 0.01 1.57 1.71 1.54 0.86 -0.09 -0.49 -0.64 0.71

108837 0.25 0.82 0.78 0.61 2.26 2.61 1.77 1.17 0.66 -0.18 -0.29 1.14

328692 1.42 1.27 1.91 2.63 5.28 6.44 4.68 3.89 2.75 1.44 1.28 0.53

71

The data set consists of 517 objects and 12 attributes for each,

where the objects are genes and the attributes are expression values of

corresponding gene at different time points. Outliers are removed in this

dataset using distance based outlier detection technique. The original data set

has been screened by this method and a maximum of twenty eight outliers out

of 6204 values are identified as shown in Table 3.2.

Table 3.2 Outliers detected on human serum dataset

3.3400 3.3800 3.4000 3.4100 3.4400 3.4600 3.4700 3.4900 3.5200 3.5400

3.5500 3.5700 3.6300 3.6500 3.6700 3.7800 3.8900 3.9300 4.0400 4.0500

4.1200 4.1700 4.3300 4.4000 4.6800 4.8200 5.2800 6.4400

Figure 3.2 shows a box plot for the given human serum dataset

which represents outliers in the form of circles. There are 28 outliers

designated in this plot among about 6204 observations. Though this method

for detecting outliers is suitable for datasets is having small number of objects

and attributes, it is not suitable for dataset having large number of objects

with high dimensional data. Because the values of objects are not visible due

to overlapping of points, the area of plot is restricted as shown in the figure.

Also some of the points indentified as outlier by the box plot is not really so,

for example the value -3.4 (lower extreme outlier). So it is not advisable to

use graphical methods for detecting outliers on high dimensional data and the

results may be reliable.

72

Figure 3.2 Box plot outliers on human serum data

A scatter plot is drawn for objects (genes) against expression values

expressed in different time points as shown in Figure 3.3. The x axis

represents genes and the y axis represents the expression values. It shows

clearly that some of the values above 3.340 are far away from other values, so

they can be assumed to be outliers.

Figure 3.3 Scatter plot outliers on human serum data

73

3.4.2 Yeast Data

Outliers in microarray dataset are detected and the result for yeast

data is given in the table 3.4. The data set consists of 201 objects and 12

attributes where the objects are genes and the attributes are expression values

of corresponding gene at different points of time. Table 3.3 provides a sample

of microarray yeast dataset. It is not full dataset, only a part of original

dataset. The full dataset is given in the appendix. Distance based outlier

detection technique is used to detect and remove outliers on yeast data.

Table 3.3 Sample input dataset for yeast microarray data

15 MIN 30 MIN 1 HR 2 HR 3 HR 4 HR 5 HR 6 HR 7 HR 8 HR 9 HR 10 HR

-0.57 -3.32 -0.81 0.11 -0.6 -1.26 -1.03 -1.84 -1 -0.6 -0.94 -0.84

0.62 0.07 0.2 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 -0.42 0.43

-0.56 0.08 -0.01 -0.79 -1.25 -2.18 -2.32 -1.84 -1 -0.97 -0.92 -1.64

-0.09 -0.07 -0.22 -0.6 -0.74 -1.06 -1.15 -1.06 -0.89 -0.89 -0.56 -0.86

-0.56 0.08 -0.01 -0.79 -1.25 -2.18 -2.32 -1.84 -1 -0.97 -0.92 -1.64

-0.09 -0.07 -0.22 -0.6 -0.74 -1.06 -1.15 -1.06 -0.89 -0.89 -0.56 -0.86

-0.03 -0.18 -0.58 -0.43 -0.64 -1.22 -1.64 -2.18 -2.06 -2 -1.12 -2.18

-0.18 -0.38 -0.54 -0.32 -0.84 -1.32 -1.47 -2.18 -2.32 -2.06 -1.12 -1.94

0.24 -0.45 -0.27 -0.34 -0.71 -1.36 -2.06 -2.56 -1.4 -1.51 -1.4 -1.18

0.15 0.04 -0.4 -0.32 -0.74 -0.94 -1.15 -1.64 -1.6 -0.97 -0.71 -0.81

-0.43 -0.07 -0.17 -0.45 -1.47 -2.32 -2.18 -2.4 -2.12 -2.64 -2.4 -1.84

-0.3 -0.1 -0.42 0.24 -0.89 -1.69 -1.56 -2.12 -1.69 -1.47 -1.47 -0.92

0.07 -0.06 -0.03 -0.2 -0.89 -1.69 -1.69 -1.29 -1.06 -1.47 -1.47 -1.32

0.11 0.56 0.55 0.03 -0.79 -1.03 -1.03 -0.67 -0.58 -0.81 -0.69 -0.67

0.38 0.34 0.1 0.07 -0.64 -1.25 -1.06 -0.71 -0.51 -0.79 -0.81 -0.25

74

Table 3.4 Outlier detected on yeast dataset

-2.1800 -2.1800 -2.2500 -2.3200 -2.3200 -2.3200 -2.3200 -2.3200 -2.3200 -2.4000

-2.4000 -2.4000 -2.4700 -2.5600 -2.6400 -2.6400 -2.6400 -3.3200

As discussed in section 2.6.4, box plot is constructed for the

original dataset to indicate the outliers. Figure 3.4 shows a box plot for yeast

dataset in which outliers are represented in the form of hollow circles. There

are only 18 outliers designated by this plot out of 2412 observations. This

method is not suitable for dataset having large number of objects with high

dimensional data as discussed in section 2.6.4.

Figure 3.4 Box plot outliers on yeast data



represents genes and the y axis represents the expression values of yeast data.

It shows that some of the values above -2.180 may be considered as outliers.

75

Figure 3.5 Scatter plot outliers on yeast data

3.4.3 Lung Cancer Data

Microarray cancer dataset is analyzed in this section for detecting

the outliers. This dataset consists of 20 objects and 4 attributes where the

objects are genes and the attributes are expression values of corresponding

gene at different time points. Sample input data for cancer dataset is given in

Table 3.5. Outliers are detected on dataset using different approaches: The

first approach is algorithmic in which distance based algorithm is used. An

input parameter N (Maximum no. of outliers) was set to five, but the method

found only four outliers out of 80 values. The outlier values are 06. 176,

12.000, 12.589 and 12.727. The detected outliers are removed from the data

set before using it for further analysis. Another approach called box plot is

graphical where a box plot is constructed for the original dataset to indicate

outliers.

76

Table 3.5 Sample input dataset for lung cancer microarray data

Gene Index 2 HR 3 HR 4 HR 5 HR

1415670 8.621339512 7.5 8.6213 9.258

1415671 11.06818663 11.068 12.589 10.568

1415672 10.44145373 9.457 10.441 8.5648

1415677 8.513605331 7.254 8.5136 8.568

1415681 9.275654563 7 7.1 7.8

1415682 7.147069562 11 12.0 11.5

1415687 12.72740879 10 9.5 8

1415688 8.928937957 8.5 9 8.2

1415689 6.176931606 7 7.2 7.

Box plot is formed by vertical axis which represents response

variables and horizontal axis which represents the factor of interest. There is a

useful variation of the box plot that more specifically identifies outliers.

Figure 3.6 shows a box plot for lung cancer dataset where outliers are

represented in the form of hollow circles. There are only four outliers

designated by this plot among about 80 observations. Figure 3.7 shows the

expression levels of genes, the x axis representing genes and the y axis, the

expression values. It shows clearly that some of the values above 11.5 and

below 7.0 are far away from other values and hence these values are

considered as outliers.

77

Figure 3.6 Box Plot outliers on lung cancer data

Figure 3.7 Expression levels of lung cancer genes

78

3.4.4 Blood Cancer Data

Blood cancer is a generalized term for malignancy which attacks

the blood, bone marrow, or lymphatic system. It refers to abnormal growth of

cells normally found in the blood. There are three kinds of blood cancer:

leukemia, lymphoma, and multiple myeloma. In this research work, the focus

is on one of the most prevalent type of blood cancers leukemia resulting from

malignant transformation of white blood cells.

White blood cells help fight infections and are a key component of

the body's immune system. In people with leukemia, the bone marrow

produces an abnormal amount of white blood cells. These abnormal white

blood cells (leukemia cells) may crowd out normal white blood cells, red

blood cells, and platelets. This makes it hard for the blood cells to function

properly.

Outliers in microarray dataset are detected and the result for blood

cancer is given in the Table 3.6. The dataset consists of 1023 objects and 25

attributes where the objects are genes and the attributes are expression values

of corresponding gene at different points of time. Distance based outlier

detection technique is used to detect and remove outliers on blood cancer

data. The full dataset of blood cancer is given in the appendix-4 as the size is

too large to be accommodated in this section.

79

Table 3.6 Outlier detected on blood cancer dataset

0.3475 0.3476 0.3477 0.3482 0.3484 0.3488 0.3490 0.34940.3499 0.3500 0.3503 0.3504 0.3520 0.3522 0.3524 0.35260.3531 0.3532 0.3535 0.3536 0.3540 0.3541 0.3542 0.35440.3548 0.3552 0.3557 0.3558 0.3559 0.3562 0.3563 0.35650.3567 0.3569 0.3573 0.3574 0.3579 0.3586 0.3586 0.35890.3596 0.3608 0.3621 0.3625 0.3626 0.3628 0.3629 0.36320.3633 0.3640 0.3641 0.3642 0.3647 0.3647 0.3647 0.36470.3654 0.3654 0.3658 0.3658 0.3658 0.3661 0.3664 0.36790.3679 0.3679 0.3685 0.3697 0.3702 0.3705 0.3707 0.37090.3712 0.3714 0.3715 0.3717 0.3718 0.3721 0.3723 0.37260.3727 0.3740 0.3742 0.3747 0.3755 0.3761 0.3763 0.37720.3774 0.3775 0.3779 0.3780 0.3782 0.3787 0.3789 0.37910.3796 0.3796 0.3804 0.3804 0.3806 0.3810 0.3813 0.38210.3822 0.3823 0.3831 0.3833 0.3835 0.3843 0.3845 0.38470.3847 0.3849 0.3852 0.3854 0.3854 0.3854 0.3857 0.38600.3863 0.3869 0.3877 0.3878 0.3880 0.3888 0.3888 0.38910.3913 0.3926 0.3927 0.3937 0.3944 0.3944 0.3946 0.39510.3956 0.3958 0.3979 0.3980 0.3994 0.4004 0.4005 0.40080.4021 0.4029 0.4029 0.4030 0.4039 0.4044 0.4044 0.40460.4050 0.4050 0.4052 0.4052 0.4062 0.4064 0.4074 0.40780.4086 0.4090 0.4093 0.4100 0.4100 0.4108 0.4110 0.41160.4116 0.4118 0.4123 0.4128 0.4138 0.4140 0.4151 0.41590.4167 0.4170 0.4172 0.4178 0.4181 0.4184 0.4185 0.41870.4187 0.4188 0.4189 0.4196 0.4224 0.4234 0.4239 0.42430.4249 0.4252 0.4253 0.4258 0.4258 0.4274 0.4275 0.42750.4276 0.4281 0.4289 0.4292 0.4297 0.4300 0.4301 0.43070.4307 0.4308 0.4314 0.4317 0.4320 0.4328 0.4328 0.43300.4339 0.4339 0.4344 0.4352 0.4354 0.4360 0.4361 0.43630.4368 0.4370 0.4379 0.4384 0.4385 0.4386 0.4390 0.43920.4396 0.4404 0.4405 0.4408 0.4412 0.4428 0.4433 0.44340.4435 0.4436 0.4441 0.4442 0.4450 0.4455 0.4467 0.44730.4474 0.4475 0.4476 0.4477 0.4486 0.4488 0.4488 0.44920.4499 0.4503 0.4517 0.4518 0.4522 0.4531 0.4531 0.45340.4535 0.4552 0.4552 0.4554 0.4574 0.4586 0.4587 0.45890.4590 0.4591 0.4610 0.4616 0.4619 0.4628 0.4634 0.4638

80

As discussed in section 2.6.4, box plot is constructed for the

original dataset to indicate the outliers. Figure 3.8 shows a box plot for blood

cancer dataset in which outliers are represented in the form of asterisks. There

are 1319 outliers designated by this plot out of 25575 observations.

Figure 3.8 Box plot outliers on blood cancer data

This method is not suitable for dataset having large number of

objects with high dimensional data as discussed in section 2.6.4.



represents genes and the y axis expression values of blood cancer data. It

shows that some of the values above 0.3475 can be considered as outliers.

81

Figure 3.9 Scatter plot outliers on blood cancer data

As discussed in previous sections, outliers are very sensitive and a

small number of outliers can make drastic change in the final result. Outliers

detected from four datasets are summarized in Table 3.7. It is obvious that the

susceptibility would increase as the percentage of outliers is increased.

Table 3.7 Outliers detected in datasets

Dataset Originalobservations Outliers detected

Human serum 6204 28Yeast 2412 18

Lung cancer 80 4Blood cancer 25575 1319

3.5 PRINCIPAL COMPONENT ANALYSIS

Principal Component Analysis (PCA) is a classical technique to

reduce the dimensionality of the data set consisting of a large number of

82

variables. This is achieved by transforming to a new set of variables

(Principal Components) which are uncorrelated and ordered so that the first

few retain the most of the variants present in all of the original variables. As a

result, dimensionality reduction using PCA can result in relatively low-

dimensional data and it may be possible to apply techniques that do not work

well with high dimensional data like microarray gene expression data. More

details are given in section 2.7.1.

Figure 3.10 Framework of dimensionality reduction

Dimensionality reduction technique is applied on outliers removed

gene expression dataset. Outliers present in dataset has already been detected

and removed in the previous section. Figure 3.10 show the process of

performing dimensionality reduction, it is demonstrated in the following

sections how dimensions of dataset is considerably reduced. An algorithm of

PCA is as follows:

Algorithm: Principal Component Analysis (PCA)

Input : Multidimensional data in a data matrix in which the rows are

genes and columns are conditions.

Output : Dimensions reduced dataset

Step 1 : Calculate mean value of each object in data matrix.

Step 2 : Calculate the covariance of data matrix.

Step 3 : Calculate the eigenvectors and eigenvalues of the covariance matrix.

83

Step 4 : Find the number of Components and forming a feature vector.

Step 5 : Transpose the feature vector and mean adjusted data matrix.

Step 6 : Derive the new dataset with reduced dimensions by multiplying the

feature vector and mean adjusted data.

In this section, the results of Principal Component Analysis (PCA)

applied on microarray dataset are discussed. Four different microarray gene

expression data with dimension range of 2 to 25 are taken for analysis. It is

observed that the dimensions are reduced considerably after PCA is

performed on these datasets and it is believed that dimensions of reduced

number would have produced the same results as original dimensions. This

was experimented on four datasets a) human serum dataset, b) yeast dataset,

c) Lung cancer dataset and Blood cancer dataset. The description of datasets

and their results are discussed in the following sub-sections.

3.5.1 Human Serum Data

This dataset contains 517 genes as row and 12 dimensions as

column. The dimensions have to be reduced in order to make the dataset fit

for cluster analysis. As discussed in section 2.7.1, large number of dimensions

may lead to inaccuracy and not comply with requirements of many clustering

techniques.

PCA will find eigenvectors and eigenvalues relevant to the data

using a covariance matrix as given in the Table 3.7. Eigenvectors can be

thought of as preferential directions of a data set, or in other words, main

patterns in the data. PCA can be applied on two profiles, PCA on genes and

PCA on conditions; in this research it is restricted to PCA on gene only. For

PCA on genes, an eigenvector would be represented as an expression profile

that is most representative of the data and eigenvalues can be thought of as

84

quantitative assessment of how much a component represents the data. The

more representative of the data, as a result of higher the eigenvalues of a

component.

Eigenvalues can also be representative of the level of explained

variance as a percentage of total variance. By themselves, eigenvalues are not

informative. The percent of variance explained is dependent on how well all

the components summarize the data. In theory, the sum of all components

explains 100% variability in the data.

Table 3.8 Co-variance matrix for human serum data

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V110.160 0.146 0.138 0.115 0.083 0.092 0.082 0.075 0.052 0.053 0.0560.146 0.443 0.385 0.221 0.291 0.322 0.307 0.249 0.087 0.091 0.0740.138 0.385 0.545 0.350 0.423 0.436 0.414 0.309 0.132 0.122 0.1110.115 0.221 0.350 0.460 0.440 0.462 0.468 0.348 0.184 0.162 0.1420.083 0.291 0.423 0.440 1.096 1.199 1.193 0.963 0.512 0.477 0.3960.092 0.322 0.436 0.462 1.199 1.517 1.574 1.304 0.739 0.691 0.5880.082 0.307 0.414 0.468 1.193 1.574 1.796 1.547 0.895 0.862 0.7340.075 0.249 0.309 0.348 0.963 1.304 1.547 1.549 0.977 0.962 0.8590.052 0.087 0.132 0.184 0.512 0.739 0.895 0.977 0.897 0.899 0.8760.053 0.091 0.122 0.162 0.477 0.691 0.862 0.962 0.899 1.039 1.0420.056 0.074 0.111 0.142 0.396 0.588 0.734 0.859 0.876 1.042 1.146

Table 3.9 shows the eigenvalues and the cumulative percentage of

variance for eigenvalues from principal component analysis of human serum

data. Among them the first three PC’s having maximum variances are taken

for our analysis. One should choose the components (variables) having largest

variance in the most significant eigenvectors. First four eigenvalues and their

respective percentage of variance as given in the table provide the 93

percentage of the total variation in the independent variables. Similarly, first

85

three eigenvalues account for 90 percentage of the total variation in the

independent variables. The variance of seven eigenvalues has less variance

and therefore these variables are no doubt interrelated with rest of the

important variables. Co-variances of eigenvectors are calculated for all

variables in order to find eigenvalues of principal components.

Table 3.9 Eigenvalues and its percentage of variance of human serum

data

PrincipalComponent Eigenvalue Percentage of

VarianceCumulative Percentage

of Variance1 7.72484 66.199 66.1992 1.9893 17.048 83.2473 0.807397 6.9191 90.1664 0.30341 2.6001 92.7665 0.236446 2.0263 94.7926 0.1978 1.6951 96.4877 0.116029 0.99433 97.4828 0.0880234 0.75433 98.2369 0.0743113 0.63682 98.873

10 0.0647379 0.55478 99.24811 0.0389406 0.33371 99.76212 0.027799 0.23823 99.999

Choosing the number of PC’s is also an important issue. There are

12 principal components on human serum data, of which, only components

having maximum total variance are to be chosen. This can be done with the

help of scree plot. The number of components to be chosen is three in the

Figure 3.11 from the point at which the curve is a steep. Hence the first three

components can be used to get the total variance. Scree graph approach to

decide the number of PC’s is very ad-hoc and subjective.

86

Figure 3.11 Scree plot for human serum data to determine number of

components

The reduced representation of human serum data is used to cluster

patterns for the dataset as shown in Table 3.10. The number of clusters

produced by Hybrid Clustering Technique after PCA performed is considered

as optimum.

Table 3.10 Reduced representations of human serum data

15MIN 30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR 28HR0.160 0.146 0.138 0.116 0.083 0.092 0.082 0.076 0.052 0.053 0.057 0.0810.146 0.444 0.385 0.221 0.291 0.322 0.307 0.249 0.087 0.091 0.074 0.0940.138 0.385 0.545 0.350 0.423 0.436 0.414 0.309 0.132 0.122 0.111 0.1290.115 0.221 0.350 0.460 0.440 0.462 0.468 0.348 0.185 0.162 0.142 0.1780.083 0.291 0.423 0.440 1.097 1.199 1.193 0.963 0.512 0.477 0.396 0.3410.092 0.322 0.436 0.462 1.199 1.517 1.574 1.304 0.739 0.691 0.588 0.5350.082 0.307 0.414 0.468 1.193 1.574 1.796 1.547 0.895 0.862 0.735 0.6800.076 0.250 0.310 0.349 0.963 1.304 1.547 1.550 0.978 0.962 0.860 0.7340.052 0.087 0.132 0.184 0.512 0.740 0.895 0.978 0.898 0.899 0.876 0.73810.053 0.091 0.122 0.162 0.477 0.691 0.862 0.963 0.899 1.039 1.042 0.8500.056 0.074 0.111 0.142 0.396 0.588 0.734 0.859 0.876 1.042 1.146 0.9100.081 0.094 0.134 0.178 0.341 0.535 0.680 0.734 0.738 0.850 0.910 1.047

87

3.5.2 Yeast Data

This dataset contains 201 rows and 12 columns. The rows are

represented as genes and the columns are dimensions of each gene. It is

obvious that the complexity of analysis increases if dimensions of dataset are

high. Those dimensions need to be reduced in order to make the dataset

amenable for cluster analysis. The significance of reducing the dimensions of

microarray data is discussed in sections 2.7, 2.7.2.

PCA would calculate eigenvectors and eigenvalues relevant to the

data using a covariance matrix given in the Table 3.11.

Table 3.11 Co-variance matrix for yeast dataset

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12

0.090 0.051 0.041 0.045 0.027 0.029 0.051 0.053 0.013 0.014 0.007 0.027

0.051 0.204 0.096 0.027 0.032 0.039 0.043 0.033 -0.019 -0.038 -0.036 -0.019

0.041 0.096 0.105 0.030 0.027 0.025 0.030 0.038 -0.002 -0.010 -0.011 0.009

0.045 0.027 0.030 0.083 0.039 0.033 0.038 0.035 0.012 0.017 0.011 0.023

0.027 0.032 0.027 0.039 0.162 0.119 0.080 0.045 -0.013 -0.003 -0.018 -0.030

0.029 0.039 0.025 0.033 0.119 0.199 0.156 0.100 0.040 0.042 0.022 0.042

0.051 0.043 0.030 0.038 0.080 0.156 0.239 0.174 0.072 0.095 0.077 0.102

0.053 0.033 0.038 0.035 0.045 0.100 0.174 0.283 0.156 0.172 0.153 0.130

0.013 -0.019 -0.002 0.012 -0.013 0.040 0.072 0.156 0.243 0.222 0.207 0.181

0.014 -0.038 -0.010 0.017 -0.003 0.042 0.095 0.172 0.222 0.303 0.279 0.207

0.007 -0.036 -0.001 0.011 -0.018 0.022 0.077 0.153 0.207 0.279 0.318 0.203

0.027 -0.019 0.009 0.023 -0.030 0.042 0.102 0.130 0.181 0.207 0.203 0.316

An eigenvector would be represented as an expression profile that

is most representative of the data and eigenvalues can be thought of as

quantitative assessment of how much a component represents the data. The

88

higher the eigenvalues of a component, the more representative it is of the

data.

Eigenvalues and the cumulative percentage of variance for

eigenvalues are calculated on yeast data by applying PCA as shown in

Table 3.12. Among them the first three PC’s having maximum variances are

considered in the analysis.

Table 3.12 Eigenvalues and its percentage of variance of yeast data

PrincipalComponents Eigenvalue Percentage of

VarianceCumulative percentage

of variance1 1.16049 45.529 45.529

2 0.522087 20.483 66.012

3 0.230884 9.0582 75.070

4 0.1486 5.83 80.900

5 0.126627 4.9679 85.860

6 0.0984549 3.8627 89.730

7 0.076368 2.9961 91.959

8 0.0539829 2.1179 94.076

9 0.0385623 1.5129 95.589

10 0.0366699 1.4387 97.027

11 0.031587 1.2392 98.296

12 0.0245814 0.9644 99.983

Another difficult task is that of choosing the number of PC’s.

Table 3.12 contains 12 principal components and it is required to select the

components having maximum variances of the total variance in the

independent variables. There are 12 principal components on yeast data out of

which components having maximum total variances are first four. There is a

89

sharp change of steepness in Figure 3.12, corresponding to the value three in

the x axis, which shows the number of components to be chosen.

Figure 3.12 Scree plot for yeast data to determine the number of

components

Table 3.13 shows reduced representation of yeast dataset that can

be used for further analysis like clustering similar patterns in the dataset.

Hybrid Clustering Technique (HCT) is used to cluster data patterns to be

discussed in chapter-4. Dimensions get reduced to 4 from 12 after PCA is

applied, but there is no difference in the final results, as it should be.

Therefore PCA applied data are more efficient and applying clustering

techniques on this data produce optimum result.

90

Table 3.13 Reduced representations of yeast data

15 MIN 30 MIN 1 HR 2 HR 3 HR 4 HR 5 HR 6 HR 7 HR 8 HR 9 HR 10 HR

0.085 0.048 0.038 0.038 0.025 0.029 0.050 0.053 0.012 0.017 0.011 0.030

0.048 0.202 0.092 0.027 0.032 0.042 0.045 0.034 -0.03 -0.04 -0.04 -0.02

0.038 0.092 0.104 0.027 0.028 0.026 0.030 0.039 -0.01 -0.09 -0.01 0.006

0.082 0.028 0.027 0.081 0.036 0.033 0.037 0.032 0.010 0.014 0.011 0.021

0.025 0.031 0.027 0.037 0.162 0.125 0.083 0.048 -0.01 -0.00 -0.01 -0.03

0.029 0.042 0.025 0.033 0.125 0.206 0.146 0.107 0.039 0.043 0.025 0.044

0.050 0.045 0.030 0.037 0.083 0.164 0.243 0.183 0.072 0.099 0.078 0.105

0.053 0.034 0.039 0.032 0.048 0.107 0.183 0.292 0.159 0.179 0.159 0.134

0.012 -0.02 -0.01 0.010 -0.01 0.040 0.072 0.16 0.243 0.224 0.205 0.181

0.018 -0.04 -0.01 0.013 -0.00 0.042 0.098 0.179 0.224 0.306 0.284 0.211

0.010 -0.03 -0.01 0.011 -0.01 0.025 0.079 0.158 0.205 0.284 0.329 0.204

0.029 -0.02 0.00 0.022 -0.03 0.044 0.105 0.134 0.181 0.211 0.204 0.319

3.5.3 Lung Cancer Data

Lung cancer gene expression dataset is somewhat smaller than the

datasets used in the previous sections. The dataset consists of 20 rows and

4 columns. The rows are represented as genes and the columns are

dimensions of each gene. PCA would calculate eigenvectors and eigenvalues

relevant to the data using a covariance matrix as given in the Table 3.14.

Table 3.14 Eigenvalues and its percentage of variance of cancer data

PrincipalComponents Eigenvalue Percentage of

VarianceCumulative percentage

of variance1 5.15584 63.142 63.1422 1.72315 21.103 84.2453 0.834011 10.214 94.4594 0.45248 5.5414 100.00

91

Table 3.15 shows the eigenvalues and the cumulative

percentage of variance for eigenvalues from principal component analysis of

cancer dataset. Among them the first two PC’s having maximum variances

are taken for analysis.

Table 3.15 Co-variance matrix for lung cancer dataset

V1 V2 V3 V4V1 2.2651 1.1119 0.82628 0.5424V2 1.1119 1.8782 1.3949 1.1108V3 0.82628 1.3949 1.8904 1.2022V4 0.5424 1.1108 1.2022 2.1318

Scree plot is drawn to choose PC’s having maximum variances of

total variance in the independent variables as shown in Figure 3.13. The point

at which the knee starts down towards x coordinates is two; this value is

considered as appropriate PC’s that covers the total variance.

Figure 3.13 Scree plot for lung cancer data to determine the number of

components

92

Table 3.15 shows reduced representation of cancer dataset that can

be used for further analysis like clustering similar patterns in the dataset.

Hybrid Clustering Technique is applied on this reduced data. It is

demonstrated that PCA applied dataset produce optimum results after

applying Hybrid Clustering Techniques.

Table 3.16 Reduced representations of lung cancer data

2 HR 3 HR 4 HR 5 HR

2.265 1.111 0.826 0.542

1.112 1.878 1.394 1.110

0.826 1.394 1.890 1.202

0.542 1.110 1.202 2.132

3.5.4 Blood cancer data

This dataset contains 1023 genes as row and 25 dimensions as

column. The dimensions have to be reduced in order to make the dataset fit

for cluster analysis. As discussed in section 2.7.1, large number of dimensions

may lead to inaccuracy and not comply with requirements of many clustering

techniques.

Eigenvectors can be thought of as preferential directions of a data

set, or in other words, main patterns in the data. PCA can be applied on two

profiles, one on genes and another on conditions; in this research it is

restricted to PCA on gene only. For PCA on genes, an eigenvector would be

represented as an expression profile that is most representative of the data and

eigenvalues can be thought of as quantitative assessment of how much a

component represents the data. The higher the eigenvalues of a component,

the more representative the dataset is.

93

Eigenvalues can also be representative of the level of explained

variance as a percentage of total variance. By themselves, eigenvalues are not

informative. The percent of variance explained is dependent on how well all

the components summarize the data. In theory, the sum of all components

explains 100% variability in the data.

Table 3.17 shows the eigenvalues and this cumulative percentage of

variance from principal component analysis of human serum data. Among

them the first three PC’s having maximum variances are taken for this

analysis. One should choose the components (variables) having largest

variance in the most significant eigenvectors. First four eigenvalues and their

respective percentage of variance as given in the table provide the 93

percentage of the total variation in the independent variables. Similarly, first

three eigenvalues account for 90 percentage of the total variation in the

independent variables.

Table 3.17 Eigenvalues and its percentage of variance of blood cancer

data

PrincipalComponents Eigenvalue Percentage

of VarianceCumulative percentage

of variance1 5.17517E06 84.376 84.3762 267685 4.3643 88.7403 105617 1.722 90.4624 96301.9 1.5701 92.0325 70243.6 1.1452 93.1776 54264.5 0.88472 94.0627 49205.1 0.80224 94.8648 44421.6 0.72425 95.5889 43432.4 0.70812 96.29610 35126.6 0.5727 96.86811 28881.5 0.47088 97.339

94

Table 3.17 (Continued)

PrincipalComponents Eigenvalue Percentage

of VarianceCumulative percentage

of variance12 24576.4 0.40069 97.74913 20498.1 0.3342 98.07414 17639.1 0.28759 98.36215 15249.1 0.24862 98.61016 13004 0.21202 98.82217 11499.7 0.18749 99.01018 9769.19 0.15928 99.16919 9512.29 0.15509 99.32420 9275.67 0.15123 99.47521 8586.16 0.13999 99.61522 6981.69 0.11383 99.72923 6355.08 0.10361 99.83224 5796.82 0.094511 99.92725 4402.97 0.071786 99.999

The variance of seven eigenvalues has less variance and therefore

these variables are no doubt interrelated with rest of the important variables.

Co-variances of eigenvectors are calculated for all variables in order to find

eigenvalues of principal components.

Choosing the number of PC’s is also an important issue. There are

25 principal components on human serum data, of which, only components

having maximum total variance are to be chosen. This can be done with the

help of scree plot. The number of components to be chosen is three in the

Figure 3.14 from the point at which the curve goes steep. Hence the first three

components are used to get the total variance. Scree graph approach to decide

the number of PC’s is very ad-hoc and subjective.

95

Figure 3.14 Scree plot for blood cancer data to determine number of

components

The number of clusters produced by Hybrid Clustering Technique

after performing PCA is considered as optimum. As discussed in the previous

sections, the efficacy of a new clustering technique is increased after reducing

the dimensions of original dataset. Table 3.18 shows the percentage of

reduction in dimensions of three datasets. The amount of reduction is ranging

from 40 to 66.

Table 3.18 Original dimensions versus reduced dimensions

Dataset Originaldimensions

Reduceddimensions

Percentage ofreduction

Human serum 12 8 66 %

Yeast 12 8 66 %

Lung cancer 4 2 50 %

Blood Cancer 25 10 40%

96

3.6 CONCLUSION

Inconsistent data like outliers need to be removed before clustering

to improve the quality. It is found that, algorithmic method called distance

based outlier detection method is reliable for gene expression data whereas

graphical method named box plot is suitable for dataset with small size only.

Dimensionality reduction is carried out using Principal Component Analysis

on four datasets and the efficiency of new hybrid clustering algorithm is

significantly improved due to considerable reduction in dimensions as

discussed in the following chapter.

CHAPTER 3 PREPROCESSING ON MICROARRAY...

Documents

Transcript of CHAPTER 3 PREPROCESSING ON MICROARRAY...