CHAPTER 3 PREPROCESSING ON MICROARRAY...
Transcript of CHAPTER 3 PREPROCESSING ON MICROARRAY...
65
CHAPTER 3
PREPROCESSING ON MICROARRAY DATA
3.1 INTRODUCTION
One of the most critical steps in a data mining process is the
preparation and transformation of the initial dataset into suitable from. This
task received little attention in the research literature, mostly because it is
considered too application specific. But, in most data mining applications,
some parts of data preparation process or, sometimes, even the entire process
can be described independent of an application and a data mining method.
Many transformations may be needed to produce features more suitable for
selected data mining methods such as prediction or classification. In most
cases, human assistance is required for finding the best transformation for a
given method or application (Yong Shi 2008).
In general, real world databases are highly susceptible to noise,
missing, and inconsistent data due to their typically huge size, often several
gigabytes or more. In order to improve the quality of the data, preprocessing
is mandatory. There are number of data preprocessing techniques: 1) Data
cleaning is applied to remove noise and correct inconsistencies in data, 2)
Data integration merges data from multiple sources into a coherent data store,
such as a data warehouse or a data cube and 3) Data transformations, such as
normalization is applied to make the data suitable for further analysis. This
may improve the accuracy and efficiency of data mining techniques. Data
reduction methods are used to reduce the data size by aggregating, eliminating
66
redundant features, or clustering, for instance. These data processing
techniques, when applied prior to mining, can substantially improve the
overall quality of the patterns mined. These methods are organized into the
following categories: data cleaning, data integration and transformation, and
data reduction (Yong Shi 2008, Hawkins 1980, Pang-Ning Tan et al 2009). In
this research work two preprocessing techniques namely outlier detection
(anomaly detection) and dimensionality reduction are used to improve the
quality of data and as a sequel improve the efficacy of clustering method.
3.2 ANOMALIES DETECTION AND REMOVAL
Real world data tend to be incomplete, noisy, and inconsistent as
discussed above. As outliers (anomalies) can significantly impact the quality
while analyzing microarray gene expression data, it is given more attention.
The goal of anomaly detection is to find objects that are different from most
other objects. Often, anomalous objects are known as outliers (Pang-Ning Tan
et al 2009).
Anomaly detection is an important branch in data mining, which isthe discovery of data that deviate a lot from other data patterns (Hawkins1980, Mansur 2005). Detecting and removing outliers is very important indata mining. For example errors in large databases are extremely common, soan important property of a data mining algorithm is robustness with respect tooutliers in the database. For instance, relatively small number of outliers canalter the set of clusters produced by a clustering algorithm (Bernard Chen et al2005). Machine learning researchers often use the concept of noise rather thanof outliers (Jorma Laurikkala et al 2000, Jan Dupac et al 2002). There are avariety of anomaly detection approaches from several areas, includingstatistics, machine learning, and data mining. All try to capture the idea thatanomalous data objects are unusual or in some way inconsistent with otherobjects. Some of common causes of anomalies are human errors often found
67
during data collection and data entry and errors while integrating data fromdifferent sources.
Data mining is a process of extracting valid, previously unknown,
and ultimately comprehensible information from large datasets and using it
for many applications (Yu et al (2002). However, lot of problems exist in
mining data in large datasets such as data redundancy, the value of attributes
not specific, data not complete and outlier (Breunig et al 2000). An outlier is
defined as data point which is very different from the rest of the data based on
some measure (Pang-Ning Tan et al 2009, Chao Yan et al 2001, Jorma
Laurikkala et al 2000, Barnett and Lewis 1987). Such a point often contains
useful information on abnormal behavior of the system described by data
(Aggarwal and Yu 2005, Jiawei Han and Micheline Kamber 2005). On the
other hand, many data mining algorithms in the literature find outliers as a by
product of clustering algorithms. From the viewpoint of a clustering
algorithm, outliers are objects not located in clusters of dataset (Breunig et al
2000).
Outlier detection problem is one of the very interesting problems in
the data mining research. Recently, a few studies have been conducted on
outlier detection for large datasets (Aggarwal and Yu 2005). Many data
mining algorithms try to minimize the influence of outliers or eliminate them
all together. However, this could result in the loss of important hidden
information since one person’s noise could be another person’s signal (Knorr
et al 2000, Koji-Kadota et al 2003)
Outliers can render the data abnormal. Since normality is one of the
assumptions for many of the statistical tests, finding and eliminating the
influence of outliers may render the data normal and appropriate for analysis
using those statistical tests. Just because a value is extreme compared to the
rest of the data does not necessarily mean it is somehow an anomaly, or
68
invalid, or should be removed. The subject chose to respond with that value
and so removing that value is arbitrarily throwing away data simply because it
does not fit the normality assumption. Conducting research is about
discovering empirical reality. If the subject chose to respond with that value,
then that data is a reflection of reality, so removing the outlier is the antithesis
of why one conduct research. One solution is to analyze the data with the
outlier and without the outlier because each analysis gives separate types of
information (Suresh and Dinakaran 2009, www.psychwiki.com/wiki 2009)
Outlier detection or outlier mining is the process of identifying
outliers in a set of data. The outlier detection technique finds applications in
financial, marketing, fraud detection, intrusion detection, ecosystems
distributions, public health, medicine and analysis of gene expression data
(Yu et al 2002). Thus, outlier detection and analysis is an interesting and
important data mining task. Outlier detection process in data mining is shown
in Figure 3.1.
Figure 3.1 Outlier detection in Data Mining
3.3 NEED FOR OUTLIER ANALYSIS
Outlier in gene expression data is crucial, a gene exception may
yield a) no harm and no benefit or b) a harmful genetic defection (such as
haemophilia that is uncontrolled bleeding) or c) improvements (such as
immunity to a certain disease). The last two are no doubt of most biological
importance. As discussed above, outlier analysis is similar to clustering
which finds clusters containing similar patterns. Although some clustering
69
algorithms (Ester et al 1996, Zhang et al 1996, Sheikholeslami 1998, Agrawal
et al 1998) can be applied to outlier detection, they are actually insensitive to
outliers as they are mainly meant for clustering. Their results are often
inaccurate (Chao Yan et al 2001), so outlier analysis needs its own
algorithms. There have been many algorithms for outlier analysis in recent
years. Yet these algorithms are all vulnerable to high dimensional data like
microarray gene expression data. Gene expression data are inherently linked
to high dimensionality.
In order to handle sparsity problems in high dimensionality,
algorithms need to be developed exclusively for such data. They should
provide interpretability in terms of the factors contributing to the abnormality.
Proper measures must be identified in order to account for the physical
significance of the definition of an outlier in k-dimensional subspace. They
should continue to be computationally efficient for very high dimensional
problems. They should provide importance to the local data behaviour while
determining whether a point is an outlier. Biological experiments often
pinpoint quite a large quantity of segments in a sequence, thus leading to the
inevitability of high dimensional data. Additionally, more advanced
technologies like microarray may yield data with higher dimensionality (Chao
Yan et al 2001) as longer sequences can be put into test. Testing gene
expression dataset in the present work has the dimensionality range of 2 to 90.
In some case, there may be hundreds of dimensions. An outlier detection
algorithm for gene expression data must have the ability to deal with high
dimensionality.
3.4 OUTLIER ANALYSIS ON MICROARRAY DATA
Among many outlier detection methods discussed in the section2.6, distance based technique was chosen to detect outliers present inmicroarray datasets such as human serum, yeast and cancer. This technique isconsidered more suitable for microarray gene expression data which are
70
expressed in different points of time. In the following sections, the results ofoutlier detection technique on four different gene expression datasets asmentioned above are discussed with suitable example.
Another outlier detection method used in this research is graphical
method. Among many graphical methods box plot was constructed for the
original data set which indicates the outliers. Box plots are an excellent tool
for conveying location and variation information in data sets. Box plots are
formed by vertical axis which represents response variables and horizontal
axis which represents the factor of interest. There is a useful variation of the
box plot that more specifically identifies outliers.
3.4.1 Human Serum Data
Outliers are detected and removed from human serum dataset
(http://genome-www.stanford.edu/serum/ referred on 04.03.2009) and this is
applied to Hybrid Clustering Technique in the next chapter. Sample input
dataset for human serum is given in Table 3.1.
Table 3.1 Sample input dataset for human serum
GeneIndex 15MIN 30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR 28HR
361771 -0.47 -3.32 -0.81 0.11 -0.6 -1.36 -1.03 -1.84 -1 -0.6 -0.94 -0.84
120386 -0.45 1.62 1.83 0.03 0.33 0.25 -0.07 0.23 -0.4 -0.1 -0.36 -0.32
26474 1.42 3.03 3.67 0.58 0.66 0.78 0.3 -0.38 0.19 -0.01 -0.17 0.11
162772 0.56 2.05 2.43 0 1.36 0.06 -0.58 -0.04 -0.76 0.16 0.21 0.07
254436 0.01 2.24 3.41 1.58 1.86 0.69 0.08 -0.22 0.74 0.61 -0.32 -0.23
510136 -0.07 -0.14 0.01 0.1 2.8 1.34 0.56 0.55 0.48 0.18 0.33 -0.3
23464 -0.54 -0.27 -1.06 0.43 1.66 1.7 1.52 0.64 0.21 0.2 -0.12 0.23
364959 0.07 0.5 -0.09 0.01 1.57 1.71 1.54 0.86 -0.09 -0.49 -0.64 0.71
108837 0.25 0.82 0.78 0.61 2.26 2.61 1.77 1.17 0.66 -0.18 -0.29 1.14
328692 1.42 1.27 1.91 2.63 5.28 6.44 4.68 3.89 2.75 1.44 1.28 0.53
71
The data set consists of 517 objects and 12 attributes for each,
where the objects are genes and the attributes are expression values of
corresponding gene at different time points. Outliers are removed in this
dataset using distance based outlier detection technique. The original data set
has been screened by this method and a maximum of twenty eight outliers out
of 6204 values are identified as shown in Table 3.2.
Table 3.2 Outliers detected on human serum dataset
3.3400 3.3800 3.4000 3.4100 3.4400 3.4600 3.4700 3.4900 3.5200 3.5400
3.5500 3.5700 3.6300 3.6500 3.6700 3.7800 3.8900 3.9300 4.0400 4.0500
4.1200 4.1700 4.3300 4.4000 4.6800 4.8200 5.2800 6.4400
Figure 3.2 shows a box plot for the given human serum dataset
which represents outliers in the form of circles. There are 28 outliers
designated in this plot among about 6204 observations. Though this method
for detecting outliers is suitable for datasets is having small number of objects
and attributes, it is not suitable for dataset having large number of objects
with high dimensional data. Because the values of objects are not visible due
to overlapping of points, the area of plot is restricted as shown in the figure.
Also some of the points indentified as outlier by the box plot is not really so,
for example the value -3.4 (lower extreme outlier). So it is not advisable to
use graphical methods for detecting outliers on high dimensional data and the
results may be reliable.
72
Figure 3.2 Box plot outliers on human serum data
A scatter plot is drawn for objects (genes) against expression values
expressed in different time points as shown in Figure 3.3. The x axis
represents genes and the y axis represents the expression values. It shows
clearly that some of the values above 3.340 are far away from other values, so
they can be assumed to be outliers.
Figure 3.3 Scatter plot outliers on human serum data
73
3.4.2 Yeast Data
Outliers in microarray dataset are detected and the result for yeast
data is given in the table 3.4. The data set consists of 201 objects and 12
attributes where the objects are genes and the attributes are expression values
of corresponding gene at different points of time. Table 3.3 provides a sample
of microarray yeast dataset. It is not full dataset, only a part of original
dataset. The full dataset is given in the appendix. Distance based outlier
detection technique is used to detect and remove outliers on yeast data.
Table 3.3 Sample input dataset for yeast microarray data
15 MIN 30 MIN 1 HR 2 HR 3 HR 4 HR 5 HR 6 HR 7 HR 8 HR 9 HR 10 HR
-0.57 -3.32 -0.81 0.11 -0.6 -1.26 -1.03 -1.84 -1 -0.6 -0.94 -0.84
0.62 0.07 0.2 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 -0.42 0.43
-0.56 0.08 -0.01 -0.79 -1.25 -2.18 -2.32 -1.84 -1 -0.97 -0.92 -1.64
-0.09 -0.07 -0.22 -0.6 -0.74 -1.06 -1.15 -1.06 -0.89 -0.89 -0.56 -0.86
-0.56 0.08 -0.01 -0.79 -1.25 -2.18 -2.32 -1.84 -1 -0.97 -0.92 -1.64
-0.09 -0.07 -0.22 -0.6 -0.74 -1.06 -1.15 -1.06 -0.89 -0.89 -0.56 -0.86
-0.03 -0.18 -0.58 -0.43 -0.64 -1.22 -1.64 -2.18 -2.06 -2 -1.12 -2.18
-0.18 -0.38 -0.54 -0.32 -0.84 -1.32 -1.47 -2.18 -2.32 -2.06 -1.12 -1.94
0.24 -0.45 -0.27 -0.34 -0.71 -1.36 -2.06 -2.56 -1.4 -1.51 -1.4 -1.18
0.15 0.04 -0.4 -0.32 -0.74 -0.94 -1.15 -1.64 -1.6 -0.97 -0.71 -0.81
-0.43 -0.07 -0.17 -0.45 -1.47 -2.32 -2.18 -2.4 -2.12 -2.64 -2.4 -1.84
-0.3 -0.1 -0.42 0.24 -0.89 -1.69 -1.56 -2.12 -1.69 -1.47 -1.47 -0.92
0.07 -0.06 -0.03 -0.2 -0.89 -1.69 -1.69 -1.29 -1.06 -1.47 -1.47 -1.32
0.11 0.56 0.55 0.03 -0.79 -1.03 -1.03 -0.67 -0.58 -0.81 -0.69 -0.67
0.38 0.34 0.1 0.07 -0.64 -1.25 -1.06 -0.71 -0.51 -0.79 -0.81 -0.25
74
Table 3.4 Outlier detected on yeast dataset
-2.1800 -2.1800 -2.2500 -2.3200 -2.3200 -2.3200 -2.3200 -2.3200 -2.3200 -2.4000
-2.4000 -2.4000 -2.4700 -2.5600 -2.6400 -2.6400 -2.6400 -3.3200
As discussed in section 2.6.4, box plot is constructed for the
original dataset to indicate the outliers. Figure 3.4 shows a box plot for yeast
dataset in which outliers are represented in the form of hollow circles. There
are only 18 outliers designated by this plot out of 2412 observations. This
method is not suitable for dataset having large number of objects with high
dimensional data as discussed in section 2.6.4.
Figure 3.4 Box plot outliers on yeast data
A scatter plot is drawn for objects (genes) against expression values
expressed in different time points as shown in Figure 3.5. The x axis
represents genes and the y axis represents the expression values of yeast data.
It shows that some of the values above -2.180 may be considered as outliers.
75
Figure 3.5 Scatter plot outliers on yeast data
3.4.3 Lung Cancer Data
Microarray cancer dataset is analyzed in this section for detecting
the outliers. This dataset consists of 20 objects and 4 attributes where the
objects are genes and the attributes are expression values of corresponding
gene at different time points. Sample input data for cancer dataset is given in
Table 3.5. Outliers are detected on dataset using different approaches: The
first approach is algorithmic in which distance based algorithm is used. An
input parameter N (Maximum no. of outliers) was set to five, but the method
found only four outliers out of 80 values. The outlier values are 06. 176,
12.000, 12.589 and 12.727. The detected outliers are removed from the data
set before using it for further analysis. Another approach called box plot is
graphical where a box plot is constructed for the original dataset to indicate
outliers.
76
Table 3.5 Sample input dataset for lung cancer microarray data
Gene Index 2 HR 3 HR 4 HR 5 HR
1415670 8.621339512 7.5 8.6213 9.258
1415671 11.06818663 11.068 12.589 10.568
1415672 10.44145373 9.457 10.441 8.5648
1415677 8.513605331 7.254 8.5136 8.568
1415681 9.275654563 7 7.1 7.8
1415682 7.147069562 11 12.0 11.5
1415687 12.72740879 10 9.5 8
1415688 8.928937957 8.5 9 8.2
1415689 6.176931606 7 7.2 7.
Box plot is formed by vertical axis which represents response
variables and horizontal axis which represents the factor of interest. There is a
useful variation of the box plot that more specifically identifies outliers.
Figure 3.6 shows a box plot for lung cancer dataset where outliers are
represented in the form of hollow circles. There are only four outliers
designated by this plot among about 80 observations. Figure 3.7 shows the
expression levels of genes, the x axis representing genes and the y axis, the
expression values. It shows clearly that some of the values above 11.5 and
below 7.0 are far away from other values and hence these values are
considered as outliers.
77
Figure 3.6 Box Plot outliers on lung cancer data
Figure 3.7 Expression levels of lung cancer genes
78
3.4.4 Blood Cancer Data
Blood cancer is a generalized term for malignancy which attacks
the blood, bone marrow, or lymphatic system. It refers to abnormal growth of
cells normally found in the blood. There are three kinds of blood cancer:
leukemia, lymphoma, and multiple myeloma. In this research work, the focus
is on one of the most prevalent type of blood cancers leukemia resulting from
malignant transformation of white blood cells.
White blood cells help fight infections and are a key component of
the body's immune system. In people with leukemia, the bone marrow
produces an abnormal amount of white blood cells. These abnormal white
blood cells (leukemia cells) may crowd out normal white blood cells, red
blood cells, and platelets. This makes it hard for the blood cells to function
properly.
Outliers in microarray dataset are detected and the result for blood
cancer is given in the Table 3.6. The dataset consists of 1023 objects and 25
attributes where the objects are genes and the attributes are expression values
of corresponding gene at different points of time. Distance based outlier
detection technique is used to detect and remove outliers on blood cancer
data. The full dataset of blood cancer is given in the appendix-4 as the size is
too large to be accommodated in this section.
79
Table 3.6 Outlier detected on blood cancer dataset
0.3475 0.3476 0.3477 0.3482 0.3484 0.3488 0.3490 0.34940.3499 0.3500 0.3503 0.3504 0.3520 0.3522 0.3524 0.35260.3531 0.3532 0.3535 0.3536 0.3540 0.3541 0.3542 0.35440.3548 0.3552 0.3557 0.3558 0.3559 0.3562 0.3563 0.35650.3567 0.3569 0.3573 0.3574 0.3579 0.3586 0.3586 0.35890.3596 0.3608 0.3621 0.3625 0.3626 0.3628 0.3629 0.36320.3633 0.3640 0.3641 0.3642 0.3647 0.3647 0.3647 0.36470.3654 0.3654 0.3658 0.3658 0.3658 0.3661 0.3664 0.36790.3679 0.3679 0.3685 0.3697 0.3702 0.3705 0.3707 0.37090.3712 0.3714 0.3715 0.3717 0.3718 0.3721 0.3723 0.37260.3727 0.3740 0.3742 0.3747 0.3755 0.3761 0.3763 0.37720.3774 0.3775 0.3779 0.3780 0.3782 0.3787 0.3789 0.37910.3796 0.3796 0.3804 0.3804 0.3806 0.3810 0.3813 0.38210.3822 0.3823 0.3831 0.3833 0.3835 0.3843 0.3845 0.38470.3847 0.3849 0.3852 0.3854 0.3854 0.3854 0.3857 0.38600.3863 0.3869 0.3877 0.3878 0.3880 0.3888 0.3888 0.38910.3913 0.3926 0.3927 0.3937 0.3944 0.3944 0.3946 0.39510.3956 0.3958 0.3979 0.3980 0.3994 0.4004 0.4005 0.40080.4021 0.4029 0.4029 0.4030 0.4039 0.4044 0.4044 0.40460.4050 0.4050 0.4052 0.4052 0.4062 0.4064 0.4074 0.40780.4086 0.4090 0.4093 0.4100 0.4100 0.4108 0.4110 0.41160.4116 0.4118 0.4123 0.4128 0.4138 0.4140 0.4151 0.41590.4167 0.4170 0.4172 0.4178 0.4181 0.4184 0.4185 0.41870.4187 0.4188 0.4189 0.4196 0.4224 0.4234 0.4239 0.42430.4249 0.4252 0.4253 0.4258 0.4258 0.4274 0.4275 0.42750.4276 0.4281 0.4289 0.4292 0.4297 0.4300 0.4301 0.43070.4307 0.4308 0.4314 0.4317 0.4320 0.4328 0.4328 0.43300.4339 0.4339 0.4344 0.4352 0.4354 0.4360 0.4361 0.43630.4368 0.4370 0.4379 0.4384 0.4385 0.4386 0.4390 0.43920.4396 0.4404 0.4405 0.4408 0.4412 0.4428 0.4433 0.44340.4435 0.4436 0.4441 0.4442 0.4450 0.4455 0.4467 0.44730.4474 0.4475 0.4476 0.4477 0.4486 0.4488 0.4488 0.44920.4499 0.4503 0.4517 0.4518 0.4522 0.4531 0.4531 0.45340.4535 0.4552 0.4552 0.4554 0.4574 0.4586 0.4587 0.45890.4590 0.4591 0.4610 0.4616 0.4619 0.4628 0.4634 0.4638
80
As discussed in section 2.6.4, box plot is constructed for the
original dataset to indicate the outliers. Figure 3.8 shows a box plot for blood
cancer dataset in which outliers are represented in the form of asterisks. There
are 1319 outliers designated by this plot out of 25575 observations.
Figure 3.8 Box plot outliers on blood cancer data
This method is not suitable for dataset having large number of
objects with high dimensional data as discussed in section 2.6.4.
A scatter plot is drawn for objects (genes) against expression values
expressed in different time points as shown in Figure 3.9. The x axis
represents genes and the y axis expression values of blood cancer data. It
shows that some of the values above 0.3475 can be considered as outliers.
81
Figure 3.9 Scatter plot outliers on blood cancer data
As discussed in previous sections, outliers are very sensitive and a
small number of outliers can make drastic change in the final result. Outliers
detected from four datasets are summarized in Table 3.7. It is obvious that the
susceptibility would increase as the percentage of outliers is increased.
Table 3.7 Outliers detected in datasets
Dataset Originalobservations Outliers detected
Human serum 6204 28Yeast 2412 18
Lung cancer 80 4Blood cancer 25575 1319
3.5 PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is a classical technique to
reduce the dimensionality of the data set consisting of a large number of
82
variables. This is achieved by transforming to a new set of variables
(Principal Components) which are uncorrelated and ordered so that the first
few retain the most of the variants present in all of the original variables. As a
result, dimensionality reduction using PCA can result in relatively low-
dimensional data and it may be possible to apply techniques that do not work
well with high dimensional data like microarray gene expression data. More
details are given in section 2.7.1.
Figure 3.10 Framework of dimensionality reduction
Dimensionality reduction technique is applied on outliers removed
gene expression dataset. Outliers present in dataset has already been detected
and removed in the previous section. Figure 3.10 show the process of
performing dimensionality reduction, it is demonstrated in the following
sections how dimensions of dataset is considerably reduced. An algorithm of
PCA is as follows:
Algorithm: Principal Component Analysis (PCA)
Input : Multidimensional data in a data matrix in which the rows are
genes and columns are conditions.
Output : Dimensions reduced dataset
Step 1 : Calculate mean value of each object in data matrix.
Step 2 : Calculate the covariance of data matrix.
Step 3 : Calculate the eigenvectors and eigenvalues of the covariance matrix.
83
Step 4 : Find the number of Components and forming a feature vector.
Step 5 : Transpose the feature vector and mean adjusted data matrix.
Step 6 : Derive the new dataset with reduced dimensions by multiplying the
feature vector and mean adjusted data.
In this section, the results of Principal Component Analysis (PCA)
applied on microarray dataset are discussed. Four different microarray gene
expression data with dimension range of 2 to 25 are taken for analysis. It is
observed that the dimensions are reduced considerably after PCA is
performed on these datasets and it is believed that dimensions of reduced
number would have produced the same results as original dimensions. This
was experimented on four datasets a) human serum dataset, b) yeast dataset,
c) Lung cancer dataset and Blood cancer dataset. The description of datasets
and their results are discussed in the following sub-sections.
3.5.1 Human Serum Data
This dataset contains 517 genes as row and 12 dimensions as
column. The dimensions have to be reduced in order to make the dataset fit
for cluster analysis. As discussed in section 2.7.1, large number of dimensions
may lead to inaccuracy and not comply with requirements of many clustering
techniques.
PCA will find eigenvectors and eigenvalues relevant to the data
using a covariance matrix as given in the Table 3.7. Eigenvectors can be
thought of as preferential directions of a data set, or in other words, main
patterns in the data. PCA can be applied on two profiles, PCA on genes and
PCA on conditions; in this research it is restricted to PCA on gene only. For
PCA on genes, an eigenvector would be represented as an expression profile
that is most representative of the data and eigenvalues can be thought of as
84
quantitative assessment of how much a component represents the data. The
more representative of the data, as a result of higher the eigenvalues of a
component.
Eigenvalues can also be representative of the level of explained
variance as a percentage of total variance. By themselves, eigenvalues are not
informative. The percent of variance explained is dependent on how well all
the components summarize the data. In theory, the sum of all components
explains 100% variability in the data.
Table 3.8 Co-variance matrix for human serum data
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V110.160 0.146 0.138 0.115 0.083 0.092 0.082 0.075 0.052 0.053 0.0560.146 0.443 0.385 0.221 0.291 0.322 0.307 0.249 0.087 0.091 0.0740.138 0.385 0.545 0.350 0.423 0.436 0.414 0.309 0.132 0.122 0.1110.115 0.221 0.350 0.460 0.440 0.462 0.468 0.348 0.184 0.162 0.1420.083 0.291 0.423 0.440 1.096 1.199 1.193 0.963 0.512 0.477 0.3960.092 0.322 0.436 0.462 1.199 1.517 1.574 1.304 0.739 0.691 0.5880.082 0.307 0.414 0.468 1.193 1.574 1.796 1.547 0.895 0.862 0.7340.075 0.249 0.309 0.348 0.963 1.304 1.547 1.549 0.977 0.962 0.8590.052 0.087 0.132 0.184 0.512 0.739 0.895 0.977 0.897 0.899 0.8760.053 0.091 0.122 0.162 0.477 0.691 0.862 0.962 0.899 1.039 1.0420.056 0.074 0.111 0.142 0.396 0.588 0.734 0.859 0.876 1.042 1.146
Table 3.9 shows the eigenvalues and the cumulative percentage of
variance for eigenvalues from principal component analysis of human serum
data. Among them the first three PC’s having maximum variances are taken
for our analysis. One should choose the components (variables) having largest
variance in the most significant eigenvectors. First four eigenvalues and their
respective percentage of variance as given in the table provide the 93
percentage of the total variation in the independent variables. Similarly, first
85
three eigenvalues account for 90 percentage of the total variation in the
independent variables. The variance of seven eigenvalues has less variance
and therefore these variables are no doubt interrelated with rest of the
important variables. Co-variances of eigenvectors are calculated for all
variables in order to find eigenvalues of principal components.
Table 3.9 Eigenvalues and its percentage of variance of human serum
data
PrincipalComponent Eigenvalue Percentage of
VarianceCumulative Percentage
of Variance1 7.72484 66.199 66.1992 1.9893 17.048 83.2473 0.807397 6.9191 90.1664 0.30341 2.6001 92.7665 0.236446 2.0263 94.7926 0.1978 1.6951 96.4877 0.116029 0.99433 97.4828 0.0880234 0.75433 98.2369 0.0743113 0.63682 98.873
10 0.0647379 0.55478 99.24811 0.0389406 0.33371 99.76212 0.027799 0.23823 99.999
Choosing the number of PC’s is also an important issue. There are
12 principal components on human serum data, of which, only components
having maximum total variance are to be chosen. This can be done with the
help of scree plot. The number of components to be chosen is three in the
Figure 3.11 from the point at which the curve is a steep. Hence the first three
components can be used to get the total variance. Scree graph approach to
decide the number of PC’s is very ad-hoc and subjective.
86
Figure 3.11 Scree plot for human serum data to determine number of
components
The reduced representation of human serum data is used to cluster
patterns for the dataset as shown in Table 3.10. The number of clusters
produced by Hybrid Clustering Technique after PCA performed is considered
as optimum.
Table 3.10 Reduced representations of human serum data
15MIN 30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR 28HR0.160 0.146 0.138 0.116 0.083 0.092 0.082 0.076 0.052 0.053 0.057 0.0810.146 0.444 0.385 0.221 0.291 0.322 0.307 0.249 0.087 0.091 0.074 0.0940.138 0.385 0.545 0.350 0.423 0.436 0.414 0.309 0.132 0.122 0.111 0.1290.115 0.221 0.350 0.460 0.440 0.462 0.468 0.348 0.185 0.162 0.142 0.1780.083 0.291 0.423 0.440 1.097 1.199 1.193 0.963 0.512 0.477 0.396 0.3410.092 0.322 0.436 0.462 1.199 1.517 1.574 1.304 0.739 0.691 0.588 0.5350.082 0.307 0.414 0.468 1.193 1.574 1.796 1.547 0.895 0.862 0.735 0.6800.076 0.250 0.310 0.349 0.963 1.304 1.547 1.550 0.978 0.962 0.860 0.7340.052 0.087 0.132 0.184 0.512 0.740 0.895 0.978 0.898 0.899 0.876 0.73810.053 0.091 0.122 0.162 0.477 0.691 0.862 0.963 0.899 1.039 1.042 0.8500.056 0.074 0.111 0.142 0.396 0.588 0.734 0.859 0.876 1.042 1.146 0.9100.081 0.094 0.134 0.178 0.341 0.535 0.680 0.734 0.738 0.850 0.910 1.047
87
3.5.2 Yeast Data
This dataset contains 201 rows and 12 columns. The rows are
represented as genes and the columns are dimensions of each gene. It is
obvious that the complexity of analysis increases if dimensions of dataset are
high. Those dimensions need to be reduced in order to make the dataset
amenable for cluster analysis. The significance of reducing the dimensions of
microarray data is discussed in sections 2.7, 2.7.2.
PCA would calculate eigenvectors and eigenvalues relevant to the
data using a covariance matrix given in the Table 3.11.
Table 3.11 Co-variance matrix for yeast dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0.090 0.051 0.041 0.045 0.027 0.029 0.051 0.053 0.013 0.014 0.007 0.027
0.051 0.204 0.096 0.027 0.032 0.039 0.043 0.033 -0.019 -0.038 -0.036 -0.019
0.041 0.096 0.105 0.030 0.027 0.025 0.030 0.038 -0.002 -0.010 -0.011 0.009
0.045 0.027 0.030 0.083 0.039 0.033 0.038 0.035 0.012 0.017 0.011 0.023
0.027 0.032 0.027 0.039 0.162 0.119 0.080 0.045 -0.013 -0.003 -0.018 -0.030
0.029 0.039 0.025 0.033 0.119 0.199 0.156 0.100 0.040 0.042 0.022 0.042
0.051 0.043 0.030 0.038 0.080 0.156 0.239 0.174 0.072 0.095 0.077 0.102
0.053 0.033 0.038 0.035 0.045 0.100 0.174 0.283 0.156 0.172 0.153 0.130
0.013 -0.019 -0.002 0.012 -0.013 0.040 0.072 0.156 0.243 0.222 0.207 0.181
0.014 -0.038 -0.010 0.017 -0.003 0.042 0.095 0.172 0.222 0.303 0.279 0.207
0.007 -0.036 -0.001 0.011 -0.018 0.022 0.077 0.153 0.207 0.279 0.318 0.203
0.027 -0.019 0.009 0.023 -0.030 0.042 0.102 0.130 0.181 0.207 0.203 0.316
An eigenvector would be represented as an expression profile that
is most representative of the data and eigenvalues can be thought of as
quantitative assessment of how much a component represents the data. The
88
higher the eigenvalues of a component, the more representative it is of the
data.
Eigenvalues and the cumulative percentage of variance for
eigenvalues are calculated on yeast data by applying PCA as shown in
Table 3.12. Among them the first three PC’s having maximum variances are
considered in the analysis.
Table 3.12 Eigenvalues and its percentage of variance of yeast data
PrincipalComponents Eigenvalue Percentage of
VarianceCumulative percentage
of variance1 1.16049 45.529 45.529
2 0.522087 20.483 66.012
3 0.230884 9.0582 75.070
4 0.1486 5.83 80.900
5 0.126627 4.9679 85.860
6 0.0984549 3.8627 89.730
7 0.076368 2.9961 91.959
8 0.0539829 2.1179 94.076
9 0.0385623 1.5129 95.589
10 0.0366699 1.4387 97.027
11 0.031587 1.2392 98.296
12 0.0245814 0.9644 99.983
Another difficult task is that of choosing the number of PC’s.
Table 3.12 contains 12 principal components and it is required to select the
components having maximum variances of the total variance in the
independent variables. There are 12 principal components on yeast data out of
which components having maximum total variances are first four. There is a
89
sharp change of steepness in Figure 3.12, corresponding to the value three in
the x axis, which shows the number of components to be chosen.
Figure 3.12 Scree plot for yeast data to determine the number of
components
Table 3.13 shows reduced representation of yeast dataset that can
be used for further analysis like clustering similar patterns in the dataset.
Hybrid Clustering Technique (HCT) is used to cluster data patterns to be
discussed in chapter-4. Dimensions get reduced to 4 from 12 after PCA is
applied, but there is no difference in the final results, as it should be.
Therefore PCA applied data are more efficient and applying clustering
techniques on this data produce optimum result.
90
Table 3.13 Reduced representations of yeast data
15 MIN 30 MIN 1 HR 2 HR 3 HR 4 HR 5 HR 6 HR 7 HR 8 HR 9 HR 10 HR
0.085 0.048 0.038 0.038 0.025 0.029 0.050 0.053 0.012 0.017 0.011 0.030
0.048 0.202 0.092 0.027 0.032 0.042 0.045 0.034 -0.03 -0.04 -0.04 -0.02
0.038 0.092 0.104 0.027 0.028 0.026 0.030 0.039 -0.01 -0.09 -0.01 0.006
0.082 0.028 0.027 0.081 0.036 0.033 0.037 0.032 0.010 0.014 0.011 0.021
0.025 0.031 0.027 0.037 0.162 0.125 0.083 0.048 -0.01 -0.00 -0.01 -0.03
0.029 0.042 0.025 0.033 0.125 0.206 0.146 0.107 0.039 0.043 0.025 0.044
0.050 0.045 0.030 0.037 0.083 0.164 0.243 0.183 0.072 0.099 0.078 0.105
0.053 0.034 0.039 0.032 0.048 0.107 0.183 0.292 0.159 0.179 0.159 0.134
0.012 -0.02 -0.01 0.010 -0.01 0.040 0.072 0.16 0.243 0.224 0.205 0.181
0.018 -0.04 -0.01 0.013 -0.00 0.042 0.098 0.179 0.224 0.306 0.284 0.211
0.010 -0.03 -0.01 0.011 -0.01 0.025 0.079 0.158 0.205 0.284 0.329 0.204
0.029 -0.02 0.00 0.022 -0.03 0.044 0.105 0.134 0.181 0.211 0.204 0.319
3.5.3 Lung Cancer Data
Lung cancer gene expression dataset is somewhat smaller than the
datasets used in the previous sections. The dataset consists of 20 rows and
4 columns. The rows are represented as genes and the columns are
dimensions of each gene. PCA would calculate eigenvectors and eigenvalues
relevant to the data using a covariance matrix as given in the Table 3.14.
Table 3.14 Eigenvalues and its percentage of variance of cancer data
PrincipalComponents Eigenvalue Percentage of
VarianceCumulative percentage
of variance1 5.15584 63.142 63.1422 1.72315 21.103 84.2453 0.834011 10.214 94.4594 0.45248 5.5414 100.00
91
Table 3.15 shows the eigenvalues and the cumulative
percentage of variance for eigenvalues from principal component analysis of
cancer dataset. Among them the first two PC’s having maximum variances
are taken for analysis.
Table 3.15 Co-variance matrix for lung cancer dataset
V1 V2 V3 V4V1 2.2651 1.1119 0.82628 0.5424V2 1.1119 1.8782 1.3949 1.1108V3 0.82628 1.3949 1.8904 1.2022V4 0.5424 1.1108 1.2022 2.1318
Scree plot is drawn to choose PC’s having maximum variances of
total variance in the independent variables as shown in Figure 3.13. The point
at which the knee starts down towards x coordinates is two; this value is
considered as appropriate PC’s that covers the total variance.
Figure 3.13 Scree plot for lung cancer data to determine the number of
components
92
Table 3.15 shows reduced representation of cancer dataset that can
be used for further analysis like clustering similar patterns in the dataset.
Hybrid Clustering Technique is applied on this reduced data. It is
demonstrated that PCA applied dataset produce optimum results after
applying Hybrid Clustering Techniques.
Table 3.16 Reduced representations of lung cancer data
2 HR 3 HR 4 HR 5 HR
2.265 1.111 0.826 0.542
1.112 1.878 1.394 1.110
0.826 1.394 1.890 1.202
0.542 1.110 1.202 2.132
3.5.4 Blood cancer data
This dataset contains 1023 genes as row and 25 dimensions as
column. The dimensions have to be reduced in order to make the dataset fit
for cluster analysis. As discussed in section 2.7.1, large number of dimensions
may lead to inaccuracy and not comply with requirements of many clustering
techniques.
Eigenvectors can be thought of as preferential directions of a data
set, or in other words, main patterns in the data. PCA can be applied on two
profiles, one on genes and another on conditions; in this research it is
restricted to PCA on gene only. For PCA on genes, an eigenvector would be
represented as an expression profile that is most representative of the data and
eigenvalues can be thought of as quantitative assessment of how much a
component represents the data. The higher the eigenvalues of a component,
the more representative the dataset is.
93
Eigenvalues can also be representative of the level of explained
variance as a percentage of total variance. By themselves, eigenvalues are not
informative. The percent of variance explained is dependent on how well all
the components summarize the data. In theory, the sum of all components
explains 100% variability in the data.
Table 3.17 shows the eigenvalues and this cumulative percentage of
variance from principal component analysis of human serum data. Among
them the first three PC’s having maximum variances are taken for this
analysis. One should choose the components (variables) having largest
variance in the most significant eigenvectors. First four eigenvalues and their
respective percentage of variance as given in the table provide the 93
percentage of the total variation in the independent variables. Similarly, first
three eigenvalues account for 90 percentage of the total variation in the
independent variables.
Table 3.17 Eigenvalues and its percentage of variance of blood cancer
data
PrincipalComponents Eigenvalue Percentage
of VarianceCumulative percentage
of variance1 5.17517E06 84.376 84.3762 267685 4.3643 88.7403 105617 1.722 90.4624 96301.9 1.5701 92.0325 70243.6 1.1452 93.1776 54264.5 0.88472 94.0627 49205.1 0.80224 94.8648 44421.6 0.72425 95.5889 43432.4 0.70812 96.29610 35126.6 0.5727 96.86811 28881.5 0.47088 97.339
94
Table 3.17 (Continued)
PrincipalComponents Eigenvalue Percentage
of VarianceCumulative percentage
of variance12 24576.4 0.40069 97.74913 20498.1 0.3342 98.07414 17639.1 0.28759 98.36215 15249.1 0.24862 98.61016 13004 0.21202 98.82217 11499.7 0.18749 99.01018 9769.19 0.15928 99.16919 9512.29 0.15509 99.32420 9275.67 0.15123 99.47521 8586.16 0.13999 99.61522 6981.69 0.11383 99.72923 6355.08 0.10361 99.83224 5796.82 0.094511 99.92725 4402.97 0.071786 99.999
The variance of seven eigenvalues has less variance and therefore
these variables are no doubt interrelated with rest of the important variables.
Co-variances of eigenvectors are calculated for all variables in order to find
eigenvalues of principal components.
Choosing the number of PC’s is also an important issue. There are
25 principal components on human serum data, of which, only components
having maximum total variance are to be chosen. This can be done with the
help of scree plot. The number of components to be chosen is three in the
Figure 3.14 from the point at which the curve goes steep. Hence the first three
components are used to get the total variance. Scree graph approach to decide
the number of PC’s is very ad-hoc and subjective.
95
Figure 3.14 Scree plot for blood cancer data to determine number of
components
The number of clusters produced by Hybrid Clustering Technique
after performing PCA is considered as optimum. As discussed in the previous
sections, the efficacy of a new clustering technique is increased after reducing
the dimensions of original dataset. Table 3.18 shows the percentage of
reduction in dimensions of three datasets. The amount of reduction is ranging
from 40 to 66.
Table 3.18 Original dimensions versus reduced dimensions
Dataset Originaldimensions
Reduceddimensions
Percentage ofreduction
Human serum 12 8 66 %
Yeast 12 8 66 %
Lung cancer 4 2 50 %
Blood Cancer 25 10 40%
96
3.6 CONCLUSION
Inconsistent data like outliers need to be removed before clustering
to improve the quality. It is found that, algorithmic method called distance
based outlier detection method is reliable for gene expression data whereas
graphical method named box plot is suitable for dataset with small size only.
Dimensionality reduction is carried out using Principal Component Analysis
on four datasets and the efficiency of new hybrid clustering algorithm is
significantly improved due to considerable reduction in dimensions as
discussed in the following chapter.