Data Mining for Analyzing the Impact of Environmental Stress on … · 2017-11-16 · Data Mining...
Transcript of Data Mining for Analyzing the Impact of Environmental Stress on … · 2017-11-16 · Data Mining...
Data Mining for Analyzing the Impact of Environmental Stress on Plants – A Case Study Using OSMID
Dr. Richard S. Segall Arkansas State University, Department of Economics and Decision Sciences, College of Business, State University, AR 72467-0239 Dr. Sarath A. Nonis Arkansas State University, Department of Marketing and Management, College of Business, State University, AR 72467-0239 Abstract This paper first provides a brief background on the basic concepts and development of data
mining and how it relates to data warehousing. Data mining and data warehousing are relatively
new and rapidly expanding areas of information systems for which new courses and curricula are
being created.
A brief background on the economics of plant biotech is provided. The plant data used in this
paper from the Osmotic Stress Microarray Information Database (OSMID) are considered to be
representative of those that could be used for biotech application such as the manufacture of
plant-made-pharmaceuticals (PMP) and genetically modified (GM) foods. Data mining of
selected plant data from the OSMID data warehouse is performed with investigations of both
economic and environmental factors. Conclusions and future directions of the research are
discussed.
Keywords: Data Mining, Data Warehousing, Plant-Made-Pharmaceuticals, Microarray Databases, Genetically Modified Foods
1. Background
1.1 What is Data Mining? Data mining is sometimes called “data or knowledge discovery” and is the process of
automating information discovery. Data mining is the process of analyzing data from different
perspectives and summarizing it into useful information. Although data mining is a relatively
new term, the basic concept is not. Companies have used powerful computers to sift through
volumes of supermarket scanner data and analyze market research reports for years.
Bigus (1996) defined data mining as the efficient discovery of valuable non-obvious
information from a large collection of data, and centers on the automated discovery of new facts
and relationships in data.
The process of data mining process is the core of the knowledge discovery process. Acxiom
Working Paper by Segall (2003) and text by Han and Kamber (2001) both show the relationships
of data mining to data cleaning and integration of databases into data warehouses, and the
subsequent appropriate selection of both task-relevant data and mining techniques to yield
pattern evaluations for new knowledge.
Data mining and knowledge discovery involves looking in the data for such factors as
associations, sequences, clusters, forecasting including model fitting, and patterns that could be
represented according to classification rules or trees. The specific models that have been used for
data mining include statistical analysis of data, neural networks, expert systems, fuzzy logic,
multidimensional analysis, data visualization, and decision trees.
Data mining is a relatively new area using statistical, artificial intelligence and related
techniques to "mine" through large volumes of data and provides knowledge without users
having to ask specific questions. The purpose of data mining is to "tell me something interesting,
even though I don't know what questions to ask, and also tell me what may happen."
It is essential to recognize that data mining is far superior to traditional methods of statistics
in many ways. For example, data mining is capable of providing visual patterns for large
databases or data warehouses. It is also essential to recognize that one of the novelties of this
research is that data mining has not been applied in the context of PMP (plant-made
pharmaceuticals), which constitute the data warehouse used in this paper.
For a more complete background on recent literature in both data mining and data
warehousing, a detailed discussion on how data mining relates to model building, and the
structure for data mining of a data warehouse; the reader is referred to previous Acxiom Working
Papers of Segall (2003) and Fish and Segall (2002) which will also appear as journal articles of
Segall (2004) and Fish and Segall (2004) respectively.
The reader is also referred to discussions about the use of data mining functions for
algorithms as applied to medical databases in work by the lead author of Segall
(1984,1988,2002). These data mining functions include modeling via linear and nonlinear
regression, and curve fitting to models. Data mining functions have also been used for databases
obtained from applications to models for learning rules of neural networks as show in additional
work of lead author of Segall (1995,1996,2001,2003,2004).
It is hoped that the reader of this paper would become interested and motivated to
investigate other biotech applications of data mining discussed for their individual teaching and
research needs. One however should also be aware that the use of incomplete data and use of
inaccurately estimates for missing data could also adversely affect the results of data mining.
1.2 What is Data Warehousing? A data warehouse, as the name implies, is a data store for a large amount of corporate data.
Data warehousing opens new possibilities in terms of decision support systems. Analysts cannot
make good decisions unless they have all of the available data. A good corporate data warehouse
makes that data readily available. In addition, it makes possible a whole new class of computing
applications as described above and now known as data mining.
A view of the multi-tiered architecture for data warehouses was presented as a figure in
Acxiom Working Paper by Segall (2002) for presenting the relationship between data sources,
data warehouse, data marts, OLAP, and tools used on a data warehouse such as analysis, query,
reports, and data mining. A data mart as shown in Segall (2002) is a specialized system that
brings together the data needed for a department or set of related applications.
Anyone who utilizes data mining should also be aware of the importance of the structure of a
data warehouse and the available methods of working with a data warehouse. SAS (2003)
discussed data warehouse solutions for pharmaceutical enterprises.
1.3 The Economic Importance of Plant Biotechnology Plants in all regions of the planet are subject to stresses from the environment throughout
their lifecycles. For the most part, these stresses are either benign or seasonal, and are within the
tolerance levels of plants. Some environmental stresses are actually beneficial, since they act as
natural mechanisms for stimulating evolution. Stresses form an important part of the design tool
chest of Nature, forcing organisms to react and reorient, or be replaced.
2. Description of Plant Databases used for Data Mining The plant data used in this paper are considered to be representative of those that could be
used for plant biotech analysis. Specifically the database that is to be used in the data mining for
this paper is the Osmotic Stress Microarray Information Database (OSMID) that contains the
results of approximately 100 microarray experiments performed at the University of Arizona as
part of a National Science Foundation (NSF) funded project named the “The Functional
Genomics of Plant Stress” whose data constitutes a data warehouse.
The selection of corn as the crop to be examined is based on the following three
observations from the current state of the plant biotech industry (Monsanto (2002)):
1. Corn in the US is one of the most researched products in the food and feed system, and its
genetic as well as agronomic properties are well documented;
2. Corn is a safe and stable medium for genetic expression;
3. Corn has been shown to express and accumulate high levels of monoclonal antobodies (proteins)
not achieved in other plants.
The OSMID database is available for public access on the web, and the OSMID contains
information about the more than 20,000 ESTs (Experimental Stress Tolerances) that were used to
produce these arrays. These 20,000 ESTs could be considered as components of data warehouse
of plant microarray databases that could be subjected to data mining. The plants represented in
the OSMID database include rice, barley, maize or corn, ice plant and Arabidopsis. Specifically,
the OSMID microarray database contains 4,000 ESTs for maize or corn, ice plant and rice, and
2,000 ESTs for barley, and 9,000 ESTs for Arabidopsis.
According to the web page of the OSMID, the Stress Genomics Consortium as funded by
NSF utilizes a variety of techniques to investigate the responses of plants and certain microbes to
environmental factors of stress such as drought, chilling, and salinity. Hence the data provided in
the OSMID database that could be used in the data mining include that for the variables of
treatments with environmental factors of salt, cold, and drought.
The OSMID allows users to search for a gene of interest by name, id or by DNA or protein
sequence. This microarray database is normalized by a uniform method of local iterative linear
regression that minimizes the effect of spatial variation.
Microarray databases assembled by Wang et al.(2003) are available on OSMID website for
corn and maize. A mircoarray database for corn is also provided in Wang (2003) as log(2)
normalized mean net signal pixel intensities. This log(2) normalization means that the local mean
background has been subtracted from each spot prior to its normalization and log transformation.
3. Data Mining Applied to Representative Plant Biotechnology
3.1 Data mining of microarray databases
Data mining can identify patterns upon selected conditions using techniques such as clustering
as shown in this paper. However, microarray databases contain so much data that one cannot
know in advance of any patterns in the data would appear upon selection of the variables of
interest of the investigator.
Data mining in this paper is used for the factor of salinity only and is one representative
example of other possible factors that could have been used such as drought and temperature as
previously mentioned.
3.2 Data mining of representative ingredient of corn using normalized data
This data mining performed for the selected plant ingredient of corn is representative for data
mining of other plant biotech databases that could be used for either biotechnology analysis or
manufacture of biopharmaceuticals using plants. The databases selected for the data mining
presented are those representing the intensity of the ESTs (Experimental Stress Tolerances) for
corn. Separate databases for corn for the factors are created by log (2) transformation ratios, and
data mining for these are also performed and contrasted with the normal databases for ingredient
of corn.
The software used for the data mining results presented in this paper is SAS Enterprise
Miner using the cluster analysis module. Figure 1 presents six pie charts (a) thru (f) that
represent some of the results of the cluster analysis obtained by data mining. Each of these pie
charts illustrates seven (7) slices indicating that the data mining resulted in seven clusters. Each
of these clusters illustrates that the darkest shade occurs with slice number 7 corresponding to the
maximum distance from cluster seed, and the lightest shade occurs with slices numbered as 2, 3
and 6 corresponding to the least distance from the cluster seed. Each of these uses a different set
of measures of frequency, standard deviation, or radius for the slice, height and color
determination illustrated.
Table 1 lists the twenty-five (25) different variations or levels of the environmental factor of
salinity and their importance, measurement, data types, and labels as used in subsequent data
mining figures. Figure 2 shows the normalized mean values for each of the clusters for each of
the 25 variations of the salinity factor. Note that environmental factor of 72-hour salt treatment
for cy5 has the smallest normalized mean and 24-hour salt treatment for cy3 has the largest
normalized mean. Figure 3 shows the cluster proximities of the seven clusters. From this Figure
3 it is evident that the centers of clusters 4 and 7 and clusters 1 and 3 are close to each other.
Figure 4 (a) thru (c) presents bar charts that illustrate the frequency for the indicated
environmental factors of 1-hour salt of cy3, 6 hour control of cy3, and 24-hour control of cy3
respectively for corn. Each of the components (a) thru (c) of Figure 4 indicates that cluster 5 had
the highest frequencies of the selected environmental factor for plant of corn that could be used
as an ingredient for plant-made-pharmaceuticals.
Table 2 indicates statistics for each of the seven clusters and each of the twenty-five (25)
environmental factors. Figure 5 indicates that the cubic clustering criterion increases
exponentially once the number of clusters exceeds seven (7).
Figure 6 provides a decision tree obtained from the data mining with branches defined by
environmental factors. Each node of the decision tree of Figure 6 has frequency counts and
percentages for each of the seven clusters having the indicated environmental factor between the
limits indicated on each level of the decision tree. The left portion of this decision tree indicates
that cluster number five (5) had the greatest frequency of occurrence of the indicated
environment stress factor. The right side of the decision tree of Figure 6 however had other
clusters having the greatest frequency of occurrence of the indicated environment stress factor.
3.3 Data mining of representative ingredient of corn using log(2) normalized data
The data mining performed on the previous data set of corn in section 3.1 is repeated for the
same ingredient of corn with the data transformed with using a log(2) ratio as provided by Wang
et al. (2003) on OSMID website. The clusters that are obtained for this log(2) ratio data set for
the ingredient of corn and maize are shown in Figure 7 (a) thru 7 (c) as obtained by data mining
using SAS Enterprise Miner. Each of the Figures 7 (a) thru 7 (c) show thirty-six slices as
compared to seven (6) slices of Figures 1 (a) thru 1 (f), and their respective frequencies as
generated by using data mining for clustering. The units of measurement for each set of the
thirty-six (36) slices are determined by using the indicated units of measurement for slice, height
and color.
Figure 8 indicates the existence of no clusters within the maximum normalized mean of 0.03
of the untransformed data of Figure 2. Figure 8 illustrates the existence of only six (6) clusters
for all of the environmental factors within minimum and maximum normalized bounds of (0.25,
0.45). Six (6) transformed environmental factors, as listed in Table 3, gave the only meaningful
information as indicated by the “importance” factor values Table 3 in contrast to twenty-five (25)
for the untransformed data.
Figure 9 shows the cluster proximities of the clusters created by the transformed data. There
are forty (40) clusters in Figure 9 compared to thirty-six (36) slices in the pie charts of Figures 10
(a) thru 10 (c) because of larger scale of dimensions in Figure 9. Figure 9 indicates the proximity
of the centers of the clusters is much less than of the untransformed data.
Figure 10(a) thru 10(f) presents bar charts that illustrate the frequency for the six (6) indicated
environmental factors as labeled in Table 3. Each of these components (a) thru (f) of Figure 10
indicate that clusters 5, 11, 12 and 13 had the highest frequencies of the selected environmental
factor for the plant of corn.
Table 4 indicates statistics for each of the forty (40) clusters and each of the six (6)
transformed environmental factors. Figure 11 indicates that the cubic cluster criterion decreases
exponentially rather than increases as in Figure 5 for the untransformed data once the number of
clusters exceeds seven (7).
Figure 12 provides a decision tree for the transformed data as obtained from the data mining
with eleven (11) levels of branches defined by the log(2) transformed environmental factors
versus six (6) levels of the untransformed data of Figure 6. Unfortunately the numerical values of
the log(2) transformed factors were not readable in the output of SAS Enterprise Miner of Figure
12.
4. Conclusions and Future Directions
This paper has provided some illustrations of the usefulness of data mining to the
assessment of environmental stress factors on plants. The implications of transformation of the
data are also illustrated. This application of data mining could also be implied for
the analysis of data used for plants to be used as possible ingredients for the manufacture of
plant-made-pharmaceuticals (PMP) as also discussed below as a future direction of this research.
According to Dow (2003), many new pharmaceuticals based on recombinant proteins will
receive regulatory approval from the United States Food and Drug Administration (FDA) in the
next few years. Dow (2003) claims that growing therapeutic proteins in plants is a new way to
produce medicines, and the unique cellular machinery of a plant can enable production of certain
novel therapeutic proteins that fermentation cannot produce. Dow (2003) also claims that
growing pharmaceuticals in plants will give drug development companies an economic
alternative to scaling back development and production. The economic issues of using plants for
plant-made-pharmaceuticals also includes the ability of faster access to new medicines for
patients whose lives may be enhanced or saved by these discoveries.
Additional economic issues for the manufacture of plant-made-pharmaceuticals using plants
according to Dow (2003) include the exploration of plant-based production to overcome capacity
limitations, the production of complex therapeutic proteins, and the reality of the commercial
potential and implications of the marketing of plant-based plant-made-pharmaceuticals. Some of
the conclusions of this paper are the implications of the applications of data mining techniques to
predict some of the patterns of the data to answer some of these questions of this area of
assessment of environmental stress factors on plants, and the potential usefulness of these
techniques and results for also in the new area of economic and environmental issues of plant-
made-pharmaceuticals.
Additionally, the future directions of this research are the application of data mining
techniques to more of the data as made available and also other data mining techniques such as
predictive modeling such as available using SAS Enterprise Miner.
Acknowledgements The authors wish to acknowledge the funding provided by a block grant from the Arkansas
Biosciences Institute (ABI) as administered by Arkansas State University (ASU) to encourage
development of a focus area in Biosciences Institute Social and Economic and Regulatory
Studies (BISERS).
References AgBiotechNet (2002) Public Awareness, Risk Assessment, Company Information. http://www.agbiotechnet.com
Arkhipenkov, S. and Golubev D. (2002), Oracle Express OLAP, Charles River Media, Hingham, MA. Berson, A. and Smith, S. J. (1997), Data Warehousing, Data Mining, & OLAP,
McGraw-Hill Publishers, New York. Bigus, J. P.(1996), Data Mining with Neural Networks, Mc-Graw Hill Publishers. Bristol-Myers Squibb (2002), ”IT and pharmaceutical data: Finding needles in haystacks,” October 23, 2002, www.cioinsight.com/print_article/0,3668,a=32840,00.asp Chassy, B. (2002), Food Safety Evaluation of Crops Produced Through Biotechnology, Journal of the American College of Nutrition, Vol. 21, No. 90003, 166S-173S.
Computer Sciences Corporation (1998), “Data warehouse designed for pharmaceutical executives,” June 2, 1998, www.csc.com/newsandevents/news/1028.shtml Dow Plant-Based Pharmaceuticals (2003) http://www.dow.com/plantbio/
DSstar (1999), “Pharmaceutical data-mining firm opens in Santa Fe, NM,” v.3, n. 7, November 23, 1999, http://www.tgc.com/dsstar/99/1123/991123.html Elands, J. (2001), IDBS says its suite of data management tools is ready for the genomics software market, Bioinform, volume 5, number 37, October 1, 2001. Fish, K. E. and Segall, R. S. (2004), A visual analysis of learning rule effects and variable importance for neural networks employed in data mining operations, to appear in Kybernetes: International Journal of Systems and Cybernetics, volume 33. (also appears as Acxiom Working Paper Series WP-02-03). Glasser, P. (2002), Leaders, Bristol-Myers Squibb CIO on IT and Pharmaceutical Data, Ziff Davis CioInsight- Strategies for IT Business Leaders, www.cioinsight.com Han, J. and Kamber, M. (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA. Hanemann, W. (1994) Valuing the Environment Through Contingent Valuation. The Journal of Economic Perspectives, 8 (4), 19-43.
Heinrich, L. (2003),”Starting the data warehouse from a data model,” Applied Data Resource Management White Paper, www.adrm.com/4_infoart.htm Hy, Heikki, (2003),“Application example: Data mining of pharmaceutical data”, www.hut.fi/~hhyotyni/latex/Final/node61.html IDBS (2001), “IDBS unveil new pharmaceutical data warehousing strategy at LabAutomation,” January 28, 2001, www.idbs.com/news/press_room/press_release.asp?release_date=01_28_2001 James, C. (2001), Global Review of Commercialized Transgenic Crops: 2001, International Service for the Acquisition of Agri-biotech Applications. No. 24, pp. iv.
Knightbridge Company (2001), “Pharmaceutical and medical products,” www.knightsbridge.com/pharmaceutical.html Monsanto (2002), Plant-Made Pharmaceuticals: A New Way to Make Medicine. Murtagh, F. and Hirtle, S. (1997), “On-Line software for clustering and multivariate analysis,” www.pitt.edu/~csna/software.html Nash, K. S. (2002), “New Rx for pharmaceutical data,“ Baseline, Ziff Davis Media Inc., June 12, 2002, www.baselinemag.com/print_article/0,3668,a=28045,00.asp PEW Initiative on Food and Biotechnology (2002) On the Pharm: GM Plants and the Future of Medicine. AgBiotechBUZZ, 2(7), July 29, 2002.
Reichert, J. M., “New biopharmaceutical in the USA: trends in development and marketing approvals 1995-1999,” TIBTECH, September 2000, v. 18, pp. 364-369. SAS (2003),“SAS solutions for pharmaceutical enterprises,” www.sas.com/industry/pharma Segall, R. S. (1984), Models of Area Wide Medical Delivery, Ph.D. Dissertation, University of Massachusetts at Amherst. Segall, R. S. (1988) Mathematical modeling for the capacity planning of market oriented systems: with an application to real health data, Applied Mathematical Modelling, v. 12, n. 4, (1988), 366-378. Segall, R. S. (1995), Some mathematical and computer modeling of neural networks, Applied Mathematical Modelling, v. 19, 386-399. Segall, R. S. (1996), Comparing learning rules of neural networks using computer graphics, Proceedings of the Twenty-seventh Annual Conference of the Southwest Decision Sciences Institute, San Antonio, TX, March 6-9, 1996. Segall, R. S. (2001), Some Applications of Data Mining and Data Warehousing to a Medical Database and Neural Networks, Proceedings of the 32nd Conference of the Southwest Decision Sciences Institute, New Orleans, LA, March 30-April 2, 2001. Segall, R. S. (2002), Some Applications of RightPoint DataCruncher for Data Mining of Data Warehouses, Proceedings of the 33rd Conference of the Southwest Decision Sciences Institute, St. Louis, MO, March 4-8, 2002. Segall, R. S. (2003), Incorporating Data Mining and Computer Graphics for Modeling of Neural Networks, Acxiom Data Engineering Laboratory Working Paper Series, ADEL-WP-03-02, Publication in Collaboration with University of Arkansas at Little Rock (UALR) Donaghey Cyber College, 33 pages, March 2003.
Segall, R. S. (2004), Incorporating data mining and computer graphics for modeling of neural networks, to appear in Kybernetes: International Journal of Systems and Cybernetics, volume 33. Silico Research, (2001), “Data warehousing is becoming one of the core drug discovery technologies,” www.bioportfolio.com/silico/dw.htm University of British Columbia (2002), Computer Science Department, Database Systems Laboratory, “Data Mining – Introduction”, www.cs.ubc.ca/nest/dbsl/mining.html Wang et al. (2003) “A maize QTL for silk maysin levels contains duplicated Myb-homologous genes which jointly regulate flavone biosynthesis, “ Journal of Plant Molecular Biology, v. 52, n. 1, pages 1-15, May 2003.
List of Figures
Figure 1: Clusters for data replacement train for corn microarray database
(a) slice: standard deviation, height: standard deviation, color: radius
(b) slice: radius, height: standard deviation, color: radius
(c) slice: radius, height: standard deviation, color: radius
(d) slice: radius, height: frequency, color: radius
(e) slice: standard deviation, height: frequency, color: radius
(f) slice: radius, height: radius, color: standard deviation
Figure 2: Clusters for environmental factors for corn microarray database
Figure 3: Cluster proximities of corn microarray database
Figure 4: Frequency bar chart of corn microarray for environment factor
(a) 1hour salt cy3, (b) 6 hour control cy3, (c) 24 hour control cy5
Figure 5: Cubic clustering criterion for corn microarray database
Figure 6: Decision tree obtained by data mining of corn microarray database
Figure 7: Clusters for data replacement train for log(2) transformed corn data
(a) slice: standard deviation, height: frequency, color: radius
(b) slice: standard deviation, height: radius, color: radius
(c) slice: radius, height: radius, color: radius
Figure 8: Clusters for environmental factors for log(2) transformed corn data
Figure 9: Cluster proximities of log(2) corn microarray data
Figure 10: Frequency bar chart of log(2) transformed corn data for environment factor of
controlled salt for: (a) 6 hour, (b) 3 hour, (c) 1 hour, (d) 24 hour, (e) 72 hour, (f) 12 hour.
Figure 11: Cubic clustering criterion for log(2) transformed corn data
Figure 12: Decision tree obtained by data mining of log(2) transformed corn data
List of Tables
Table 1: Environmental factors for corn microarray database
Table 2: Statistics for clusters for environment factors for corn
Table 3: Environmental factors for log(2) ratio transformation of corn microarray
database.
Table 4: Statistics for clusters for environment factors for log(2) transformed corn data
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Table1
Tab
le 2
Tab
le 3
Tab
le 4