Data Mining for Analyzing the Impact of Environmental Stress on … · 2017-11-16 · Data Mining...

Data Mining for Analyzing the Impact of Environmental Stress on Plants – A Case Study Using OSMID

Dr. Richard S. Segall Arkansas State University, Department of Economics and Decision Sciences, College of Business, State University, AR 72467-0239 Dr. Sarath A. Nonis Arkansas State University, Department of Marketing and Management, College of Business, State University, AR 72467-0239 Abstract This paper first provides a brief background on the basic concepts and development of data

mining and how it relates to data warehousing. Data mining and data warehousing are relatively

new and rapidly expanding areas of information systems for which new courses and curricula are

being created.

A brief background on the economics of plant biotech is provided. The plant data used in this

paper from the Osmotic Stress Microarray Information Database (OSMID) are considered to be

representative of those that could be used for biotech application such as the manufacture of

plant-made-pharmaceuticals (PMP) and genetically modified (GM) foods. Data mining of

selected plant data from the OSMID data warehouse is performed with investigations of both

economic and environmental factors. Conclusions and future directions of the research are

discussed.

Keywords: Data Mining, Data Warehousing, Plant-Made-Pharmaceuticals, Microarray Databases, Genetically Modified Foods

1. Background

1.1 What is Data Mining? Data mining is sometimes called “data or knowledge discovery” and is the process of

automating information discovery. Data mining is the process of analyzing data from different

perspectives and summarizing it into useful information. Although data mining is a relatively

new term, the basic concept is not. Companies have used powerful computers to sift through

volumes of supermarket scanner data and analyze market research reports for years.

Bigus (1996) defined data mining as the efficient discovery of valuable non-obvious

information from a large collection of data, and centers on the automated discovery of new facts

and relationships in data.

The process of data mining process is the core of the knowledge discovery process. Acxiom

Working Paper by Segall (2003) and text by Han and Kamber (2001) both show the relationships

of data mining to data cleaning and integration of databases into data warehouses, and the

subsequent appropriate selection of both task-relevant data and mining techniques to yield

pattern evaluations for new knowledge.

Data mining and knowledge discovery involves looking in the data for such factors as

associations, sequences, clusters, forecasting including model fitting, and patterns that could be

represented according to classification rules or trees. The specific models that have been used for

data mining include statistical analysis of data, neural networks, expert systems, fuzzy logic,

multidimensional analysis, data visualization, and decision trees.

Data mining is a relatively new area using statistical, artificial intelligence and related

techniques to "mine" through large volumes of data and provides knowledge without users

having to ask specific questions. The purpose of data mining is to "tell me something interesting,

even though I don't know what questions to ask, and also tell me what may happen."

It is essential to recognize that data mining is far superior to traditional methods of statistics

in many ways. For example, data mining is capable of providing visual patterns for large

databases or data warehouses. It is also essential to recognize that one of the novelties of this

research is that data mining has not been applied in the context of PMP (plant-made

pharmaceuticals), which constitute the data warehouse used in this paper.

For a more complete background on recent literature in both data mining and data

warehousing, a detailed discussion on how data mining relates to model building, and the

structure for data mining of a data warehouse; the reader is referred to previous Acxiom Working

Papers of Segall (2003) and Fish and Segall (2002) which will also appear as journal articles of

Segall (2004) and Fish and Segall (2004) respectively.

The reader is also referred to discussions about the use of data mining functions for

algorithms as applied to medical databases in work by the lead author of Segall

(1984,1988,2002). These data mining functions include modeling via linear and nonlinear

regression, and curve fitting to models. Data mining functions have also been used for databases

obtained from applications to models for learning rules of neural networks as show in additional

work of lead author of Segall (1995,1996,2001,2003,2004).

It is hoped that the reader of this paper would become interested and motivated to

investigate other biotech applications of data mining discussed for their individual teaching and

research needs. One however should also be aware that the use of incomplete data and use of

inaccurately estimates for missing data could also adversely affect the results of data mining.

1.2 What is Data Warehousing? A data warehouse, as the name implies, is a data store for a large amount of corporate data.

Data warehousing opens new possibilities in terms of decision support systems. Analysts cannot

make good decisions unless they have all of the available data. A good corporate data warehouse

makes that data readily available. In addition, it makes possible a whole new class of computing

applications as described above and now known as data mining.

A view of the multi-tiered architecture for data warehouses was presented as a figure in

Acxiom Working Paper by Segall (2002) for presenting the relationship between data sources,

data warehouse, data marts, OLAP, and tools used on a data warehouse such as analysis, query,

reports, and data mining. A data mart as shown in Segall (2002) is a specialized system that

brings together the data needed for a department or set of related applications.

Anyone who utilizes data mining should also be aware of the importance of the structure of a

data warehouse and the available methods of working with a data warehouse. SAS (2003)

discussed data warehouse solutions for pharmaceutical enterprises.

1.3 The Economic Importance of Plant Biotechnology Plants in all regions of the planet are subject to stresses from the environment throughout

their lifecycles. For the most part, these stresses are either benign or seasonal, and are within the

tolerance levels of plants. Some environmental stresses are actually beneficial, since they act as

natural mechanisms for stimulating evolution. Stresses form an important part of the design tool

chest of Nature, forcing organisms to react and reorient, or be replaced.

2. Description of Plant Databases used for Data Mining The plant data used in this paper are considered to be representative of those that could be

used for plant biotech analysis. Specifically the database that is to be used in the data mining for

this paper is the Osmotic Stress Microarray Information Database (OSMID) that contains the

results of approximately 100 microarray experiments performed at the University of Arizona as

part of a National Science Foundation (NSF) funded project named the “The Functional

Genomics of Plant Stress” whose data constitutes a data warehouse.

The selection of corn as the crop to be examined is based on the following three

observations from the current state of the plant biotech industry (Monsanto (2002)):

1. Corn in the US is one of the most researched products in the food and feed system, and its

genetic as well as agronomic properties are well documented;

2. Corn is a safe and stable medium for genetic expression;

3. Corn has been shown to express and accumulate high levels of monoclonal antobodies (proteins)

not achieved in other plants.

The OSMID database is available for public access on the web, and the OSMID contains

information about the more than 20,000 ESTs (Experimental Stress Tolerances) that were used to

produce these arrays. These 20,000 ESTs could be considered as components of data warehouse

of plant microarray databases that could be subjected to data mining. The plants represented in

the OSMID database include rice, barley, maize or corn, ice plant and Arabidopsis. Specifically,

the OSMID microarray database contains 4,000 ESTs for maize or corn, ice plant and rice, and

2,000 ESTs for barley, and 9,000 ESTs for Arabidopsis.

According to the web page of the OSMID, the Stress Genomics Consortium as funded by

NSF utilizes a variety of techniques to investigate the responses of plants and certain microbes to

environmental factors of stress such as drought, chilling, and salinity. Hence the data provided in

the OSMID database that could be used in the data mining include that for the variables of

treatments with environmental factors of salt, cold, and drought.

The OSMID allows users to search for a gene of interest by name, id or by DNA or protein

sequence. This microarray database is normalized by a uniform method of local iterative linear

regression that minimizes the effect of spatial variation.

Microarray databases assembled by Wang et al.(2003) are available on OSMID website for

corn and maize. A mircoarray database for corn is also provided in Wang (2003) as log(2)

normalized mean net signal pixel intensities. This log(2) normalization means that the local mean

background has been subtracted from each spot prior to its normalization and log transformation.

3. Data Mining Applied to Representative Plant Biotechnology

3.1 Data mining of microarray databases

Data mining can identify patterns upon selected conditions using techniques such as clustering

as shown in this paper. However, microarray databases contain so much data that one cannot

know in advance of any patterns in the data would appear upon selection of the variables of

interest of the investigator.

Data mining in this paper is used for the factor of salinity only and is one representative

example of other possible factors that could have been used such as drought and temperature as

previously mentioned.

3.2 Data mining of representative ingredient of corn using normalized data

This data mining performed for the selected plant ingredient of corn is representative for data

mining of other plant biotech databases that could be used for either biotechnology analysis or

manufacture of biopharmaceuticals using plants. The databases selected for the data mining

presented are those representing the intensity of the ESTs (Experimental Stress Tolerances) for

corn. Separate databases for corn for the factors are created by log (2) transformation ratios, and

data mining for these are also performed and contrasted with the normal databases for ingredient

of corn.

The software used for the data mining results presented in this paper is SAS Enterprise

Miner using the cluster analysis module. Figure 1 presents six pie charts (a) thru (f) that

represent some of the results of the cluster analysis obtained by data mining. Each of these pie

charts illustrates seven (7) slices indicating that the data mining resulted in seven clusters. Each

of these clusters illustrates that the darkest shade occurs with slice number 7 corresponding to the

maximum distance from cluster seed, and the lightest shade occurs with slices numbered as 2, 3

and 6 corresponding to the least distance from the cluster seed. Each of these uses a different set

of measures of frequency, standard deviation, or radius for the slice, height and color

determination illustrated.

Table 1 lists the twenty-five (25) different variations or levels of the environmental factor of

salinity and their importance, measurement, data types, and labels as used in subsequent data

mining figures. Figure 2 shows the normalized mean values for each of the clusters for each of

the 25 variations of the salinity factor. Note that environmental factor of 72-hour salt treatment

for cy5 has the smallest normalized mean and 24-hour salt treatment for cy3 has the largest

normalized mean. Figure 3 shows the cluster proximities of the seven clusters. From this Figure

3 it is evident that the centers of clusters 4 and 7 and clusters 1 and 3 are close to each other.

Figure 4 (a) thru (c) presents bar charts that illustrate the frequency for the indicated

environmental factors of 1-hour salt of cy3, 6 hour control of cy3, and 24-hour control of cy3

respectively for corn. Each of the components (a) thru (c) of Figure 4 indicates that cluster 5 had

the highest frequencies of the selected environmental factor for plant of corn that could be used

as an ingredient for plant-made-pharmaceuticals.

Table 2 indicates statistics for each of the seven clusters and each of the twenty-five (25)

environmental factors. Figure 5 indicates that the cubic clustering criterion increases

exponentially once the number of clusters exceeds seven (7).

Figure 6 provides a decision tree obtained from the data mining with branches defined by

environmental factors. Each node of the decision tree of Figure 6 has frequency counts and

percentages for each of the seven clusters having the indicated environmental factor between the

limits indicated on each level of the decision tree. The left portion of this decision tree indicates

that cluster number five (5) had the greatest frequency of occurrence of the indicated

environment stress factor. The right side of the decision tree of Figure 6 however had other

clusters having the greatest frequency of occurrence of the indicated environment stress factor.

3.3 Data mining of representative ingredient of corn using log(2) normalized data

The data mining performed on the previous data set of corn in section 3.1 is repeated for the

same ingredient of corn with the data transformed with using a log(2) ratio as provided by Wang

et al. (2003) on OSMID website. The clusters that are obtained for this log(2) ratio data set for

the ingredient of corn and maize are shown in Figure 7 (a) thru 7 (c) as obtained by data mining

using SAS Enterprise Miner. Each of the Figures 7 (a) thru 7 (c) show thirty-six slices as

compared to seven (6) slices of Figures 1 (a) thru 1 (f), and their respective frequencies as

generated by using data mining for clustering. The units of measurement for each set of the

thirty-six (36) slices are determined by using the indicated units of measurement for slice, height

and color.

Figure 8 indicates the existence of no clusters within the maximum normalized mean of 0.03

of the untransformed data of Figure 2. Figure 8 illustrates the existence of only six (6) clusters

for all of the environmental factors within minimum and maximum normalized bounds of (0.25,

0.45). Six (6) transformed environmental factors, as listed in Table 3, gave the only meaningful

information as indicated by the “importance” factor values Table 3 in contrast to twenty-five (25)

for the untransformed data.

Figure 9 shows the cluster proximities of the clusters created by the transformed data. There

are forty (40) clusters in Figure 9 compared to thirty-six (36) slices in the pie charts of Figures 10

(a) thru 10 (c) because of larger scale of dimensions in Figure 9. Figure 9 indicates the proximity

of the centers of the clusters is much less than of the untransformed data.

Figure 10(a) thru 10(f) presents bar charts that illustrate the frequency for the six (6) indicated

environmental factors as labeled in Table 3. Each of these components (a) thru (f) of Figure 10

indicate that clusters 5, 11, 12 and 13 had the highest frequencies of the selected environmental

factor for the plant of corn.

Table 4 indicates statistics for each of the forty (40) clusters and each of the six (6)

transformed environmental factors. Figure 11 indicates that the cubic cluster criterion decreases

exponentially rather than increases as in Figure 5 for the untransformed data once the number of

clusters exceeds seven (7).

Figure 12 provides a decision tree for the transformed data as obtained from the data mining

with eleven (11) levels of branches defined by the log(2) transformed environmental factors

versus six (6) levels of the untransformed data of Figure 6. Unfortunately the numerical values of

the log(2) transformed factors were not readable in the output of SAS Enterprise Miner of Figure

12.

4. Conclusions and Future Directions

This paper has provided some illustrations of the usefulness of data mining to the

assessment of environmental stress factors on plants. The implications of transformation of the

data are also illustrated. This application of data mining could also be implied for

the analysis of data used for plants to be used as possible ingredients for the manufacture of

plant-made-pharmaceuticals (PMP) as also discussed below as a future direction of this research.

According to Dow (2003), many new pharmaceuticals based on recombinant proteins will

receive regulatory approval from the United States Food and Drug Administration (FDA) in the

next few years. Dow (2003) claims that growing therapeutic proteins in plants is a new way to

produce medicines, and the unique cellular machinery of a plant can enable production of certain

novel therapeutic proteins that fermentation cannot produce. Dow (2003) also claims that

growing pharmaceuticals in plants will give drug development companies an economic

alternative to scaling back development and production. The economic issues of using plants for

plant-made-pharmaceuticals also includes the ability of faster access to new medicines for

patients whose lives may be enhanced or saved by these discoveries.

Additional economic issues for the manufacture of plant-made-pharmaceuticals using plants

according to Dow (2003) include the exploration of plant-based production to overcome capacity

limitations, the production of complex therapeutic proteins, and the reality of the commercial

potential and implications of the marketing of plant-based plant-made-pharmaceuticals. Some of

the conclusions of this paper are the implications of the applications of data mining techniques to

predict some of the patterns of the data to answer some of these questions of this area of

assessment of environmental stress factors on plants, and the potential usefulness of these

techniques and results for also in the new area of economic and environmental issues of plant-

made-pharmaceuticals.

Additionally, the future directions of this research are the application of data mining

techniques to more of the data as made available and also other data mining techniques such as

predictive modeling such as available using SAS Enterprise Miner.

Acknowledgements The authors wish to acknowledge the funding provided by a block grant from the Arkansas

Biosciences Institute (ABI) as administered by Arkansas State University (ASU) to encourage

development of a focus area in Biosciences Institute Social and Economic and Regulatory

Studies (BISERS).

References AgBiotechNet (2002) Public Awareness, Risk Assessment, Company Information. http://www.agbiotechnet.com

Arkhipenkov, S. and Golubev D. (2002), Oracle Express OLAP, Charles River Media, Hingham, MA. Berson, A. and Smith, S. J. (1997), Data Warehousing, Data Mining, & OLAP,

http://www.agbiotechnet.com/

McGraw-Hill Publishers, New York. Bigus, J. P.(1996), Data Mining with Neural Networks, Mc-Graw Hill Publishers. Bristol-Myers Squibb (2002), ”IT and pharmaceutical data: Finding needles in haystacks,” October 23, 2002, www.cioinsight.com/print_article/0,3668,a=32840,00.asp Chassy, B. (2002), Food Safety Evaluation of Crops Produced Through Biotechnology, Journal of the American College of Nutrition, Vol. 21, No. 90003, 166S-173S.

Computer Sciences Corporation (1998), “Data warehouse designed for pharmaceutical executives,” June 2, 1998, www.csc.com/newsandevents/news/1028.shtml Dow Plant-Based Pharmaceuticals (2003) http://www.dow.com/plantbio/

DSstar (1999), “Pharmaceutical data-mining firm opens in Santa Fe, NM,” v.3, n. 7, November 23, 1999, http://www.tgc.com/dsstar/99/1123/991123.html Elands, J. (2001), IDBS says its suite of data management tools is ready for the genomics software market, Bioinform, volume 5, number 37, October 1, 2001. Fish, K. E. and Segall, R. S. (2004), A visual analysis of learning rule effects and variable importance for neural networks employed in data mining operations, to appear in Kybernetes: International Journal of Systems and Cybernetics, volume 33. (also appears as Acxiom Working Paper Series WP-02-03). Glasser, P. (2002), Leaders, Bristol-Myers Squibb CIO on IT and Pharmaceutical Data, Ziff Davis CioInsight- Strategies for IT Business Leaders, www.cioinsight.com Han, J. and Kamber, M. (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA. Hanemann, W. (1994) Valuing the Environment Through Contingent Valuation. The Journal of Economic Perspectives, 8 (4), 19-43.

Heinrich, L. (2003),”Starting the data warehouse from a data model,” Applied Data Resource Management White Paper, www.adrm.com/4_infoart.htm Hy, Heikki, (2003),“Application example: Data mining of pharmaceutical data”, www.hut.fi/~hhyotyni/latex/Final/node61.html IDBS (2001), “IDBS unveil new pharmaceutical data warehousing strategy at LabAutomation,” January 28, 2001, www.idbs.com/news/press_room/press_release.asp?release_date=01_28_2001 James, C. (2001), Global Review of Commercialized Transgenic Crops: 2001, International Service for the Acquisition of Agri-biotech Applications. No. 24, pp. iv.

http://www.csc.com/newsandevents/news/1028.shtml

http://www.dow.com/plantbio/

http://www.tgc.com/dsstar/99/1123/991123.html

http://www.adrm.com/4_infoart.htm

http://www.hut.fi/~hhyotyni/latex/Final/node61.html

http://www.id-bs.com/news/press_room/press_release.asp?release_date=01_28_2001

Knightbridge Company (2001), “Pharmaceutical and medical products,” www.knightsbridge.com/pharmaceutical.html Monsanto (2002), Plant-Made Pharmaceuticals: A New Way to Make Medicine. Murtagh, F. and Hirtle, S. (1997), “On-Line software for clustering and multivariate analysis,” www.pitt.edu/~csna/software.html Nash, K. S. (2002), “New Rx for pharmaceutical data,“ Baseline, Ziff Davis Media Inc., June 12, 2002, www.baselinemag.com/print_article/0,3668,a=28045,00.asp PEW Initiative on Food and Biotechnology (2002) On the Pharm: GM Plants and the Future of Medicine. AgBiotechBUZZ, 2(7), July 29, 2002.

Reichert, J. M., “New biopharmaceutical in the USA: trends in development and marketing approvals 1995-1999,” TIBTECH, September 2000, v. 18, pp. 364-369. SAS (2003),“SAS solutions for pharmaceutical enterprises,” www.sas.com/industry/pharma Segall, R. S. (1984), Models of Area Wide Medical Delivery, Ph.D. Dissertation, University of Massachusetts at Amherst. Segall, R. S. (1988) Mathematical modeling for the capacity planning of market oriented systems: with an application to real health data, Applied Mathematical Modelling, v. 12, n. 4, (1988), 366-378. Segall, R. S. (1995), Some mathematical and computer modeling of neural networks, Applied Mathematical Modelling, v. 19, 386-399. Segall, R. S. (1996), Comparing learning rules of neural networks using computer graphics, Proceedings of the Twenty-seventh Annual Conference of the Southwest Decision Sciences Institute, San Antonio, TX, March 6-9, 1996. Segall, R. S. (2001), Some Applications of Data Mining and Data Warehousing to a Medical Database and Neural Networks, Proceedings of the 32nd Conference of the Southwest Decision Sciences Institute, New Orleans, LA, March 30-April 2, 2001. Segall, R. S. (2002), Some Applications of RightPoint DataCruncher for Data Mining of Data Warehouses, Proceedings of the 33rd Conference of the Southwest Decision Sciences Institute, St. Louis, MO, March 4-8, 2002. Segall, R. S. (2003), Incorporating Data Mining and Computer Graphics for Modeling of Neural Networks, Acxiom Data Engineering Laboratory Working Paper Series, ADEL-WP-03-02, Publication in Collaboration with University of Arkansas at Little Rock (UALR) Donaghey Cyber College, 33 pages, March 2003.

http://www.knightsbridge.com/pharmaceutical.html

http://www.pitt.edu/~csna/software.html

http://www.baselinemag.com/print_article/0,3668,a=28045,00.asp

http://www.sas.com/industry/pharma

Segall, R. S. (2004), Incorporating data mining and computer graphics for modeling of neural networks, to appear in Kybernetes: International Journal of Systems and Cybernetics, volume 33. Silico Research, (2001), “Data warehousing is becoming one of the core drug discovery technologies,” www.bioportfolio.com/silico/dw.htm University of British Columbia (2002), Computer Science Department, Database Systems Laboratory, “Data Mining – Introduction”, www.cs.ubc.ca/nest/dbsl/mining.html Wang et al. (2003) “A maize QTL for silk maysin levels contains duplicated Myb-homologous genes which jointly regulate flavone biosynthesis, “ Journal of Plant Molecular Biology, v. 52, n. 1, pages 1-15, May 2003.

http://www.bioportfolio.com/silico/dw.htm

http://www.cs.ubc.ca/nest/dbsl/mining.html

List of Figures

Figure 1: Clusters for data replacement train for corn microarray database

(a) slice: standard deviation, height: standard deviation, color: radius

(b) slice: radius, height: standard deviation, color: radius

(c) slice: radius, height: standard deviation, color: radius

(d) slice: radius, height: frequency, color: radius

(e) slice: standard deviation, height: frequency, color: radius

(f) slice: radius, height: radius, color: standard deviation

Figure 2: Clusters for environmental factors for corn microarray database

Figure 3: Cluster proximities of corn microarray database

Figure 4: Frequency bar chart of corn microarray for environment factor

(a) 1hour salt cy3, (b) 6 hour control cy3, (c) 24 hour control cy5

Figure 5: Cubic clustering criterion for corn microarray database

Figure 6: Decision tree obtained by data mining of corn microarray database

Figure 7: Clusters for data replacement train for log(2) transformed corn data

(a) slice: standard deviation, height: frequency, color: radius

(b) slice: standard deviation, height: radius, color: radius

(c) slice: radius, height: radius, color: radius

Figure 8: Clusters for environmental factors for log(2) transformed corn data

Figure 9: Cluster proximities of log(2) corn microarray data

Figure 10: Frequency bar chart of log(2) transformed corn data for environment factor of

controlled salt for: (a) 6 hour, (b) 3 hour, (c) 1 hour, (d) 24 hour, (e) 72 hour, (f) 12 hour.

Figure 11: Cubic clustering criterion for log(2) transformed corn data

Figure 12: Decision tree obtained by data mining of log(2) transformed corn data

List of Tables

Table 1: Environmental factors for corn microarray database

Table 2: Statistics for clusters for environment factors for corn

Table 3: Environmental factors for log(2) ratio transformation of corn microarray

database.

Table 4: Statistics for clusters for environment factors for log(2) transformed corn data

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Table1

Tab

le 2

Tab

le 3

Tab

le 4

Data Mining for Analyzing the Impact of Environmental Stress on … · 2017-11-16 · Data Mining...

Documents

Transcript of Data Mining for Analyzing the Impact of Environmental Stress on … · 2017-11-16 · Data Mining...