Data Mining a Review Pramod

7/28/2019 Data Mining a Review Pramod

1/6


2/6

2

Nonvolatile: The data are read only, not updated orchanged by users.

3. KNOWLEDGE DISCOVERY IN DATABASES(KDD)

KDD is the nontrivial process of identifying valid, novel,

potentially useful, and ultimately understandable patterns in

data [3].

By nontrivial, we mean that some search or inference is

involved; that is, it is not a straightforward computation of

predefined quantities like computing the average value of a

set of numbers. The discovered patterns should be valid on

new data with some degree of certainty. We also want

patterns to be novel (at least to the system and preferably to

the user) and potentially useful, that is, lead to some benefit

to the user or task. Finally, the patterns should beunderstandable, if not immediately then after some post

processing.

An important notion, called interestingness (for example,see [4] and [5], is usually taken as an overall measure of

pattern value, combining validity, novelty, usefulness, and

simplicity. Interestingness functions can be defined

explicitly or can be manifested implicitly through an

ordering placed by the KDD system on the discovered

patterns or models. Given these notions, we can consider a

pattern to be knowledge if it exceeds some interestingnessthreshold, which is by no means an attempt to define

knowledge in the philosophical or even the popular view.

As a matter of fact, knowledge in this definition is purely

user oriented and domain specific and is determined by

whatever functions and thresholds the user chooses.

Data mining is a step in the KDD process that consists of

applying data analysis and discovery algorithms that, under

acceptable computational efficiency limitations, produce a

particular enumeration of patterns (or models) over the data.

3.1 The KDD Process

The KDD process is interactive and iterative, involving

numerous steps with many decisions made by the user. A

practical view of the KDD process is given in [6] thatemphasize the interactive nature of the process. Here, we

broadly outline some of its basic steps:

First is developing an understanding of the applicationdomain and the relevant prior knowledge and

identifying the goal of the KDD process from the

customers viewpoint.

Second is creating a target data set: selecting a data set,or focusing on a subset of variables or data samples, on

which discovery is to be performed.

Third is data cleaning and preprocessing. Basicoperations include removing noise if appropriate,

collecting the necessary information to model or

account for noise, deciding on strategies for handling

missing data fields, and accounting for time-sequence

information and known changes.

Fourth is data reduction and projection: finding usefulfeatures to represent the data depending on the goal of

the task. With dimensionality reduction or

transformation methods, the effective number of

variables under consideration can be reduced, or

invariant representations for the data can be found.

Fifth is matching the goals of the KDD process (step 1)to a particular data-mining method. For example,summarization, classification, regression, clustering,and so on, are described later as well as in Fayyad,Piatetsky- Shapiro, and Smyth (1996).

Sixth is exploratory analysis and model and hypothesisselection: choosing the data mining algorithm(s) and

selecting method(s) to be used for searching for data

patterns. This process includes deciding which models

and parameters might be appropriate (for example,

models of categorical data are different than models of

vectors over the reals) and matching a particular data-

mining method with the overall criteria of the KDD

process (for example, the end user might be moreinterested in understanding the model than its predictive

capabilities).

Seventh is data mining: searching for patterns ofinterest in a particular representational form or a set ofsuch representations, including classification rules ortrees, regression, and clustering. The user can

significantly aid the data-mining method by correctly

performing the preceding steps.

Eighth is interpreting mined patterns, possiblyreturning to any of steps 1 through 7 for furtheriteration. This step can also involve visualization of the

extracted patterns and models or visualization of the

data given the extracted models.

Ninth is acting on the discovered knowledge: using theknowledge directly, incorporating the knowledge intoanother system for further action, or simply

documenting it and reporting it to interested parties.

This process also includes checking for and resolving

potential conflicts with previously believed (orextracted) knowledge.

The KDD process can involve significant iteration and can

contain loops between any two steps. The basic flow of


3/6

3

steps (although not the potential multitude of iterations and

loops) is illustrated in figure 1. Most previous work on

KDD has focused on step 7, the data mining. However, the

other steps are as important (and probably more so) for the

successful application of KDD in practice.

Figure 1. An Overview of the Steps That Compose the

KDD Process [3].

4. DATA MINING (DM)Data mining is the process of discovering meaningful new

correlations, patterns and trends by sifting through large

amounts of data stored in repositories, using pattern

recognition technologies as well as statistical andmathematical techniques.

There are other definitions:

Data mining is the analysis of (often large)observational data sets to find unsuspected relationshipsand to summarize the data in novel ways that are both

understandable and useful to the data owner [7].

Data mining is an interdisciplinary field bringingtogether techniques from machine learning, pattern

recognition, statistics, databases, and visualization to

address the issue of information extraction from largedata bases [8].

Data mining, which is also referred to as knowledge

discovery in databases, means a process of nontrivial

extraction of implicit, previously unknown and potentially

useful information (such as knowledge rules, constraints,

regularities) from data in databases [9]. There are also manyother terms, appearing in some articles and documents,

carrying a similar or slightly different meaning, such as

knowledge mining from databases, knowledge extraction,

data archaeology, data dredging, data analysis, and so on.

Mining information and knowledge from large databases

has been recognized by many researchers as a key research

topic in database systems and machine learning and by

many industrial companies as an important area with an

opportunity of major revenues. The discovered knowledge

can be applied to information management, query

processing, decision making, process control, and manyother applications. Researchers in many different fields,including database systems, knowledge-base systems,artificial intelligence, machine learning, knowledge

acquisition, statistics, spatial databases, and data

visualization, have shown great interest in data mining.

Furthermore, several emerging applications in information

providing services, such as on-line services and World Wide

Web, also call for various data mining techniques to better

understand user behavior, to meliorate the service provided,

and to increase the business opportunities.

4.1 An Overview of Data Mining Techniques

Since data mining poses many challenging research issues,direct applications of methods and techniques developed in

related studies in machine learning, statistics, and databasesystems cannot solve these problems. It is necessary to

perform dedicated studies to invent new data mining

methods or develop integrated techniques for efficient and

effective data mining. In this sense, data mining itself has

formed an independent new field.

The kinds of patterns that can be discovered depend uponthe data mining tasks employed. By and large, there are twotypes of data mining tasks: descriptive data mining tasksthat describe the general properties of the existing data, and

predictive data mining tasks that attempt to do predictions

based on inference on available data. The data mining

functionalities and the variety of knowledge they discoverare briefly presented in the following list:

Characterization: Data characterization is asummarization of general features of objects in a target

class, and produces what is called characteristic rules.

For example, one may want to characterize the OurVideo Store customers who regularly rent more than 30

movies a year. With concept hierarchies on the

attributes describing the target class, the attribute

oriented induction method can be used, for example, to

carry out data summarization.

Discrimination: Data discrimination produces what arecalled discriminant rules and is basically thecomparison of the general features of objects between

two classes referred to as the target class and the

contrasting class. For example, one may want to

compare the general characteristics of the customerswho rented more than 30 movies in the last year with

those whose rental account is lower than 5. The

techniques used for data discrimination are very similar

to the techniques used for data characterization with the


4/6

4

exception that data discrimination results include

comparative measures.

Association analysis: Association analysis is thediscovery of what are commonly called association

rules. It studies the frequency of items occurringtogether in transactional databases, and based on a

threshold called support, identifies the frequent item

sets. Another threshold, confidence, which is the

conditional probability than an item appears in atransaction when another item appears, is used to

pinpoint association rules. Association analysis is

commonly used for market basket analysis. For

example, it could be useful for the Our Video Store

manager to know what movies are often rented together

or if there is a relationship between renting a certain

type of movies and buying popcorn or pop. The

discovered association rules are of the form: P^Q [s,c],where P and Q are conjunctions of attribute value-pairs,

and s (for support) is the probability that P and Q

appear together in a transaction and c (for confidence)is the conditional probability that Q appears in atransaction when P is present. For example, the

hypothetic association rule:

RentType(X, game) ^ Age(X, 13- 19)

Buys(X, pop) [s=2%, c=55%]

would indicate that 2% of the transactions consideredare of customers aged between 13 and 19 who are

renting a game and buying a pop, and that there is a

certainty of 55% that teenage customers who rent a

game also buy pop.

Classification: Classification analysis is theorganization of data in given classes. Also known assupervised classification, the classification uses given

class labels to order the objects in the data collection.

Classification approaches normally use a training set

where all objects are already associated with knownclass labels. The classification algorithm learns from

the training set and builds a model. The model is used

to classify new objects. For example, after starting a

credit policy, the Our Video Store managers could

analyze the customers behaviors vis--vis their credit,

and label accordingly the customers who received

credits with three possible labels safe, risky andvery risky. The classification analysis would generatea model that could be used to either accept or reject

credit requests in the future.

Prediction: Prediction has attracted considerableattention given the potential implications of successfulforecasting in a business context. There are two major

types of predictions: one can either try to predict some

unavailable data values or pending trends, or predict a

class label for some data. The latter is tied to

classification. Once a classification model is built based

on a training set, the class label of an object can be

foreseen based on the attribute values of the object and

the attribute values of the classes. Prediction is howevermore often referred to the forecast of missing numericalvalues, or increase/ decrease trends in time related data.The major idea is to use a large number of past values

to consider probable future values.

Clustering: Similar to classification, clustering is theorganization of data in classes. However, unlike

classification, in clustering, class labels are unknown

and it is up to the clustering algorithm to discover

acceptable classes. Clustering is also called

unsupervised classification, because the classification is

not dictated by given class labels. There are many

clustering approaches all based on the principle ofmaximizing the similarity between objects in a same

class (intra-class similarity) and minimizing the

similarity between objects of different classes (inter-class similarity).

Outlier analysis: Outliers are data elements that cannotbe grouped in a given class or cluster. Also known asexceptions or surprises, they are often very important to

identify. While outliers can be considered noise and

discarded in some applications, they can reveal

important knowledge in other domains, and thus can bevery significant and their analysis valuable.

Evolution and deviation analysis: Evolution anddeviation analysis pertain to the study of time related

data that changes in time. Evolution analysis modelsevolutionary trends in data, which consent to

characterizing, comparing, classifying or clustering of

time related data. Deviation analysis, on the other hand,

considers differences between measured values and

expected values, and attempts to find the cause of thedeviations from the anticipated values.

5. THE ISSUES IN DATA MININGWhile data mining is still in its infancy, it is becoming a

trend and ubiquitous. Before data mining develops into a

conventional, mature and trusted discipline, many still

pending issues have to be addressed. Some of these issuesare addressed below:

Security and social issues: Security is an importantissue with any data collection that is shared and/or is

intended to be used for strategic decision-making. Inaddition, when data is collected for customer profiling,

user behavior understanding, correlating personal data

with other information, etc., large amounts of sensitive

and private information about individuals or companies


5/6

5

is gathered and stored. This becomes controversial

given the confidential nature of some of this data and

the potential illegal access to the information.

Moreover, data mining could disclose new implicit

knowledge about individuals or groups that could beagainst privacy policies, especially if there is potentialdissemination of discovered information. Another issuethat arises from this concern is the appropriate use of

data mining. Due to the value of data, databases of all

sorts of content are regularly sold, and because of the

competitive advantage that can be attained from

implicit knowledge discovered, some important

information could be withheld, while other information

could be widely distributed and used without control.

User interface issues: Good data visualization easesthe interpretation of data mining results, as well as

helps users better understand their needs. Many dataexploratory analysis tasks are significantly facilitated

by the ability to see data in an appropriate visual

presentation. There are many visualization ideas andproposals for effective data graphical presentation. Themajor issues related to user interfaces and visualization

are screen real-estate, information rendering, and

interaction. Interactivity with the data and data mining

results is crucial since it provides means for the user to

focus and refine the mining tasks, as well as to picture

the discovered knowledge from different angles and at

different conceptual levels.

Mining methodology issues: These issues pertain tothe data mining approaches applied and their

limitations. Topics such as versatility of the mining

approaches, the diversity of data available, thedimensionality of the domain, the broad analysis needs,

the assessment of the knowledge discovered, theexploitation of background knowledge and metadata,

the control and handling of noise in data, etc. are all

examples that can dictate mining methodology choices.

For instance, it is often desirable to have different datamining methods available since different approaches

may perform differently depending upon the data at

hand.

Performance issues: Many artificial intelligence andstatistical methods exist for data analysis and

interpretation. However, these methods were often notdesigned for the very large data sets data mining is

dealing with today. Terabyte sizes are common. This

raises the issues of scalability and efficiency of the data

mining methods when processing considerably largedata. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical usefor data mining. Linear algorithms are usually the norm.

In same theme, sampling can be used for mining instead

of the whole dataset. However, concerns such as

completeness and choice of samples may arise. Other

topics in the issue of performance are incremental

updating, and parallel programming.

Data source issues: There are many issues related tothe data sources, some are practical such as thediversity of data types, while others are philosophical

like the data glut problem. We certainly have an excess

of data since we already have more data than we can

handle and we are still collecting data at an even higherrate. If the spread of database management systems has

helped increase the gathering of information, the advent

of data mining is certainly encouraging more data

harvesting. The current practice is to collect as much

data as possible now and process it, or try to process it,

later. Regarding the practical issues related to data

sources, there is the subject of heterogeneous databases

and the focus on diverse complex data types. We arestoring different types of data in a variety of

repositories. It is difficult to expect a data mining

system to effectively and efficiently achieve goodmining results on all kinds of data and sources.Different kinds of data and sources may require distinct

algorithms and methodologies. Currently, there is a

focus on relational databases and data warehouses, but

other approaches need to be pioneered for other specific

complex data types. A versatile data mining tool, for all

sorts of data, may not be realistic.

6. CONCLUSIONIn this paper we briefly reviewed the Data Warehouse,

Knowledge Discovery in Databases (KDD), KDD Process,

Data Mining and issues related to Data Mining. This reviewwould be helpful to researchers to focus on the various

issues of data mining. In future course, we will review the

various classification algorithms and significance of

evolutionary computing (genetic programming) approach in

designing of efficient classification algorithms for data

mining.

7. REFERENCES[1] B. A. Devlin and P. T. Murphy. Architecture for a

Business and Information System. IBM Systems Journal,27(1),1988.

[2] W. H. Inmon. Building the Data Warehouse, 2nd edition.

Wiley Computer Publishing, 1996.

[3] Fayyad et. al. Knowledge Discovery and Data Mining:Towards a Unified framework. 1996.

[4] Matheus, C.; Piatetsky-Shapiro, G.; and McNeill, D.1996. Selecting and Reporting What Is Interesting: TheKEfiR Application to Healthcare Data. In Advances in

Knowledge Discovery and Data Mining, eds.


6/6

6

[5] Silberschatz, A., and Tuzhilin, A. 1995. On SubjectiveMeasures of Interestingness in Knowledge Discovery. InProceedings of KDD-95: First International Conferenceon Knowledge Discovery and Data Mining, 275281.Menlo Park, Calif.: American Association for Artificial

Intelligence.

[6] Brachman, R., and Anand, T. 1996. The Process ofKnowledge Discovery in Databases: A Human-CenteredApproach. In Advances in Knowledge Discovery andData Mining, 3758, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Menlo Park,Calif.: AAAI Press.

[7] David Hand, Heikki Mannila, and Padhraic Smyth,Principles of Data Mining, MIT Press, Cambridge, MA,2001.

[8] Peter Cabena, Pablo Hadjinian, Rolf Stadler,JaapVerhees, and Alessandro Zanasi, Discovering DataMining: From Concept to Implementation, Prentice Hall,

Upper Saddle River, NJ, 1998.

[9] G. Piatetsky-Shapiro and W. J. Frawley. KnowledgeDiscovery in Databases. AAAI/MIT Press, 1991.

[10] M. S. Chen, J. Han, and P. S. Yu. Data mining: Anoverview from a database perspective. IEEE Trans.

Knowledge and Data Engineering, 8:866-883, 1996.

[11] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.Uthurusamy. Advances in Knowledge Discovery andData Mining. AAAI/MIT Press, 1996.

[12] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus,

Knowledge Discovery in Databases: An Overview. In G.Piatetsky-Shapiro et al. (eds.), Knowledge Discovery inDatabases. AAAI/MIT Press, 1991.

[13] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann, 2000.

[14] T. Imielinski and H. Mannila. A database perspective onknowledge discovery. Communications of ACM, 39:58-64, 1996.

[15] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. Fromdata mining to knowledge discovery: An overview. InU.M. Fayyad, et al. (eds.), Advances in Knowledge

Discovery and Data Mining, 1-35. AAAI/MIT Press,

1996.

[16] G. Piatetsky-Shapiro and W. J. Frawley. KnowledgeDiscovery in Databases. AAAI/MIT Press,1991.

Data Mining a Review Pramod

Documents

Transcript of Data Mining a Review Pramod