Data Mining a Review Pramod
-
Upload
pramod-vishwakarma -
Category
Documents
-
view
215 -
download
0
Transcript of Data Mining a Review Pramod
-
7/28/2019 Data Mining a Review Pramod
1/6
-
7/28/2019 Data Mining a Review Pramod
2/6
2
Nonvolatile: The data are read only, not updated orchanged by users.
3. KNOWLEDGE DISCOVERY IN DATABASES(KDD)
KDD is the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in
data [3].
By nontrivial, we mean that some search or inference is
involved; that is, it is not a straightforward computation of
predefined quantities like computing the average value of a
set of numbers. The discovered patterns should be valid on
new data with some degree of certainty. We also want
patterns to be novel (at least to the system and preferably to
the user) and potentially useful, that is, lead to some benefit
to the user or task. Finally, the patterns should beunderstandable, if not immediately then after some post
processing.
An important notion, called interestingness (for example,see [4] and [5], is usually taken as an overall measure of
pattern value, combining validity, novelty, usefulness, and
simplicity. Interestingness functions can be defined
explicitly or can be manifested implicitly through an
ordering placed by the KDD system on the discovered
patterns or models. Given these notions, we can consider a
pattern to be knowledge if it exceeds some interestingnessthreshold, which is by no means an attempt to define
knowledge in the philosophical or even the popular view.
As a matter of fact, knowledge in this definition is purely
user oriented and domain specific and is determined by
whatever functions and thresholds the user chooses.
Data mining is a step in the KDD process that consists of
applying data analysis and discovery algorithms that, under
acceptable computational efficiency limitations, produce a
particular enumeration of patterns (or models) over the data.
3.1 The KDD Process
The KDD process is interactive and iterative, involving
numerous steps with many decisions made by the user. A
practical view of the KDD process is given in [6] thatemphasize the interactive nature of the process. Here, we
broadly outline some of its basic steps:
First is developing an understanding of the applicationdomain and the relevant prior knowledge and
identifying the goal of the KDD process from the
customers viewpoint.
Second is creating a target data set: selecting a data set,or focusing on a subset of variables or data samples, on
which discovery is to be performed.
Third is data cleaning and preprocessing. Basicoperations include removing noise if appropriate,
collecting the necessary information to model or
account for noise, deciding on strategies for handling
missing data fields, and accounting for time-sequence
information and known changes.
Fourth is data reduction and projection: finding usefulfeatures to represent the data depending on the goal of
the task. With dimensionality reduction or
transformation methods, the effective number of
variables under consideration can be reduced, or
invariant representations for the data can be found.
Fifth is matching the goals of the KDD process (step 1)to a particular data-mining method. For example,summarization, classification, regression, clustering,and so on, are described later as well as in Fayyad,Piatetsky- Shapiro, and Smyth (1996).
Sixth is exploratory analysis and model and hypothesisselection: choosing the data mining algorithm(s) and
selecting method(s) to be used for searching for data
patterns. This process includes deciding which models
and parameters might be appropriate (for example,
models of categorical data are different than models of
vectors over the reals) and matching a particular data-
mining method with the overall criteria of the KDD
process (for example, the end user might be moreinterested in understanding the model than its predictive
capabilities).
Seventh is data mining: searching for patterns ofinterest in a particular representational form or a set ofsuch representations, including classification rules ortrees, regression, and clustering. The user can
significantly aid the data-mining method by correctly
performing the preceding steps.
Eighth is interpreting mined patterns, possiblyreturning to any of steps 1 through 7 for furtheriteration. This step can also involve visualization of the
extracted patterns and models or visualization of the
data given the extracted models.
Ninth is acting on the discovered knowledge: using theknowledge directly, incorporating the knowledge intoanother system for further action, or simply
documenting it and reporting it to interested parties.
This process also includes checking for and resolving
potential conflicts with previously believed (orextracted) knowledge.
The KDD process can involve significant iteration and can
contain loops between any two steps. The basic flow of
-
7/28/2019 Data Mining a Review Pramod
3/6
3
steps (although not the potential multitude of iterations and
loops) is illustrated in figure 1. Most previous work on
KDD has focused on step 7, the data mining. However, the
other steps are as important (and probably more so) for the
successful application of KDD in practice.
Figure 1. An Overview of the Steps That Compose the
KDD Process [3].
4. DATA MINING (DM)Data mining is the process of discovering meaningful new
correlations, patterns and trends by sifting through large
amounts of data stored in repositories, using pattern
recognition technologies as well as statistical andmathematical techniques.
There are other definitions:
Data mining is the analysis of (often large)observational data sets to find unsuspected relationshipsand to summarize the data in novel ways that are both
understandable and useful to the data owner [7].
Data mining is an interdisciplinary field bringingtogether techniques from machine learning, pattern
recognition, statistics, databases, and visualization to
address the issue of information extraction from largedata bases [8].
Data mining, which is also referred to as knowledge
discovery in databases, means a process of nontrivial
extraction of implicit, previously unknown and potentially
useful information (such as knowledge rules, constraints,
regularities) from data in databases [9]. There are also manyother terms, appearing in some articles and documents,
carrying a similar or slightly different meaning, such as
knowledge mining from databases, knowledge extraction,
data archaeology, data dredging, data analysis, and so on.
Mining information and knowledge from large databases
has been recognized by many researchers as a key research
topic in database systems and machine learning and by
many industrial companies as an important area with an
opportunity of major revenues. The discovered knowledge
can be applied to information management, query
processing, decision making, process control, and manyother applications. Researchers in many different fields,including database systems, knowledge-base systems,artificial intelligence, machine learning, knowledge
acquisition, statistics, spatial databases, and data
visualization, have shown great interest in data mining.
Furthermore, several emerging applications in information
providing services, such as on-line services and World Wide
Web, also call for various data mining techniques to better
understand user behavior, to meliorate the service provided,
and to increase the business opportunities.
4.1 An Overview of Data Mining Techniques
Since data mining poses many challenging research issues,direct applications of methods and techniques developed in
related studies in machine learning, statistics, and databasesystems cannot solve these problems. It is necessary to
perform dedicated studies to invent new data mining
methods or develop integrated techniques for efficient and
effective data mining. In this sense, data mining itself has
formed an independent new field.
The kinds of patterns that can be discovered depend uponthe data mining tasks employed. By and large, there are twotypes of data mining tasks: descriptive data mining tasksthat describe the general properties of the existing data, and
predictive data mining tasks that attempt to do predictions
based on inference on available data. The data mining
functionalities and the variety of knowledge they discoverare briefly presented in the following list:
Characterization: Data characterization is asummarization of general features of objects in a target
class, and produces what is called characteristic rules.
For example, one may want to characterize the OurVideo Store customers who regularly rent more than 30
movies a year. With concept hierarchies on the
attributes describing the target class, the attribute
oriented induction method can be used, for example, to
carry out data summarization.
Discrimination: Data discrimination produces what arecalled discriminant rules and is basically thecomparison of the general features of objects between
two classes referred to as the target class and the
contrasting class. For example, one may want to
compare the general characteristics of the customerswho rented more than 30 movies in the last year with
those whose rental account is lower than 5. The
techniques used for data discrimination are very similar
to the techniques used for data characterization with the
-
7/28/2019 Data Mining a Review Pramod
4/6
4
exception that data discrimination results include
comparative measures.
Association analysis: Association analysis is thediscovery of what are commonly called association
rules. It studies the frequency of items occurringtogether in transactional databases, and based on a
threshold called support, identifies the frequent item
sets. Another threshold, confidence, which is the
conditional probability than an item appears in atransaction when another item appears, is used to
pinpoint association rules. Association analysis is
commonly used for market basket analysis. For
example, it could be useful for the Our Video Store
manager to know what movies are often rented together
or if there is a relationship between renting a certain
type of movies and buying popcorn or pop. The
discovered association rules are of the form: P^Q [s,c],where P and Q are conjunctions of attribute value-pairs,
and s (for support) is the probability that P and Q
appear together in a transaction and c (for confidence)is the conditional probability that Q appears in atransaction when P is present. For example, the
hypothetic association rule:
RentType(X, game) ^ Age(X, 13- 19)
Buys(X, pop) [s=2%, c=55%]
would indicate that 2% of the transactions consideredare of customers aged between 13 and 19 who are
renting a game and buying a pop, and that there is a
certainty of 55% that teenage customers who rent a
game also buy pop.
Classification: Classification analysis is theorganization of data in given classes. Also known assupervised classification, the classification uses given
class labels to order the objects in the data collection.
Classification approaches normally use a training set
where all objects are already associated with knownclass labels. The classification algorithm learns from
the training set and builds a model. The model is used
to classify new objects. For example, after starting a
credit policy, the Our Video Store managers could
analyze the customers behaviors vis--vis their credit,
and label accordingly the customers who received
credits with three possible labels safe, risky andvery risky. The classification analysis would generatea model that could be used to either accept or reject
credit requests in the future.
Prediction: Prediction has attracted considerableattention given the potential implications of successfulforecasting in a business context. There are two major
types of predictions: one can either try to predict some
unavailable data values or pending trends, or predict a
class label for some data. The latter is tied to
classification. Once a classification model is built based
on a training set, the class label of an object can be
foreseen based on the attribute values of the object and
the attribute values of the classes. Prediction is howevermore often referred to the forecast of missing numericalvalues, or increase/ decrease trends in time related data.The major idea is to use a large number of past values
to consider probable future values.
Clustering: Similar to classification, clustering is theorganization of data in classes. However, unlike
classification, in clustering, class labels are unknown
and it is up to the clustering algorithm to discover
acceptable classes. Clustering is also called
unsupervised classification, because the classification is
not dictated by given class labels. There are many
clustering approaches all based on the principle ofmaximizing the similarity between objects in a same
class (intra-class similarity) and minimizing the
similarity between objects of different classes (inter-class similarity).
Outlier analysis: Outliers are data elements that cannotbe grouped in a given class or cluster. Also known asexceptions or surprises, they are often very important to
identify. While outliers can be considered noise and
discarded in some applications, they can reveal
important knowledge in other domains, and thus can bevery significant and their analysis valuable.
Evolution and deviation analysis: Evolution anddeviation analysis pertain to the study of time related
data that changes in time. Evolution analysis modelsevolutionary trends in data, which consent to
characterizing, comparing, classifying or clustering of
time related data. Deviation analysis, on the other hand,
considers differences between measured values and
expected values, and attempts to find the cause of thedeviations from the anticipated values.
5. THE ISSUES IN DATA MININGWhile data mining is still in its infancy, it is becoming a
trend and ubiquitous. Before data mining develops into a
conventional, mature and trusted discipline, many still
pending issues have to be addressed. Some of these issuesare addressed below:
Security and social issues: Security is an importantissue with any data collection that is shared and/or is
intended to be used for strategic decision-making. Inaddition, when data is collected for customer profiling,
user behavior understanding, correlating personal data
with other information, etc., large amounts of sensitive
and private information about individuals or companies
-
7/28/2019 Data Mining a Review Pramod
5/6
5
is gathered and stored. This becomes controversial
given the confidential nature of some of this data and
the potential illegal access to the information.
Moreover, data mining could disclose new implicit
knowledge about individuals or groups that could beagainst privacy policies, especially if there is potentialdissemination of discovered information. Another issuethat arises from this concern is the appropriate use of
data mining. Due to the value of data, databases of all
sorts of content are regularly sold, and because of the
competitive advantage that can be attained from
implicit knowledge discovered, some important
information could be withheld, while other information
could be widely distributed and used without control.
User interface issues: Good data visualization easesthe interpretation of data mining results, as well as
helps users better understand their needs. Many dataexploratory analysis tasks are significantly facilitated
by the ability to see data in an appropriate visual
presentation. There are many visualization ideas andproposals for effective data graphical presentation. Themajor issues related to user interfaces and visualization
are screen real-estate, information rendering, and
interaction. Interactivity with the data and data mining
results is crucial since it provides means for the user to
focus and refine the mining tasks, as well as to picture
the discovered knowledge from different angles and at
different conceptual levels.
Mining methodology issues: These issues pertain tothe data mining approaches applied and their
limitations. Topics such as versatility of the mining
approaches, the diversity of data available, thedimensionality of the domain, the broad analysis needs,
the assessment of the knowledge discovered, theexploitation of background knowledge and metadata,
the control and handling of noise in data, etc. are all
examples that can dictate mining methodology choices.
For instance, it is often desirable to have different datamining methods available since different approaches
may perform differently depending upon the data at
hand.
Performance issues: Many artificial intelligence andstatistical methods exist for data analysis and
interpretation. However, these methods were often notdesigned for the very large data sets data mining is
dealing with today. Terabyte sizes are common. This
raises the issues of scalability and efficiency of the data
mining methods when processing considerably largedata. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical usefor data mining. Linear algorithms are usually the norm.
In same theme, sampling can be used for mining instead
of the whole dataset. However, concerns such as
completeness and choice of samples may arise. Other
topics in the issue of performance are incremental
updating, and parallel programming.
Data source issues: There are many issues related tothe data sources, some are practical such as thediversity of data types, while others are philosophical
like the data glut problem. We certainly have an excess
of data since we already have more data than we can
handle and we are still collecting data at an even higherrate. If the spread of database management systems has
helped increase the gathering of information, the advent
of data mining is certainly encouraging more data
harvesting. The current practice is to collect as much
data as possible now and process it, or try to process it,
later. Regarding the practical issues related to data
sources, there is the subject of heterogeneous databases
and the focus on diverse complex data types. We arestoring different types of data in a variety of
repositories. It is difficult to expect a data mining
system to effectively and efficiently achieve goodmining results on all kinds of data and sources.Different kinds of data and sources may require distinct
algorithms and methodologies. Currently, there is a
focus on relational databases and data warehouses, but
other approaches need to be pioneered for other specific
complex data types. A versatile data mining tool, for all
sorts of data, may not be realistic.
6. CONCLUSIONIn this paper we briefly reviewed the Data Warehouse,
Knowledge Discovery in Databases (KDD), KDD Process,
Data Mining and issues related to Data Mining. This reviewwould be helpful to researchers to focus on the various
issues of data mining. In future course, we will review the
various classification algorithms and significance of
evolutionary computing (genetic programming) approach in
designing of efficient classification algorithms for data
mining.
7. REFERENCES[1] B. A. Devlin and P. T. Murphy. Architecture for a
Business and Information System. IBM Systems Journal,27(1),1988.
[2] W. H. Inmon. Building the Data Warehouse, 2nd edition.
Wiley Computer Publishing, 1996.
[3] Fayyad et. al. Knowledge Discovery and Data Mining:Towards a Unified framework. 1996.
[4] Matheus, C.; Piatetsky-Shapiro, G.; and McNeill, D.1996. Selecting and Reporting What Is Interesting: TheKEfiR Application to Healthcare Data. In Advances in
Knowledge Discovery and Data Mining, eds.
-
7/28/2019 Data Mining a Review Pramod
6/6
6
[5] Silberschatz, A., and Tuzhilin, A. 1995. On SubjectiveMeasures of Interestingness in Knowledge Discovery. InProceedings of KDD-95: First International Conferenceon Knowledge Discovery and Data Mining, 275281.Menlo Park, Calif.: American Association for Artificial
Intelligence.
[6] Brachman, R., and Anand, T. 1996. The Process ofKnowledge Discovery in Databases: A Human-CenteredApproach. In Advances in Knowledge Discovery andData Mining, 3758, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Menlo Park,Calif.: AAAI Press.
[7] David Hand, Heikki Mannila, and Padhraic Smyth,Principles of Data Mining, MIT Press, Cambridge, MA,2001.
[8] Peter Cabena, Pablo Hadjinian, Rolf Stadler,JaapVerhees, and Alessandro Zanasi, Discovering DataMining: From Concept to Implementation, Prentice Hall,
Upper Saddle River, NJ, 1998.
[9] G. Piatetsky-Shapiro and W. J. Frawley. KnowledgeDiscovery in Databases. AAAI/MIT Press, 1991.
[10] M. S. Chen, J. Han, and P. S. Yu. Data mining: Anoverview from a database perspective. IEEE Trans.
Knowledge and Data Engineering, 8:866-883, 1996.
[11] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.Uthurusamy. Advances in Knowledge Discovery andData Mining. AAAI/MIT Press, 1996.
[12] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus,
Knowledge Discovery in Databases: An Overview. In G.Piatetsky-Shapiro et al. (eds.), Knowledge Discovery inDatabases. AAAI/MIT Press, 1991.
[13] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann, 2000.
[14] T. Imielinski and H. Mannila. A database perspective onknowledge discovery. Communications of ACM, 39:58-64, 1996.
[15] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. Fromdata mining to knowledge discovery: An overview. InU.M. Fayyad, et al. (eds.), Advances in Knowledge
Discovery and Data Mining, 1-35. AAAI/MIT Press,
1996.
[16] G. Piatetsky-Shapiro and W. J. Frawley. KnowledgeDiscovery in Databases. AAAI/MIT Press,1991.