Data Mining a Review Pramod

download Data Mining a Review Pramod

of 6

Transcript of Data Mining a Review Pramod

  • 7/28/2019 Data Mining a Review Pramod

    1/6

  • 7/28/2019 Data Mining a Review Pramod

    2/6

    2

    Nonvolatile: The data are read only, not updated orchanged by users.

    3. KNOWLEDGE DISCOVERY IN DATABASES(KDD)

    KDD is the nontrivial process of identifying valid, novel,

    potentially useful, and ultimately understandable patterns in

    data [3].

    By nontrivial, we mean that some search or inference is

    involved; that is, it is not a straightforward computation of

    predefined quantities like computing the average value of a

    set of numbers. The discovered patterns should be valid on

    new data with some degree of certainty. We also want

    patterns to be novel (at least to the system and preferably to

    the user) and potentially useful, that is, lead to some benefit

    to the user or task. Finally, the patterns should beunderstandable, if not immediately then after some post

    processing.

    An important notion, called interestingness (for example,see [4] and [5], is usually taken as an overall measure of

    pattern value, combining validity, novelty, usefulness, and

    simplicity. Interestingness functions can be defined

    explicitly or can be manifested implicitly through an

    ordering placed by the KDD system on the discovered

    patterns or models. Given these notions, we can consider a

    pattern to be knowledge if it exceeds some interestingnessthreshold, which is by no means an attempt to define

    knowledge in the philosophical or even the popular view.

    As a matter of fact, knowledge in this definition is purely

    user oriented and domain specific and is determined by

    whatever functions and thresholds the user chooses.

    Data mining is a step in the KDD process that consists of

    applying data analysis and discovery algorithms that, under

    acceptable computational efficiency limitations, produce a

    particular enumeration of patterns (or models) over the data.

    3.1 The KDD Process

    The KDD process is interactive and iterative, involving

    numerous steps with many decisions made by the user. A

    practical view of the KDD process is given in [6] thatemphasize the interactive nature of the process. Here, we

    broadly outline some of its basic steps:

    First is developing an understanding of the applicationdomain and the relevant prior knowledge and

    identifying the goal of the KDD process from the

    customers viewpoint.

    Second is creating a target data set: selecting a data set,or focusing on a subset of variables or data samples, on

    which discovery is to be performed.

    Third is data cleaning and preprocessing. Basicoperations include removing noise if appropriate,

    collecting the necessary information to model or

    account for noise, deciding on strategies for handling

    missing data fields, and accounting for time-sequence

    information and known changes.

    Fourth is data reduction and projection: finding usefulfeatures to represent the data depending on the goal of

    the task. With dimensionality reduction or

    transformation methods, the effective number of

    variables under consideration can be reduced, or

    invariant representations for the data can be found.

    Fifth is matching the goals of the KDD process (step 1)to a particular data-mining method. For example,summarization, classification, regression, clustering,and so on, are described later as well as in Fayyad,Piatetsky- Shapiro, and Smyth (1996).

    Sixth is exploratory analysis and model and hypothesisselection: choosing the data mining algorithm(s) and

    selecting method(s) to be used for searching for data

    patterns. This process includes deciding which models

    and parameters might be appropriate (for example,

    models of categorical data are different than models of

    vectors over the reals) and matching a particular data-

    mining method with the overall criteria of the KDD

    process (for example, the end user might be moreinterested in understanding the model than its predictive

    capabilities).

    Seventh is data mining: searching for patterns ofinterest in a particular representational form or a set ofsuch representations, including classification rules ortrees, regression, and clustering. The user can

    significantly aid the data-mining method by correctly

    performing the preceding steps.

    Eighth is interpreting mined patterns, possiblyreturning to any of steps 1 through 7 for furtheriteration. This step can also involve visualization of the

    extracted patterns and models or visualization of the

    data given the extracted models.

    Ninth is acting on the discovered knowledge: using theknowledge directly, incorporating the knowledge intoanother system for further action, or simply

    documenting it and reporting it to interested parties.

    This process also includes checking for and resolving

    potential conflicts with previously believed (orextracted) knowledge.

    The KDD process can involve significant iteration and can

    contain loops between any two steps. The basic flow of

  • 7/28/2019 Data Mining a Review Pramod

    3/6

    3

    steps (although not the potential multitude of iterations and

    loops) is illustrated in figure 1. Most previous work on

    KDD has focused on step 7, the data mining. However, the

    other steps are as important (and probably more so) for the

    successful application of KDD in practice.

    Figure 1. An Overview of the Steps That Compose the

    KDD Process [3].

    4. DATA MINING (DM)Data mining is the process of discovering meaningful new

    correlations, patterns and trends by sifting through large

    amounts of data stored in repositories, using pattern

    recognition technologies as well as statistical andmathematical techniques.

    There are other definitions:

    Data mining is the analysis of (often large)observational data sets to find unsuspected relationshipsand to summarize the data in novel ways that are both

    understandable and useful to the data owner [7].

    Data mining is an interdisciplinary field bringingtogether techniques from machine learning, pattern

    recognition, statistics, databases, and visualization to

    address the issue of information extraction from largedata bases [8].

    Data mining, which is also referred to as knowledge

    discovery in databases, means a process of nontrivial

    extraction of implicit, previously unknown and potentially

    useful information (such as knowledge rules, constraints,

    regularities) from data in databases [9]. There are also manyother terms, appearing in some articles and documents,

    carrying a similar or slightly different meaning, such as

    knowledge mining from databases, knowledge extraction,

    data archaeology, data dredging, data analysis, and so on.

    Mining information and knowledge from large databases

    has been recognized by many researchers as a key research

    topic in database systems and machine learning and by

    many industrial companies as an important area with an

    opportunity of major revenues. The discovered knowledge

    can be applied to information management, query

    processing, decision making, process control, and manyother applications. Researchers in many different fields,including database systems, knowledge-base systems,artificial intelligence, machine learning, knowledge

    acquisition, statistics, spatial databases, and data

    visualization, have shown great interest in data mining.

    Furthermore, several emerging applications in information

    providing services, such as on-line services and World Wide

    Web, also call for various data mining techniques to better

    understand user behavior, to meliorate the service provided,

    and to increase the business opportunities.

    4.1 An Overview of Data Mining Techniques

    Since data mining poses many challenging research issues,direct applications of methods and techniques developed in

    related studies in machine learning, statistics, and databasesystems cannot solve these problems. It is necessary to

    perform dedicated studies to invent new data mining

    methods or develop integrated techniques for efficient and

    effective data mining. In this sense, data mining itself has

    formed an independent new field.

    The kinds of patterns that can be discovered depend uponthe data mining tasks employed. By and large, there are twotypes of data mining tasks: descriptive data mining tasksthat describe the general properties of the existing data, and

    predictive data mining tasks that attempt to do predictions

    based on inference on available data. The data mining

    functionalities and the variety of knowledge they discoverare briefly presented in the following list:

    Characterization: Data characterization is asummarization of general features of objects in a target

    class, and produces what is called characteristic rules.

    For example, one may want to characterize the OurVideo Store customers who regularly rent more than 30

    movies a year. With concept hierarchies on the

    attributes describing the target class, the attribute

    oriented induction method can be used, for example, to

    carry out data summarization.

    Discrimination: Data discrimination produces what arecalled discriminant rules and is basically thecomparison of the general features of objects between

    two classes referred to as the target class and the

    contrasting class. For example, one may want to

    compare the general characteristics of the customerswho rented more than 30 movies in the last year with

    those whose rental account is lower than 5. The

    techniques used for data discrimination are very similar

    to the techniques used for data characterization with the

  • 7/28/2019 Data Mining a Review Pramod

    4/6

    4

    exception that data discrimination results include

    comparative measures.

    Association analysis: Association analysis is thediscovery of what are commonly called association

    rules. It studies the frequency of items occurringtogether in transactional databases, and based on a

    threshold called support, identifies the frequent item

    sets. Another threshold, confidence, which is the

    conditional probability than an item appears in atransaction when another item appears, is used to

    pinpoint association rules. Association analysis is

    commonly used for market basket analysis. For

    example, it could be useful for the Our Video Store

    manager to know what movies are often rented together

    or if there is a relationship between renting a certain

    type of movies and buying popcorn or pop. The

    discovered association rules are of the form: P^Q [s,c],where P and Q are conjunctions of attribute value-pairs,

    and s (for support) is the probability that P and Q

    appear together in a transaction and c (for confidence)is the conditional probability that Q appears in atransaction when P is present. For example, the

    hypothetic association rule:

    RentType(X, game) ^ Age(X, 13- 19)

    Buys(X, pop) [s=2%, c=55%]

    would indicate that 2% of the transactions consideredare of customers aged between 13 and 19 who are

    renting a game and buying a pop, and that there is a

    certainty of 55% that teenage customers who rent a

    game also buy pop.

    Classification: Classification analysis is theorganization of data in given classes. Also known assupervised classification, the classification uses given

    class labels to order the objects in the data collection.

    Classification approaches normally use a training set

    where all objects are already associated with knownclass labels. The classification algorithm learns from

    the training set and builds a model. The model is used

    to classify new objects. For example, after starting a

    credit policy, the Our Video Store managers could

    analyze the customers behaviors vis--vis their credit,

    and label accordingly the customers who received

    credits with three possible labels safe, risky andvery risky. The classification analysis would generatea model that could be used to either accept or reject

    credit requests in the future.

    Prediction: Prediction has attracted considerableattention given the potential implications of successfulforecasting in a business context. There are two major

    types of predictions: one can either try to predict some

    unavailable data values or pending trends, or predict a

    class label for some data. The latter is tied to

    classification. Once a classification model is built based

    on a training set, the class label of an object can be

    foreseen based on the attribute values of the object and

    the attribute values of the classes. Prediction is howevermore often referred to the forecast of missing numericalvalues, or increase/ decrease trends in time related data.The major idea is to use a large number of past values

    to consider probable future values.

    Clustering: Similar to classification, clustering is theorganization of data in classes. However, unlike

    classification, in clustering, class labels are unknown

    and it is up to the clustering algorithm to discover

    acceptable classes. Clustering is also called

    unsupervised classification, because the classification is

    not dictated by given class labels. There are many

    clustering approaches all based on the principle ofmaximizing the similarity between objects in a same

    class (intra-class similarity) and minimizing the

    similarity between objects of different classes (inter-class similarity).

    Outlier analysis: Outliers are data elements that cannotbe grouped in a given class or cluster. Also known asexceptions or surprises, they are often very important to

    identify. While outliers can be considered noise and

    discarded in some applications, they can reveal

    important knowledge in other domains, and thus can bevery significant and their analysis valuable.

    Evolution and deviation analysis: Evolution anddeviation analysis pertain to the study of time related

    data that changes in time. Evolution analysis modelsevolutionary trends in data, which consent to

    characterizing, comparing, classifying or clustering of

    time related data. Deviation analysis, on the other hand,

    considers differences between measured values and

    expected values, and attempts to find the cause of thedeviations from the anticipated values.

    5. THE ISSUES IN DATA MININGWhile data mining is still in its infancy, it is becoming a

    trend and ubiquitous. Before data mining develops into a

    conventional, mature and trusted discipline, many still

    pending issues have to be addressed. Some of these issuesare addressed below:

    Security and social issues: Security is an importantissue with any data collection that is shared and/or is

    intended to be used for strategic decision-making. Inaddition, when data is collected for customer profiling,

    user behavior understanding, correlating personal data

    with other information, etc., large amounts of sensitive

    and private information about individuals or companies

  • 7/28/2019 Data Mining a Review Pramod

    5/6

    5

    is gathered and stored. This becomes controversial

    given the confidential nature of some of this data and

    the potential illegal access to the information.

    Moreover, data mining could disclose new implicit

    knowledge about individuals or groups that could beagainst privacy policies, especially if there is potentialdissemination of discovered information. Another issuethat arises from this concern is the appropriate use of

    data mining. Due to the value of data, databases of all

    sorts of content are regularly sold, and because of the

    competitive advantage that can be attained from

    implicit knowledge discovered, some important

    information could be withheld, while other information

    could be widely distributed and used without control.

    User interface issues: Good data visualization easesthe interpretation of data mining results, as well as

    helps users better understand their needs. Many dataexploratory analysis tasks are significantly facilitated

    by the ability to see data in an appropriate visual

    presentation. There are many visualization ideas andproposals for effective data graphical presentation. Themajor issues related to user interfaces and visualization

    are screen real-estate, information rendering, and

    interaction. Interactivity with the data and data mining

    results is crucial since it provides means for the user to

    focus and refine the mining tasks, as well as to picture

    the discovered knowledge from different angles and at

    different conceptual levels.

    Mining methodology issues: These issues pertain tothe data mining approaches applied and their

    limitations. Topics such as versatility of the mining

    approaches, the diversity of data available, thedimensionality of the domain, the broad analysis needs,

    the assessment of the knowledge discovered, theexploitation of background knowledge and metadata,

    the control and handling of noise in data, etc. are all

    examples that can dictate mining methodology choices.

    For instance, it is often desirable to have different datamining methods available since different approaches

    may perform differently depending upon the data at

    hand.

    Performance issues: Many artificial intelligence andstatistical methods exist for data analysis and

    interpretation. However, these methods were often notdesigned for the very large data sets data mining is

    dealing with today. Terabyte sizes are common. This

    raises the issues of scalability and efficiency of the data

    mining methods when processing considerably largedata. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical usefor data mining. Linear algorithms are usually the norm.

    In same theme, sampling can be used for mining instead

    of the whole dataset. However, concerns such as

    completeness and choice of samples may arise. Other

    topics in the issue of performance are incremental

    updating, and parallel programming.

    Data source issues: There are many issues related tothe data sources, some are practical such as thediversity of data types, while others are philosophical

    like the data glut problem. We certainly have an excess

    of data since we already have more data than we can

    handle and we are still collecting data at an even higherrate. If the spread of database management systems has

    helped increase the gathering of information, the advent

    of data mining is certainly encouraging more data

    harvesting. The current practice is to collect as much

    data as possible now and process it, or try to process it,

    later. Regarding the practical issues related to data

    sources, there is the subject of heterogeneous databases

    and the focus on diverse complex data types. We arestoring different types of data in a variety of

    repositories. It is difficult to expect a data mining

    system to effectively and efficiently achieve goodmining results on all kinds of data and sources.Different kinds of data and sources may require distinct

    algorithms and methodologies. Currently, there is a

    focus on relational databases and data warehouses, but

    other approaches need to be pioneered for other specific

    complex data types. A versatile data mining tool, for all

    sorts of data, may not be realistic.

    6. CONCLUSIONIn this paper we briefly reviewed the Data Warehouse,

    Knowledge Discovery in Databases (KDD), KDD Process,

    Data Mining and issues related to Data Mining. This reviewwould be helpful to researchers to focus on the various

    issues of data mining. In future course, we will review the

    various classification algorithms and significance of

    evolutionary computing (genetic programming) approach in

    designing of efficient classification algorithms for data

    mining.

    7. REFERENCES[1] B. A. Devlin and P. T. Murphy. Architecture for a

    Business and Information System. IBM Systems Journal,27(1),1988.

    [2] W. H. Inmon. Building the Data Warehouse, 2nd edition.

    Wiley Computer Publishing, 1996.

    [3] Fayyad et. al. Knowledge Discovery and Data Mining:Towards a Unified framework. 1996.

    [4] Matheus, C.; Piatetsky-Shapiro, G.; and McNeill, D.1996. Selecting and Reporting What Is Interesting: TheKEfiR Application to Healthcare Data. In Advances in

    Knowledge Discovery and Data Mining, eds.

  • 7/28/2019 Data Mining a Review Pramod

    6/6

    6

    [5] Silberschatz, A., and Tuzhilin, A. 1995. On SubjectiveMeasures of Interestingness in Knowledge Discovery. InProceedings of KDD-95: First International Conferenceon Knowledge Discovery and Data Mining, 275281.Menlo Park, Calif.: American Association for Artificial

    Intelligence.

    [6] Brachman, R., and Anand, T. 1996. The Process ofKnowledge Discovery in Databases: A Human-CenteredApproach. In Advances in Knowledge Discovery andData Mining, 3758, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Menlo Park,Calif.: AAAI Press.

    [7] David Hand, Heikki Mannila, and Padhraic Smyth,Principles of Data Mining, MIT Press, Cambridge, MA,2001.

    [8] Peter Cabena, Pablo Hadjinian, Rolf Stadler,JaapVerhees, and Alessandro Zanasi, Discovering DataMining: From Concept to Implementation, Prentice Hall,

    Upper Saddle River, NJ, 1998.

    [9] G. Piatetsky-Shapiro and W. J. Frawley. KnowledgeDiscovery in Databases. AAAI/MIT Press, 1991.

    [10] M. S. Chen, J. Han, and P. S. Yu. Data mining: Anoverview from a database perspective. IEEE Trans.

    Knowledge and Data Engineering, 8:866-883, 1996.

    [11] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.Uthurusamy. Advances in Knowledge Discovery andData Mining. AAAI/MIT Press, 1996.

    [12] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus,

    Knowledge Discovery in Databases: An Overview. In G.Piatetsky-Shapiro et al. (eds.), Knowledge Discovery inDatabases. AAAI/MIT Press, 1991.

    [13] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann, 2000.

    [14] T. Imielinski and H. Mannila. A database perspective onknowledge discovery. Communications of ACM, 39:58-64, 1996.

    [15] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. Fromdata mining to knowledge discovery: An overview. InU.M. Fayyad, et al. (eds.), Advances in Knowledge

    Discovery and Data Mining, 1-35. AAAI/MIT Press,

    1996.

    [16] G. Piatetsky-Shapiro and W. J. Frawley. KnowledgeDiscovery in Databases. AAAI/MIT Press,1991.