DDB_presentation5Data Mining Overview

download DDB_presentation5Data Mining Overview

of 19

Transcript of DDB_presentation5Data Mining Overview

  • 8/17/2019 DDB_presentation5Data Mining Overview

    1/19

     

    Data Mining Overview

    “Data Mining is data analysis in order todiscover hidden correlations (pattern,

    rules)

    in huge data sets”

    “Data Mining is the process of extracting

    valid, previously unknown, comprehensible,and actionable information from large

    databases and using it to make crucial

    business decisions”

  • 8/17/2019 DDB_presentation5Data Mining Overview

    2/19

     

    Data Mining versus KDD

    ! "nowledge Discovery in Databasesinvolves the extraction of implicit,

    previously

    unknown and potentially usefulinformation

    from data

    ! Data Mining is the use of algorithms to

    extract the information and patterns

    derivedby the "DD process

  • 8/17/2019 DDB_presentation5Data Mining Overview

    3/19

     

  • 8/17/2019 DDB_presentation5Data Mining Overview

    4/19

     

    THE KDD PROCESS! #election$ %his first step obtains the data from various

    databases, files, and nonelectronic sources

    ! &reprocessing$ 'ncorrect data is corrected or removed,missing data must be supplied or predicted

    ! %ransformation$ Data from different sources isconverted

    into a common format for processing #ome data isencoded or transformed into more usable formats Datareduction might be applied to shrink the data to beanalysed

    ! Data Mining$ pplying algorithms to the transformed data

    to generate the desired results

    ! 'nterpretation*valuation$ +isualising the results by usingdifferent -' strategies and interpreting them

  • 8/17/2019 DDB_presentation5Data Mining Overview

    5/19

     

    Enabling factors for data mining

    Data availability! 'ncreased amount of electronically stored data

    ! 'ncreased processing power

    ! 'ncreased data storage ability

    ! 'ncreased data gathering ability (networks,

    extraction tools)! 'ncreased number of data warehouses

    .usiness conditions

    ! 'ncreased need to compete effectively

    ! 'ncreased awareness of need to know customers

  • 8/17/2019 DDB_presentation5Data Mining Overview

    6/19

     

    Data Mining Models and Tasks

  • 8/17/2019 DDB_presentation5Data Mining Overview

    7/19

     

    Classification

    ! /lassification maps data into predefined groupsof classes /lassification algorithms re0uire theclasses to be defined based on data attributevalues %hey often describe these classes by

    looking at the characteristics of data alreadyknown to belong to the classes

    ! &attern recognition is a type of classification

    where an input patterns is classified into one ofseveral classes based on its similarity to thesepredefined classes! *xample$ Determining whether to approve a

    bankloan application

  • 8/17/2019 DDB_presentation5Data Mining Overview

    8/19

     

    Regression

    ! 1egression is used to map a data item to a realvalued prediction variable 'n actuality, regressioninvolves the learning of the function that does thismapping

    ! 1egression assumes that the target data fit intosome known type of function (eg, linear) and thendetermines the best function of this type thatmodels the given data

    ! *xample$ *va wishes to reach a certain level of savingsbefore her retirement &eriodically, she predicts what herretirement savings will be based on its current value andseveral past values #he uses a simple linear regression

    formula to predict this value by fitting past behaviour to alinear function and then using this function to predict thevalues at point in the future .ased on these values, shethen alters her investment portfolio

  • 8/17/2019 DDB_presentation5Data Mining Overview

    9/19

     

    ! 2ith time series analysis, the value of an

    attribute is examined as it varies over time %hevalues are obtained as evenly spaced time points(daily, weekly, hourly, etc)

    ! time series plot is used to visualise the timeseries

    ! *xample$ *va is trying to determine whether topurchase stocks from /ompanies 3, 4 or 5 6or aperiod of one month she charts the daily stock pricefor these companies -sing this information she

    decides to purchase stocks from 3, because it isless volatile while overall showing a slightly largerrelative amount of growth then either of the other

    stocks

    Time Series Analysis

  • 8/17/2019 DDB_presentation5Data Mining Overview

    10/19

     

    ! Many real7world data mining applications can beseen as predicting future data states based on past

    and current data &rediction can be viewed as a typeof classification (with the difference that it isclassifying a future state rather than a currentstate)

    ! lthough future values may be predicted using timeseries analysis or regression techni0ues, otherapproaches may be used as well

    ! *xample$ &redicting flooding is a difficult problem8ne approach uses monitors placed at various pointsin the river %hese monitors collect data relevant to

    flood prediction, water level, rain amount, time,humidity, and so on %hen the water level at apotential flooding point in the river can be predictedbased on data collected by the sensors upriver from

    this point %he prediction must be made with respect

    Prediction

    Cl t i

  • 8/17/2019 DDB_presentation5Data Mining Overview

    11/19

     

    Clustering! /lustering is similar to classification except that the groups

    are

    not predefined, but rather defined by the data alone 't can be

    thought of as partitioning the data into groups that might or

    might not be dis9ointed

    ! %he clustering is usually accomplished by determining the

    similarity among the data on predefined attributes

    ! #ince the clusters are not predefined, a domain expert isoften

    re0uired to interpret the meaning of the created clusters

    *xample$

    S i ti

  • 8/17/2019 DDB_presentation5Data Mining Overview

    12/19

     

    ! #ummarisation maps data into subsets withassociated simple descriptions

    ! #ummarisation is also called characterisation orgeneralisation 't extracts or derives representativeinformation about the data set

    ! %his may be accomplished by actually retrievingportions of the data lternatively, summary typeinformation (such as mean of some numeric attribute)can be derived from the data

    ! *xample$ 8ne of the many criteria used to compareuniversities by the -# :ews and 2orld 1eport isthe average score

    Summarisation

    A i ti R l

  • 8/17/2019 DDB_presentation5Data Mining Overview

    13/19

     

    Association Rules

    ! ;ink analysis, alternatively referred to as affinity analysis orassociation, refers to the data mining task of uncoveringrelationship among data

    ! n association rule is a model that identifies specific types

    of data associations %hese associations are often used in theretail sales community to identify items that are fre0uentlypurchased together

    ! *xample$ grocery store is trying to decide whether to putbread on sale %o help determine the impact of this decision,the retailer generates association rules that show what otherproducts are fre0uently purchased with bread ? of the time bread is sold 9elly is also sold .ased on this,he decide to place some 9elly at the end of the aisle where thebread is placed and decides to not have the 9elly on sale at thesame time

    S Di

  • 8/17/2019 DDB_presentation5Data Mining Overview

    14/19

     

    ! #e0uential analysis or se0uence discovery is used to

    determine se0uential patterns in data

    ! %hese patterns are similar to associations that are

    found in the data, but they are based on time

    ! -nlike a market basket analysis, which re0uires theitems to be purchased at the same time, in se0uencediscovery the items are purchased over time in someorder

    ! *xample$ %he webmaster at 345 /orp periodicallyanalyse the 2eb log data to determine how users of the

    345@s 2eb pages access them

  • 8/17/2019 DDB_presentation5Data Mining Overview

    15/19

     

    Association Rules

     

  • 8/17/2019 DDB_presentation5Data Mining Overview

    16/19

     

    Alications of Data Mining

  • 8/17/2019 DDB_presentation5Data Mining Overview

    17/19

     

    Alications of Data MiningMarketing

    ! analysis of customers behaviour based on buying

    patterns! determination of marketing strategies including

    advertising, store location, and targeted mailing

    ! segmentation of customers, stores, or products

    ! design of catalogs, store layouts, and advertisingcampaigns

    6inance

    ! analysis of creditworthiness of clients! segmentation of accounts receivables

    ! performance analysis of finance investments like

    stocks, bonds and mutual funds

    ! evaluation of financing options

    ! fraud detection

    Alications of Data Mining !

  • 8/17/2019 DDB_presentation5Data Mining Overview

    18/19

     

    Alications of Data Mining !

    Manufacturing! optimisation of resources like machines,manpower and materials! optimal design of manufacturing processes,

    shop7floor layouts and product design, such as for products tailored according to customers re0uirements

  • 8/17/2019 DDB_presentation5Data Mining Overview

    19/19

     

    End