Chap1 Intro x

download Chap1 Intro x

of 31

Transcript of Chap1 Intro x

  • 8/12/2019 Chap1 Intro x

    1/31

    COMP4331: Introduction to Data Mining

    James KwokDepartment of Computer Science and Engineering

    Hong Kong University of Science and Technology

    Fall 2013

    COMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    2/31

    Why Data Mining?

    We are experiencing an explosive growth of data

    5 exabytes (1018 bytes) of data were created by human until2003 (kilo, mega, giga, tera; peta; exa; zetta); today this

    amount of information is created in two days

    in 2012, digital world of data was expanded to 2.72 zettabytes

    it is predicted to double every two years

    IBM indicates that every day 2.5 exabytes of data is created

    COMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    3/31

    Major Sources of Abundant Data

    society: news, social networking (Facebook, YouTube)business: web, e-commerce, transactions, stocksscience: sensors, bioinformatics, scientific simulations

    Example

    Facebookhas 955 million monthly active accounts using 70languages, 140 billion photos uploaded, 125 billion friendconnections, every day 30 billion pieces of content and 2.7

    billion likes and comments have been postedEvery minute, 48 hours of video are uploaded and every day, 4billion views performed onYouTube

    1 billion Tweets every 72 hours from more than 140 million

    active users onTwitter571 new websites are created every minute of the day

    every day 10 billion text messages are sent

    by the year 2020, 50 billion devices will be connected to

    networks and the internetCOMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    4/31

    Why Data Mining?

    We are drowning in data, but starving for knowledgewhile datasizeandcomplexityrapidly increase, the number ofdata analysts remains relatively small

    Example

    Within the next decade, number of information will increase by 50times; however number of information technology specialists whokeep up with all that data will increase by 1.5 time

    traditional techniques are simply inapplicable

    we need to find efficient ways to analyzethe vast quantities ofraw data to extractknowledge

    COMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    5/31

    Some Motivating Examples

    Business: A book store wishes to makerecommendationstocustomers based on other customers previous purchases

    Science: A bioinformatics lab wishes to find DNAsimilaritiesamong different organisms

    Society: Either a company (for marketing purposes) or a lab(for research purposes) wishes to identify the mostinfluential

    users in a social network

    COMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    6/31

  • 8/12/2019 Chap1 Intro x

    7/31

    What is Data Mining?...

    Is everything Data Mining?

    simple search or query processing should not be confused withdata mining

    What is Not Data Mining?

    look up phone number in aphone directory

    query a web search engine

    for information aboutAmazon

    What is Data Mining?

    identify the most prevalentnames in certain USlocations (e.g., O.Brien,O.Rurke, and O.Reilly inBoston area)

    group together similar

    documents returned by asearch engine according totheir context (e.g. Amazonrainforest, Amazon.com,etc.)

    COMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    8/31

    What is Data Mining?...

    Data mining is a confluence of several disciplines

    Vizualization

    DataMining

    MachineLearning

    Algorithms

    DatabaseTechnology

    Statistics

    PatternRecognition

    OtherDisciplines

    COMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    9/31

    Data Mining Architecture

    DataWarehouse

    World WideWeb

    Other InfoRepositories

    Databases

    Data

    Data Preprocessing

    Data cleaning, integration,selection, transformation, etc.

    Database or DataWarehouse Server

    Data Mining Engine

    Pattern Evaluation

    User Interface

    KnowledgeBase

    Architecture of a typical data mining system

    COMP4331: Introduction

    O Wh Ki d f D ?

  • 8/12/2019 Chap1 Intro x

    10/31

    On What Kind of Data?

    Various data repositories

    relational data

    data warehouses

    transactional data

    graph data

    sequence data

    time series

    spatial data

    text & multimedia data

    COMP4331: Introduction

    R l i l D

  • 8/12/2019 Chap1 Intro x

    11/31

    Relational Data

    a relational database consists of a set oftables, each of whichconsisting of a set ofattributes(or columns or fields), and

    containing a large set ofrecords(or tuples or rows)

    Example

    EID Name

    0023

    ...

    A. Smith

    ...

    122 Lake Ave., Chicago, IL Manager

    Employee

    Address Position

    ... ...

    BID Name

    005

    ...

    City Square

    ...

    356 Michigan Ave., Chicago, IL

    Branch

    Address

    ...

    EID BID

    0023

    ...

    005

    ...

    Works_At

    Salary

    200,000$

    ...

    COMP4331: Introduction

    D t W h

  • 8/12/2019 Chap1 Intro x

    12/31

    Data Warehouse

    a data repository of information collected from differentsources stored under a unified scheme, and it usually resides

    at a single sitethe stored data provide information from ahistoricalperspectiveand they are usuallysummarizedthe physical structure is typically amultidimensional data cube

    Example

    Time

    Product type

    Branch

    Numberof itemsof a certain

    product type soldduring a specific

    time interval, at aspecific branch

    Data cube

    COMP4331: Introduction

    T ti l D t

  • 8/12/2019 Chap1 Intro x

    13/31

    Transactional Data

    a special type of relational data, where every record is atransactionand involves a set of items

    Example (a set of transactions)

    TID Items

    1

    2

    3

    4

    5

    Bread, Butter, Milk, Cereal

    Beer, Coke

    Bread, Diaper, Milk, Cereal

    Beer, Diaper

    Coke, Bisquits, Milk

    COMP4331: Introduction

  • 8/12/2019 Chap1 Intro x

    14/31

    Sequence Data

  • 8/12/2019 Chap1 Intro x

    15/31

    Sequence Data

    orderedsequences of events with or without a concrete notion

    of time

    Example (social network)

    Genomic Sequence Data

    COMP4331: Introduction

    Time Series Data

  • 8/12/2019 Chap1 Intro x

    16/31

    Time Series Data

    a special type of sequence data, where the values or events

    are obtained over repeated measurements oftime(e.g.,hourly, daily, weekly)

    Example (Apple vs. Google Stock)

    COMP4331: Introduction

    Spatial Data

  • 8/12/2019 Chap1 Intro x

    17/31

    Spatial Data

    containgeographicalattributes (such as spatial coordinates or

    areas)

    Example (Road Network of North America (NA))

    COMP4331: Introduction

    Text & Multimedia Data

  • 8/12/2019 Chap1 Intro x

    18/31

    Text & Multimedia Data

    text databases containword descriptionsfor objects

    multimedia databases storeimage,audio, andvideodata

    Example

    'Team'

    Doc#1

    Doc#2

    3

    0

    0

    Document Term Vectors

    'Coach'

    7

    Image

    'Timeout'

    0

    0

    COMP4331: Introduction

    Data Mining Functionalities

  • 8/12/2019 Chap1 Intro x

    19/31

    Data Mining Functionalities

    Major data mining tasks

    Classification and regression

    Cluster analysis

    Association analysis

    COMP4331: Introduction

    Classification and Regression

  • 8/12/2019 Chap1 Intro x

    20/31

    Classification and Regression

    Classification

    we have a set of records calledtraining set

    each record contains various attributes, among which there isacategorical(i.e., discrete) attribute referred to as class

    Example

    Regression

    classification predicts categorical attribute values; regressionpredictsnumericalattribute values

    COMP4331: Introduction

    Classification and Regression

  • 8/12/2019 Chap1 Intro x

    21/31

    Classification and Regression...

    How topredictthe value of anew(i.e., previously unseen) record?

    what about an old record?

    we explore the training set and devise a function called modelthe model takes as input a set of attributes values, andreturns a value for the class attribute

    we then predict the class of the new record based on themodel

    COMP4331: Introduction

    Classification and Regression

  • 8/12/2019 Chap1 Intro x

    22/31

    Classification and Regression

    Example (Direct Marketing)suppose that we run an electronics consumer store

    we wish to reduce the cost of mailing by targeting only the setof consumers that are likely to buy a new product

    we collect a dataset of consumers that bought a similarproduct introduced before, as well as various demographic,lifestyle, and other information about them (training set)

    this {buy, dont buy} decision forms theclassattribute

    we use the above information to devise a classifier model

    when reviewing the potential mail recipients, wepredictifthey are likely to buy the new product based on the model

    COMP4331: Introduction

    Clustering

  • 8/12/2019 Chap1 Intro x

    23/31

    Clustering

    Given a set of objects, each having a set of attributes, and asimilarity measureamong them, findclusters(i.e., groups) such

    thatobjects in one cluster are more similar to one another

    objects in separate clusters are less similar to one another

    unlike classification, clustering analyzes objectswithoutconsulting a known class label

    COMP4331: Introduction

    Clustering...

  • 8/12/2019 Chap1 Intro x

    24/31

    g

    Example (image segmentation)

    COMP4331: Introduction

    Clustering...

  • 8/12/2019 Chap1 Intro x

    25/31

    g

    Example

    COMP4331: Introduction

    Clustering...

  • 8/12/2019 Chap1 Intro x

    26/31

    g

    Example (content-based image retrieval)

    COMP4331: Introduction

    Outlier Detection

  • 8/12/2019 Chap1 Intro x

    27/31

    outlieris an object that is far awayfrom any cluster

    clustering can be used

    Example (Fraud Detection)

    collect old transactions of a credit card holder

    cluster the transactions based on the location and/or theamount of money spent

    detect whether an incoming transaction is considerably

    dissimilar toall clusters

    COMP4331: Introduction

    Association Analysis

  • 8/12/2019 Chap1 Intro x

    28/31

    y

    Example

    TID Items

    1

    2

    3

    4

    5

    Bread, Butter, Milk, Cereal

    Beer, Coke

    Bread, Diaper, Milk, Cereal

    Beer, Diaper

    Coke, Bisquits, Milk

    A supermarket wishes to find the products that most frequentlyco-occur in the customer transactions, in order to strategizeeffective promotions

    GoalGiven atransactional database, find the sets of objects thatfrequently appearwithin thesametransactions

    also calledfrequent pattern mining

    COMP4331: Introduction

    Emerging Data Mining Functionalities

  • 8/12/2019 Chap1 Intro x

    29/31

    Social network analysis

    mainly motivated by the rapid proliferation of social

    networking (Facebook, YouTube, etc.)

    Example (possible tasks)

    discover socialcommunitiesusing similarity metrics

    model thestrengthof a social link based on the interactionbetween the users

    identify the mostinfluentialusers in the social network (forviral marketingpurposes)

    COMP4331: Introduction

    Summary

  • 8/12/2019 Chap1 Intro x

    30/31

    We explained what data mining is and why it is important

    We described the architecture of a typical data mining system

    We outlined various repositories used for data mining

    We overviewed the major data mining tasks

    COMP4331: Introduction

    Summary

  • 8/12/2019 Chap1 Intro x

    31/31

    DataWarehouse

    World WideWeb

    Other InfoRepositories

    Databases

    Data

    Data Preprocessing

    Data cleaning, integration,selection, transformation, etc.

    Database or DataWarehouse Server

    Data Mining Engine

    Pattern Evaluation

    User Interface

    KnowledgeBase

    In thislecture

    In thislecture

    (only task overview)

    In the nextlecture

    COMP4331: Introduction