Chap1 Intro x

8/12/2019 Chap1 Intro x

1/31

COMP4331: Introduction to Data Mining

James KwokDepartment of Computer Science and Engineering

Hong Kong University of Science and Technology

Fall 2013

COMP4331: Introduction


2/31

Why Data Mining?

We are experiencing an explosive growth of data

5 exabytes (1018 bytes) of data were created by human until2003 (kilo, mega, giga, tera; peta; exa; zetta); today this

amount of information is created in two days

in 2012, digital world of data was expanded to 2.72 zettabytes

it is predicted to double every two years

IBM indicates that every day 2.5 exabytes of data is created



3/31

Major Sources of Abundant Data

society: news, social networking (Facebook, YouTube)business: web, e-commerce, transactions, stocksscience: sensors, bioinformatics, scientific simulations

Example

Facebookhas 955 million monthly active accounts using 70languages, 140 billion photos uploaded, 125 billion friendconnections, every day 30 billion pieces of content and 2.7

billion likes and comments have been postedEvery minute, 48 hours of video are uploaded and every day, 4billion views performed onYouTube

1 billion Tweets every 72 hours from more than 140 million

active users onTwitter571 new websites are created every minute of the day

every day 10 billion text messages are sent

by the year 2020, 50 billion devices will be connected to

networks and the internetCOMP4331: Introduction


4/31

Why Data Mining?

We are drowning in data, but starving for knowledgewhile datasizeandcomplexityrapidly increase, the number ofdata analysts remains relatively small

Example

Within the next decade, number of information will increase by 50times; however number of information technology specialists whokeep up with all that data will increase by 1.5 time

traditional techniques are simply inapplicable

we need to find efficient ways to analyzethe vast quantities ofraw data to extractknowledge



5/31

Some Motivating Examples

Business: A book store wishes to makerecommendationstocustomers based on other customers previous purchases

Science: A bioinformatics lab wishes to find DNAsimilaritiesamong different organisms

Society: Either a company (for marketing purposes) or a lab(for research purposes) wishes to identify the mostinfluential

users in a social network



6/31


7/31

What is Data Mining?...

Is everything Data Mining?

simple search or query processing should not be confused withdata mining

What is Not Data Mining?

look up phone number in aphone directory

query a web search engine

for information aboutAmazon

What is Data Mining?

identify the most prevalentnames in certain USlocations (e.g., O.Brien,O.Rurke, and O.Reilly inBoston area)

group together similar

documents returned by asearch engine according totheir context (e.g. Amazonrainforest, Amazon.com,etc.)



8/31

What is Data Mining?...

Data mining is a confluence of several disciplines

Vizualization

DataMining

MachineLearning

Algorithms

DatabaseTechnology

Statistics

PatternRecognition

OtherDisciplines



9/31

Data Mining Architecture

DataWarehouse

World WideWeb

Other InfoRepositories

Databases

Data

Data Preprocessing

Data cleaning, integration,selection, transformation, etc.

Database or DataWarehouse Server

Data Mining Engine

Pattern Evaluation

User Interface

KnowledgeBase

Architecture of a typical data mining system


O Wh Ki d f D ?


10/31

On What Kind of Data?

Various data repositories

relational data

data warehouses

transactional data

graph data

sequence data

time series

spatial data

text & multimedia data


R l i l D


11/31

Relational Data

a relational database consists of a set oftables, each of whichconsisting of a set ofattributes(or columns or fields), and

containing a large set ofrecords(or tuples or rows)

Example

EID Name

0023

...

A. Smith

...

122 Lake Ave., Chicago, IL Manager

Employee

Address Position

... ...

BID Name

005

...

City Square

...

356 Michigan Ave., Chicago, IL

Branch

Address

...

EID BID

0023

...

005

...

Works_At

Salary

200,000$

...


D t W h


12/31

Data Warehouse

a data repository of information collected from differentsources stored under a unified scheme, and it usually resides

at a single sitethe stored data provide information from ahistoricalperspectiveand they are usuallysummarizedthe physical structure is typically amultidimensional data cube

Example

Time

Product type

Branch

Numberof itemsof a certain

product type soldduring a specific

time interval, at aspecific branch

Data cube


T ti l D t


13/31

Transactional Data

a special type of relational data, where every record is atransactionand involves a set of items

Example (a set of transactions)

TID Items

1

2

3

4

5

Bread, Butter, Milk, Cereal

Beer, Coke

Bread, Diaper, Milk, Cereal

Beer, Diaper

Coke, Bisquits, Milk



14/31

Sequence Data


15/31

Sequence Data

orderedsequences of events with or without a concrete notion

of time

Example (social network)

Genomic Sequence Data


Time Series Data


16/31

Time Series Data

a special type of sequence data, where the values or events

are obtained over repeated measurements oftime(e.g.,hourly, daily, weekly)

Example (Apple vs. Google Stock)


Spatial Data


17/31

Spatial Data

containgeographicalattributes (such as spatial coordinates or

areas)

Example (Road Network of North America (NA))


Text & Multimedia Data


18/31

Text & Multimedia Data

text databases containword descriptionsfor objects

multimedia databases storeimage,audio, andvideodata

Example

'Team'

Doc#1

Doc#2

3

0

0

Document Term Vectors

'Coach'

7

Image

'Timeout'

0

0


Data Mining Functionalities


19/31

Data Mining Functionalities

Major data mining tasks

Classification and regression

Cluster analysis

Association analysis


Classification and Regression


20/31


Classification

we have a set of records calledtraining set

each record contains various attributes, among which there isacategorical(i.e., discrete) attribute referred to as class

Example

Regression

classification predicts categorical attribute values; regressionpredictsnumericalattribute values




21/31

Classification and Regression...

How topredictthe value of anew(i.e., previously unseen) record?

what about an old record?

we explore the training set and devise a function called modelthe model takes as input a set of attributes values, andreturns a value for the class attribute

we then predict the class of the new record based on themodel




22/31


Example (Direct Marketing)suppose that we run an electronics consumer store

we wish to reduce the cost of mailing by targeting only the setof consumers that are likely to buy a new product

we collect a dataset of consumers that bought a similarproduct introduced before, as well as various demographic,lifestyle, and other information about them (training set)

this {buy, dont buy} decision forms theclassattribute

we use the above information to devise a classifier model

when reviewing the potential mail recipients, wepredictifthey are likely to buy the new product based on the model


Clustering


23/31

Clustering

Given a set of objects, each having a set of attributes, and asimilarity measureamong them, findclusters(i.e., groups) such

thatobjects in one cluster are more similar to one another

objects in separate clusters are less similar to one another

unlike classification, clustering analyzes objectswithoutconsulting a known class label


Clustering...


24/31

g

Example (image segmentation)


Clustering...


25/31

g

Example


Clustering...


26/31

g

Example (content-based image retrieval)


Outlier Detection


27/31

outlieris an object that is far awayfrom any cluster

clustering can be used

Example (Fraud Detection)

collect old transactions of a credit card holder

cluster the transactions based on the location and/or theamount of money spent

detect whether an incoming transaction is considerably

dissimilar toall clusters


Association Analysis


28/31

y

Example

TID Items

1

2

3

4

5

Bread, Butter, Milk, Cereal

Beer, Coke

Bread, Diaper, Milk, Cereal

Beer, Diaper

Coke, Bisquits, Milk

A supermarket wishes to find the products that most frequentlyco-occur in the customer transactions, in order to strategizeeffective promotions

GoalGiven atransactional database, find the sets of objects thatfrequently appearwithin thesametransactions

also calledfrequent pattern mining


Emerging Data Mining Functionalities


29/31

Social network analysis

mainly motivated by the rapid proliferation of social

networking (Facebook, YouTube, etc.)

Example (possible tasks)

discover socialcommunitiesusing similarity metrics

model thestrengthof a social link based on the interactionbetween the users

identify the mostinfluentialusers in the social network (forviral marketingpurposes)


Summary


30/31

We explained what data mining is and why it is important

We described the architecture of a typical data mining system

We outlined various repositories used for data mining

We overviewed the major data mining tasks


Summary


31/31

DataWarehouse

World WideWeb

Other InfoRepositories

Databases

Data

Data Preprocessing

Data cleaning, integration,selection, transformation, etc.

Database or DataWarehouse Server

Data Mining Engine

Pattern Evaluation

User Interface

KnowledgeBase

In thislecture

In thislecture

(only task overview)

In the nextlecture


Chap1 Intro x

Documents

Transcript of Chap1 Intro x