Chap1 Intro x
-
Upload
jason-cheng -
Category
Documents
-
view
234 -
download
0
Transcript of Chap1 Intro x
-
8/12/2019 Chap1 Intro x
1/31
COMP4331: Introduction to Data Mining
James KwokDepartment of Computer Science and Engineering
Hong Kong University of Science and Technology
Fall 2013
COMP4331: Introduction
-
8/12/2019 Chap1 Intro x
2/31
Why Data Mining?
We are experiencing an explosive growth of data
5 exabytes (1018 bytes) of data were created by human until2003 (kilo, mega, giga, tera; peta; exa; zetta); today this
amount of information is created in two days
in 2012, digital world of data was expanded to 2.72 zettabytes
it is predicted to double every two years
IBM indicates that every day 2.5 exabytes of data is created
COMP4331: Introduction
-
8/12/2019 Chap1 Intro x
3/31
Major Sources of Abundant Data
society: news, social networking (Facebook, YouTube)business: web, e-commerce, transactions, stocksscience: sensors, bioinformatics, scientific simulations
Example
Facebookhas 955 million monthly active accounts using 70languages, 140 billion photos uploaded, 125 billion friendconnections, every day 30 billion pieces of content and 2.7
billion likes and comments have been postedEvery minute, 48 hours of video are uploaded and every day, 4billion views performed onYouTube
1 billion Tweets every 72 hours from more than 140 million
active users onTwitter571 new websites are created every minute of the day
every day 10 billion text messages are sent
by the year 2020, 50 billion devices will be connected to
networks and the internetCOMP4331: Introduction
-
8/12/2019 Chap1 Intro x
4/31
Why Data Mining?
We are drowning in data, but starving for knowledgewhile datasizeandcomplexityrapidly increase, the number ofdata analysts remains relatively small
Example
Within the next decade, number of information will increase by 50times; however number of information technology specialists whokeep up with all that data will increase by 1.5 time
traditional techniques are simply inapplicable
we need to find efficient ways to analyzethe vast quantities ofraw data to extractknowledge
COMP4331: Introduction
-
8/12/2019 Chap1 Intro x
5/31
Some Motivating Examples
Business: A book store wishes to makerecommendationstocustomers based on other customers previous purchases
Science: A bioinformatics lab wishes to find DNAsimilaritiesamong different organisms
Society: Either a company (for marketing purposes) or a lab(for research purposes) wishes to identify the mostinfluential
users in a social network
COMP4331: Introduction
-
8/12/2019 Chap1 Intro x
6/31
-
8/12/2019 Chap1 Intro x
7/31
What is Data Mining?...
Is everything Data Mining?
simple search or query processing should not be confused withdata mining
What is Not Data Mining?
look up phone number in aphone directory
query a web search engine
for information aboutAmazon
What is Data Mining?
identify the most prevalentnames in certain USlocations (e.g., O.Brien,O.Rurke, and O.Reilly inBoston area)
group together similar
documents returned by asearch engine according totheir context (e.g. Amazonrainforest, Amazon.com,etc.)
COMP4331: Introduction
-
8/12/2019 Chap1 Intro x
8/31
What is Data Mining?...
Data mining is a confluence of several disciplines
Vizualization
DataMining
MachineLearning
Algorithms
DatabaseTechnology
Statistics
PatternRecognition
OtherDisciplines
COMP4331: Introduction
-
8/12/2019 Chap1 Intro x
9/31
Data Mining Architecture
DataWarehouse
World WideWeb
Other InfoRepositories
Databases
Data
Data Preprocessing
Data cleaning, integration,selection, transformation, etc.
Database or DataWarehouse Server
Data Mining Engine
Pattern Evaluation
User Interface
KnowledgeBase
Architecture of a typical data mining system
COMP4331: Introduction
O Wh Ki d f D ?
-
8/12/2019 Chap1 Intro x
10/31
On What Kind of Data?
Various data repositories
relational data
data warehouses
transactional data
graph data
sequence data
time series
spatial data
text & multimedia data
COMP4331: Introduction
R l i l D
-
8/12/2019 Chap1 Intro x
11/31
Relational Data
a relational database consists of a set oftables, each of whichconsisting of a set ofattributes(or columns or fields), and
containing a large set ofrecords(or tuples or rows)
Example
EID Name
0023
...
A. Smith
...
122 Lake Ave., Chicago, IL Manager
Employee
Address Position
... ...
BID Name
005
...
City Square
...
356 Michigan Ave., Chicago, IL
Branch
Address
...
EID BID
0023
...
005
...
Works_At
Salary
200,000$
...
COMP4331: Introduction
D t W h
-
8/12/2019 Chap1 Intro x
12/31
Data Warehouse
a data repository of information collected from differentsources stored under a unified scheme, and it usually resides
at a single sitethe stored data provide information from ahistoricalperspectiveand they are usuallysummarizedthe physical structure is typically amultidimensional data cube
Example
Time
Product type
Branch
Numberof itemsof a certain
product type soldduring a specific
time interval, at aspecific branch
Data cube
COMP4331: Introduction
T ti l D t
-
8/12/2019 Chap1 Intro x
13/31
Transactional Data
a special type of relational data, where every record is atransactionand involves a set of items
Example (a set of transactions)
TID Items
1
2
3
4
5
Bread, Butter, Milk, Cereal
Beer, Coke
Bread, Diaper, Milk, Cereal
Beer, Diaper
Coke, Bisquits, Milk
COMP4331: Introduction
-
8/12/2019 Chap1 Intro x
14/31
Sequence Data
-
8/12/2019 Chap1 Intro x
15/31
Sequence Data
orderedsequences of events with or without a concrete notion
of time
Example (social network)
Genomic Sequence Data
COMP4331: Introduction
Time Series Data
-
8/12/2019 Chap1 Intro x
16/31
Time Series Data
a special type of sequence data, where the values or events
are obtained over repeated measurements oftime(e.g.,hourly, daily, weekly)
Example (Apple vs. Google Stock)
COMP4331: Introduction
Spatial Data
-
8/12/2019 Chap1 Intro x
17/31
Spatial Data
containgeographicalattributes (such as spatial coordinates or
areas)
Example (Road Network of North America (NA))
COMP4331: Introduction
Text & Multimedia Data
-
8/12/2019 Chap1 Intro x
18/31
Text & Multimedia Data
text databases containword descriptionsfor objects
multimedia databases storeimage,audio, andvideodata
Example
'Team'
Doc#1
Doc#2
3
0
0
Document Term Vectors
'Coach'
7
Image
'Timeout'
0
0
COMP4331: Introduction
Data Mining Functionalities
-
8/12/2019 Chap1 Intro x
19/31
Data Mining Functionalities
Major data mining tasks
Classification and regression
Cluster analysis
Association analysis
COMP4331: Introduction
Classification and Regression
-
8/12/2019 Chap1 Intro x
20/31
Classification and Regression
Classification
we have a set of records calledtraining set
each record contains various attributes, among which there isacategorical(i.e., discrete) attribute referred to as class
Example
Regression
classification predicts categorical attribute values; regressionpredictsnumericalattribute values
COMP4331: Introduction
Classification and Regression
-
8/12/2019 Chap1 Intro x
21/31
Classification and Regression...
How topredictthe value of anew(i.e., previously unseen) record?
what about an old record?
we explore the training set and devise a function called modelthe model takes as input a set of attributes values, andreturns a value for the class attribute
we then predict the class of the new record based on themodel
COMP4331: Introduction
Classification and Regression
-
8/12/2019 Chap1 Intro x
22/31
Classification and Regression
Example (Direct Marketing)suppose that we run an electronics consumer store
we wish to reduce the cost of mailing by targeting only the setof consumers that are likely to buy a new product
we collect a dataset of consumers that bought a similarproduct introduced before, as well as various demographic,lifestyle, and other information about them (training set)
this {buy, dont buy} decision forms theclassattribute
we use the above information to devise a classifier model
when reviewing the potential mail recipients, wepredictifthey are likely to buy the new product based on the model
COMP4331: Introduction
Clustering
-
8/12/2019 Chap1 Intro x
23/31
Clustering
Given a set of objects, each having a set of attributes, and asimilarity measureamong them, findclusters(i.e., groups) such
thatobjects in one cluster are more similar to one another
objects in separate clusters are less similar to one another
unlike classification, clustering analyzes objectswithoutconsulting a known class label
COMP4331: Introduction
Clustering...
-
8/12/2019 Chap1 Intro x
24/31
g
Example (image segmentation)
COMP4331: Introduction
Clustering...
-
8/12/2019 Chap1 Intro x
25/31
g
Example
COMP4331: Introduction
Clustering...
-
8/12/2019 Chap1 Intro x
26/31
g
Example (content-based image retrieval)
COMP4331: Introduction
Outlier Detection
-
8/12/2019 Chap1 Intro x
27/31
outlieris an object that is far awayfrom any cluster
clustering can be used
Example (Fraud Detection)
collect old transactions of a credit card holder
cluster the transactions based on the location and/or theamount of money spent
detect whether an incoming transaction is considerably
dissimilar toall clusters
COMP4331: Introduction
Association Analysis
-
8/12/2019 Chap1 Intro x
28/31
y
Example
TID Items
1
2
3
4
5
Bread, Butter, Milk, Cereal
Beer, Coke
Bread, Diaper, Milk, Cereal
Beer, Diaper
Coke, Bisquits, Milk
A supermarket wishes to find the products that most frequentlyco-occur in the customer transactions, in order to strategizeeffective promotions
GoalGiven atransactional database, find the sets of objects thatfrequently appearwithin thesametransactions
also calledfrequent pattern mining
COMP4331: Introduction
Emerging Data Mining Functionalities
-
8/12/2019 Chap1 Intro x
29/31
Social network analysis
mainly motivated by the rapid proliferation of social
networking (Facebook, YouTube, etc.)
Example (possible tasks)
discover socialcommunitiesusing similarity metrics
model thestrengthof a social link based on the interactionbetween the users
identify the mostinfluentialusers in the social network (forviral marketingpurposes)
COMP4331: Introduction
Summary
-
8/12/2019 Chap1 Intro x
30/31
We explained what data mining is and why it is important
We described the architecture of a typical data mining system
We outlined various repositories used for data mining
We overviewed the major data mining tasks
COMP4331: Introduction
Summary
-
8/12/2019 Chap1 Intro x
31/31
DataWarehouse
World WideWeb
Other InfoRepositories
Databases
Data
Data Preprocessing
Data cleaning, integration,selection, transformation, etc.
Database or DataWarehouse Server
Data Mining Engine
Pattern Evaluation
User Interface
KnowledgeBase
In thislecture
In thislecture
(only task overview)
In the nextlecture
COMP4331: Introduction