Dwdm Intro
-
Upload
battula-sudheer-kumar-naidu -
Category
Documents
-
view
259 -
download
2
Embed Size (px)
Transcript of Dwdm Intro
-
8/2/2019 Dwdm Intro
1/103
Outline Background
Content of human mind, Sample data miningproblems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
-
8/2/2019 Dwdm Intro
2/103
Data, Information, Knowledge, and Wisdomby Gene Bellinger , Durval Castro , Anthony Mills
According to Russell Ackoff, content of human mind can be
classified into five categories: Data, Information, Knowledge,Understanding and wisdom
Data: Symbols
Data represents a fact or statement of event without relationto other things.
Data is raw. It simply exists and has no significance beyondits existence (in and of itself). It can exist in any form,
usable or not. It does not have meaning of itself. In computerparlance, a spreadsheet generally starts out by holding data.
Ex: It is raining.
http://www.systems-thinking.org/feedback.htmmailto:[email protected]:[email protected]:[email protected]:[email protected]://www.systems-thinking.org/feedback.htm -
8/2/2019 Dwdm Intro
3/103
Content of Human Mind
Information: Data that are processed to beuseful; provides answer to who, what,where, and when questions. Information is data that has been given meaning by way of
relational connection. This "meaning" can be useful, butdoes not have to be. In computer parlance, a relational database makes
information from the data stored within it.
Information embodies the understanding of a relationship of some sort, possibly cause and effect. Example The temperature dropped 15 degrees and then it
started raining.
-
8/2/2019 Dwdm Intro
4/103
Knowledge: application of data and information;answers how questions. Knowledge is the appropriate collection of information,
such that it's intent is to be useful. Knowledge is a
deterministic process. When someone "memorizes"information (as less-aspiring test-bound students oftendo), then they have amassed knowledge.
Ex: If the humidity is very high and the temperature
drops suddenly the atmosphere is often unlikely to beable to hold the moisture so it rains.
Content of Human Mind
-
8/2/2019 Dwdm Intro
5/103
Understanding: appreciation of why
It is the process by which one can take knowledge and synthesizenew knowledge from the previously held knowledge. The difference between understanding and knowledge is the
difference between "learning" and "memorizing". People who have understanding can undertake useful actions because
they can synthesize new knowledge, or in some cases, at least newinformation, from what is previously known (and understood).
That is, understanding can build upon currently held information,knowledge and understanding itself.
In computer parlance, AI systems possess understanding in the sensethat they are able to synthesize new knowledge from previouslystored information and knowledge.
Content of Human Mind
-
8/2/2019 Dwdm Intro
6/103
Content of human mind Wisdom: evaluated understanding
It is the process by which we also discern, or judge, between right and wrong, good
and bad. I personally believe that computers do not have, and will never have theability to posses wisdom.
Ex: It rains because it rains. And this encompasses an understanding of all the
interactions that happen between raining, evaporation, air currents, temperature
gradients, changes, and raining.
-
8/2/2019 Dwdm Intro
7/103
Sample data mining problem # 1
I manage a supermarket (restaurant, video store, book store) and my cash register (or web site) pumpstransactions into my DB. Can you help me visualize my sales ? Can you profile my customers ? Tell me something interesting I do not know statistics, and I do not want to hire
statisticians.
-
8/2/2019 Dwdm Intro
8/103
Sample data mining problem #2
I am an astronomer and I have sky survey 3 terabytes of data, 2 billion objects. Can you help to recognize the objects ? Most of my data is beyond my reach.
Can you find new/unusual items in my data ? Can you help me with basic manipulation, so
I can focus on basic science ?
I know my data and statistics, but that is notenough
-
8/2/2019 Dwdm Intro
9/103
About Data mining
Look-up a few records SQL Populate standard report SQL Create a new report OLAP/mining
Data mining Optimize business process Locate a new problem Understand something new Answer a tough question
-
8/2/2019 Dwdm Intro
10/103
Evolution of Database Technology
Before 1960s: Primitive file processing
1960s: Data collection, database creation, IMS and network DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining, data warehousing, multimedia databases,and Web databases
2000s Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
-
8/2/2019 Dwdm Intro
11/103
Why Data Mining ?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized
society
Major sources of abundant data
Business: Web, e- commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation,
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis
of massive data sets
-
8/2/2019 Dwdm Intro
12/103
Lots of data is being collectedand warehoused Web data, e-commerce purchases at department/
grocery stores Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Commercial Viewpoint
-
8/2/2019 Dwdm Intro
13/103
Why Mine Data? Scientific Viewpoint
Data collected and stored atenormous speeds (GB/hour) remote sensors on a satellite
telescopes scanning the skies
microarrays generating geneexpression data
scientific simulationsgenerating terabytes of data
Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
-
8/2/2019 Dwdm Intro
14/103
Mining Large Data Sets - Motivation
There is often information hidden in the data that is
not readily evident Human analysts may take weeks to discover useful
information Much of the data is never analyzed at all
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
Number of analysts
From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications
-
8/2/2019 Dwdm Intro
15/103
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivateexperiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch (e.g.empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to findclosed-form solutions for complex mathematical models.
1990-now, data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science ,Comm. ACM, 45(11): 50-54, Nov. 2002
-
8/2/2019 Dwdm Intro
16/103
Outline Background
Content of human mind, Sample data miningproblems, Why data mining ?
Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
-
8/2/2019 Dwdm Intro
17/103
What Is Data Mining? Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previouslyunknown and potentially useful) information or patternsfrom data in large databases
Alternative names and their inside stories: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, dataarcheology, data dredging, information harvesting,business intelligence, etc.
What is not data mining? (Deductive) query processing. Expert systems or small ML/statistical programs
-
8/2/2019 Dwdm Intro
18/103
What is (not) Data Mining?
What is Data Mining?
Certain names are moreprevalent in certain USlocations (OBrien, ORurke,OReilly in Boston area)
Group together similar
documents returned bysearch engine according totheir context (e.g. Amazonrainforest, Amazon.com,)
What is not DataMining?
Look up phonenumber in phonedirectory
Query a Web
search engine forinformation aboutAmazon
-
8/2/2019 Dwdm Intro
19/103
Data Mining: A KDD Process
Data mining: the core of knowledge discoveryprocess.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
-
8/2/2019 Dwdm Intro
20/103
Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of application Data cleaning: to remove noise and inconsistent data Data integration: Multiple data sources can be combined Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariantrepresentation.
Choosing functions of data mining summarization, association, classification, clustering.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
-
8/2/2019 Dwdm Intro
21/103
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or DataWarehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowledge-Base
Database DataWarehouse
World-WideWeb
Other InfoRepositories
-
8/2/2019 Dwdm Intro
22/103
Components of data mining system
Database, Data warehouse, World Wide Web or other information
Repository Data cleaning and data integration techniques are performed on this data
Database and data warehouse server: Responsible for fetching therelevant data, based on the user s data mining request.
Knowledge-base: Domain knowledge which is used to guide the data
mining process. Attribute levels, semantics, user beliefs, pattern interestingness, thrsholds,meta data
Data mining engine: Set of functional modules for tasks such ascharacterization, summarization, association, classification, clustering,outlier extraction
Pattern evaluation: Employees interestingness measures Put the evaluation pattern as much deep as you can so that one can
optimize. User interface: communication between users and the data mining
system.
-
8/2/2019 Dwdm Intro
23/103
Outline Background
Content of human mind, Sample data miningproblems, Why data mining ?
Definition, KDD process, System architecture Data Visualization
Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
-
8/2/2019 Dwdm Intro
24/103
Data VisualizationOne Picture May Worth 1000 Words!
Visual Data Mining Visualization of data
Visualization of data mining results
Visualization of data mining processes Interactive data mining: visual classification
One melody may worth 1000 words too!
Audio data mining: turn data into music and melody! Uses audio signals to indicate the patterns of data or the
features of data mining results
Vi li i f d i i l i SAS
-
8/2/2019 Dwdm Intro
25/103
Visualization of data mining results in SASEnterprise Miner: scatter plots
-
8/2/2019 Dwdm Intro
26/103
Visualization of association rules inMineSet 3.0
-
8/2/2019 Dwdm Intro
27/103
Visualization of a decision tree in MineSet 3.0
-
8/2/2019 Dwdm Intro
28/103
Visualization of Data MiningProcesses by Clementine
-
8/2/2019 Dwdm Intro
29/103
Interactive Visual Mining byPerception-Based Classification (PBC)
-
8/2/2019 Dwdm Intro
30/103
Visualization on NTT i-Townpage
-
8/2/2019 Dwdm Intro
31/103
Traversal Diagram
-
8/2/2019 Dwdm Intro
32/103
Visitor Success Path
-
8/2/2019 Dwdm Intro
33/103
Day/Night Success Path
-
8/2/2019 Dwdm Intro
34/103
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions End User
BusinessAnalyst
DataAnalyst
DBA
MakingDecisions
Data PresentationVisualization Techniques
Data Mining Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources Paper, Files, Information Providers, Database Systems, OLTP
-
8/2/2019 Dwdm Intro
35/103
Data Mining: Confluence of MultipleDisciplines
Data Mining
DatabaseTechnology Statistics
OtherDisciplines
InformationScience
MachineLearning
Visualization
Other disciplines: pattern recognition, image processing, signal processingSpatial or temporal data analysis.
-
8/2/2019 Dwdm Intro
36/103
Regarding this course Emphasis is on efficient and scalable data mining techniques.
Algorithms must be highly scalable to handle such as tera-bytes of data
Scalability: Running time should grow approximately linearlyin proportion to the size of data given the available resourcessuch as main memory and disk space.
Using the proposed techniques, interesting knowledge,regularities or high-level information can be extracted
from the databases and viewed or browsed fromdifferent angles.
Efficiency: Without compromising quality
-
8/2/2019 Dwdm Intro
37/103
Why Not Traditional Data Analysis?(statistics, .)
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of data
Scalability: Running time should grow approximately linearly in proportion to thesize of data.
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
-
8/2/2019 Dwdm Intro
38/103
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-oriented/relational,active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
-
8/2/2019 Dwdm Intro
39/103
Data Mining: On What Kinds of Data? Database-oriented data sets and applications
Relational database, data warehouse, transactional database Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data Multimedia database
Text databases
The World-Wide Web
-
8/2/2019 Dwdm Intro
40/103
Data Mining Functionalities Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs.wet regions
Frequent patterns, association, correlation vs. causality
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish classes orconcepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas
mileage) Predict some unknown or missing numerical values
-
8/2/2019 Dwdm Intro
41/103
Data Mining Functionalities (2)
Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior of thedata
Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis
Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis
Other pattern-directed or statistical analyses
-
8/2/2019 Dwdm Intro
42/103
Outline Background
Content of human mind, Sample data miningproblems, Why data mining ?
Definition, KDD process, System architecture Data Visualization
Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
-
8/2/2019 Dwdm Intro
43/103
What is Data Warehouse?
Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the
organizations operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
managements decision -making process. W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
-
8/2/2019 Dwdm Intro
44/103
Data Warehouse Subject-Oriented
Organized around major subjects, such as customer,product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations ortransaction processing
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
-
8/2/2019 Dwdm Intro
45/103
-
8/2/2019 Dwdm Intro
46/103
Data Warehouse Time Variant
The time horizon for the data warehouse issignificantly longer than that of operational systems Operational database: current value data
Data warehouse data: provide information from ahistorical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse Contains an element of time, explicitly or implicitly
But the key of operational data may or may not containtime element
-
8/2/2019 Dwdm Intro
47/103
Data Warehouse Nonvolatile
A physically separate store of data transformed fromthe operational environment
Operational update of data does not occur in the data
warehouse environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
-
8/2/2019 Dwdm Intro
48/103
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration: A query driven
approach
Build wrappers/mediators on top of heterogeneous databases
When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual heterogeneous
sites involved, and the results are integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis
-
8/2/2019 Dwdm Intro
49/103
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
-
8/2/2019 Dwdm Intro
50/103
OLTP vs. OLAPOLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-datedetailed, flat relationalisolated
historical,summarized, multidimensionalintegrated, consolidated
usage repetitive ad-hoc
access read/writeindex/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
-
8/2/2019 Dwdm Intro
51/103
Why Separate Data Warehouse? High performance for both systems
DBMS tuned for OLTP: access methods, indexing, concurrencycontrol, recovery
Warehouse tuned for OLAP: complex OLAP queries,multidimensional view, consolidation
Different functions and different data: missing data : Decision support requires historical data which operational
DBs do not typically maintain
data consolidation : DS requires consolidation (aggregation,summarization) of data from heterogeneous sources
data quality : different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAPanalysis directly on relational databases
O tli
-
8/2/2019 Dwdm Intro
52/103
Outline Background
Content of human mind, Sample data miningproblems, Why data mining ?
Definition, KDD process, System architecture Data Visualization
Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
-
8/2/2019 Dwdm Intro
53/103
Objectives
Data mining is the process of extractinginteresting and useful information/knowledgefrom large databases or data warehouses.
The course covers the concepts and techniques of data mining such as
association rules, clustering, and classification. the basic concepts, architecture and general
implementations of data warehousing technology
-
8/2/2019 Dwdm Intro
54/103
Course topics Introduction (3 hrs): Definition, KDD framework, Issues in data mining. Association Rules (9hrs): Problem definition, Frequent item-set generation,
A priori and FP-growth algorithm, Evaluation of Association patterns. Clustering (9hrs): Overview, Types of Data, K-means, Aglomerative
clustering, Clustering algorithms (DBSCAN, BIRCH, CURE, ROCK,CHAMELEON).
Classification (9hrs): Overview, Decision tree induction, Over-fitting andunder-fitting, Scalable decision tree algorithms, Bayesian Classification,Regression-based Prediction methods
Data preprocessing (6 hrs): Data summarization, Data cleaning, Dataintegration and transformation, Data reduction, Data discretization andConcept hierarchy.
Data warehousing (9 hrs): Multidimensional data model, Data warehousingarchitecture, Data cube computation and OLAP technology.
-
8/2/2019 Dwdm Intro
55/103
Text Books Research Papers:
In this course, about 25 research papers will be covered.Students can refer the following books for the details of some research papers and other background information.
Text books Book: Jiawei Han and Micheline Kamber, Data
Mining: Concepts and Techniques, Second edition,2006, Elseiver Inc.
Pang-Nong Tan, Michael Steinbach and Vipin Kumar,Introduction to Data Mining, 2006, Pearson Education.
Reference Books: Papers from the proceeding of the conferences and
journals related to data mining and data warehousing.
-
8/2/2019 Dwdm Intro
56/103
LAB WORK
Several data mining tasks related to datapreprocessing, association rules, clusteringand classification will be given.
-
8/2/2019 Dwdm Intro
57/103
Outcome
After completing the course, the students will be able to appreciate the importance of
extracting useful knowledge from large amountsof data to improve the performance of a
business/organization. get enough exposure to investigate new/improveddata mining methods.
will understand the basics of data warehousingtechnology and its links to data mining.
Will be able play a role of a Data Miner in anorganization.
-
8/2/2019 Dwdm Intro
58/103
GRADING
MidSem1: 15 %; MidSemII: 15 %; EndSem: 30%;
Research Paper Quiz: 10 % Project/Lab: 30 %
-
8/2/2019 Dwdm Intro
59/103
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and DataMining (KDD95 -98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
ACM Transactions on KDD starting in 2007
-
8/2/2019 Dwdm Intro
60/103
Conferences and Journals on Data Mining
KDD Conferences ACM SIGKDD Int. Conf. on
Knowledge Discovery inDatabases and Data Mining(KDD )
SIAM Data Mining Conf. ( SDM ) (IEEE) Int. Conf. on Data Mining
(ICDM ) Conf. on Principles and practices
of Knowledge Discovery and
Data Mining ( PKDD ) Pacific-Asia Conf. on KnowledgeDiscovery and Data Mining(PAKDD )
Other related conferences ACM SIGMOD
VLDB
(IEEE) ICDE
WWW, SIGIR
ICML, CVPR, NIPS
Journals Data Mining and Knowledge
Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge andData Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD
h d f l
-
8/2/2019 Dwdm Intro
61/103
Where to Find References? DBLP, CiteSeer, Google
Data mining and KDD (SIGKDD: CDROM)
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Database systems (SIGMOD: ACM SIGMOD Anthology CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-
PAMI, etc.
Web and IR Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems,
Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc.
Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.
-
8/2/2019 Dwdm Intro
62/103
Reference Books S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2 nd ed., 2006
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,
Springer-Verlag, 2001
B. Liu, Web Data Mining, Springer 2006.
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,
Morgan Kaufmann, 2 nd ed. 2005
Outline
-
8/2/2019 Dwdm Intro
63/103
Outline Background
Content of human mind, Sample data miningproblems, Why data mining ? Definition, KDD process, System architecture Data Visualization
Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary
-
8/2/2019 Dwdm Intro
64/103
Data Mining Tasks
Prediction Methods Use some variables to predict unknown or future
values of other variables.
Description Methods Find human-interpretable patterns that describe
the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
-
8/2/2019 Dwdm Intro
65/103
Data Mining Tasks...
Association Rule Discovery [Descriptive]
Clustering [Descriptive] Classification [Predictive] Sequential Pattern Discovery [Descriptive] Regression [Predictive]
Deviation Detection [Predictive]
-
8/2/2019 Dwdm Intro
66/103
Association Rule Discovery: Definition
Given a set of records each of which contain somenumber of items from a given collection; Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}
-
8/2/2019 Dwdm Intro
67/103
Association Rule Discovery: Application 1
Marketing and Sales Promotion: Let the rule discovered be {Bagels, } --> {Potato Chips}
Potato Chips as consequent => Can be used to determinewhat should be done to boost its sales.
Bagels in the antecedent => Can be used to see whichproducts would be affected if the store discontinues sellingbagels.
Bagels in antecedent and Potato chips in consequent =>Can be used to see what products should be sold withBagels to promote sale of Potato chips!
-
8/2/2019 Dwdm Intro
68/103
Association Rule Discovery: Application 2
Supermarket shelf management. Goal: To identify items that are bought together by
sufficiently many customers. Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies amongitems.
A classic rule --
If a customer buys diaper and milk, then he is verylikely to buy beer.
So, dont be surprised if you find six -packs stacked nextto diapers!
Association Rule Discovery: Application 3
-
8/2/2019 Dwdm Intro
69/103
Association Rule Discovery: Application 3
Inventory Management: Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer productsand keep the service vehicles equipped with right parts toreduce on number of visits to consumer households.
Approach: Process the data on tools and parts required inprevious repairs at different consumer locations anddiscover the co-occurrence patterns.
Sequential Pattern Discovery: Definition
-
8/2/2019 Dwdm Intro
70/103
Sequential Pattern Discovery: Definition
Given is a set of objects , with each object associated with its own timeline of events , find rules that predict strong sequential dependencies among differentevents.
Rules are formed by first discovering patterns. Event occurrences in thepatterns are governed by timing constraints.
(A B) (C) (D E)
-
8/2/2019 Dwdm Intro
71/103
Sequential Pattern Discovery: Examples
In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)
In point-of-sale transaction sequences, Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->(Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store:(Shoes) (Racket, Racketball) --> (Sports_Jacket)
-
8/2/2019 Dwdm Intro
72/103
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure amongthem, find clusters such that
Data points in one cluster are more similar to oneanother. Data points in separate clusters are less similar to
one another.
Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
-
8/2/2019 Dwdm Intro
73/103
Illustrating ClusteringEuclidean Distance Based Clustering in 3-D space.
Intracluster distancesare minimized
Intercluster distancesare maximized
-
8/2/2019 Dwdm Intro
74/103
Clustering: Application 1
Market Segmentation: Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be selectedas a market target to be reached with a distinct marketing
mix. Approach:
Collect different attributes of customers based on theirgeographical and lifestyle related information.
Find clusters of similar customers. Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
-
8/2/2019 Dwdm Intro
75/103
Clustering: Application 2
Document Clustering: Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on thefrequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate
a new document or search term to clustered documents.
-
8/2/2019 Dwdm Intro
76/103
Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles
Times. Similarity Measure: How many words are common
in these documents (after some word filtering).
Category Total Articles
Correctly Placed
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
Clustering of S&P 500 Stock Data
-
8/2/2019 Dwdm Intro
77/103
Clustering of S&P 500 Stock Data
Discovered Clusters Industry Group
1Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Technology1-DOWN
2Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP
Oil-UP
Observe Stock Movements every day.Clustering points: Stock-{UP/DOWN}Similarity Measure: Two points are more similar if the eventsdescribed by them frequently happen together on the same day.
We used association rules to quantify a similarity measure.
-
8/2/2019 Dwdm Intro
78/103
Classification: Definition
Given a collection of records ( training set ) Each record contains a set of attributes , one of theattributes is the class .
Find a model for class attribute as a function
of the values of other attributes. Goal: previously unseen records should be
assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training andtest sets, with training set used to build the model andtest set used to validate it.
-
8/2/2019 Dwdm Intro
79/103
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
TrainingSet
Model Learn
Classifier
-
8/2/2019 Dwdm Intro
80/103
Classification: Application 1
Direct Marketing Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product. Approach:
Use the data for a similar product introduced before. We know which customers decided to buy and which decided
otherwise. This {buy, dont buy} decision forms the classattribute .
Collect various demographic, lifestyle, and company-interaction
related information about all such customers. Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997
-
8/2/2019 Dwdm Intro
81/103
Classification: Application 2
Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays ontime, etc
Label past transactions as fraud or fair transactions. This forms theclass attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit cardtransactions on an account.
-
8/2/2019 Dwdm Intro
82/103
Classification: Application 3
Customer Attrition/Churn: Goal: To predict whether a customer is likely to be
lost to a competitor.
Approach: Use detailed record of transactions with each of the past
and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
-
8/2/2019 Dwdm Intro
83/103
Classification: Application 4
Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopicsurvey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach: Segment the image. Measure image attributes (features) - 40 of them per object.
Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of
the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
-
8/2/2019 Dwdm Intro
84/103
Classifying Galaxies
Early
Intermediate
Late
Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB
Class: Stages of Formation
Attributes: Image features, Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
-
8/2/2019 Dwdm Intro
85/103
Regression Predict a value of a given continuous valued variable
based on the values of other variables, assuming alinear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples: Predicting sales amounts of new product based on
advertising expenditure. Predicting wind velocities as a function of temperature,
humidity, air pressure, etc. Time series prediction of stock market indices.
Deviation/Anomaly Detection
-
8/2/2019 Dwdm Intro
86/103
y
Detect significant deviations from normal behavior Applications:
Credit Card Fraud Detection
Network IntrusionDetection
Typical network traffic at University level may reach over 100 million connections per day
First Assignment
-
8/2/2019 Dwdm Intro
87/103
Assignment 1: Identify a problem from your own experience that you think would beamenable to data mining. Describe:
(i) What the data is.(ii) What type of benefit you might hope to get from data mining.(iii) What type of data mining (classification, clustering, etc.) you think would berelevant.
For each, illustrate with an example, e.g., if you think clustering is relevant, describe
what you think a likely cluster might contain and what the real-world meaning would be.
Submit twwo pages of 11 point single-spaced typeset text (leave 0.5 inch margins). Wrieyour roll number and name.
Last Date: 14-08-08 (5PM)
References: Introductory chapters of any data mining book or any data mining paper andthe PPTs of first two classes.
Outline
-
8/2/2019 Dwdm Intro
88/103
Background Content of human mind, Sample data miningproblems, Why data mining ?
Definition, KDD process, System architecture Data Visualization
Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms
Issues in data mining Data mining applications Summary
Top-10 Most Popular DM Algorithms:f
-
8/2/2019 Dwdm Intro
89/103
18 Identified Candidates (I)
Classification
#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. MorganKaufmann., 1993. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification
and Regression Trees. Wadsworth, 1984. #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996.
Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid AfterAll? Internat. Statist. Rev. 69, 385-398.
Statistical Learning #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory.
Springer-Verlag. #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley,
New York. Association Analysis #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for
Mining Association Rules. In VLDB '94. #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns
without candidate generation. In SIGMOD '00.
The 18 Identified Candidates (II)
-
8/2/2019 Dwdm Intro
90/103
The 18 Identified Candidates (II) Link Mining
#9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scalehypertextual Web search engine. In WWW-7, 1998.
#10. HITS: Kleinberg, J. M. 1998. Authoritative sources in ahyperlinked environment. SODA, 1998.
Clustering #11. K-Means: MacQueen, J. B., Some methods for classification and
analysis of multivariate observations, in Proc. 5th Berkeley Symp.Mathematical Statistics and Probability, 1967.
#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996.BIRCH: an efficient data clustering method for very large databases. InSIGMOD '96.
Bagging and Boosting #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application toboosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
The 18 Identified Candidates (III)
-
8/2/2019 Dwdm Intro
91/103
The 18 Identified Candidates (III)
Sequential Patterns
#14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:Generalizations and Performance Improvements. In Proceedings of the 5thInternational Conference on Extending Database Technology, 1996.
#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayaland M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01.
Integrated Mining #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and
association rule mining. KDD-98. Rough Sets
#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992
Graph Mining #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure
Pattern Mining. In ICDM '02.
Top- 10 Algorithm Finally Selected at ICDM06
-
8/2/2019 Dwdm Intro
92/103
p g y
#1: C4.5 (61 votes)
#2: K-Means (60 votes)
#3: SVM (58 votes)
#4: Apriori (52 votes)
#5: EM (48 votes) #6: PageRank (46 votes)
#7: AdaBoost (45 votes)
#7: kNN (45 votes) #7: Naive Bayes (45 votes)
#10: CART (34 votes)
Outline
-
8/2/2019 Dwdm Intro
93/103
Background
Content of human mind, Sample data miningproblems, Why data mining ? Definition, KDD process, System architecture Data Visualization
Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms
Issues in data mining Data mining applications Summary
h ll f
-
8/2/2019 Dwdm Intro
94/103
Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data
Data Quality Data Ownership and Distribution Privacy Preservation
Streaming Data
Major Issues in Data Mining
-
8/2/2019 Dwdm Intro
95/103
Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion
User interaction Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy
Outline
-
8/2/2019 Dwdm Intro
96/103
Background
Content of human mind, Sample data miningproblems, Why data mining ? Definition, KDD process, System architecture Data Visualization
Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms
Issues in data mining Data mining applications Summary
DM applications: Market Analysis and Management
-
8/2/2019 Dwdm Intro
97/103
DM applications: Market Analysis and Management
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons, customercomplaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same characteristics:
interest, income level, spending habits, etc. Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales Prediction based on the association information
DM applications: Market Analysis and Management.
-
8/2/2019 Dwdm Intro
98/103
pp y g
Customer profiling
data mining can tell you what types of customers buy what products
(clustering or classification)
Identifying customer requirements
identifying the best products for different customers use prediction to find what factors will attract new customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central tendency and variation)
DM applications: Corporate Analysis and Risk
-
8/2/2019 Dwdm Intro
99/103
Management
Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
Resource planning: summarize and compare the resources and spending
Competition: monitor competitors and market directions group customers into classes and a class-based pricing
procedure set pricing strategy in a highly competitive market
DM applications: Fraud Detection and Management
-
8/2/2019 Dwdm Intro
100/103
DM applications: Fraud Detection and Management
Applications
widely used in health care, retail, credit card services,telecommunications (phone card fraud), etc.
Approach use historical data to build models of fraudulent behavior and use data
mining to help identify similar instances Examples
auto insurance: detect a group of people who stage accidents to collecton insurance
money laundering: detect suspicious money transactions (USTreasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of doctors andring of references
DM applications: Fraud Detection and Management
-
8/2/2019 Dwdm Intro
101/103
pp g
Detecting inappropriate medical treatment Australian Health Insurance Commission identifies that in many cases
blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud
Telephone call model: destination of the call, duration, time of day orweek. Analyze patterns that deviate from an expected norm.
British Telecom identified discrete groups of callers with frequentintra-group calls, especially mobile phones, and broke a multimilliondollar fraud.
Retail Analysts estimate that 38% of retail shrink is due to dishonest
employees.
Other Applications of data mining
-
8/2/2019 Dwdm Intro
102/103
Other Applications of data mining Sports
IBM Advanced Scout analyzed NBA game statistics (shots blocked,assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with the helpof data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web access logs formarket-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web siteorganization, etc.
Summary
-
8/2/2019 Dwdm Intro
103/103
y
Data mining: Discovering interesting patterns from large amounts of data
A natural evolution of database technology, in great demand, with wideapplications
A KDD process includes data cleaning, data integration, data selection,transformation, data mining, pattern evaluation, and knowledge presentation
Mining can be performed in a variety of information repositories Data mining systems and architectures
Data warehousing
Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc. Major issues in data mining