Dwdm Intro

download Dwdm Intro

of 103

Embed Size (px)

Transcript of Dwdm Intro

  • 8/2/2019 Dwdm Intro


    Outline Background

    Content of human mind, Sample data miningproblems, Why data mining ?

    Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

  • 8/2/2019 Dwdm Intro


    Data, Information, Knowledge, and Wisdomby Gene Bellinger , Durval Castro , Anthony Mills

    According to Russell Ackoff, content of human mind can be

    classified into five categories: Data, Information, Knowledge,Understanding and wisdom

    Data: Symbols

    Data represents a fact or statement of event without relationto other things.

    Data is raw. It simply exists and has no significance beyondits existence (in and of itself). It can exist in any form,

    usable or not. It does not have meaning of itself. In computerparlance, a spreadsheet generally starts out by holding data.

    Ex: It is raining.

    http://www.systems-thinking.org/feedback.htmmailto:[email protected]:[email protected]:[email protected]:[email protected]://www.systems-thinking.org/feedback.htm
  • 8/2/2019 Dwdm Intro


    Content of Human Mind

    Information: Data that are processed to beuseful; provides answer to who, what,where, and when questions. Information is data that has been given meaning by way of

    relational connection. This "meaning" can be useful, butdoes not have to be. In computer parlance, a relational database makes

    information from the data stored within it.

    Information embodies the understanding of a relationship of some sort, possibly cause and effect. Example The temperature dropped 15 degrees and then it

    started raining.

  • 8/2/2019 Dwdm Intro


    Knowledge: application of data and information;answers how questions. Knowledge is the appropriate collection of information,

    such that it's intent is to be useful. Knowledge is a

    deterministic process. When someone "memorizes"information (as less-aspiring test-bound students oftendo), then they have amassed knowledge.

    Ex: If the humidity is very high and the temperature

    drops suddenly the atmosphere is often unlikely to beable to hold the moisture so it rains.

    Content of Human Mind

  • 8/2/2019 Dwdm Intro


    Understanding: appreciation of why

    It is the process by which one can take knowledge and synthesizenew knowledge from the previously held knowledge. The difference between understanding and knowledge is the

    difference between "learning" and "memorizing". People who have understanding can undertake useful actions because

    they can synthesize new knowledge, or in some cases, at least newinformation, from what is previously known (and understood).

    That is, understanding can build upon currently held information,knowledge and understanding itself.

    In computer parlance, AI systems possess understanding in the sensethat they are able to synthesize new knowledge from previouslystored information and knowledge.

    Content of Human Mind

  • 8/2/2019 Dwdm Intro


    Content of human mind Wisdom: evaluated understanding

    It is the process by which we also discern, or judge, between right and wrong, good

    and bad. I personally believe that computers do not have, and will never have theability to posses wisdom.

    Ex: It rains because it rains. And this encompasses an understanding of all the

    interactions that happen between raining, evaporation, air currents, temperature

    gradients, changes, and raining.

  • 8/2/2019 Dwdm Intro


    Sample data mining problem # 1

    I manage a supermarket (restaurant, video store, book store) and my cash register (or web site) pumpstransactions into my DB. Can you help me visualize my sales ? Can you profile my customers ? Tell me something interesting I do not know statistics, and I do not want to hire


  • 8/2/2019 Dwdm Intro


    Sample data mining problem #2

    I am an astronomer and I have sky survey 3 terabytes of data, 2 billion objects. Can you help to recognize the objects ? Most of my data is beyond my reach.

    Can you find new/unusual items in my data ? Can you help me with basic manipulation, so

    I can focus on basic science ?

    I know my data and statistics, but that is notenough

  • 8/2/2019 Dwdm Intro


    About Data mining

    Look-up a few records SQL Populate standard report SQL Create a new report OLAP/mining

    Data mining Optimize business process Locate a new problem Understand something new Answer a tough question

  • 8/2/2019 Dwdm Intro


    Evolution of Database Technology

    Before 1960s: Primitive file processing

    1960s: Data collection, database creation, IMS and network DBMS

    1970s: Relational data model, relational DBMS implementation

    1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

    Application-oriented DBMS (spatial, scientific, engineering, etc.)

    1990s: Data mining, data warehousing, multimedia databases,and Web databases

    2000s Stream data management and mining

    Data mining and its applications

    Web technology (XML, data integration) and global information systems

  • 8/2/2019 Dwdm Intro


    Why Data Mining ?

    The Explosive Growth of Data: from terabytes to petabytes

    Data collection and data availability

    Automated data collection tools, database systems, Web, computerized


    Major sources of abundant data

    Business: Web, e- commerce, transactions, stocks,

    Science: Remote sensing, bioinformatics, scientific simulation,

    Society and everyone: news, digital cameras, YouTube

    We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis

    of massive data sets

  • 8/2/2019 Dwdm Intro


    Lots of data is being collectedand warehoused Web data, e-commerce purchases at department/

    grocery stores Bank/Credit Card


    Computers have become cheaper and more powerful

    Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in

    Customer Relationship Management)

    Why Mine Data? Commercial Viewpoint

  • 8/2/2019 Dwdm Intro


    Why Mine Data? Scientific Viewpoint

    Data collected and stored atenormous speeds (GB/hour) remote sensors on a satellite

    telescopes scanning the skies

    microarrays generating geneexpression data

    scientific simulationsgenerating terabytes of data

    Traditional techniques infeasible for raw data Data mining may help scientists

    in classifying and segmenting data in Hypothesis Formation

  • 8/2/2019 Dwdm Intro


    Mining Large Data Sets - Motivation

    There is often information hidden in the data that is

    not readily evident Human analysts may take weeks to discover useful

    information Much of the data is never analyzed at all










    1995 1996 1997 1998 1999

    The Data Gap

    Total new disk (TB) since 1995

    Number of analysts

    From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

  • 8/2/2019 Dwdm Intro


    Evolution of Sciences

    Before 1600, empirical science

    1600-1950s, theoretical science

    Each discipline has grown a theoretical component. Theoretical models often motivateexperiments and generalize our understanding.

    1950s-1990s, computational science

    Over the last 50 years, most disciplines have grown a third, computational branch (e.g.empirical, theoretical, and computational ecology, or physics, or linguistics.)

    Computational Science traditionally meant simulation. It grew out of our inability to findclosed-form solutions for complex mathematical models.

    1990-now, data science

    The flood of data from new scientific instruments and simulations

    The ability to economically store and manage petabytes of data online

    The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks

    scale almost linearly with data volumes. Data mining is a major new challenge!

    Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science ,Comm. ACM, 45(11): 50-54, Nov. 2002

  • 8/2/2019 Dwdm Intro


    Outline Background

    Content of human mind, Sample data miningproblems, Why data mining ?

    Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

  • 8/2/2019 Dwdm Intro


    What Is Data Mining? Data mining (knowledge discovery in databases):

    Extraction of interesting (non-trivial, implicit, previouslyunknown and potentially useful) information or patternsfrom data in large databases

    Alternative names and their inside stories: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD),

    knowledge extraction, data/pattern analysis, dataarcheology, data dredging, information harvesting,business intelligence, etc.

    What is not data mining? (Deductive) query processing. Expert systems or small ML/statistical programs

  • 8/2/2019 Dwdm Intro


    What is (not) Data Mining?

    What is Data Mining?

    Certain names are moreprevalent in certain USlocations (OBrien, ORurke,OReilly in Boston area)

    Group together similar

    documents returned bysearch engine according totheir context (e.g. Amazonrainforest, Amazon.com,)

    What is not DataMining?

    Look up phonenumber in phonedirectory

    Query a Web

    search engine forinformation aboutAmazon

  • 8/2/2019 Dwdm Intro


    Data Mining: A KDD Process

    Data mining: the core of knowledge discoveryprocess.

    Data Cleaning

    Data Integration


    Data Warehouse

    Task-relevant Data


    Data Mining

    Pattern Evaluation

  • 8/2/2019 Dwdm Intro


    Steps of a KDD Process

    Learning the application domain:

    relevant prior knowledge and goals of application Data cleaning: to remove noise and inconsistent data Data integration: Multiple data sources can be combined Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

    Find useful features, dimensionality/variable reduction, invariantrepresentation.

    Choosing functions of data mining summarization, association, classification, clustering.

    Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

    visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

  • 8/2/2019 Dwdm Intro


    Architecture: Typical Data Mining System

    data cleaning, integration, and selection

    Database or DataWarehouse Server

    Data Mining Engine

    Pattern Evaluation

    Graphical User Interface


    Database DataWarehouse


    Other InfoRepositories

  • 8/2/2019 Dwdm Intro


    Components of data mining system

    Database, Data warehouse, World Wide Web or other information

    Repository Data cleaning and data integration techniques are performed on this data

    Database and data warehouse server: Responsible for fetching therelevant data, based on the user s data mining request.

    Knowledge-base: Domain knowledge which is used to guide the data

    mining process. Attribute levels, semantics, user beliefs, pattern interestingness, thrsholds,meta data

    Data mining engine: Set of functional modules for tasks such ascharacterization, summarization, association, classification, clustering,outlier extraction

    Pattern evaluation: Employees interestingness measures Put the evaluation pattern as much deep as you can so that one can

    optimize. User interface: communication between users and the data mining


  • 8/2/2019 Dwdm Intro


    Outline Background

    Content of human mind, Sample data miningproblems, Why data mining ?

    Definition, KDD process, System architecture Data Visualization

    Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

  • 8/2/2019 Dwdm Intro


    Data VisualizationOne Picture May Worth 1000 Words!

    Visual Data Mining Visualization of data

    Visualization of data mining results

    Visualization of data mining processes Interactive data mining: visual classification

    One melody may worth 1000 words too!

    Audio data mining: turn data into music and melody! Uses audio signals to indicate the patterns of data or the

    features of data mining results

    Vi li i f d i i l i SAS

  • 8/2/2019 Dwdm Intro


    Visualization of data mining results in SASEnterprise Miner: scatter plots

  • 8/2/2019 Dwdm Intro


    Visualization of association rules inMineSet 3.0

  • 8/2/2019 Dwdm Intro


    Visualization of a decision tree in MineSet 3.0

  • 8/2/2019 Dwdm Intro


    Visualization of Data MiningProcesses by Clementine

  • 8/2/2019 Dwdm Intro


    Interactive Visual Mining byPerception-Based Classification (PBC)

  • 8/2/2019 Dwdm Intro


    Visualization on NTT i-Townpage

  • 8/2/2019 Dwdm Intro


    Traversal Diagram

  • 8/2/2019 Dwdm Intro


    Visitor Success Path

  • 8/2/2019 Dwdm Intro


    Day/Night Success Path

  • 8/2/2019 Dwdm Intro


    Data Mining and Business Intelligence

    Increasing potentialto supportbusiness decisions End User





    Data PresentationVisualization Techniques

    Data Mining Information Discovery

    Data Exploration


    Statistical Analysis, Querying and Reporting

    Data Warehouses / Data Marts

    Data Sources Paper, Files, Information Providers, Database Systems, OLTP

  • 8/2/2019 Dwdm Intro


    Data Mining: Confluence of MultipleDisciplines

    Data Mining

    DatabaseTechnology Statistics





    Other disciplines: pattern recognition, image processing, signal processingSpatial or temporal data analysis.

  • 8/2/2019 Dwdm Intro


    Regarding this course Emphasis is on efficient and scalable data mining techniques.

    Algorithms must be highly scalable to handle such as tera-bytes of data

    Scalability: Running time should grow approximately linearlyin proportion to the size of data given the available resourcessuch as main memory and disk space.

    Using the proposed techniques, interesting knowledge,regularities or high-level information can be extracted

    from the databases and viewed or browsed fromdifferent angles.

    Efficiency: Without compromising quality

  • 8/2/2019 Dwdm Intro


    Why Not Traditional Data Analysis?(statistics, .)

    Tremendous amount of data

    Algorithms must be highly scalable to handle such as tera-bytes of data

    Scalability: Running time should grow approximately linearly in proportion to thesize of data.

    High-dimensionality of data

    Micro-array may have tens of thousands of dimensions

    High complexity of data

    Data streams and sensor data

    Time-series data, temporal data, sequence data

    Structure data, graphs, social networks and multi-linked data

    Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data

    Software programs, scientific simulations

    New and sophisticated applications

  • 8/2/2019 Dwdm Intro


    Multi-Dimensional View of Data Mining

    Data to be mined

    Relational, data warehouse, transactional, stream, object-oriented/relational,active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

    Knowledge to be mined

    Characterization, discrimination, association, classification, clustering,

    trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels

    Techniques utilized

    Database-oriented, data warehouse (OLAP), machine learning, statistics,visualization, etc.

    Applications adapted

    Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

  • 8/2/2019 Dwdm Intro


    Data Mining: On What Kinds of Data? Database-oriented data sets and applications

    Relational database, data warehouse, transactional database Advanced data sets and advanced applications

    Data streams and sensor data

    Time-series data, temporal data, sequence data (incl. bio-sequences)

    Structure data, graphs, social networks and multi-linked data

    Object-relational databases

    Heterogeneous databases and legacy databases

    Spatial data and spatiotemporal data Multimedia database

    Text databases

    The World-Wide Web

  • 8/2/2019 Dwdm Intro


    Data Mining Functionalities Multidimensional concept description: Characterization and discrimination

    Generalize, summarize, and contrast data characteristics, e.g., dry vs.wet regions

    Frequent patterns, association, correlation vs. causality

    Diaper Beer [0.5%, 75%] (Correlation or causality?)

    Classification and prediction

    Construct models (functions) that describe and distinguish classes orconcepts for future prediction

    E.g., classify countries based on (climate), or classify cars based on (gas

    mileage) Predict some unknown or missing numerical values

  • 8/2/2019 Dwdm Intro


    Data Mining Functionalities (2)

    Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster

    houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity

    Outlier analysis

    Outlier: Data object that does not comply with the general behavior of thedata

    Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis

    Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis

    Other pattern-directed or statistical analyses

  • 8/2/2019 Dwdm Intro


    Outline Background

    Content of human mind, Sample data miningproblems, Why data mining ?

    Definition, KDD process, System architecture Data Visualization

    Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

  • 8/2/2019 Dwdm Intro


    What is Data Warehouse?

    Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the

    organizations operational database

    Support information processing by providing a solid platform of

    consolidated, historical data for analysis.

    A data warehouse is a subject-oriented, integrated, time-

    variant, and nonvolatile collection of data in support of

    managements decision -making process. W. H. Inmon

    Data warehousing:

    The process of constructing and using data warehouses

  • 8/2/2019 Dwdm Intro


    Data Warehouse Subject-Oriented

    Organized around major subjects, such as customer,product, sales

    Focusing on the modeling and analysis of data for

    decision makers, not on daily operations ortransaction processing

    Provide a simple and concise view around particular

    subject issues by excluding data that are not useful in

    the decision support process

  • 8/2/2019 Dwdm Intro


  • 8/2/2019 Dwdm Intro


    Data Warehouse Time Variant

    The time horizon for the data warehouse issignificantly longer than that of operational systems Operational database: current value data

    Data warehouse data: provide information from ahistorical perspective (e.g., past 5-10 years)

    Every key structure in the data warehouse Contains an element of time, explicitly or implicitly

    But the key of operational data may or may not containtime element

  • 8/2/2019 Dwdm Intro


    Data Warehouse Nonvolatile

    A physically separate store of data transformed fromthe operational environment

    Operational update of data does not occur in the data

    warehouse environment

    Does not require transaction processing, recovery, and

    concurrency control mechanisms

    Requires only two operations in data accessing:

    initial loading of data and access of data

  • 8/2/2019 Dwdm Intro


    Data Warehouse vs. Heterogeneous DBMS

    Traditional heterogeneous DB integration: A query driven


    Build wrappers/mediators on top of heterogeneous databases

    When a query is posed to a client site, a meta-dictionary is used to

    translate the query into queries appropriate for individual heterogeneous

    sites involved, and the results are integrated into a global answer set

    Complex information filtering, compete for resources

    Data warehouse: update-driven, high performance

    Information from heterogeneous sources is integrated in advance and

    stored in warehouses for direct query and analysis

  • 8/2/2019 Dwdm Intro


    Data Warehouse vs. Operational DBMS

    OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing,

    payroll, registration, accounting, etc.

    OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making

    Distinct features (OLTP vs. OLAP):

    User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated

    Database design: ER + application vs. star + subject

    View: current, local vs. evolutionary, integrated

    Access patterns: update vs. read-only but complex queries

  • 8/2/2019 Dwdm Intro



    users clerk, IT professional knowledge worker

    function day to day operations decision support

    DB design application-oriented subject-oriented

    data current, up-to-datedetailed, flat relationalisolated

    historical,summarized, multidimensionalintegrated, consolidated

    usage repetitive ad-hoc

    access read/writeindex/hash on prim. key

    lots of scans

    unit of work short, simple transaction complex query

    # records accessed tens millions

    #users thousands hundreds

    DB size 100MB-GB 100GB-TB

    metric transaction throughput query throughput, response

  • 8/2/2019 Dwdm Intro


    Why Separate Data Warehouse? High performance for both systems

    DBMS tuned for OLTP: access methods, indexing, concurrencycontrol, recovery

    Warehouse tuned for OLAP: complex OLAP queries,multidimensional view, consolidation

    Different functions and different data: missing data : Decision support requires historical data which operational

    DBs do not typically maintain

    data consolidation : DS requires consolidation (aggregation,summarization) of data from heterogeneous sources

    data quality : different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled

    Note: There are more and more systems which perform OLAPanalysis directly on relational databases

    O tli

  • 8/2/2019 Dwdm Intro


    Outline Background

    Content of human mind, Sample data miningproblems, Why data mining ?

    Definition, KDD process, System architecture Data Visualization

    Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

  • 8/2/2019 Dwdm Intro



    Data mining is the process of extractinginteresting and useful information/knowledgefrom large databases or data warehouses.

    The course covers the concepts and techniques of data mining such as

    association rules, clustering, and classification. the basic concepts, architecture and general

    implementations of data warehousing technology

  • 8/2/2019 Dwdm Intro


    Course topics Introduction (3 hrs): Definition, KDD framework, Issues in data mining. Association Rules (9hrs): Problem definition, Frequent item-set generation,

    A priori and FP-growth algorithm, Evaluation of Association patterns. Clustering (9hrs): Overview, Types of Data, K-means, Aglomerative

    clustering, Clustering algorithms (DBSCAN, BIRCH, CURE, ROCK,CHAMELEON).

    Classification (9hrs): Overview, Decision tree induction, Over-fitting andunder-fitting, Scalable decision tree algorithms, Bayesian Classification,Regression-based Prediction methods

    Data preprocessing (6 hrs): Data summarization, Data cleaning, Dataintegration and transformation, Data reduction, Data discretization andConcept hierarchy.

    Data warehousing (9 hrs): Multidimensional data model, Data warehousingarchitecture, Data cube computation and OLAP technology.

  • 8/2/2019 Dwdm Intro


    Text Books Research Papers:

    In this course, about 25 research papers will be covered.Students can refer the following books for the details of some research papers and other background information.

    Text books Book: Jiawei Han and Micheline Kamber, Data

    Mining: Concepts and Techniques, Second edition,2006, Elseiver Inc.

    Pang-Nong Tan, Michael Steinbach and Vipin Kumar,Introduction to Data Mining, 2006, Pearson Education.

    Reference Books: Papers from the proceeding of the conferences and

    journals related to data mining and data warehousing.

  • 8/2/2019 Dwdm Intro



    Several data mining tasks related to datapreprocessing, association rules, clusteringand classification will be given.

  • 8/2/2019 Dwdm Intro



    After completing the course, the students will be able to appreciate the importance of

    extracting useful knowledge from large amountsof data to improve the performance of a

    business/organization. get enough exposure to investigate new/improveddata mining methods.

    will understand the basics of data warehousingtechnology and its links to data mining.

    Will be able play a role of a Data Miner in anorganization.

  • 8/2/2019 Dwdm Intro



    MidSem1: 15 %; MidSemII: 15 %; EndSem: 30%;

    Research Paper Quiz: 10 % Project/Lab: 30 %

  • 8/2/2019 Dwdm Intro


    A Brief History of Data Mining Society

    1989 IJCAI Workshop on Knowledge Discovery in Databases

    Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

    1991-1994 Workshops on Knowledge Discovery in Databases

    Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

    1995-1998 International Conferences on Knowledge Discovery in Databases and DataMining (KDD95 -98)

    Journal of Data Mining and Knowledge Discovery (1997)

    ACM SIGKDD conferences since 1998 and SIGKDD Explorations

    More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM

    (2001), etc.

    ACM Transactions on KDD starting in 2007

  • 8/2/2019 Dwdm Intro


    Conferences and Journals on Data Mining

    KDD Conferences ACM SIGKDD Int. Conf. on

    Knowledge Discovery inDatabases and Data Mining(KDD )

    SIAM Data Mining Conf. ( SDM ) (IEEE) Int. Conf. on Data Mining

    (ICDM ) Conf. on Principles and practices

    of Knowledge Discovery and

    Data Mining ( PKDD ) Pacific-Asia Conf. on KnowledgeDiscovery and Data Mining(PAKDD )

    Other related conferences ACM SIGMOD





    Journals Data Mining and Knowledge

    Discovery (DAMI or DMKD)

    IEEE Trans. On Knowledge andData Eng. (TKDE)

    KDD Explorations

    ACM Trans. on KDD

    h d f l

  • 8/2/2019 Dwdm Intro


    Where to Find References? DBLP, CiteSeer, Google

    Data mining and KDD (SIGKDD: CDROM)

    Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD

    Database systems (SIGMOD: ACM SIGMOD Anthology CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.

    AI & Machine Learning Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-

    PAMI, etc.

    Web and IR Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems,

    Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc.

    Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.

  • 8/2/2019 Dwdm Intro


    Reference Books S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

    R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

    T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

    U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.

    AAAI/MIT Press, 1996

    U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan

    Kaufmann, 2001

    J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2 nd ed., 2006

    D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

    T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,

    Springer-Verlag, 2001

    B. Liu, Web Data Mining, Springer 2006.

    T. M. Mitchell, Machine Learning, McGraw Hill, 1997

    G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991

    P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

    S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

    I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,

    Morgan Kaufmann, 2 nd ed. 2005


  • 8/2/2019 Dwdm Intro


    Outline Background

    Content of human mind, Sample data miningproblems, Why data mining ? Definition, KDD process, System architecture Data Visualization

    Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

  • 8/2/2019 Dwdm Intro


    Data Mining Tasks

    Prediction Methods Use some variables to predict unknown or future

    values of other variables.

    Description Methods Find human-interpretable patterns that describe

    the data.

    From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

  • 8/2/2019 Dwdm Intro


    Data Mining Tasks...

    Association Rule Discovery [Descriptive]

    Clustering [Descriptive] Classification [Predictive] Sequential Pattern Discovery [Descriptive] Regression [Predictive]

    Deviation Detection [Predictive]

  • 8/2/2019 Dwdm Intro


    Association Rule Discovery: Definition

    Given a set of records each of which contain somenumber of items from a given collection; Produce dependency rules which will predict occurrence of

    an item based on occurrences of other items.

    TID Items

    1 Bread, Coke, Milk

    2 Beer, Bread

    3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk

    5 Coke, Diaper, Milk

    Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

  • 8/2/2019 Dwdm Intro


    Association Rule Discovery: Application 1

    Marketing and Sales Promotion: Let the rule discovered be {Bagels, } --> {Potato Chips}

    Potato Chips as consequent => Can be used to determinewhat should be done to boost its sales.

    Bagels in the antecedent => Can be used to see whichproducts would be affected if the store discontinues sellingbagels.

    Bagels in antecedent and Potato chips in consequent =>Can be used to see what products should be sold withBagels to promote sale of Potato chips!

  • 8/2/2019 Dwdm Intro


    Association Rule Discovery: Application 2

    Supermarket shelf management. Goal: To identify items that are bought together by

    sufficiently many customers. Approach: Process the point-of-sale data collected

    with barcode scanners to find dependencies amongitems.

    A classic rule --

    If a customer buys diaper and milk, then he is verylikely to buy beer.

    So, dont be surprised if you find six -packs stacked nextto diapers!

    Association Rule Discovery: Application 3

  • 8/2/2019 Dwdm Intro


    Association Rule Discovery: Application 3

    Inventory Management: Goal: A consumer appliance repair company wants to

    anticipate the nature of repairs on its consumer productsand keep the service vehicles equipped with right parts toreduce on number of visits to consumer households.

    Approach: Process the data on tools and parts required inprevious repairs at different consumer locations anddiscover the co-occurrence patterns.

    Sequential Pattern Discovery: Definition

  • 8/2/2019 Dwdm Intro


    Sequential Pattern Discovery: Definition

    Given is a set of objects , with each object associated with its own timeline of events , find rules that predict strong sequential dependencies among differentevents.

    Rules are formed by first discovering patterns. Event occurrences in thepatterns are governed by timing constraints.

    (A B) (C) (D E)

  • 8/2/2019 Dwdm Intro


    Sequential Pattern Discovery: Examples

    In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current)

    (Rectifier_Alarm) --> (Fire_Alarm)

    In point-of-sale transaction sequences, Computer Bookstore:

    (Intro_To_Visual_C) (C++_Primer) -->(Perl_for_dummies,Tcl_Tk)

    Athletic Apparel Store:(Shoes) (Racket, Racketball) --> (Sports_Jacket)

  • 8/2/2019 Dwdm Intro


    Clustering Definition

    Given a set of data points, each having a set of attributes, and a similarity measure amongthem, find clusters such that

    Data points in one cluster are more similar to oneanother. Data points in separate clusters are less similar to

    one another.

    Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

  • 8/2/2019 Dwdm Intro


    Illustrating ClusteringEuclidean Distance Based Clustering in 3-D space.

    Intracluster distancesare minimized

    Intercluster distancesare maximized

  • 8/2/2019 Dwdm Intro


    Clustering: Application 1

    Market Segmentation: Goal: subdivide a market into distinct subsets of

    customers where any subset may conceivably be selectedas a market target to be reached with a distinct marketing

    mix. Approach:

    Collect different attributes of customers based on theirgeographical and lifestyle related information.

    Find clusters of similar customers. Measure the clustering quality by observing buying patterns of

    customers in same cluster vs. those from different clusters.

  • 8/2/2019 Dwdm Intro


    Clustering: Application 2

    Document Clustering: Goal: To find groups of documents that are similar to each

    other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each

    document. Form a similarity measure based on thefrequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate

    a new document or search term to clustered documents.

  • 8/2/2019 Dwdm Intro


    Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles

    Times. Similarity Measure: How many words are common

    in these documents (after some word filtering).

    Category Total Articles

    Correctly Placed

    Financial 555 364

    Foreign 341 260

    National 273 36

    Metro 943 746

    Sports 738 573

    Entertainment 354 278

    Clustering of S&P 500 Stock Data

  • 8/2/2019 Dwdm Intro


    Clustering of S&P 500 Stock Data

    Discovered Clusters Industry Group








    Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN


    3Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN




    Observe Stock Movements every day.Clustering points: Stock-{UP/DOWN}Similarity Measure: Two points are more similar if the eventsdescribed by them frequently happen together on the same day.

    We used association rules to quantify a similarity measure.

  • 8/2/2019 Dwdm Intro


    Classification: Definition

    Given a collection of records ( training set ) Each record contains a set of attributes , one of theattributes is the class .

    Find a model for class attribute as a function

    of the values of other attributes. Goal: previously unseen records should be

    assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.

    Usually, the given data set is divided into training andtest sets, with training set used to build the model andtest set used to validate it.

  • 8/2/2019 Dwdm Intro


    Classification Example

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    Refund MaritalStatus

    TaxableIncome Cheat

    No Single 75K ?

    Yes Married 50K ?

    No Married 150K ?Yes Divorced 90K ?

    No Single 40K ?

    No Married 80K ?10



    Model Learn


  • 8/2/2019 Dwdm Intro


    Classification: Application 1

    Direct Marketing Goal: Reduce cost of mailing by targeting a set of

    consumers likely to buy a new cell-phone product. Approach:

    Use the data for a similar product introduced before. We know which customers decided to buy and which decided

    otherwise. This {buy, dont buy} decision forms the classattribute .

    Collect various demographic, lifestyle, and company-interaction

    related information about all such customers. Type of business, where they stay, how much they earn, etc.

    Use this information as input attributes to learn a classifier model.

    From [Berry & Linoff] Data Mining Techniques, 1997

  • 8/2/2019 Dwdm Intro


    Classification: Application 2

    Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:

    Use credit card transactions and the information on its account-holder as attributes.

    When does a customer buy, what does he buy, how often he pays ontime, etc

    Label past transactions as fraud or fair transactions. This forms theclass attribute.

    Learn a model for the class of the transactions.

    Use this model to detect fraud by observing credit cardtransactions on an account.

  • 8/2/2019 Dwdm Intro


    Classification: Application 3

    Customer Attrition/Churn: Goal: To predict whether a customer is likely to be

    lost to a competitor.

    Approach: Use detailed record of transactions with each of the past

    and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the

    day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty.

    From [Berry & Linoff] Data Mining Techniques, 1997

  • 8/2/2019 Dwdm Intro


    Classification: Application 4

    Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects,

    especially visually faint ones, based on the telescopicsurvey images (from Palomar Observatory).

    3000 images with 23,040 x 23,040 pixels per image.

    Approach: Segment the image. Measure image attributes (features) - 40 of them per object.

    Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of

    the farthest objects that are difficult to find!

    From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

  • 8/2/2019 Dwdm Intro


    Classifying Galaxies




    Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

    Class: Stages of Formation

    Attributes: Image features, Characteristics of light

    waves received, etc.

    Courtesy: http://aps.umn.edu

  • 8/2/2019 Dwdm Intro


    Regression Predict a value of a given continuous valued variable

    based on the values of other variables, assuming alinear or nonlinear model of dependency.

    Greatly studied in statistics, neural network fields.

    Examples: Predicting sales amounts of new product based on

    advertising expenditure. Predicting wind velocities as a function of temperature,

    humidity, air pressure, etc. Time series prediction of stock market indices.

    Deviation/Anomaly Detection

  • 8/2/2019 Dwdm Intro



    Detect significant deviations from normal behavior Applications:

    Credit Card Fraud Detection

    Network IntrusionDetection

    Typical network traffic at University level may reach over 100 million connections per day

    First Assignment

  • 8/2/2019 Dwdm Intro


    Assignment 1: Identify a problem from your own experience that you think would beamenable to data mining. Describe:

    (i) What the data is.(ii) What type of benefit you might hope to get from data mining.(iii) What type of data mining (classification, clustering, etc.) you think would berelevant.

    For each, illustrate with an example, e.g., if you think clustering is relevant, describe

    what you think a likely cluster might contain and what the real-world meaning would be.

    Submit twwo pages of 11 point single-spaced typeset text (leave 0.5 inch margins). Wrieyour roll number and name.

    Last Date: 14-08-08 (5PM)

    References: Introductory chapters of any data mining book or any data mining paper andthe PPTs of first two classes.


  • 8/2/2019 Dwdm Intro


    Background Content of human mind, Sample data miningproblems, Why data mining ?

    Definition, KDD process, System architecture Data Visualization

    Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms

    Issues in data mining Data mining applications Summary

    Top-10 Most Popular DM Algorithms:f

  • 8/2/2019 Dwdm Intro


    18 Identified Candidates (I)


    #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. MorganKaufmann., 1993. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification

    and Regression Trees. Wadsworth, 1984. #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996.

    Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)

    #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid AfterAll? Internat. Statist. Rev. 69, 385-398.

    Statistical Learning #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory.

    Springer-Verlag. #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley,

    New York. Association Analysis #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for

    Mining Association Rules. In VLDB '94. #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns

    without candidate generation. In SIGMOD '00.

    The 18 Identified Candidates (II)

  • 8/2/2019 Dwdm Intro


    The 18 Identified Candidates (II) Link Mining

    #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scalehypertextual Web search engine. In WWW-7, 1998.

    #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in ahyperlinked environment. SODA, 1998.

    Clustering #11. K-Means: MacQueen, J. B., Some methods for classification and

    analysis of multivariate observations, in Proc. 5th Berkeley Symp.Mathematical Statistics and Probability, 1967.

    #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996.BIRCH: an efficient data clustering method for very large databases. InSIGMOD '96.

    Bagging and Boosting #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application toboosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

    The 18 Identified Candidates (III)

  • 8/2/2019 Dwdm Intro


    The 18 Identified Candidates (III)

    Sequential Patterns

    #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:Generalizations and Performance Improvements. In Proceedings of the 5thInternational Conference on Extending Database Technology, 1996.

    #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayaland M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01.

    Integrated Mining #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and

    association rule mining. KDD-98. Rough Sets

    #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992

    Graph Mining #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure

    Pattern Mining. In ICDM '02.

    Top- 10 Algorithm Finally Selected at ICDM06

  • 8/2/2019 Dwdm Intro


    p g y

    #1: C4.5 (61 votes)

    #2: K-Means (60 votes)

    #3: SVM (58 votes)

    #4: Apriori (52 votes)

    #5: EM (48 votes) #6: PageRank (46 votes)

    #7: AdaBoost (45 votes)

    #7: kNN (45 votes) #7: Naive Bayes (45 votes)

    #10: CART (34 votes)


  • 8/2/2019 Dwdm Intro



    Content of human mind, Sample data miningproblems, Why data mining ? Definition, KDD process, System architecture Data Visualization

    Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms

    Issues in data mining Data mining applications Summary

    h ll f

  • 8/2/2019 Dwdm Intro


    Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data

    Data Quality Data Ownership and Distribution Privacy Preservation

    Streaming Data

    Major Issues in Data Mining

  • 8/2/2019 Dwdm Intro


    Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream,


    Performance: efficiency, effectiveness, and scalability

    Pattern evaluation: the interestingness problem

    Incorporation of background knowledge

    Handling noise and incomplete data

    Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion

    User interaction Data mining query languages and ad-hoc mining

    Expression and visualization of data mining results

    Interactive mining of knowledge at multiple levels of abstraction

    Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy


  • 8/2/2019 Dwdm Intro



    Content of human mind, Sample data miningproblems, Why data mining ? Definition, KDD process, System architecture Data Visualization

    Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms

    Issues in data mining Data mining applications Summary

    DM applications: Market Analysis and Management

  • 8/2/2019 Dwdm Intro


    DM applications: Market Analysis and Management

    Where are the data sources for analysis?

    Credit card transactions, loyalty cards, discount coupons, customercomplaint calls, plus (public) lifestyle studies

    Target marketing

    Find clusters of model customers who share the same characteristics:

    interest, income level, spending habits, etc. Determine customer purchasing patterns over time

    Conversion of single to a joint bank account: marriage, etc.

    Cross-market analysis

    Associations/co-relations between product sales Prediction based on the association information

    DM applications: Market Analysis and Management.

  • 8/2/2019 Dwdm Intro


    pp y g

    Customer profiling

    data mining can tell you what types of customers buy what products

    (clustering or classification)

    Identifying customer requirements

    identifying the best products for different customers use prediction to find what factors will attract new customers

    Provides summary information

    various multidimensional summary reports

    statistical summary information (data central tendency and variation)

    DM applications: Corporate Analysis and Risk

  • 8/2/2019 Dwdm Intro



    Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio,

    trend analysis, etc.)

    Resource planning: summarize and compare the resources and spending

    Competition: monitor competitors and market directions group customers into classes and a class-based pricing

    procedure set pricing strategy in a highly competitive market

    DM applications: Fraud Detection and Management

  • 8/2/2019 Dwdm Intro


    DM applications: Fraud Detection and Management


    widely used in health care, retail, credit card services,telecommunications (phone card fraud), etc.

    Approach use historical data to build models of fraudulent behavior and use data

    mining to help identify similar instances Examples

    auto insurance: detect a group of people who stage accidents to collecton insurance

    money laundering: detect suspicious money transactions (USTreasury's Financial Crimes Enforcement Network)

    medical insurance: detect professional patients and ring of doctors andring of references

    DM applications: Fraud Detection and Management

  • 8/2/2019 Dwdm Intro


    pp g

    Detecting inappropriate medical treatment Australian Health Insurance Commission identifies that in many cases

    blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud

    Telephone call model: destination of the call, duration, time of day orweek. Analyze patterns that deviate from an expected norm.

    British Telecom identified discrete groups of callers with frequentintra-group calls, especially mobile phones, and broke a multimilliondollar fraud.

    Retail Analysts estimate that 38% of retail shrink is due to dishonest


    Other Applications of data mining

  • 8/2/2019 Dwdm Intro


    Other Applications of data mining Sports

    IBM Advanced Scout analyzed NBA game statistics (shots blocked,assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat


    JPL and the Palomar Observatory discovered 22 quasars with the helpof data mining

    Internet Web Surf-Aid

    IBM Surf-Aid applies data mining algorithms to Web access logs formarket-related pages to discover customer preference and behavior

    pages, analyzing effectiveness of Web marketing, improving Web siteorganization, etc.


  • 8/2/2019 Dwdm Intro



    Data mining: Discovering interesting patterns from large amounts of data

    A natural evolution of database technology, in great demand, with wideapplications

    A KDD process includes data cleaning, data integration, data selection,transformation, data mining, pattern evaluation, and knowledge presentation

    Mining can be performed in a variety of information repositories Data mining systems and architectures

    Data warehousing

    Data mining functionalities: characterization, discrimination, association,

    classification, clustering, outlier and trend analysis, etc. Major issues in data mining