Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data...

28
Data Mining Lecture 2

Transcript of Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data...

Page 1: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Data Mining

Lecture 2

Page 2: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Course Syllabus

• Course topics:• Introduction (Week1-Week2)

– What is Data Mining?– Data Collection and Data Management Fundamentals– The Essentials of Learning – The Emerging Needs for Different Data Analysis

Perspectives• Data Management and Data Collection Techniques for

Data Mining Applications (Week3-Week4)– Data Warehouses: Gathering Raw Data from Relational

Databases and transforming into Information. – Information Extraction and Data Processing Techniques– Data Marts: The need for building highly specialized data

storages for data mining applications

Page 3: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

• Data:– raw– atomic– (mostly!) operational

• Information:– processed– re-organized– grouped

• Knowledge– patterns, models, findings ‘behind’ Information

• Wisdom– perfect orchestration of Knowledge

Data

Data (Operation)

Information (Analytic)

Knowledge

Wisdom

“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?”

T. S. Eliot

Week 2- Data vs. Knowledge

Page 4: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Evolution of Database and Information Systems

•1960s: (focus on efficient data collection)Data collection, database creation, IMS and network DBMS

•1970s: (focus on structured data collection)Relational data model, relational DBMS implementation

•1980s: (focus on information extraction)RDBMS, advanced data models (extended- relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

•1990s – 2000s: (focus on knowledge extraction and modeling)Data Mining, Data Warehousing, Multi Dimensional Databases

Page 5: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

collection of data in support of management’s decision making process”

William H. Inmon

Subject-oriented: A data warehouse is organized around major subjects, such as customer,supplier, product, and sales.Rather than concentrating on the day-to-day operations and transaction processing of an organization,a data warehouse focuses on the modeling and analysis of data for decision makers

Page 6: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

collection of data in support of management’s decision making process”

William H. Inmon

Integrated: A data warehouse is usually constructed by integrating multiple Heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.

Page 7: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

collection of data in support of management’s decision making process”

William H. Inmon

Time-variant: Data are stored to provide information from a historical perspective(e.g., the past 5–10 years). Every key structure in the data warehouse contains, eitherimplicitly or explicitly, an element of time.

Nonvolatile: A data warehouse is always a physically separate store of data transformedfrom the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.

Page 8: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

• data cleaning

• data integration

• data consolidation

Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse

Page 9: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

• object oriented methodology comes in

• entities (cubes)

• attributes (dimensions)

Week 2- Data Collection and Data Management Fundamentals – What is OLAP

Page 10: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is OLAP

taken from the Text Book

Page 11: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

• Multi Dimensional Database Modeling– star schema– snowflake schema– fact constellation schema

• fact vs dimension

Week 2- Data Collection and Data Management Fundamentals – What is OLAP

Page 12: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is OLAP

taken from the Text Book

Page 13: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is OLAP

taken from the Text Book

Page 14: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is OLAP

taken from the Text Book

Page 15: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – OLAP Operations

taken from the Text Book

•roll-up•drill-down•slice•dice•pivot (rotation)

Page 16: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals –

OLAP Operations

Page 17: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Collection and Data Management Fundamentals – What is Data Mart ?

data warehouse information about subjects that span the entire organization,its scope is enterprise-wide.

which modeling schema ?the fact constellation schema is commonly used, since it can modelmultiple, interrelated subjects.

data mart a department subset of the data warehouse that focuses on selected subjects, its scope is departmentwide.

which modeling schema ?the star or snowflake schema are commonly used, since both aregeared toward modeling single subjects

Page 18: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week2-OLAP vs Data Mining

On-Line Analytical Processing provides the ability to pose statistical and summary queries interactively (traditional On-Line Transaction Processing (OLTP) databases may take minutes or even hours to answer these queries)

Advantages relative to data miningCan obtain a wider variety of resultsGenerally faster to obtain results

Disadvantages relative to data miningUser must “ask the right question”Generally used to determine high-level statistical summaries, rather than specific relationships among instances

Page 19: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week2-Reporting vs Data Mining

Reporting•Last months sales for each service type •Sales per service grouped by customer sex or age bracket •List of customers who lapsed their policy

Data Mining •What characteristics do customers that lapse their policy have in common and how do they differ from customers who renew their policy?

•Which motor insurance policy holders would be potential customers for my House Content Insurance policy?

Page 20: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week2- Data to Knowledge Pyramid

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 21: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Data Mining Perspective to Knowledge Discovery

adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

DataTargetData

Selection

KnowledgeKnowledge

PreprocessedData

Patterns

Data Mining

Interpretation/Evaluation

Preprocessing

Page 22: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week2- Data Mining Process Flow

Background Knowledge

Goals for Learning Knowledge Base Database(s)

Plan for

Learning

DiscoverKnowledge

DetermineKnowledgeRelevancy

EvolveKnowledge/

Data

Generateand Test

Hypotheses

Visualization andHuman Computer

Interaction

Discovery Algorithms

“In order to discover anything, you must be looking for something”Laws of Serendipity

Page 23: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week2-Simplified view of Data Mining Process Flow

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Page 24: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Extended Perspective on Data Mining Process Flow

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 25: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Essentials of Learning

Learning ?

•can we formalize it?

•is it just a chemical activation?

•is it memorization?

•is it continous node connecting/disconnecting on dynamically changing brain network topology?

Page 26: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Essentials of Learning The Artifical Intelligence View:

•central to human knowledge and intelligence, essential for building intelligent machines.

•years of effort in AI has shown that trying to build intelligent computers by programming all the rules cannot be done; automatic learning is crucial. For example, we humans are not born with the ability to understand language — we learn it — and it makes sense to try to have computers learn language instead of trying to program it all it

Page 27: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2- Essentials of Learning The Software Engineering View:

• Machine Learning allows us to program computers by example, which can be easier than writing code the traditional way.

The Stats View:• Machine Learning is the marriage of computer science and statistics

•computational techniques are applied to statistical problems. Machine Learning has been applied to a vast number of problems in many contexts, beyond the typical statistics problems. Machine Learning is often designed with different considerations than statistics (e.g., speed is often more important than accuracy).

Page 28: Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.

Week 2-End

• Please check the web site for Learning Theory and its Esssentials:http://www.infed.org/biblio/b-learn.htm

• read – Course Text Book Chapter 3