Data Mining
Lecture 2
Course Syllabus
• Course topics:• Introduction (Week1-Week2)
– What is Data Mining?– Data Collection and Data Management Fundamentals– The Essentials of Learning – The Emerging Needs for Different Data Analysis
Perspectives• Data Management and Data Collection Techniques for
Data Mining Applications (Week3-Week4)– Data Warehouses: Gathering Raw Data from Relational
Databases and transforming into Information. – Information Extraction and Data Processing Techniques– Data Marts: The need for building highly specialized data
storages for data mining applications
• Data:– raw– atomic– (mostly!) operational
• Information:– processed– re-organized– grouped
• Knowledge– patterns, models, findings ‘behind’ Information
• Wisdom– perfect orchestration of Knowledge
Data
Data (Operation)
Information (Analytic)
Knowledge
Wisdom
“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?”
T. S. Eliot
Week 2- Data vs. Knowledge
Week 2- Evolution of Database and Information Systems
•1960s: (focus on efficient data collection)Data collection, database creation, IMS and network DBMS
•1970s: (focus on structured data collection)Relational data model, relational DBMS implementation
•1980s: (focus on information extraction)RDBMS, advanced data models (extended- relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
•1990s – 2000s: (focus on knowledge extraction and modeling)Data Mining, Data Warehousing, Multi Dimensional Databases
Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”
William H. Inmon
Subject-oriented: A data warehouse is organized around major subjects, such as customer,supplier, product, and sales.Rather than concentrating on the day-to-day operations and transaction processing of an organization,a data warehouse focuses on the modeling and analysis of data for decision makers
Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”
William H. Inmon
Integrated: A data warehouse is usually constructed by integrating multiple Heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.
Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”
William H. Inmon
Time-variant: Data are stored to provide information from a historical perspective(e.g., the past 5–10 years). Every key structure in the data warehouse contains, eitherimplicitly or explicitly, an element of time.
Nonvolatile: A data warehouse is always a physically separate store of data transformedfrom the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.
• data cleaning
• data integration
• data consolidation
Week 2- Data Collection and Data Management Fundamentals – What is Data Warehouse
• object oriented methodology comes in
• entities (cubes)
• attributes (dimensions)
Week 2- Data Collection and Data Management Fundamentals – What is OLAP
Week 2- Data Collection and Data Management Fundamentals – What is OLAP
taken from the Text Book
• Multi Dimensional Database Modeling– star schema– snowflake schema– fact constellation schema
• fact vs dimension
Week 2- Data Collection and Data Management Fundamentals – What is OLAP
Week 2- Data Collection and Data Management Fundamentals – What is OLAP
taken from the Text Book
Week 2- Data Collection and Data Management Fundamentals – What is OLAP
taken from the Text Book
Week 2- Data Collection and Data Management Fundamentals – What is OLAP
taken from the Text Book
Week 2- Data Collection and Data Management Fundamentals – OLAP Operations
taken from the Text Book
•roll-up•drill-down•slice•dice•pivot (rotation)
Week 2- Data Collection and Data Management Fundamentals –
OLAP Operations
Week 2- Data Collection and Data Management Fundamentals – What is Data Mart ?
data warehouse information about subjects that span the entire organization,its scope is enterprise-wide.
which modeling schema ?the fact constellation schema is commonly used, since it can modelmultiple, interrelated subjects.
data mart a department subset of the data warehouse that focuses on selected subjects, its scope is departmentwide.
which modeling schema ?the star or snowflake schema are commonly used, since both aregeared toward modeling single subjects
Week2-OLAP vs Data Mining
On-Line Analytical Processing provides the ability to pose statistical and summary queries interactively (traditional On-Line Transaction Processing (OLTP) databases may take minutes or even hours to answer these queries)
Advantages relative to data miningCan obtain a wider variety of resultsGenerally faster to obtain results
Disadvantages relative to data miningUser must “ask the right question”Generally used to determine high-level statistical summaries, rather than specific relationships among instances
Week2-Reporting vs Data Mining
Reporting•Last months sales for each service type •Sales per service grouped by customer sex or age bracket •List of customers who lapsed their policy
Data Mining •What characteristics do customers that lapse their policy have in common and how do they differ from customers who renew their policy?
•Which motor insurance policy holders would be potential customers for my House Content Insurance policy?
Week2- Data to Knowledge Pyramid
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
Week 2- Data Mining Perspective to Knowledge Discovery
adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
DataTargetData
Selection
KnowledgeKnowledge
PreprocessedData
Patterns
Data Mining
Interpretation/Evaluation
Preprocessing
Week2- Data Mining Process Flow
Background Knowledge
Goals for Learning Knowledge Base Database(s)
Plan for
Learning
DiscoverKnowledge
DetermineKnowledgeRelevancy
EvolveKnowledge/
Data
Generateand Test
Hypotheses
Visualization andHuman Computer
Interaction
Discovery Algorithms
“In order to discover anything, you must be looking for something”Laws of Serendipity
Week2-Simplified view of Data Mining Process Flow
Data Warehouse
Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
Week 2- Extended Perspective on Data Mining Process Flow
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
Week 2- Essentials of Learning
Learning ?
•can we formalize it?
•is it just a chemical activation?
•is it memorization?
•is it continous node connecting/disconnecting on dynamically changing brain network topology?
Week 2- Essentials of Learning The Artifical Intelligence View:
•central to human knowledge and intelligence, essential for building intelligent machines.
•years of effort in AI has shown that trying to build intelligent computers by programming all the rules cannot be done; automatic learning is crucial. For example, we humans are not born with the ability to understand language — we learn it — and it makes sense to try to have computers learn language instead of trying to program it all it
Week 2- Essentials of Learning The Software Engineering View:
• Machine Learning allows us to program computers by example, which can be easier than writing code the traditional way.
The Stats View:• Machine Learning is the marriage of computer science and statistics
•computational techniques are applied to statistical problems. Machine Learning has been applied to a vast number of problems in many contexts, beyond the typical statistics problems. Machine Learning is often designed with different considerations than statistics (e.g., speed is often more important than accuracy).
Week 2-End
• Please check the web site for Learning Theory and its Esssentials:http://www.infed.org/biblio/b-learn.htm
• read – Course Text Book Chapter 3
Top Related