Advanced Data Mining: Introduction

17
Advanced Data Mining: Introduction http://delab.csd.auth.gr/courses/c_dm_pms

description

Advanced Data Mining: Introduction. http://delab.csd.auth.gr/courses/c_dm_ pms. Material Covered. Chapter 1 from Ullman’s book. Many slides are from the “Data Mining: Concepts and Techniques” book. Why Data Mining?. The Explosive Growth of Data: from terabytes to petabytes - PowerPoint PPT Presentation

Transcript of Advanced Data Mining: Introduction

Page 1: Advanced Data Mining: Introduction

Advanced Data Mining:

Introduction

http://delab.csd.auth.gr/courses/c_dm_pms

Page 2: Advanced Data Mining: Introduction

Material Covered

• Chapter 1 from Ullman’s book.• Many slides are from the “Data

Mining: Concepts and Techniques” book.

2

Page 3: Advanced Data Mining: Introduction

3

Why Data Mining?

• The Explosive Growth of Data: from terabytes to petabytes

– Data collection and data availability

• Automated data collection tools, database systems, Web,

computerized society

– Major sources of abundant data

• Business: Web, e-commerce, transactions, stocks, …

• Science: Remote sensing, bioinformatics, scientific

simulation, …

• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”—Data mining—

Automated analysis of massive data sets

Page 4: Advanced Data Mining: Introduction

from “Data Mining: Concepts and Techniques” 4

Evolution of Sciences

• Before 1600, empirical science

• 1600-1950s, theoretical science

– Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.

• 1950s-1990s, computational science

– Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)

– Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.

• 1990-now, data science

– The flood of data from new scientific instruments and simulations

– The ability to economically store and manage petabytes of data online

– The Internet and computing Grid that makes all these archives universally accessible

– Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!

• Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

Page 5: Advanced Data Mining: Introduction

5

What Is Data Mining?

• Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

– Data mining: a misnomer?

• Alternative names– Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, business intelligence, etc.

• Watch out: Is everything “data mining”?

• Negative examples: – Simple search and query processing

– (Deductive) expert systemsfrom “Data Mining: Concepts and Techniques”

Page 6: Advanced Data Mining: Introduction

6

Knowledge Discovery (KDD) Process

• This is a view from typical database systems and data warehousing communities

• Data mining plays an essential role in the knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

from “Data Mining: Concepts and Techniques”

Page 7: Advanced Data Mining: Introduction

7

Data Mining in Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

from “Data Mining: Concepts and Techniques”

Page 8: Advanced Data Mining: Introduction

Directions in modeling

• Pattern extraction Model Discovery• Statistical modeling

– E.g., decide that the data comes from a Gaussian distribution, estimate μ,σ parameters.

• Machine learning– Train an algorithm, then apply to new data.

• Results of Complex Queries (computational approaches)– E.g., summarization of the importance of a webpage in the

form of a “pagerank” value.– E.g., prominent feature extraction, such as frequent

itemsets and similar items.

8

Page 9: Advanced Data Mining: Introduction

9

Multi-Dimensional View of Data Mining

• Knowledge to be mined (or: Data mining functions)– Characterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc.– Descriptive vs. predictive data mining – Multiple/integrated functions and mining at multiple levels

• Data to be mined– Database data (extended-relational, object-oriented,

heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

• Techniques utilized– Data-intensive, data warehouse (OLAP), machine learning,

statistics, pattern recognition, visualization, high-performance, etc.• Applications adapted

– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

from “Data Mining: Concepts and Techniques”

Page 10: Advanced Data Mining: Introduction

Meaningfulness of patterns

• A big data-mining risk is that you will “discover” patterns that are meaningless.

• Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find meaningless patterns

10

Page 11: Advanced Data Mining: Introduction

Rhine Paradox

• Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception

• He devised an experiment where subjects were asked to guess 10 hidden cards – red or blue

• He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right!

• He told these people they had ESP and called them in for another test of the same type

• Alas, he discovered that almost all of them had lost their ESP • What did he conclude? • He concluded that you shouldn’t tell people they have ESP; it

causes them to lose it!

11

Page 12: Advanced Data Mining: Introduction

12

Major Challenges in Data Mining

• Efficiency and scalability of data mining algorithms

• Parallel, distributed, stream, and incremental mining methods

• Handling high-dimensionality

• Handling noise, uncertainty, and incompleteness of data

• Incorporation of constraints, expert knowledge, and background knowledge in data mining

• Pattern evaluation and knowledge integration

• Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, Web, software/system engineering, information networks

• Application-oriented and domain-specific data mining

• Invisible data mining (embedded in other functional modules)

• Protection of security, integrity, and privacy in data miningfrom “Data Mining: Concepts and Techniques”

Page 13: Advanced Data Mining: Introduction

Kdnuggets polls - 1

13

Page 14: Advanced Data Mining: Introduction

Kdnuggets polls - 2

14

Page 15: Advanced Data Mining: Introduction

Kdnuggets polls - 3

15

Page 16: Advanced Data Mining: Introduction

Things Useful to Know

• Probability• Linear Algebra basics• Hash functions• Indices• Secondary storage• Power laws

16

Page 17: Advanced Data Mining: Introduction

Big Data

Sizes:• Tiny 0s• Small 1000s fitting in memory• Medium 1000000 (may not fit in

memory)• Large 1000000000• Huge 1000000000000 ++

From Graefe’s “New algorithms for join and grouping operations”, 2011.

17