Knowledge discovery thru data mining

Knowledge Discoverythrough

Data Mining

C. DevakumarIndian Council of Agricultural Research

New Delhi-110 [email protected]

Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Classification of data mining systems

Major issues in data mining

Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,

computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis

of massive data sets

Evolution of Database Technology

1960s: Data collection, database creation, IMS and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s: Data mining, data warehousing, multimedia databases, and Web

databases 2000s

Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems

What is Data Mining?

Many DefinitionsNon-trivial extraction of implicit, previously

unknown and potentially useful information from data

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

What is not Data Mining? Look up phone number in phone directory

•Query a Web search engine for information

Origins of Data MiningDraws ideas from machine learning/AI, pattern

recognition, statistics, and database systemsTraditional Techniques

may be unsuitable due to Enormity of dataHigh dimensionality

of dataHeterogeneous,

distributed nature of data

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

+ =Data

Interestingnesscriteria

Hiddenpatterns

+ =Data


Hiddenpatterns

Type of Patterns

+ =Data


Hiddenpatterns

Type of data Type of Interestingness criteria

Type of DataTabular (Ex: Transaction data)

RelationalMulti-dimensional

Spatial (Ex: Remote sensing data)Temporal (Ex: Log information)

Streaming (Ex: multimedia, network traffic)Spatio-temporal (Ex: GIS)

Tree (Ex: XML data)Graphs (Ex: WWW, BioMolecular data)Sequence (Ex: DNA, activity logs) Text, Multimedia …

Type of InterestingnessFrequencyRarityCorrelation Length of occurrence (for sequence and temporal

data)

Consistency Repeating / periodicity “Abnormal” behavior Other patterns of interestingness…

Statistics:

ConceptualModel

(Hypothesis)

StatisticalReasoning

“Proof”(Validation of Hypothesis)

Data mining:

MiningAlgorithmBased on InterestingnessData

Pattern (model, rule, hypothesis)discovery

Explores Your Data

Finds Patterns

Performs Prediction

s

Predictive Analysis

Presentation

Exploration Discovery

Passive

Interactive

Proactive

Role of Software

Business Insight

Canned reporting

Ad-hoc reporting

OLAP

Data mining

Data Mining Tasks

Prediction MethodsUse some variables to predict unknown or

future values of other variables.

Description MethodsFind human-interpretable patterns that

describe the data.

Data Mining Tasks...

Classification [Predictive]Regression [Predictive]Clustering [Descriptive]Association Rule Discovery [Descriptive]Semi-supervised Learning

Semi-supervised ClusteringSemi-supervised Classification

Challenges of Data MiningScalabilityDimensionalityComplex and Heterogeneous DataData QualityData Ownership and DistributionPrivacy PreservationStreaming Data

Data Mining and Business Intelligence Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision

MakingData Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

Architecture: Typical Data Mining System

data cleaning, integration, and selection

Database or Data Warehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Knowledge-Base

Database Data Warehouse

World-WideWeb

Other InfoRepositories

Multi-Dimensional View of Data Mining Data to be mined

Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

Knowledge to be minedCharacterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc.Multiple/integrated functions and mining at multiple levels

Techniques utilizedDatabase-oriented, data warehouse (OLAP), machine

learning, statistics, visualization, etc. Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Data Mining: Classification SchemesGeneral functionality

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be

discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted

Data Mining FunctionalitiesMultidimensional concept description: Characterization and

discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

Frequent patterns, association, correlation vs. causality

Classification and prediction

Construct models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars

based on (gas mileage)

Predict some unknown or missing numerical values

Data Mining Functionalities (2)Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

Maximizing intra-class similarity & minimizing interclass similarityOutlier analysis

Outlier: Data object that does not comply with the general behavior of the data

Noise or exception? Useful in fraud detection, rare events analysisTrend and evolution analysis

Trend and deviation: e.g., regression analysisSequential pattern mining: e.g., digital camera large SD memoryPeriodicity analysisSimilarity-based analysis

Other pattern-directed or statistical analyses

Major Issues in Data Mining Mining methodology

Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web

Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge

fusion User interaction

Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction

Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

Bioinformatics, Computational Biology, Data Mining

Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems.

Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g.Genomes (viruses, bacteria, fungi, plants, insects,…)Proteins and ProteomesBiological SequencesMolecular Function and Structure

Data Mining is searching for knowledge in dataKnowledge mining from databasesKnowledge extractionData/pattern analysisData dredgingKnowledge Discovery in Databases (KDD)

28

Data production at the levels of molecules, cells, organs, organisms, populations

Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, …

Prediction of Molecular Function and Structure

Computational biology: synthesis (simulations) and analysis (machine learning)

29

Problems in Bioinformatics Domain

Tokenization, which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spacesFiltering methods remove words like articles, conjunctions, prepositions, etc. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic preprocessingN-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.).

Castellano, M. et al. A bioinformatics knowledge discovery in text application for gridComputing BMC Bioinformatics 2009, 10(Suppl 6):S23

BIOINFORMATICS ARCHITECTUREThe Layer Architecture consisting of GATE 4.0 Toolkit for Text Mining, a Middleware solution written by Java API, the grid infrastructure middleware, and a physical layer that consists of a Gnu/Linux Operating System.The integrated development environment, GATE was used for the text mining process. GATE operated on a collection of scientific publications in full text available on MedLine/Pubmed (in pdf format) using the process of Text Mining

Technology Platform

What is New in SQL Server 2008?Data Mining EnhancementsIn addition to other new aspects of SQL

Server:Enhanced Mining Structures

Easier to prepare and test your models Models allow for cross-validation Filtering

Algorithm Updates Improved Time Series algorithm combining best of

ARIMA and ARTXP “What-If” analysis

Microsoft Data Mining FrameworkSupplements CRISP-DM

Microsoft DM CompetitorsSAS, largest market

share of DM, specialised product for traditional experts

SPSS (Clementine), strength in statistical analysis

IBM (Intelligent Miner) tied to DB2, interoperates with Microsoft through PMML

Oracle (10g), supports Java APIs

Angoss (KnowledgeSTUDIO), result visualisation, works with SQL Server

KXEN, supports OLAP and Excel

Data mining: Discovering interesting patterns from large amounts of data

A natural evolution of database technology, in great demand, with wide applications

Includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

Mining can be performed in a variety of information repositories

Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

Data mining systems and architectures

Major issues in data mining; Let’s mine for valuable gems of knowledge in our databases!

Knowledge discovery thru data mining

Education

Transcript of Knowledge discovery thru data mining