Knowledge discovery thru data mining
-
Upload
devakumar-jain -
Category
Education
-
view
3.247 -
download
3
description
Transcript of Knowledge discovery thru data mining
Knowledge Discoverythrough
Data Mining
C. DevakumarIndian Council of Agricultural Research
New Delhi-110 [email protected]
Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Classification of data mining systems
Major issues in data mining
Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis
of massive data sets
Evolution of Database Technology
1960s: Data collection, database creation, IMS and network DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining, data warehousing, multimedia databases, and Web
databases 2000s
Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
What is Data Mining?
Many DefinitionsNon-trivial extraction of implicit, previously
unknown and potentially useful information from data
Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
What is not Data Mining? Look up phone number in phone directory
•Query a Web search engine for information
Origins of Data MiningDraws ideas from machine learning/AI, pattern
recognition, statistics, and database systemsTraditional Techniques
may be unsuitable due to Enormity of dataHigh dimensionality
of dataHeterogeneous,
distributed nature of data
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
+ =Data
Interestingnesscriteria
Hiddenpatterns
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of Patterns
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of data Type of Interestingness criteria
Type of DataTabular (Ex: Transaction data)
RelationalMulti-dimensional
Spatial (Ex: Remote sensing data)Temporal (Ex: Log information)
Streaming (Ex: multimedia, network traffic)Spatio-temporal (Ex: GIS)
Tree (Ex: XML data)Graphs (Ex: WWW, BioMolecular data)Sequence (Ex: DNA, activity logs) Text, Multimedia …
Type of InterestingnessFrequencyRarityCorrelation Length of occurrence (for sequence and temporal
data)
Consistency Repeating / periodicity “Abnormal” behavior Other patterns of interestingness…
Statistics:
ConceptualModel
(Hypothesis)
StatisticalReasoning
“Proof”(Validation of Hypothesis)
Data mining:
MiningAlgorithmBased on InterestingnessData
Pattern (model, rule, hypothesis)discovery
Explores Your Data
Finds Patterns
Performs Prediction
s
Predictive Analysis
Presentation
Exploration Discovery
Passive
Interactive
Proactive
Role of Software
Business Insight
Canned reporting
Ad-hoc reporting
OLAP
Data mining
Data Mining Tasks
Prediction MethodsUse some variables to predict unknown or
future values of other variables.
Description MethodsFind human-interpretable patterns that
describe the data.
Data Mining Tasks...
Classification [Predictive]Regression [Predictive]Clustering [Descriptive]Association Rule Discovery [Descriptive]Semi-supervised Learning
Semi-supervised ClusteringSemi-supervised Classification
Challenges of Data MiningScalabilityDimensionalityComplex and Heterogeneous DataData QualityData Ownership and DistributionPrivacy PreservationStreaming Data
Data Mining and Business Intelligence Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
Decision
MakingData Presentation
Visualization Techniques
Data MiningInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
AlgorithmOther
Disciplines
Visualization
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowledge-Base
Database Data Warehouse
World-WideWeb
Other InfoRepositories
Multi-Dimensional View of Data Mining Data to be mined
Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
Knowledge to be minedCharacterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.Multiple/integrated functions and mining at multiple levels
Techniques utilizedDatabase-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc. Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Data Mining: Classification SchemesGeneral functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be
discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Data Mining FunctionalitiesMultidimensional concept description: Characterization and
discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
Frequent patterns, association, correlation vs. causality
Classification and prediction
Construct models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown or missing numerical values
Data Mining Functionalities (2)Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarityOutlier analysis
Outlier: Data object that does not comply with the general behavior of the data
Noise or exception? Useful in fraud detection, rare events analysisTrend and evolution analysis
Trend and deviation: e.g., regression analysisSequential pattern mining: e.g., digital camera large SD memoryPeriodicity analysisSimilarity-based analysis
Other pattern-directed or statistical analyses
Major Issues in Data Mining Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge
fusion User interaction
Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy
Bioinformatics, Computational Biology, Data Mining
Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems.
Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g.Genomes (viruses, bacteria, fungi, plants, insects,…)Proteins and ProteomesBiological SequencesMolecular Function and Structure
Data Mining is searching for knowledge in dataKnowledge mining from databasesKnowledge extractionData/pattern analysisData dredgingKnowledge Discovery in Databases (KDD)
28
Data production at the levels of molecules, cells, organs, organisms, populations
Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, …
Prediction of Molecular Function and Structure
Computational biology: synthesis (simulations) and analysis (machine learning)
29
Problems in Bioinformatics Domain
Tokenization, which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spacesFiltering methods remove words like articles, conjunctions, prepositions, etc. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic preprocessingN-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.).
Castellano, M. et al. A bioinformatics knowledge discovery in text application for gridComputing BMC Bioinformatics 2009, 10(Suppl 6):S23
BIOINFORMATICS ARCHITECTUREThe Layer Architecture consisting of GATE 4.0 Toolkit for Text Mining, a Middleware solution written by Java API, the grid infrastructure middleware, and a physical layer that consists of a Gnu/Linux Operating System.The integrated development environment, GATE was used for the text mining process. GATE operated on a collection of scientific publications in full text available on MedLine/Pubmed (in pdf format) using the process of Text Mining
Technology Platform
What is New in SQL Server 2008?Data Mining EnhancementsIn addition to other new aspects of SQL
Server:Enhanced Mining Structures
Easier to prepare and test your models Models allow for cross-validation Filtering
Algorithm Updates Improved Time Series algorithm combining best of
ARIMA and ARTXP “What-If” analysis
Microsoft Data Mining FrameworkSupplements CRISP-DM
Microsoft DM CompetitorsSAS, largest market
share of DM, specialised product for traditional experts
SPSS (Clementine), strength in statistical analysis
IBM (Intelligent Miner) tied to DB2, interoperates with Microsoft through PMML
Oracle (10g), supports Java APIs
Angoss (KnowledgeSTUDIO), result visualisation, works with SQL Server
KXEN, supports OLAP and Excel
Data mining: Discovering interesting patterns from large amounts of data
A natural evolution of database technology, in great demand, with wide applications
Includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining; Let’s mine for valuable gems of knowledge in our databases!