Knowledge Discovery in Databases - vse.czberka/docs/4iz451/sl01-kdd-en.pdfKnowledge Discovery in...
Transcript of Knowledge Discovery in Databases - vse.czberka/docs/4iz451/sl01-kdd-en.pdfKnowledge Discovery in...
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 1/22
Knowledge Discovery in Databases
(Information Harvesting, Data Archeology, Data
Mining, Knowledge Destilery, ....)
Non-trivial process of identifying valid, novel,
potentially useful and ultimately understandable
patterns from data (Fayyad a kol., 1996)
Data mining involves the use of sophisticated data
analysis tools to discover previously unknown, valid
patterns and relationships in large data sets
(Adriaans, Zantinge, 1999)
Analysis of observational data sets to find
unsuspected relationships and summarize data in
novel ways that are both understandable and useful
to the data owner (Hand, Manilla, Smyth, 2001)
Data mining is the process of analyzing hidden
patterns of data from different perspectives and
categorizing them into useful information
(techopedia.org, 2011)
Three sources: databases (query languages, OLAP),
statistics (data analysis), artificial intelligence
(machine learning)
Related concepts: data science, business intelligence
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 2/22
KDD Tasks
(Klosgen, Zytkow, 1997)
classification/prediction: the task is to
find knowledge applicable to automatically
process new examples
desription: the task is to find dominant
structure or relationships
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 3/22
search for „nuggets“: the task is to find
partial novel and surprising knowledge
(Chapman et al, 2000)
data description and summarization:
concise description of characteristics of
the data, typically in elementary and
aggregated form
segmentation: separation of the data into
interesting and meaningful subgroups or
classes
concept description: understandable
description of concepts or classes to gain
insight
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 4/22
classification: build classification models
(sometimes called classifiers) which assign
the correct class label to previously unseen
and unlabeled objects
prediction: similar to classification, but the
target attribute (class) is not a qualitative
discrete attribute but a continuous one.
Prediction also often deals with time
dependent concepts
dependency analysis: describe significant
dependencies (or associations) between data
items or events
deviation detection: focuses on discovering
the most significant changes in the data
from previously measured or normative
values
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 5/22
Application areas of KDD
Segmentation and classification (clients of a bank
or insurance company),
Credit Risk Assessment,
Fraud detection
Prediction of stock market prices,
Prediction of energy consumption,
Intrusion detection,
Churn Analysis (telco services providers, internet
providers),
Microarray data analysis (molecular biology),
Targeted marketing,
Medical diagnosis,
Market Basket Analysis.
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 6/22
Application areas (KDnuggets Poll)
CRM/Consumer analytics, 16.8%
Finance, 15.2%
Banking, 14.1%
Health care, 13.2%
Fraud Detection, 13.0%
https://www.kdnuggets.com/2018/04/poll-analytics-data-science-ml-applied-2017.html
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 7/22
Market basket analysis: data exploration
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 8/22
Market basket analysis: dependency analysis
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 9/22
Market basket analysis: classification
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 10/22
KDD Standards
1. Methodologies
(Marban a kol, 2009)
5A Developed in mid. 90th by SPSS. The name is an
acronym for the performed steps:
Assess – assess the requirements of the project,
Access – access the available data,
Analyze – perform the analyses,
Act – turn knowledge into actions,
Automate – deploy the models in an automatic way.
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 11/22
SEMMA Developed in mid. 90th by SAS:
Sample the data by creating one or more data
tables,
Explore the data by searching for relationships,
trends or anomalies,
Modify the data by creating, selecting, and
transforming the variables,
Model the relationships between input and output
variables by using various data mining techniques,
Assess the quality of the models.
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 12/22
CRISP-DM Currently a de-facto standard supported by most
data mining systems
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 13/22
2. Standards to describe models
Predictive Modeling Markup Language
Standard based on XML developed at Data Mining
Group (www.dmg.org), that allows to describe data,
data transformations and created models. Main parts
of a PMML document:
Header
Data Dictionary
Data Transformations
Model
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 14/22
<?xml version="1.0" ?>
<PMML version="4.0">
<Header copyright="P.B." description="An example decision tree model."/>
<DataDictionary numberOfFields="5" >
<DataField name="income" optype="categorical" />
<Value value="low"/>
<Value value="high"/>
<DataField name=account" optype= categorical " />
<Value value="low"/>
<Value value="medium"/>
<Value value="high"/>
<DataField name="sex" optype="categorical" >
<Value value="male"/>
<Value value="female"/>
</DataField>
<DataField name="unemployed" optype="categorical" >
<Value value="yes"/>
<Value value="no"/>
</DataField>
<DataField name=loan" optype="categorical" >
<Value value="A"/>
<Value value="n"/>
</DataField>
</DataDictionary>
<TreeModel modelName="loan aproval decision tree" >
<MiningSchema>
<MiningField name=“income"/>
<MiningField name="account"/>
<MiningField name="sex"/>
<MiningField name="unemployed"/>
<MiningField name="loan" usageType="predicted"/>
</MiningSchema>
<Node score="A">
<True/>
<Node score="A">
<SimplePredicate field="income" operator="equal" value="high"/>
</Node>
<Node score="n">
<SimplePredicate field="income" operator="equal" value="low"/>
<Node score="A">
<SimplePredicate field="account" operator="equal"
value="high"/>
</Node>
<Node score="n">
<SimplePredicate field="account" operator="equal"
value="low"/>
<Node score="n">
<SimplePredicate field="unemployed" operator="equal"
value="yes“/>
</Node>
<Node score="A">
<SimplePredicate field="unemployed" operator="equal"
value="no“/>
</Node>
</Node>
</Node>
</Node>
</TreeModel>
</PMML>
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 15/22
3. Programming standards (API)
SQL/MM Data Mining
Standard interface that enables to access data
mining algorithms from relational databases
OLE DB for Data Mining
API developed by Microsoft
Java Data Mining
CREATE MINING MODEL CreditRisk
(
CustomerId long key,
Income text discrete,
Account text discrete,
Sex text discrete,
Unemployed boolean discrete,
Loan text discrete predict,
)
USING [Microsoft Decision Tree]
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 16/22
Data Mining Systems
cover the whole KDD process (from data
preprocessing to model evaluation),
offer more data mining algorithms (than single-
purpose machine learning systems),
focus on visualization (both in the way how to
use the system and in the way how to present
and interpret data and results).
Types of systems:
Data mining suites - stand-alone tools that
implement a number of data mining and data
pre-processing algorithms (commercial or
free/open_source)
Programming tools -
Cloud solutions – Software As A Service
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 18/22
SAS Enterprise Miner
IBM SPSS Modeler (Clementine)
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 21/22
KDD tools - usage 2017, 2018
https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html
Knowledge Discovery in Databases T1: introduction
P. Berka, 2019 22/22
KDD tools - current trends
1. Automatization of the whole KDD process to
make the analyzes available to non-expert
users
OptiML in BigML
RapidMiner: TurboPrep, Auto Model
Weka: Auto-Weka
Full automatization – Datarobot
2. Extending cloud platforms of key IT
companies Azure Machine Learning Studio,
Google Cloud Platform,
Machine Learning on AWS