Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“...
Transcript of Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“...
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 1/14
Databases
Flat file
client transaction
account
Relational database
Querying: QBE vs. SQL
date name surname street city account no. payment balance
980103 Jan Novak Dlouha 5 Praha 1 9945371 100.00 100.00
980105 Jan Novak Dlouha 5 Praha 1 9945371 1500.00 1600.00
980106 Jan Novak Dlouha 5 Praha 1 9945371 -1550.00 50.00
980106 Karel Nemec Podolska 4 Praha 2 24867134 3000.00 6000.00
980107 Karel Nemec Podolska 4 Praha 2 24867134 -4000.00 2000.00
980108 Jan Novak Dlouha 5 Praha 1 9945371 -150.00 -100.00
980111 Karel Nemec Podolska 4 Praha 2 24867134 5000.00 7000.00
. . .
id_client
name
surname
street
city
. . .
id_transaction
id_account
date
payment
balance
. . .
id_account
id_client
. . .
SELECT client.name, client.surname, client.street, client.city, account.
id_client, transaction.balance
FROM client, account, transaction
WHERE client.id_client = account.id_account;
AND transaction.id_account = account.id_account;
AND transaction.balance < 100;
GROUP BY client.city
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 2/14
Decision support using databases
1. Executive Information Systems
managerial information systems designed for
fast access to information
user friendly interface but less flexible
2. On-Line Analytical Processing
multidimensional concept of data storage and
manipulation (DATA CUBE),
intuitive data manipulation,
work with data collected from heterogeneous
data sources – data conversion is necessary,
use of analytical methods – statistical
summaries, what-if analysis,
Client/Server architecture,
support for multi-user access,
OLAP results stored separately from the
source data,
dynamic manipulation with sparse matrices,
missing values processing,
unlimited number of dimensions and
aggregation levels.
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 3/14
Database structure
basketID date product city sales
000001 10.1. screws Praha 241
000001 10.1. nuts Praha 61
000001 10.1. screws Brno 17
000001 10.1. nails Brno 42
000002 10.2. screws Praha 92
000002 10.2. nails Praha 27
000002 10.2. screws Kladno 35
Table SALES
Praha Brno Kladno
screws nuts nails screws nuts nails screws nuts nails
10.1. 241 61 17 42
10.2. 92 27 35
Sparse matrix
product date region
city sales
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 4/14
Data manipulation:
slice and dice
roll up vs. drill down
Microsoft Data Analyzer
sales
aggregation
cities
aggregation
regions
aggregation
products
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 5/14
Implementation:
hypercube
multicube
„true“ OLAP vs. ROLAP
MOLAP vs. ROLAP
physical implementation of the system:
star schema,
snowflake schema.
SQL engine
OLAP engine
summarized
data
granular data
User interface
ROLAP MOLAP
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 6/14
dimension store fact table dimension time
dimension product
Star
dimension store
fact table region
fact table district
fact table store
Snowflake
STORE KEY
PRODUCT KEY
PERIOD KEY
cena
počet
STORE KEY
data o prodejně
město
ID okresu
data o okresu
ID regionu
data o regionu
úroveň (level)
PRODUCT KEY
data o produktu
značka
výrobce
úroveň (level)
PERIOD KEY
data o období
rok
čtvrtletí
měsíc
den
STORE KEY
PRODUCT KEY
PERIOD KEY
cena
počet
STORE KEY
data o prodejně
město
ID okresu
data o okresu
ID regionu
data o regionu
úroveň (level) ID okresu
PRODUCT KEY
PERIOD KEY
cena
počet
ID regionu
PRODUCT KEY
PERIOD KEY
cena
počet
ID okresu
data o okresu
ID regionu
ID regionu
data o regionu
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 7/14
OLAP functionality „reachable“ by classic
technologies
Microsoft Access
Microsoft Excel
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 8/14
3. Data warehouse subject oriented,
integrated,
time variant,
nonvolatile data repository used for decision
support
ddddd Data
současná detailní
data
starší detailní data
m
e
t
a
d
a
t
a
středně
sumarizovaná data
silně
sumarizovaná data
1. vrstva
Data
Mart
Data Warehouse 2. vrstva
3. vrstva
produkční databáze
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 9/14
4. Business Intelligence
computerized tools and techniques used to
collect, integrate, analyze, interpret and
present (business) data and information. 4 main
components:
data warehouse
business analytics (querying, reporting,
statistical analyses, data mining)
business performance management
user interface (presentations)
Major Components of BI (Turban a kol., 2007)
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 10/14
Databases meet Data Mining
1. Query languages for KDD
Mine Rule (Boulicaut, 1998) - association rules
MSQL (Imielinski, Virmani, 1999) – association
rules, records
MSQL - rules
MSQL - exceptions
MINE RULE Priklad AS
SELECT DISTINCT 1..n produkt AS BODY, 1..1 produkt AS HEAD,
SUPPORT, CONFIDENCE
FROM Prodej
WHERE BODY.město = HEAD.město
AND BODY.datum = HEAD.datum
EXTRACTING RULES WITH SUPPORT: 0.1, CONFIDENCE: 0.5
Emp(Id,Age,Sex,Salary,Position,Car)
GetRules (Emp)
into R
where support > 0.1 and confidence > 0.9
SelectRules (R)
where body has {Age=*), (Sex=*)}
and body is {(Car=*)}
Select *
from Emp
where violates all (GetRules (Emp)
where body is {(Age=*)}
and head is {(Salary=*)}
and confidence > 0.3)
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 11/14
DMQL (Han et al., 1996) – different rules
DMQL – association rules
DMQL – classification rules
DMQL – discrimination rules
2. API standards
SQL/MM Data Mining
OLE DB for Data Mining
Find classification rules for computer_science_students
according to average_grading
related to birth_place, address
from student
where major = „computer_science“ and birth_place = „Canada“
Find association rules
related to average_grading, birth_place, address
from student
where major = „computer_science“ and birth_place = „Canada“
with support threshold = 0.05
with confidence threshold = 0.7
Find discriminant rule
for cs_grads with status = „graduate“
in contrast to cs_undergrads with status = „undergraduate“
related to average_grading, birth_place, address
from student
where major = „computer_science“ and birth_place = „Canada“
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 12/14
3. Extending databse systems with data
mining („in-database“ data mining)
MicroSoft SQL Server 2005
- Decision trees, association rules, naive bayes
classifier, neuronal nets, text mining, sequence
clustering, time series
- uses OLE DB DM and
PMML to describe
tasks and models) and
BI Development
studio as interface
Oracle Data Mining
tools for classification, regression, anomaly
detection, association mining, clustering, feature
extraction, attribute ranking
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 13/14
New Trend – Big Data
Big Data are data that cannot be handled using
standard data base systems and standard data
analysis tools.
Big Data sources: sensors, surveillance systems,
mobile phones, GPS devices, RFID readers,
social networks, computer networks, web logs,
scientific data …
Big Data Characteristics (4 V’s):
Volume: the size of Big Data goes beyond
standard data storage and manipulation
techniques
Velocity: Big Data is often available in real
time
Variety: Big Data contains not only
structured data (e.g. in tabular or relational
form) but also texts, images, audio or video
Veracity: the quality and reliability of Big
Data can vary
Knowledge Discovery in Databases T2: databases
P. Berka, 2018 14/14
Challenges for collecting, storing and
manipulating Big Data
New forms of data storage: file systems
(e.g. Google File System), NoSQL databases
(data that cannot be expressed using
tabular relations of relational databases)
New forms of computation: parallel
computing, distributed computing, grid
computing, cloud computing
Example solution: Apache Hadoop
Hadoop distributed file system
Map/Reduce: divide and conquer approach to
break-down intractable problem into
tractable sub-problems