Data Processing over very Large Relational Databases
-
Upload
kvaderlipa -
Category
Technology
-
view
164 -
download
0
description
Transcript of Data Processing over very Large Relational Databases
Data Processing over Very Large Databases
Ing. Ľuboš Takáč
Supervisor: doc. Ing. Michal Zábovský, PhD.
Faculty of Management Science and Informatics
University of Žilina
Large Databases
• VLDB (very large databases)
• Relational Databases with hundreds of tables and millions of rows
The Problem
• How to understand relational database model so that we could find information in them.
• Orientation in large RDB– given by the complexity of RDB model
• Modification and development of RDB.
Existing approaches
• Database metrics
• Database visualization
• Database to ontology mapping and examination of ontology
Database Metrics• Database metric is a function that assigns to an
object from the database a numeric value.
• Examples of table metrics– DRT(T) – depth of relational tree
– TS(T) – table size
– RD(T) – referential degree
– …
• Rankings – grouping metrics with different weights.
RDB Visualization
• Database schema visualization.
• Standard ER - diagram is insufficient for large RDB model.
SchemaBall
• Visualization of large or complex RDB schemas.
• Using RDB metrics and rankings.
• We implemented and enhanced such solution.
SchemaBall
Visualization of RDB schema graph
• Vertex and edge weighted graph based on RDB metrics.
• Using Gephi for visualization– automatic generated layout
– interactive visualization (selections, examinations of nodes and edges)
– using graph algorithms
Analyzing of RDB graph
• Three approaches– graph of RDB model (vertex – table, edges – foreign key
relations)
– alternative (vertex – table, edge – foreign key relation for each tuple)
– graph of tuples (vertex – tuple, edge – foreign key relation between tuples)
Analyzing of RDB Graph – first approach
1 2 3 4 5 6 7 8 9 10 11 13 17 18 290.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
probability
vertex degreeDistribution function of vertex degree.
Analyzing of RDB Graph – second approach
probability
vertex degreeDistribution function of vertex degree.
0.00
E+00
5.00
E+05
1.00
E+06
1.50
E+06
2.00
E+06
2.50
E+06
3.00
E+06
3.50
E+06
4.00
E+06
4.50
E+06
5.00
E+06
5.50
E+06
6.00
E+06
6.50
E+06
7.00
E+06
7.50
E+06
8.00
E+06
8.50
E+06
9.00
E+06
9.50
E+06
1.00
E+07
1.05
E+07
1.10
E+07
1.15
E+07
1.20
E+07
1.25
E+07
1.30
E+07
1.35
E+070
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Analyzing of RDB Graph – third approach
count
vertex degree
Distribution function of vertex degree.
Analyzing of RDB Graph – Scale free networks• Connected graph with Yule-Simon distribution of
vertex degree.
• , usually between 2 – 3
Visualization of RDB schema network
Analyzing of RDB Graph - Conclusion
• RDB model is scale-free.
• To understand RDB you must to understand centers at first. (there is not a lot of centres)
• Very useful metric NR(T) – number of references validated by analyzing of RDB Graph.
• We created 2 new metrics based on mentioned three approaches.
A Method for Analyzing Large RDB
• Find components of schema graph (tables = vertices, FK = edges)
• Examine each component starting in order with largest first– If you get alone table, very probably is an archive, try to
check it or find another purpose.
– Else visualize it via ER diagram, Schamaball or graph using table metrics.
Practical Example
• Unknown complex RDB– 332 tables
– 2339 attributes
– 192 foreign keys
– Size 2,4 GB
All tables
Archive Tables
• Each alone table is archive table, with convention “_A”
Component A
Component B
RDBAnalyzer• supports all RDB Systems supporting JDBC, easy
scalable, online connection
• features– large online RDB schema visualization
– finding the components of graph
– schema graph creation, visualization and export (GEPHI)
– transform RDB to tuple graph
– metrics charts, parallel coordinates visualization
RDBAnalyzer
RDB to Ontology Mapping
– better understanding and searching for information without knowledge of RDB model, data mining from RDB
– can be used by web search engines to search in RDBs
– getting information from RDB by people, whose do not understand RDB technology (layman)
– a method how to merge multiple databases (ontology merging)
– interactive searching for information (Protégé)
RDB Schema NORTHWIND (ER-Diagram)
OntoGraph (Protége)
How to find information in Ontologies
• using query language (SPARQL)
• interactive (e.g. Protégé)– using OntoGraf combined with text searching
– explore entities and individuals
Disadvantages & Problems of mapped RDBs to Ontologies
• Difficult to maintain actual data (static & dynamic Ontology creation).
• Aggregated queries are very slow.
• Existing tools are not capable with large RDBs (or large ontologies).
Conclusion & Scientific Contribution• Design and creation of method for orientation,
understanding and finding information in large or unknown relational databases. (RDBAnalyzer supports mentioned principles)
• Detection of RDB graph characteristics (Scale free network) and using this knowledge to create 2 new and validate 1 existing metric.
• Design and creation of method for finding information in ontologies generated from RDB.
Thank you for your attention!