Chapter 25: Distributed Databases
description
Transcript of Chapter 25: Distributed Databases
![Page 1: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/1.jpg)
Chapter 25: Distributed Databases
![Page 2: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/2.jpg)
Definitions
• Distributed Database – a collection of of multiple logically interrelated databases distributed over a computer network
• Distributed Database Management System – A software system that manages a distributed database while making he distribution transparent to the user.
![Page 3: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/3.jpg)
Motivations for Distributed DBs
• No centralized point of failure.• Local Autonomy.• There’s a whole lot of data out there to store.• Replication of Data for Disaster Recovery and
High Availability (think RAID on a network)• High-throughput query processing (either inter-
query or intra-query parallelism), dynamic load-balancing,
• Poor people can’t afford supercomputers.
![Page 4: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/4.jpg)
Drawbacks of DDBs:
• Security
• Increased complexity of Database Design
• Increased complexity of Software
• Data integrity and resolution of concurrent operations.
• Cost (But if you’re big enough to need one, you probably can afford one?)
![Page 5: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/5.jpg)
Transparency:
Transparency of Data: – Location Transparency – A command works the same no matter
where in the system it is issued– Naming Transparency – We can refer to data by the same
name, from anywhere in the system, with no further specification.
– Replication Transparency – Hides multiple copies of data from user
– Fragmentation Transparency – Hide the fact that data is fragmented (ie, different sections of correlated data may be in different locations)
![Page 6: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/6.jpg)
Two Fundamental Patterns for Fragmenting Data
• Horizontal – Store Whole Tuples on Different machines.– Nice because we can use standard relational algebra
statements to define a restriction on a relation that creates these:
”newyork” (City)
“chicago” (City)
(Do we need to know all possible values for City in order to fully specify a fragmentation.)
![Page 7: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/7.jpg)
• Vertical – Store Different Fields of the same tuples on Different machines.– Use Projection Op to declare these:
(Acct #, Branch, Client Name Account)
(Acct #, Balance Account)
(Notice this requires redundant storage of at least one primary key per tuple)
![Page 8: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/8.jpg)
Redundant / Non-Redundant Allocations:
• Full Replication (Completely Redundant)– Good read time, good recoverability– Requires more coordination for multiple
writers on same data, hogs disk space
• No Replication (Non-Redundant)– Easier to coordinate multiple writers, multiple
readers. But no backup in case of disaster.
• Partial Replication– Trade-off between the above two options.
![Page 9: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/9.jpg)
Global Directory
• Global Centralized (Why have a DDBMS at all if you’re going to do this?)
• Dispersed or no Global Directory
• Completely Replicated
• Local-Master Directory– Each node has its own catalog of data– Each node has a directory to all of its data
that is replicated elsewhere.
![Page 10: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/10.jpg)
Each database in a distributed database is distinct from all other databases in the system and has its
own global database name.
![Page 11: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/11.jpg)
Name Resolution in Oracle8
• Every data object in every schema in every database has a unique identifying name:– SELECT * FROM
[email protected]_auto.com;
• A remote query is a query that selects information from one or more remote tables, all of which reside at the same remote node. For example: – SELECT * FROM
[email protected]_auto.com;
![Page 12: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/12.jpg)
Remote and Distributed SQL Statements in Oracle8
• A remote update is an update that modifies data in one or more tables, all of which are located at the same remote node. For example: – UPDATE
[email protected]_auto.com SET loc = 'NEW YORK' WHERE deptno = 10;
• A distributed query retrieves information from two or more nodes. For example: – SELECT ename, dname FROM scott.emp,
[email protected]_auto.com d WHERE e.deptno = d.deptno;
![Page 13: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/13.jpg)
• A distributed update modifies data on two or more nodes. A distributed update is possible using a PL/SQL subprogram unit, such as a procedure or trigger, that includes two or more remote updates that access data on different nodes. For example:
BEGIN UPDATE [email protected]_auto.com
SET loc = 'NEW YORK' WHERE deptno = 10;
UPDATE scott.emp
SET deptno = 11 WHERE deptno = 10;
END;
![Page 14: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/14.jpg)
2-Phase Commit Process
• Easy to trigger with the COMMIT directive.
• The Recoverer (RECO) background process on each server involved in the transaction coordinates to resolve any in-doubt transactions.
• All RECOs either commit or roll-back the change in a consistent manner.
![Page 15: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/15.jpg)
Chapter 28: Data Warehousing and OLAP
![Page 16: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/16.jpg)
• Data Warehousing “a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of managements decisions”
• Decision Support Systems or Executive Information Systems
• Online Analytical Processing (OLAP) analysis of complex data from a data warehouse
![Page 17: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/17.jpg)
Data Warehouses
• Optimized for providing general information about large data sets instead of explicit information about individual data records
• Multidimensional Matrices called Data Cubes (or hypercubes)
• Efficient storage, data marts, distributed DW, federate DW
![Page 18: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/18.jpg)
Some steps in Data Acquisition
1. Data is extracted (from multiple heterogeneous sources)
2. Data must be formatted3. Data must be cleaned (the most involved step)
Data can be backflushed to its source after cleaning.
4. Data must be converted from its source (relational, OO, hierarchical) to the DW’s multidimensional scheme.
5. The data must actually be loaded.
![Page 19: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/19.jpg)
Basic Operations
• Pivot (rotate)
• Roll-Up (grouping)
• Drill-Down (subdivision)
• Slice and dice: Perform projection operations on dimensions.
• Sort (data, by some criteria)
• Select Data (by value or range)
![Page 20: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/20.jpg)
Chunk-Offset Compression
Only stores the addresses and data for valid cells in
each chunk in a (offset, cellValue) format
Heum-Geun Kang and Chin-Wan Chung, Exploiting versions for on-line data warehouse maintenance in MOLAP servers, VLDB, 2002.
![Page 21: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/21.jpg)
Bitmap Indexing
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0
![Page 22: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/22.jpg)
Multidimensional Schema
• Components:– Dimension Tables – tuples of attributes of the
dimension– Fact Table – Holds tuples that correspond to recorded
facts.• Patterns:
– Star Schema – A single table for each dimension.– Snowflake Schema – obtained by normalizing a star
schema, creating a new hierarchy of multiple dimensional tables
– Fact Constellation – A set of fact tables that share some dimension tables
![Page 23: Chapter 25: Distributed Databases](https://reader033.fdocuments.in/reader033/viewer/2022061522/5681555b550346895dc323ee/html5/thumbnails/23.jpg)
Data Warehousing vs. Materialized Views
• DWs exist as persistent storage instead of being materialized on-demand.
• DWs are multidimensional, not relational. Views of a relational database are relational.
• DWs can be indexed to optimize performance. Views are dependant on the structure of the underlying database.
• DWs contain compositions of data collected from multiple datasources. Views are derived from a single database.