My seminar on distributed dbms

1

1

DITRIBUTED DATABASE MANAGEMENT SYSTEM

SEMINAR

On

“DISTRIBUTED DATABASE MANAGEMENT SYSTEM”

Submitted by

Name: - Patel Vinaykumar Dineshchandra

Class: - B.C.A (Sem-6)

Seat No: - 1732

Submitted to

LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATIONS SARIGAM (BCA)

Laxmi Institute of Commerce & Computer Applications (BCA) Sarigam

LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM (BCA)

1

1


SEMINAR REPORT

AS a Partial Requirement

For the Degree of

Bachelor of Computer Applications

(B.C.A)

Academic Year: 2015-16

Submitted by:

Patel Vinaykumar Dineshchandra

Guided by:Internal: Miss Rucha Nage


1

1


PREFACE

It is an exciting moment for me to present this seminar report. The proper care was taken while preparing the report so that it is easy to read & understand. During the preparation of this seminar report, the Information technology concepts were implemented.

This seminar is part of Third Year study, the final step towards the completion of BCA Course.

This documentation defines the system function in an understandable manner. Seminar report consists of different sections like specification of technology, study on it, Different functions, its features etc. that will help user to understand the particular technology in brief.


1

1


ACKNOWLEDGEMENT

I, the student of Laxmi Institute of Commerce & Computer Applications, Sarigam B.C.A feel full satisfaction and pleasure to pleasure to present the seminar on,


I have great pleasure in acknowledgement the help given by various individuals throughout the seminar work. This project is itself an acknowledgement to the inspiration; drive the technical assistance contributed by many individuals.

I express my sincere and heartfelt gratitude to Dr. Keyur Nayak, Director of the Department of Computer Applications, for being helpful and c0-operative during this period.

I also express my deep gratitude to the faculty member Miss Rucha Nage, Subject Faculty and Internal guide for valuable guidance, good suggestions and help in the completion of this seminar.

I extend my sincere thanks to all the faculty members for providing useful help and necessary help. Without the support of anyone of them this seminar could not be complete.

Sincerely,

Patel Vinaykumar Dineshchandra

(T.Y.B.C.A)


1

1


DISTRIBUTED DATABASE

MANAGEMENT SYSTEM

INDEX


1

1



SR.NO. DESCRIPTION PAGE NO.

1. ABSTRACT 7

2. INTRODUCTION 8

3. DEFINITION 9

4. TYPES 10

5. FUNCTIONS

6. ADVANTAGES

7. DISADVANTAGES

ABSTRACT


1

1


The purpose of this paper is to present an introduction to distributed databases though two main parts: in the first part, we present a study of the fundamentals of distributed databases (DDBS).

We discuss issues related to the motivations of Distributed DBS, architecture, design, performance, and concurrency control, etc.

The topics of this research include, query optimization, distribution optimization, fragmentation, optimization, and join optimization on the internet.

We include examples and results to demonstrate the topics we are presenting.

INTRODUCTION


1

1


In today’s world of universal dependence on information systems, all sorts of people need access to companies’ databases. In addition to a company’s own employees, these include the company’s customers, potential customers, suppliers, and vendors of all types. It is possible for a company to have all of its databases concentrated at one mainframe computer site with worldwide access to this site provided by telecommunications networks, including the Internet.

Although the management of such a centralized system and its databases can be controlled in a well-contained manner and this can be advantageous, it poses some problems as well. For example, if the single site goes down, then everyone is blocked from accessing the databases until the site comes back up again. Also the communications costs from the many far PCs and terminals to the central site can be expensive.

One solution to such problems, and an alternative design to the centralized database concept, is known as ‘Distributed Database’. The idea is that instead of having one, centralized database, we are going to spread the data out among the cities on the distributed network, each of which has its own computer and data storage facilities. All of this distributed data is still considered to be a single logical database.

When a person or process anywhere on the distributed network queries the database, it is not necessary to know where on the network the data being sought is located. The user just issues the query, and the result is returned. This feature is known as ‘Location Transparency’. This can become rather complex very quickly, and it must be managed by sophisticated software known as a ‘Distributed Database Management System’ or ‘Distributed DBMS’.

DEFENITION


1

1


A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network.

A distributed database management system (DDBMS) is the software that manages the DDB, and provides an access mechanism that makes this distribution transparent to the user.

Distributed database system (DDBS) is the integration of Distributed DB and Distributed DBMS.

This integration is achieved through the merging the database and networking technologies together.

Or it can be described as, a system that runs on a collection of machines that do not have shared memory, yet looks to the user like a single machine.

There are 2 types of it,

1. Homogeneous Distributed DBMS

2. Heterogeneous Distributed DBMS

1. Homogeneous Distributed DBMS: -


1

1


All sites of the database system have identical setup, i.e., same database system software. The underlying operating system may be different. For example, all sites run Oracle or DB2, or Sybase or some other database system. The underlying operating systems can be a mixture of Linux, Window, UNIX, etc. The clients thus have to use identical client software.

2. Heterogeneous Distributed DBMS: - Federated: Each site may run different database system but the data access

is managed through a single conceptual schema. This implies that the degree of local autonomy is Minimum. Each site must adhere to a centralized access policy. There may be a global schema.

Multi database: There is no one conceptual global schema. For data access a schema is constructed dynamically as needed by the application software.

Architecture of a DDBMS


1

1


Each computer (site) in a distributed system may contain a Transaction Manager (TM) and a Data Manager (DM) - as we will see later, there is also a Transaction Coordinator (TC). The TM is responsible for the Transactions received by the computer. The DM manages the database access on the local computer. When a Transaction arrives at the TM, the TM divides the transaction into sub

transactions which are transmitted to those DMs containing the data needed by the Transaction. (In some cases the TC is responsible for this.)

The TM processes the collected received data from the sub-transactions' responses and produces the final result.

Any TM can communicate with all DMs and vice versa.

NOTE: The DMs cannot transmit data to other DMs and the same applies to TMs, except in certain cases where it is convenient to transfer the total responsibility of a Transaction from one TM to another (i.e. if a Transaction runs as a local Transaction on another computer.)

CHARACTERISTICS OF DISTRIBUTED DBMS


1

1


A Distributed DBMS developed by a single vendor may contain: 1. Data Independence2. Concurrency Control3. Replication facilities4. Recovery facilities5. Co-ordinated Data Dictionary

Now I Discuss them in detail,

Data Independence: -

- A database system normally contains a lot of data in addition to users’ data. For example, it stores data about data, known as metadata, to locate and retrieve data easily. It is rather difficult to modify or update a set of metadata once it is stored in the database. But as a DBMS expands, it needs to change over time to satisfy the requirements of the users. If the entire data is dependent, it would become a tedious and highly complex job.

- Metadata itself follows a layered architecture, so that when we change data at one layer, it does not affect the data at another level. This data is independent but mapped to each other. a. Logical Data Independence: - Logical data is data about database, that is, it

stores information about how data is managed inside. For example, a table relation stored in the database and all its constraints, applied on that relation. Logical data independence is a kind of mechanism, which liberalizes itself from actual data stored on the disk. If we do some changes on table format, it should not change the data residing on the disk.

b. Physical Data Independence: - All the schemas are logical, and the actual data is stored in bit format on the disk. Physical data independence is the power to change the physical data without impacting the schema or logical data. For example, in case we want to change or upgrade the storage system itself − suppose we want to replace hard-disks with SSD − it should not have any impact on the logical data or schemas.

Concurrency Control: -


1

1


- Concurrency control is a database management system (DBMS) concept that is used to address conflicts with the simultaneous accessing or altering of data that can occur with a multi-user system. It ensures that Database transactions are performed concurrently without violating the data integrity of the respective databases. Thus concurrency control is an essential element for correctness in any system where two or more database transactions, executed with time overlap, can access the same data.

- Concurrency Control Protocols can be broadly divided into two categories,a. Lock based protocolsb. Time stamp based protocols

a. Lock based protocols: - A lock is nothing but a mechanism that tells the DBMS whether a particular data item is being used by any transaction for read/write purpose. Since there are two types of operations, i.e. read and write, whose basic nature are different, the locks for read and write operation may behave differently. Read operation performed by different transactions on the same data

item. The value of the data item, if constant, can be read by any number of transactions at any given time. If a transaction is reading the content of a sharable data item, then any number of other processes can be allowed to read the content of the same data item.

Write operation is something different. When a transaction writes some value into a data item, the content of that data item remains in an inconsistent state, starting from the moment when the writing operation begins up to the moment the writing operation is over.

But if any transaction is writing into a sharable data item, then no other transaction will be allowed to read or write that same data item.

Database systems equipped with lock-based protocols use a mechanism by which any transaction cannot read or write data until it acquires an appropriate lock on it.

Locks are of two kinds,1) Binary Locks: - A lock on a data item can be in two states; it is

either locked or unlocked.2) Shared/Exclusive Lock: -1) Shared Lock: A transaction may acquire shared lock on a data

item in order to read its content. The lock is shared in the sense that any other transaction can acquire the shared lock on that same data item for reading purpose.

2) Exclusive Lock: A transaction may acquire exclusive lock on a data item in order to both read/write into it. The lock is excusive in the sense that no other transaction can acquire any kind of lock (either shared or exclusive) on that same data item.

There are four types of lock protocols available,3) Simplistic Lock Protocol: - Simplistic lock-based protocols

allow transactions to obtain a lock on every object before a


https://en.wikipedia.org/wiki/Database

https://en.wikipedia.org/wiki/Data_integrity

https://en.wikipedia.org/wiki/Concurrency_(computer_science)

1

1


'write' operation is performed. Transactions may unlock the data item after completing the ‘write’ operation.

4) Pre-claiming Lock Protocol; - Pre-claiming protocols evaluate their operations and create a list of data items on which they need locks. Before initiating an execution, the transaction requests the system for all the locks it needs beforehand. If all the locks are granted, the transaction executes and releases all the locks when all its operations are over. If all the locks are not granted, the transaction rolls back and waits until all the locks are granted.

b. Timestamp-based Protocols: - A timestamp is a tag that can be attached to any transaction or any data item, which denotes a specific time on which the transaction or data item had been activated in any way. This protocol uses either system time or logical counter as a

timestamp. Every transaction has a timestamp associated with it, and the ordering is determined by the age of the transaction.

The timestamp of a data item can be of the following two types:(1) W-timestamp (Q): This means the latest time when the data item

Q has been written into.(2) R-timestamp (Q): This means the latest time when the data item Q

has been read from.

How should timestamps be used?

For Read operations: If a transaction T1 issues a read(X) operation,

If TS(T1) < W-timestamp(X) Operation rejected.

If TS(T1) >= W-timestamp(X) Operation executed.

All data-item timestamps updated.

For Write operations: If a transaction T1 issues a write(X) operation,

If TS(T1) < R-timestamp(X) Operation rejected.

If TS(T1) < W-timestamp(X) Operation rejected and T1 rolled back.

Otherwise, operation executed.

Replication Facilities: -


1

1


- Replication is useful in improving the availability of data by coping data at multiple sites.

- Either a relation or a fragment can be replicated at one or more sites.- Fully redundant databases are those in which every site contains a copy of the

entire database. - Depending on the availability and redundancy factor there are three types of

replications:

a. Full replication.

b. No replication.

c. Partial replication.

Full replication: -

The most extreme case is replication of the whole database at every site in the distributed system.

This can improve availability remarkably because the system can continue to operate as long as at least one site is up.

It also improves performance for retrieval of global queries as the result can be obtained locally at any client.

Disadvantage: Slows the update process as a single update must be performed at different databases to keep the copies consistent.

No replication : -

The other extreme from full replication involves having no replication—that is, each fragment is stored at exactly one site.

In this case, all fragments must be disjoint, except for the repetition of primary keys among vertical (or mixed) fragments.

This is also called ‘Non-redundant allocation.’

Partial Replication: -

Here some fragments of the database may be replicated whereas others may not.

The number of copies of each fragment can range from one up to the total number of sites in the distributed system.

For example:mobile workers—sales forces, financial planners, carry partially replicated databases on their laptops and synchronize periodically with the server databases.


1

1


A description of the replication of fragments is sometimes called a replication schema.

Each fragment—or each copy of a fragment—must be assigned to a particular site in the distributed system. This process is called data distribution (or data allocation).

The choice of sites and the degree of replication depend on the performance and availability goals of the system and on the types and frequencies of transactions submitted at each site.

For example, if high availability is required, transactions can be submitted at any site, and most transactions are retrieval only, a fully replicated database is a good choice.

However, if certain transactions that access particular parts of the database are mostly submitted at a particular site, the corresponding set of fragments can be allocated at that site only.

Data that is accessed at multiple sites can be replicated at those sites. If many updates are performed, it may be useful to limit replication.


1

1


Recovery Facilities: -

- Recovery protocols bring failed nodes back online.- Effectiveness of recovery protocol affects availability of the database.- There are following methods of it,

1 Salvation Program: - A post-crash process that tries to restore the DB to a valid state. No recovery data used.

2 Incremental Dumping: - Copies updated files to archival storage. Performed either after TX completion or regular intervals.

3 Audit Trail: - Keeps track of a sequence of actions. Useful for DB restoration to pre-crash state.

4 Differential Files: - separate files records updates requested for records in a main file.

5 Backup/Current Version: - current version of DB is stored in currently existing files with present values.

6 Multiple Copies: - multiple identical copies of the DB files are maintained.7 Careful Replacement: - Update performed on a copy. Original is deleted

upon commit. Original copy available after a crash during update.

- Dealing with Recovery: - (1) Lower time to recover.(2) Reduce amount of recovery data to be transferred from active nodes.(3) Log-based and version based recovery support.(4) Support for amnesia phenomenon.


1

1


Harbor

• Recovery technique for “updatable warehouse” like systems.

• Queries active remote nodes.

• Timestamps determine which tuples to copy or update.

• Allows non-DBA transactions while recovering.

• Lower runtime overhead.

• Performance comparable to ARIES.

• Does not require stable log.

• Exploits replication to support recovery.

• Exploits historical queries.

• Supports recovery in warehouse-like systems that requires fine-granularity insertions and updates.

• Uses versioning and “time travel.”

• Replicas are kept consistent up to some historical point using check pointing.

• Replication need not be physically identical, but must logically represent the same data.

• Provides K-safety, i.e. tolerates K simultaneous site failures.

• Augments the tuples with Insert- and Delete-Time to provide versioning.

• 3 Stage Algorithm

- Restore to last checkpoint - Update with Historical Queries - Update to current time


1

1


FUNCTIONS OF DISTRIBUTED DBMS

A DDBMS governs the storage and processing of logically related data over interconnected computer systems in which both data and processing functions are distributed among several sites. A DBMS must have at least the following functions to be classified as distributed:

Application interface to interact with the end user, application programs, and other DBMSs within the distributed database.

Validation to analyse data requests for syntax correctness. Transformation to decompose complex requests into atomic data request

components. Query optimization to find the best access strategy. (Which database

fragments must be accessed by the query, and how must data updates, if any, be synchronized?)

Mapping to determine the data location of local and remote fragments. I/O interface to read or write data from or to permanent local storage. Formatting to prepare the data for presentation to the end user or to an

application program. Security to provide data privacy at both local and remote databases. Backup and recovery to ensure the availability and recoverability of the

database in case of a failure. Backup and recovery to ensure the availability and recoverability of the

database in case of a failure. DB administration features for the database administrator. Concurrency control to manage simultaneous data access and to ensure

data consistency across database fragments in the DDBMS. Transaction management to ensure that the data moves from one consistent

state to another. This activity includes the synchronization of local and remote transactions as well as transactions across multiple distributed segments.

ADVANTAGES OF DISTRIBUTED DBMS


1

1


1. Data are located near the greatest demand site.

The data in a distributed database system are dispersed to match business requirements which reduce the cost of data access.

2. Faster data access.

End users often work with only a locally stored subset of the company’s data.

3. Faster data processing.

A distributed database system spreads out the systems workload by processing data at several sites.

4. Growth facilitation.

New sites can be added to the network without affecting the operations of other sites.

5. Improved communications.

Because local sites are smaller and located closer to customers, local sites foster better communication among departments and between customers and company staff.

6. Reduced operating costs.

It is more cost-effective to add workstations to a network than to update a mainframe system. Development work is done more cheaply and more quickly on low-cost PCs than on mainframes.

7. User-friendly interface.

PCs and workstations are usually equipped with an easy-to-use graphical user interface (GUI). The GUI simplifies training and use for end users.

8. Less danger of a single-point failure.

When one of the computers fails, the workload is picked up by other workstations. Data are also distributed at multiple sites.

9. Processor independence.

The end user is able to access any available copy of the data, and an end user's request is processed by any processor at the data location.

DISADVANTAGES OF DISTRIBUTED DBMS


1

1


1. Complexity of management and control.

Applications must recognize data location, and they must be able to stitch together data from various sites. Database administrators must have the ability to coordinate database activities to prevent database degradation due to data anomalies.

2. Technological difficulty.

Data integrity, transaction management, concurrency control, security, backup, recovery, query optimization, access path selection, and so on, must all be addressed and resolved.

3. Security.

The probability of security lapses increases when data are located at multiple sites. The responsibility of data management will be shared by different people at several sites.

4. Lack of standards.

There are no standard communication protocols at the database level. (Although TCP/IP is the de facto standard at the network level, there is no standard at the application level.) For example, different database vendors employ different—and often incompatible—techniques to manage the distribution of data and processing in a DDBMS environment.

5. Increased storage and infrastructure requirements.

Multiple copies of data are required at different sites, thus requiring additional disk storage space.

6. Increased training cost.

Training costs are generally higher in a distributed model than they would be in a centralized model, sometimes even to the extent of offsetting operational and hardware savings.

7. Costs.

Distributed databases require duplicated infrastructure to operate (physical location, environment, personnel, software, licensing, etc.)


My seminar on distributed dbms

Education

Transcript of My seminar on distributed dbms