Database Systems

Data vs. Information

• Data:

– Raw facts; data constitute the building blocks of information

– Unprocessed information.

Raw data must be properly formatted for storage, processing and presentation.

• Information:

- Result of processing raw data to reveal its meaning.

Accurate, relevant, and timely information is key to good decision making

Good decision making is the key to organizational survival in a global environment.

Data Management: Discipline that focuses on the proper generation, storage and retrieval of data.

Data Mgt. is a core activity for any business, govt. agency service or organization or charity.

Why store the data as raw facts?

Historical Roots:

Files and File Systems

Although file system as a way of managing data are now largely obsolete, there are several good

reasons for studying them in detail:

- understanding file system characteristics make database design easier to understand

- awareness of problems with file system helps prevent similar problems in DBMS

- Knowledge of file systems is helpful if you plan to convert an obsolete file system to a

DBMS.

In recent past, a manager of almost any small org. was (sometimes still is) able to keep track of

necessary data by using manual file system. Such a file system was traditionally composed of

collection of file folders each properly tagged and kept in filing cabinet. Unfortunately, report

generation from a manual file system can be slow and cumbersome.

File and File Systems:

Data: Raw facts e.g telephone number, birth date, customer name and year-to-date (YTD) sales

value. Data have little meaning unless they have been organized in some logical manners. Smallest

piece of data that can be recognized by the computer is a single character, such as letter A, the

number 5 or a symbol as /. A single character requires 1 byte of computer storage.

Field: A character or group of characters (alphabetic or numeric) that has a specific meaning. A

field is used to define and store data.

Record: A logically connected set of one or more fields that describe a person, place or thing e.g

the fields that constitutes a record for a customer names, address, phone number, Date of Birth.

File: a collection of related records e.g a file might contain data about vendors of ROBCOR

Company or a file might contain records for the students currently enrolled at UEL.

Historical Roots: Files and File Systems

A simple file system where each department has multiple programs to directly access the data –

Note: No separation as will be seen in a DBMS

As the number of files increased, a small file system evolved. Each file in the system used its own

application programs to store, retrieve and modify data. And each file has owned by the individual

or the dept. that commissioned its creation.

As the file system grew, the demand for the data processing specialists programming skills grew

even faster and the DP specialist was authorized to hire additional programmers. The size of the file

system also requires a larger more complex computer. The new computer and the additional

programming staff caused the DP specialist to spend less time programming and more time

managing technical and human resources. Therefore the DP specialist‟s job evolved into that of a

Data Processing (DP) Manager who supervised a DP dept.

File-based System: A collection of application programs that perform services for the end-users

such as the production of reports. Each program defines and manages its own data.

Problems with File System Data Mgt.

- Data Redundancy: multiple file locations/copies could lead to Update, Insert and Delete

anomalies.

- Structural Dependencies/Data Dependence

· Access to a file depends on its structure

· Marketing changes in existing file structure is difficult

· File structure changes require modifications in all programs that use data in that file

· Different program languages have different file structures

· Modifications are likely to produce errors, requiring additional time to “debug” the program

· Programs written in (3GL): example of 3GL re COBOL, BASIC, FORTRAN

· Programmer must specify task and how it‟s done.

· Modern databases use (4GL) allow users to specify what must be done without saying how

it must be done. 4GL are used data retrieval (such as query by example and report generator

tools) and can work with different DBMS. The need to write 3GL program to produce even

the simplest reports makes ad hoc queries impossible.

- Security features such as effective password protection, the ability to lock out parts of files or

parts of the system itself and other measures designed to safeguard data confidentiality are difficult

to program and therefore often omitted in a file system envt.

To Summarize the Limitations of File System Data Mgt.

Requires extensive programming.

There are no ad hoc query capabilities.

System administration can be complex and difficult.

Difficult to make changes to existing structures.

Security features are likely to be inadequate.

Limitations of File-based Approach

Separation and Isolation of data: when data is isolated in separate files, it is more difficult to

access data that should be available.

Duplication of Data: Owing to the decentralised approach taken by each dept. file-based approach

encouraged if not necessitated, the uncontrolled duplication of data.

Duplication is wasteful, it costs time and money to enter data more than once

It takes up additional storage space, again with associated costs.

Duplication can lead to loss of data integrity.

Data Dependence: the physical structure and storage of data files and records are defined in the

duplication code. This means that changes to an existing structure are difficult to make.

Incompatible file formats

Fixed queries/ Proliferation of application programs: File-based system are very dependent

upon the application developer, who has to write any queries or reports that are required.

No provision for security or integrity

Recovery in the event of hardware/software failure was limited or non-existent

Access to the files was restricted to one user at a time- there was no provision for shared

access by staff in same dept.

Introducing DB and DBMS

• DB (Database)

– shared, integrated computer structure that stores:

• End user data (raw facts)

• Metadata (data about data) through which the end-user data the integrated and managed.

Metadata provide a description of the data characteristic and the set of relationship that link

the data found within that the database. Database resembles a very well-organized electronic filling

cabinet in which powerful software (DBMS) helps manage the cabinet‟s contents..

• DBMS (database management system):

– Collection of programs that manages database structure and controls access to the data

– Possible to share data among multiple applications or users

– Makes data management more efficient and effective.

Roles:

DBMS serves as the intermediary between the user and the DB. The DBMS receives all application

requests and translates them into complex operations required to fulfil those requests.

DBMS hides much of the DB‟s internal complexity from the application programs and users.

DBMS uses System Catalog

• A detailed data dictionary that makes access to the system tables (metadata) which describes the

database.

• Typically stores:

– names, types, and sizes of data items

– constraints on the data

– names of authorized users

– data items accessible by a user and the type of access

– usage statistics (ref. optimisation)

Role & Advantages of the DBMS

· End users have better access to more and better-managed data

Promotes integrated view of organization‟s operations

· Minimized Data Inconsistency– Probability of data inconsistency is greatly reduced

· Improved Data Access– Possible to produce quick answers to ad hoc queries

Particularly important when compared to previous historical DBMSs.

· Improved Data Sharing: Create an envt. In which end users have better access to more and

better- managed data such access makes it possible for end users to respond quickly to

changes in their envt.

· Better Data Integration: wider access to well-managed data promotes an integrated view

of the organisation‟s operations and a clearer view of the big picture.

· Data Inconsistency exists when different versions of the same data appear in different

places.

· Increased end-user Productivity: Availability of data combined with tools that transform

data into usable information, empowers end-users to make quick informed decisions that can

be the different between success and failure in global economy.

First DBMS enables data in the DB to be shared among multiple applications or users.

Second – it integrates the many diff users‟ views of the data into a single all encompassing data

repository because data are crucial raw material from which information is derived, you must have

a good way of managing such data.

Role & Advantages of the DBMS

DBMS serves as intermediary between the user/applications and the DB, compare this to the

previous file based systems.

Database System: refers to an org. of components that define and regulate the collection, storage,

mgt and use of data within a DB envt.

Types of Databases

Can be classified by users:

· Single User: supports only one user at a time. If user A is using the DB, user B and C must

wait until user A is done.

· Multiuser: supports multiple users at the same time. When multiuser DB supports a

relatively small number of users (usually fewer than 50) dept. within organization (Work

Group Database). When the DB is used by entire organization and supports many users

across many depts. (Enterprise DB)

Can be classified by location:

• Centralized: Supports data located at a single site

• Distributed: Supports data distributed across several sites

Can be classified by use:

• Transactional (or production): OLTP

– Supports a company‟s day-to-day operations

• Data Warehouse: OLAP

– Stores data used to generate information required to make tactical or strategic decisions

– Often used to store historical data

– Structure is quite different

History of Database Systems

• First-generation

– Hierarchical and Network

• Second generation

– Relational

• Third generation

– Object-Oriented

– Object-Relational

DBMS FUNCTIONS

DBMS performs several important functions that guarantee the integrity and consistency of the data

in the DB.

These include:

1. Data Dictionary MGT.: Stores definitions of the data elements and their relationship

(metadata) in a data dictionary. DBMS uses data dictionary to look up the required data

components structure and relationships, thus relieving you from having to code such

complex relationships in each program.

2. Data Storage Mgt: DBMS creates and manages the complex structure required for data

storage, thus relieving you of the difficult task of defining and programming the physical

data characteristic.

3. Data Transformation and Presentation: It transforms entered data to conform to required

data structure. DBMS relieves you of the chore of making a distribution between the logical

data format and physical data format.

4. Security Mgt. DBMS creates a security system that enforces user security and data privacy.

Security rules determine which users can access the DB, which data items each user can

access and which data operations (read, add, delete or modify) the user can perform.

5. Multiuser Access Control: To provide data integrity and data consistency.

6. Backup and Recovery Mgt: To ensure data safety and integrity. It allows DBA to perform

routine and special backup and recovery procedures.

7. Data Integrity Mgt: Promotes and enforces integrity rules, thus minimising data

redundancy and maximising data consistency.

8. DB Access lang. and Application Programming Interfaces: provides data access through

a query lang.

9. DB Communication Interfaces: Current-generation DBMS accept end-user requests via

multiple, different network envts e.g DBMS might provide access to the DB via the internet

through the use of web browsers such as Mozilla Firefox or internet explorer.

Data Models

Hierarchical Model e.g ADABAS

It is developed in 1960s to manage large amounts of data for complex manufacturing projects e.g

Apollo rocket that landed on the moon in 1969. basic logical structure is represented by an upside-

down „tree‟

HM contains levels or segments. Segment is equivalent of a file system‟s record type. The root

segment is the parent of the level 1 segments which in turn are the parents of the level to segment

etc. other segments below are children of the segment above.

In short, HM depicts a set of one-to-many (1:*) relationships between a parent and its children

segments. (Each parent can have many children but each child has only one parent.

Limitations:

i. Complex to implement

ii. Difficult to manage

iii. Lack structural independence

Hierarchical Structure diag

Network Model (NM) e.g IMS Information Mgt Syst.

NM was created to represent complex data relationship more effectively than the HM to improve

DB performance and to impose DB standard. Lack of DB standard was troubles some to

programmers and application designers because it made designs and applications less portable.

Conference on Data System Lang. (CODASYL) created the Database task Group (DBTG) that

defined crucial components:

Network Schema: conceptual org. of the entire DB as viewed by the DB administrator. It

includes definition of the DB name, record type for each record and the components that

make up those records.

Network Subschema: defines the portion of the DB „seen‟ by the program that actually

produce the desired information from the data contained within the DB. Existence of

subschema definitions allows all DB program to simply involve the subschema required to

access the application DB file(s).

Data Mgt Lang. (DML): defines the environment in which data can manage. To produce

desired standardisation for each of the tree components, the DBTG specified 3 distinct DML

components.

- A schema data definition lang (DDL): enables DBA to define the schema

components

- Subschema DDL: allows application program to define DB component that will be

used by the application.

- Data Manipulation Lang: to work with the data in the DB.

The NM allows a record to have more than one parent. A relationship is called a SET. Each set is

composed of at least 2 record types i.e an owner record that is equivalent to hierarchical model‟s

parent and a member record that is equivalent to the hierarchical model‟s child. A set represents a

1:* relationship between owner and member.

Network Model diag

The Relational Model e.g Oracle, DB2

RM is implemented through a Sophisticated Relational Database Mgt System (RDBMS). RDBMS

performs same basic functions provided by the hierarchical and network DBMS system.

Important advantage of the RDBMS is the user. The RDBMS manages all of the physical details

while the user sees the relational DB as a collection of tables in which data are stored and can

manipulate and query data in a way that seems intuitive and logical.

Each table is a matrix consisting of a series of row/column intersections. Tables also called

Relations; are related to each other through the sharing of a field which is common to both entities.

-A relational diagram is a representation of relational DB‟s entities, the attributes with those

entities and the relationship between those entities‟.

- A relational table stores collection of related DB‟s entities, therefore, relational DB table

resembles a file. Crucial difference between table and a file:

A table yields complete data and structural independent cos it is a purely logical

structure.

- Reason for the relational DB model‟s rise to dominance is its powerful and flexible query

lang.

RDBMS uses SQL (4GL) to translate use queries into instructions for retrieving the requested data.

Object-Oriented Model (OOM) (check appendix G)

In object-oriented data model (OOM), both data and their relationship are contained in a single

structure known as an OBJECT.

Like the relational model‟s entity, an object is described by its factual content. But quite unlike an

entity, an object includes information about relationships between the facts within the object, as

well as info about its relationship with other objects. Therefore, the facts within the object are given

greater meaning. The OODM is said to be a semantic data model cos semantic indicates meaning.

OO data model is based on the following components:

i. An object is an abstraction of a real-world entity i.e object may be considered equivalent to

an ER model‟s entity. An object represents only one individual occurrence of an entity.

ii. Attributes describe the properties of an object.

iii. Object that share similar characteristic are grouped in classes. A class is a collection of

similar objects with shared structure (attributes) and behaviour (method). Class resembles

the ER model‟s entity set. However, a class is diff from an entity set in that it contains set of

procedures known as Methods. A class‟s method represents a real-world action such as

finding a selected Person‟s name, changing a Person‟s name.

iv. Classes are organised in a class hierarchy. The class hierarchy resembles an upside-down

tree in which each class has only one parent e.g CUSTOMER class and EMPLOYEE class

share a parent PERSON class.

v. Inheritance is the ability of an object within the class hierarchy to inherit the attributes and

method of classes above it, for example- 2 classes, CUSTOMER and EMPLOYEE can be

created as subclasses from the class PERSON. In this case, CUSTOMER and EMPLOYEE

will inherit all attributes and methods from PERSON.

Entity Relationship Model

Complex design activities require conceptual simplicity to yield successful results although the

relational model was a vast improvement over the hierarchical and network models. It still lacked

the features that would make it an effective database design tool. Because it is easier to examine

structures graphically than to describe them in text, database designers prefer to use a graphical tool

in which entities and their relationship are pictured. Thus, the entity relationship (ER) model or

ERM, has become a widely accepted standard for data modelling.

One of the more recent versions of Peter Chen‟s Notations is known as the Crow‟s Foot Model. The

Crow‟s Foot Notation was originally invented by Gordon Everest. In crow‟s foot notation, graphical

symbols were used instead of using the simple notation such as „n‟ to indicate „many‟ used by

Chen. The label „Crow‟s Foot is derived from the three-pronged symbol used to represent the

„many‟ side of the relationship. Although there is a general shift towards the use of UML, many

organisations today still use the Crow‟s Foot Notation. This is particular true in legacy systems

which are running on obsolete hardware and software but are vital to the organisation. It is therefore

important that you are familiar with both Chen‟s and Crow‟s foot modelling notations.

Many recently the class diagram component of thr Unified Modelling Language. has been used to

produce entity relationship models. Although class diagrams have been developed as a part of the

larger UML object-oriented design method, the notation is emerging as the industry data modelling

standard.

The ERM uses ERDs to represent the conceptual database as viewed by the end user. The ERM‟s

main components are entities, relationships and attributes. The ERD also includes connectivity and

cardinality notations. An ERD can also show relationship strength, relationship participation

(optional or mandatory), and degree of relationship (unary, binary, ternary, etc).

ERDs may be based on many different ERMs.

The Object Oriented Model

Objects that share similar characteristics are grouped in classes. Classes are organized in a class

hierarchy and contain attributes and methods. Inheritance is the ability of an object within the class

hierarchy to inherit the attributes and methods of classes above it.

An OODBMS will use pointer to link objects together. Is this a backwards step?

The OO data Model represents an object as a box; all of the object attributes and relationships to

other objects are included in within the object box. The object representation of the INVOICE

includes all related objects within the same object box. Note that the connectivity (1:1 and 1:*)

indicate relationship of the related objects to the invoice.

The ER model uses 3 separate entities and 2 relationships to represent an invoice transaction. As

customers can buy more than one item at a time, each invoice references one or more lines, one

item per line.

The Relational Model

• Developed by Codd (IBM) in 1970

• Considered ingenious but impractical in 1970

• Conceptually simple

• Computers use to lacked the power to implement the relational model

• Today, microcomputers can run sophisticated relational database software.

Advantages of DBMS

• Control of data redundancy

• Data consistency

• More information from the same amount of data

• Sharing of data

• Improved data integrity

• Improved security

• Enforcement of standards

• Economy of scale • Balance conflicting requirements

• Improved data accessibility and responsiveness

• Increased productivity

• Improved maintenance through data independence

• Increased concurrency

• Improved backup and recovery services

Disadvantages of DBMS • Cost of DBMS

– Additional hardware costs

– Cost of conversion

• Complexity

– Size

• Maintenance / Performance?

• Higher dependency / impact of a failure

Degrees of Data Abstraction

ANSI ( American National Standard Institute)

SPARC (Standard Planning and Requirements Committee) defined a framework for data modelling

based on degrees of data abstraction

ANSI-SPARC Three Level Architecture

External Model

A specific representation of external view is known as External Schema. The use of external view

representing subjects of the DB has some important advantage:

It makes it easy to identify specific data required to support each business units operations.

Makes the designers‟ job easy by providing feedback about the model‟s adequacy.

It helps to ensure security constraints in the DB design.

It makes application program development much similar.

Conceptual Model

It represents a global view of the entire DB. It is a representation of data as viewed by the entire

organisation. It integrates all ext. views (entities, relationships, constraints and processes) into a

single global view of the entire data in the enterprise known as Conceptual Schema. Most widely

used conceptual model is the ER model. The ER model is used to graphically represent the

conceptual schema.

Advantage:

It provides relatively easily macro-level view of the data envt.

It is independent of both software and hardware.

Software Independence means the model does not depend on the DBMS software used to

implement the model. DBMS software used to implement the model.

Hardware Independence means the model does not depend on the hardware used in the

implementation of the model.

Internal Model

Once a specific DBMS has been selected, the internal model maps the conceptual model to the

DBMS. The internal model is the representation of the DB as „seen‟ by the DBMS. An Internal

Schema depicts a specific representation of an internal model, using the DB constructs supported by

the chosen DB. Internal model depends on specific DB software.

When you can change the internal model without affecting the conceptual model you have Logical

Independence. However, the I.M. is also hardware-independent because it unaffected by the choice

of the computer on which software is installed.

Physical Model

It operates at the lowest level of abstraction, describing the way data are saved on storage media

such as disks or tapes. P.M. requires the definition of both the physical storage devices and the

(physical) access methods required to reach the data within those storage devices, making it both

software and hardware dependent.

- When you can change the physical model without affecting the internal model, you have Physical

Independence. Therefore, a change in storage devices/methods and even a change in OS will not

affect the internal model.

• External Level

– Users‟ view of the database.

– Describes that part of database that is relevant to a particular user.

• Conceptual Level

– Community view of the database.

– Describes what data is stored in database and relationships among the data.

• Internal Level

– Physical representation of the database on the computer.

– Describes how the data is stored in the database.

The Importance of Data Models

• Data models

– Relatively simple representations, usually graphical, of complex real-world data structures

– Facilitate interaction among the designer, the applications programmer, and the end user.

Data model is a (relatively) simple abstraction of a complex real-world data envt. DB designers use

data models to communicate with applications programmers and end users. The basic data-

modeling components are entities attributes, relationships and constraints.

Business rules are used to identify and define the basic modeling component within a specific real-

world environment.

Database Modelling

Alternate Notations

• Crow‟s Feet Notation

• Chen Notation

• UML – Unified Modelling Language Review the coverage provided on the Books’s CD in appendix E, regarding these alternative notations.

Data Model Basic Building Blocks

Entity Relationship Diagrams (ERD)

ERD represents the conceptual DB as viewed by the end user. ERD depicts the DB‟s main

components:

• Entity - anything about which data is to be collected and stored

• Attribute - a characteristic of an entity

• Relationship - describes an association among entities

– One-to-many (1:m) relationship

– Many-to-many (m:n) relationship

– One-to-one (1:1) relationship

• Constraint - a restriction placed on the data

Entity Relationship Diagrams

An Entity

• A thing of independent existence on which you may wish to hold data on

• Example: an Employee, a Department

Entity is an object of interest to the end user. The word entity in the ERM corresponds to a table and

to a row in the relational envt. The ERM refers to a specific table row as an entity instance or entity

occurrence. In UML notation, an entity is represented by a box that is subdivided into 3 parts:

- the top part is used to name the entity –a noun, usually written in capital letters.

- Middle part is used to name and describe the attributes.

- Bottom part is used to list the methods. Methods are used only when designing object

relational/object oriented DB models.

The two terms ER Model / ER Diagram are often used interchangeable to refer to the same thing – a

graphical representation of a database. To be more precise you would refer to the specific notation

being used as the model, e.g. Chen, crows foot, describing what type of symbols are used. Whereas

an actual example of that notation being used in practice would be called an ER diagram. So the

Model is the notation specification, the diagram is an actual drawing.

Relationships: an association between entities.

Entity types may bear relationship to one another

• Example: Employee works in a Department

• Recording: which Dept an Emp is in

The relationship could be

Works in

Existence Dependence: an entity is said to existence-dependent if it can exist in the database only

when it is associated with another related entity occurrence. Implementation terms, an entity is

existence-dependent if it has a mandatory foreign key – that is foreign key attribute that cannot be

null.

Relationship Strength: this concept is based on how the pry key of a related entity is defined

Weak (Non-identifying) Relationship: exists if the pry key of the related entity does not contain a

pry key component of the parent entity e.g COURSE (CRS_CODE, DEPT_CODE. CRS_CREDIT)

CLASS (CLASS_CODE, CRS_CODE, CLASS_SECTION)

Strong (Identifying) Relationship: exists when the pry key of the related entity contains a pry key

component of the parent entity e.g COURSE (CRS_CODE, DEPT_CODE, CRS_CREDIT)

CLASS (CRS_CODE, CLASS_SECTION, CLASS_TIME, ROOM_CODE)

Weak Entities

A weak entity is one that meets two conditions:

1. It is existence-dependent; it cannot exist without the entity with which it has a relationship

2. It has a pry key that is partially or totally derived from the parent entity in the relationship.

Attributes: are characteristic of entities for example STUDENT entity includes the attributes

STU_LNAME, STU_FNAME and STU_INITIAL

Domains: attributes have a domain, a domain is the attributes set of possible values.

Relationship Degree

A relationship degree indicates the number of entities or participants associated with a relationship.

Unary Relationships: exists when an association is maintained within a single entity.

Binary Relationships: exist when two entities are associated in a relationship.

Ternary Relationship: exists when three entities are associated.

Recursive Relationship: is one in which a relationship can exist between occurrences of same

entity set. (Naturally, such a condition is found within a unary relationship.)

Compose Entity (Bridge entity):

This is composed of the primary keys of each of the entities to be connected. Example is converting

the *:* relationship into two 1:* relationships.

Composite and Simple Attributes.

Composite attribute is an attribute that can be further subdivided to yield additional attributes e.g

ADDRESS, can be subdivided into street, city, state and postcode.

Simple attribute is an attribute that cannot be subdivided e.g age, sex and marital status.

Single-Valued Attributes: attribute that can have only a single value e.g person can have only one

social security number.

Multivalued Attributes: have many values e.g person may have several university degrees.

Derived attribute: attribute whose value is calculated from other attribute e.g an employee‟s age

EMP_AGE may be found by computing the integer value of the difference between the current date

and the EMP_DOB

Properties of an entity we want to record

• Example: Employee number, name

• The attributes could be

EMP_NO, EMP_NAME

Relation Types

• Relation between two entities Emp and Dept

• More than one relation between entities

Lecturer and Student

Teaches - Personal Tutor

• Relationship with itself

Called Recursive

Part made up of parts

Degree and cardinality are two important properties of the relational model.

The word relation, also known as a Dataset in Microsoft Access, is based on the mathematical set

theory form which Codd derived his model. Since the relational model uses attribute values to

establish relationships among tables, many database users incorrectly assume that the term relation

refers to such relationships. Many then incorrectly conclude that only the relational model permits

the use of relationships.

A Relation Schema is a textual representation of the DB tables where each table is described by its

name followed by the list of its attributes in parentheses e.g. LECTURER (EMP_NUM,

LECTURER_OFFICE, LECTURER_EXTENSION.

Rows sometimes referred to as Records

Columns are sometimes labeled as Fields.

Tables occasionally labeled as Files.

DB table is a logical rather than a physical concept and the file, record and the field describe

physical concepts.

Properties of a Relation

1. A table is perceived as 2-dimensional structure composed of rows and columns.

2. Each table row (tuple) represents a single entity occurrence within the entity set and must be

distinct. Duplicate rows are not allowed in a relation.

3. Each table column represents an attribute and each column has a distinct name

4. Each cell/column/row intersection in a relation should contain only single data value.

5. All values in a column must conform to the same data format.

6. Each column has a specific range of values known as the Attribute Domain.

7. The order of the rows and columns is immaterial to the DBMS.

8. Each table must have an attribute or a combination of attributes that uniquely identifies each

row.

Cardinality of Relationship

• Determines the number of occurrences from one entity to another.

• Example: Each Dept there are a number of Employees that work in it.

Cardinality is used to express the maximum number of entity occurrence associated with one

occurrence of the related entity.

Participation determines whether all occurrence of an entity participate in the relationship or not.

Three Types of Cardinality

• One-to-Many

Dept – Emp

• Many-to-Many

Student – Courses

Must be resolved into 1:m

• One-to-One

Husband – Wife (UK Law)

Optionality / Participation

Identifies the minimum cardinality of relationship between entities

0 - May be related

1 - Must be related

Developing an ER Diagram

The process of database design is an iterative rather than a linear or sequential process. An iterative

process is thus one based on repetition of processes and procedures.

1. Create a detailed narrative of the organization‟s description of operations.

2. Identify the business rules based on the descriptions of operations.

3. Identify Entities

4. Work out Relationships

5. Develop an initial ERD

6. Work out Cardinality/ Optionality

7. Identify the primary and foreign keys

8. Identify Attributes

9. Revise and Review the ERD

Types of Keys

• Primary Key - The attribute which uniquely identifies each entity occurrence

• Candidate Key - one of a number of possible attributes which could be used as the key field

• Composite Key - when more than one attribute is required to identify each occurrence

Composite Pry keys: pry key composed of more than one attribute.

• Foreign Key - when an entity has a Key attribute from another entity stored in it

Superkey – an attribute ( or combination of attributes) that uniquely identifies each row in a table.

Candidate – A superkey that does not contain a subset of attributes that is itself a superkey.

Primary key – A candidate key selected to uniquely identify all other attribute values in any given

row. It cannot contain null entries.

Identifiers (Pry Keys): the ERM uses identifiers to uniquely identify each entity instance.

Identifiers

are underlined in the ERD, key attributes are also underlined when writing the

relational schema.

Secondary Key – An attribute (combination of attribute) used strictly for data retrieval purposes.

Foreign Key – An attribute (or combination of attribute) in one table whose values must either

match the primary key in another table or be null.

The basic UML ERD

The basic Crow’s foot ERD

Example Problem 1

• A college library holds books for its members to borrow. Each book may be written by more than

one author. Any one author may have written several books. If no copies of a wanted book are

currently in stock, a member may make a reservation for the title until it is available. If books are

not returned on time a fine is imposed and if the fine is not paid the member is barred from loaning

any other books until the fine is paid.

ER Diag One

Example Problem 2

A local authority wishes to keep a database of all its schools and the school children that are

attending each school. The system should also be able to record teachers available to be employed

at a school and be able to show which teachers teach which children. Each school has one head

teacher who‟s responsibility it is to manage their individual school, this should also be modelled.

Example Problem 3

A university runs many courses. Each course consists of many modules, each module can

contribute to many courses. Students can attend a number of modules but first they must possess the

right qualifications to be accepted on a particular course. Each course requires a set of qualifications

at particular grades to allow students to be accepted, for example the Science course requires a least

2 „A‟ levels, one of which must be mathematics at grade „B‟ or above. There is the normal teaching

student/lecturer relationship, but you will also have to record personal tutor assignments.

Review Questions ch1

Discuss each of the following: Data, Field, Record, File

What is data redundancy and which characteristic of the file system can lead to it?

Discuss the lack of data independence in file systems.

What is a DBMS and what are its functions?

What is structural independence and why is it important?

Explain the diff betw data and information.

What is the role of DBMS and what are its advantages

List and describe the diff types of databases.

What re the main components of a database system?

What is metadata?

Explain why database design is important.

What re the potential costs of implementing a database system?


1. Discuss the importance of data modelling.

2. What is a business rule, and what is its purpose in data modelling?

3. How would you translate business rules into data model components?

5. What three languages were adopted by the DBTG to standardize the basic network data model,

and why was such standardisation important to users and designers?

6. Describe the basic features of the relational data model and discuss their importance to the end

user and the designer.

7. Explain how the entity relationship (ER) model helped produce a more structured relational

database design envt.

9. Why is an object said to have greater semantic content than an entity?

10. What is the difference between an object and a class in the object-oriented data model

(OODM)?

12. What is an ERDM, and what role does it play in the modern (production) database envt?

14. What is a relationship, and what three types of relationships exist?

15. Give an example of each of the three types of relationships.

16. What is a table and what role does it play in the relational model?

17. What is a relational diagram? Give an example.

18. What is logical independence?

19. What is physical independence?

20. What is connectivity? Draw ERDs to illustrate connectivity.

Review Questions ch.3

1. What is the difference between a database and a table?

2. What does a database expert mean when he/she says that a database displays both entity

integrity and referential integrity?

3. Why are entity integrity and referential integrity important in a database?


1. When two conditions must be met before an entity can be classified as a weak entity? Give an

example of a weak entity.

2. What is a strong (or identifying) relationship?

4. What is composed entity and when is it used?

6. What is recursive relationship? Give an example.

7. How would you graphically identify each of the following ERM components in a UML model:

I. An Entity

II. The multiplicity (0:*)

8. Discuss difference between a composite key and a composite attribute. How would each be

indicated in an ERD?

9. What two courses of action are available to a designer when he or she encounters a multivalued

attribute?

10. What is a derived attribute? Give example.

11. How is a composite entity represented in an ERD and what is its function? Illustrate using the

UML notation.

14. What three (often conflicting) database requirements must be addressed in database design?

15. Briefly, but precisely, explain the diff betw single-valued attributes and simple attributes. Give

an example of each.

16. What are multivalued attributes and how can they be handled within the database design?

Enhanced Entity Relationship (EER) Modelling ( Extended Entity Relationship Model)

This is the result of adding more semantic constructs to the original entity relationship (ER) model.

Examples of the additional concepts are in EER models are:

Specialisation/Generalization

Super Class/Sub Class

Aggregation

Composition

In modelling terms, an entity supertype is a generic entity type that is related to one or more entity

subtypes, where the entity supertype contains the common characteristic and the entity subtypes

contain the unique characteristic of each entity subtype.

Specialization Hierarchy

Entity supertypes and subtypes are organised in a specialization hierarchy. The specialization

hierarchy depicts the arrangement of higher-level entity supertypes (parent entities) and lower-level

entity subtypes (child entities).

In UML notation subtypes are called Subclasses and supertypes are known as Superclasses

Specialization and Generalization

Specialization is the top-down process of identifying lower-level, more specific entity subtypes

from a higher-level entity supertype. Specialization is based on grouping unique characteristics and

relationships of the subtypes.

Generalization is the bottom-up process of identifying a higher-level, more generic entity supertype

from lower-level entity subtypes. Generalization is based on grouping common characteristics and

relationships of the subtypes.

Superclass – An entity type that includes one or more distinct sub-groupings of its occurrences

therefore a generalization.

Subclass - A distinct sub-grouping of occurrences of an entity type therefore a specialization.

Attribute Inheritance: An entity in a subclass represents same „real world‟ object as in superclass

and may possess subclass-specific attributes, as well as those associated with the superclass.

Composite and Aggregation

Aggregation is where by a larger entity can be composed of smaller entities e.g.

University…Departments.

A special case of aggregation is known as Composition. This is a much stronger relationship than

aggregation, since when the parent entity instance is deleted, all child entity instances are

automatically deleted.

An Aggregation construct is used when an entity is composed of a collection of other entities, but

the entities are independent of each other.

A Composition construct is used when two entities are associated in an aggregation association

with a strong identifying relationship. That is, deleting the parent deletes the children instances.

Normalization of Database Tables

This is a process for evaluating and correcting table structures to minimise data redundancies,

thereby reducing the likelihood of data anomalies.

Normalization works through a series of stages called Normal Forms. i.e First normal form (1NF),

second normal form (2NF), third normal form (3NF). From a structural point of view, 2NF is better

than 1NF and 3F is better than 2NF. For most business database design purposes, 3NF is as high as

we need to go in normalization process. The highest level of normalization is not always most

desirable. But all most business designs use 3NF as the ideal normal form.

A table is in 1NF when all key attributes are defined and when all remaining attributes are

dependent on the pry key. However a table in 1NF can still contain both partial and transitive

dependencies. (A partial dependency is one in which an attribute is functionally dependent on only

a part of a multiattribute pry key. A transitive dependency is one in which one non-key attribute is

functionally dependent on another non-key attribute). A table with a single-attribute pry key cannot

exhibit partial dependecies.

A table is in 2NF when it is in 1NF and contains no partial dependencies. Therefore, a 1NF table is

automatically in 2NF when its pry key is based on only a single attribute. A table in 2NF may still

contain transitive dependencies.

A table is in 3NF when it is in 2NF and contains no transitive dependencies. When a table has only

a single candidate key, a 3NF table is automatically in BCNF (Boyce-Codd Normal Form).

Normalization Process

Checking ER model using functional dependency

Result - Removes any data duplication problems

Saves excess storage space

Removes insertion, update and deletion anomalies.

Functional Dependency A ® B

B is functionally dependent on A

If we know A then we can find B

Studno ®Studname

Review Questions

1. What is an entity supertype and why is it used?

2. What kinds of data would you store in an entity subtype?

3. What is a specialization hierarchy?

Review Questions

1. What is normalization?

2. When is a table in 1NF?



5. When is a table in BCNF?

7. What is a partial dependency? With what normal form is it associated?

8. What three data anomalies are likely to be the result of data redundancy? How can such

anomalies be eliminated?

9. Define and discuss the concept of transitive dependency.

11. Why is a table whose pry key consists of a single attribute automatically in 2NF when it is in

1NF.

Relational Algebra and SQL

Relational DB Roots

Relational algebra and relational calculus are the mathematical basis for „relational databases‟.

Proposed by E.F. Codd in 1971 as the basis for defining the relational model.

Relational algebra

– Procedural, describes operations

Relational calculus

– None procedural / Declarative

Relational Algebra is a collection of formal operations acting on these relations which produce

new relations as a result. The algebra is based on predicate logic and set theory and is discussed as

Procedural Lang. Relational algebra defines theoretical way of manipulating table contents through

a number of relational operators.

Set Theory

Relational Algebra Operators • UNION • INTERSECT • DIFFERENCE • SELECT (Restrict) • PROJECT • CARTESIAN PRODUCT • DIVISION • JOIN

The SELECT operator denoted by σθ which is formally defined as σθ (R) or σ <criterion> (RELATION)

where σθ (R) is the set of specified tuples of the relation R and θ is the predicate (or criterion) to

extract the required tuples.

The PROJECT operator returns all values for selected attributes and is formally defined as

Пa1...an(R) or П<list of attributes> (Relation)

Relational Operators

• Union R U S

– builds a relation consisting of all tuples appearing in either or both of two specified

relations.

• Intersection R ∩ S

– Builds a relation consisting of all tuples appearing in both of two specified relations.

• Difference (complement) R - S

– Builds a relation consisting of all tuples appearing in the first and not the second of two

specified relations.

Union

Intersection

Difference

• Select (Restrict) σ a(R)

– extract specified tuples from a specific relation.

• Project Π a,b(R) – extract specified attributes from specified relation.

• Cartesian-product R x S – Builds a relation from two specified relations consisting of all possible concatenated pairs of tuples, one

from each of the two specified relations.

Cartesian Product Example

R1 (a1, a2…an) with cardinality i and R2 (b1, b2..bn) with cardinality j is a relation R3 with degree k =

n+m, cardinality i*j and attributes (a1, a2..an, b1, b2..bn) this can be denoted R3 = R1 x R2

• Division R / S

– Takes two relations, one binary and one unary, and builds a relation consisting of all

values of one attribute of the binary relation that match (in the other attribute) all values in the

unary relation.

• Join R S

– Builds a relation from two specified relations consisting of all possible concatenated pairs

of

tuples, one from each of the two specified relations, such that in each pair the two tuples satisfy

some specified condition.

The DIVISION of 2 relations R1 (a1, a2…am) with cardinality i and R2 (b1, b2..bm) with cardinality j

is a relation R3 with degree k = n-m and cardinality i ÷ j

The JOIN of 2 relations R1 (a1, a2…an) and R2 (b1, b2..bn) is a relation R3 with degree k = n+m and

attributes (a1, a2…an , b1, b2..bn ) that satisfy a specific join condition.

Division

See page 129

Equijoin Example

i. Complete R1 x R2 this first performs a Cartesian Product to form all possible combinations

of the rows R1 and R2

ii. Restrict the Cartesian Product to only those rows where the values in certain columns match. See page 131

Secondary Algebraic Operators

Intersection R ∩ S = R - (R - S)

Division R/S = ΠA (R) - ΠA ((ΠA (R)xS)-R)

θ Join R IӼI R S = σR (RxS)

Equijoin R IӼI R=S S = σR=S (RxS)

Natural Join R IӼI A S = σA (RxS)

Semijoin R IӼI R S = ΠA (R IӼI R S)

Example Tables

S1 S# SNAME CITY S2 S# SNAME CITY

S1 Smith London S1 Smith London

S4 Clark London S2 Jones Paris

P P# PNAME WEIGHT SP S# P# QTY

P1 Bolt 10 S1 P1 10

P2 Nut 15 S1 P2 5

P3 Screw 15 S4 P2 7

S2 P3 8

Union S1 U S2

produce a table consisting of rows in either S1 or S2




S1 S# SNAME CITY

S1 Smith London

S4 Clark London

S2 Jones Paris

Intersection S1 ∩ S2

Produce a table consisting of rows in both S1 and S2




S1 S# SNAME CITY

S1 Smith

London

Difference S1 – S2

Produce table consisting of rows in S1 and not in S2




S1 S# SNAME CITY

S4 Clark London

Restriction σ city=‘London’S2

extract rows from a table that meet a specific criteria



S2 Jones Paris

Project Π PnameP

extract values of specified columns from a table

P P# PNAME WEIGHT

P1 Bolt 10

P2 Nut 15

P3 Screw 15

PNAME

Bolt

Nut

Screw

Cartesian product S1 X P

produce a table of all combinations from two other tables

S1 S# SNAME CITY P P# PNAME WEIGHT

S1 Smith London P1 Bolt 10

S4 Clark London P2 Nut 15

P3 Screw 15

S1 S# SNAME CITY

S1 Smith London P1 Bolt 10

S1 Smith London P2 Nut 15

S1 Smith London P3 Screw 15

S4 Clark London P1 Bolt 10

S4 Clark London P2 Nut 15

S4 Clark London P3 Screw 15

Divide P / S Produce a new table by selecting a column from rows

in P that match every row in S

P Partname S#

Bolt 1

Nut 1

Screw 1

Washer 1

Bolt 2

Screw 2

Washer 2

Bolt 3

Nut 3

Washer 3

S S#

1

2

3 BOLT

WASHER

Which Parts does every Supplier Supply

Natural Join – you must select only the rows in which the common attribute values match.

You could also do a right outer join and left outer join to select the rows that have no matching

values in the other related table.

An Inner Join – in which only rows that meet a given criteria are selected.

Outer Join – returns the matching rows as well as the rows with unmatched attribute values for one

table to be joined.

Natural Join S1 IӼI SP

Produce a table from two tables on matching columns

S1 S# SNAME CITY SP S# P# QTY

S1 Smith London S1 P1 10

S4 Clark London S1 P2 5

S4 P2 7

S2 P3 8

S# SNAME CITY P# QTY

S1 Smith London P1 10

S1 Smith London P2 5

S4 Clark London P2 7

Read pg 132 - 136

Consider Get supplier numbers and cities for suppliers who supply

part P2.

Algebra

Join relation S on S# with SP on S#

Restrict the results of that join to tuples with P# = P2

Project the result of that restriction on S# and City

Calculus

Get S# and city for suppliers such that there exists a shipment SP with the same S# value and with

P# value P2.

SQL – Structured Query Language SQL is a non-procedural lang.

Data Manipulation Language (DML)

Data Definition Language (DDL)

Data Control Language (DCL)

Embedded and Dynamic SQL

Security

Transaction Management

C/S execution and remote DB access

Types of Operations

• Data Definition Language DDL

– Define the underlying DB structure

– SQL includes commands to create DB object such as tables, indexes and views e.g

CREATE TABLE, NOT NULL, UNIQUE, PK, FK, CREATE INDEX, CREATE

VIEW, ALTER TABLE, DROP TABLE, DROP INDEX, DROP VIEW Data Definition Language

Create / Amend / Drop a table

Specify integrity checks

Build indexes

Create virtual Views of a table

• Data Manipulation Language

– Retrieving and updating the data

– Includes commands to INSERT, UPDATE, DELETE and retrieve data within the DB

tables. E.g INSERT, SELECT, WHERE, GROUP BY, HAVING, ORDER BY,

UPDATE, DELETE, COMMIT, ROLLBACK

Data Manipulation Language

Query the DB to show selected data

Insert, delete and update table rows

Control transactions - commit / rollback

• Data Control Language

- Control Access rights to parts of the DB

• GRANT to allow specified users to perform specified tasks.

• DENY to disallow specified users from performing specified tasks.

• REVOKE to cancel previously granted or denied permissions.

• UPDATE to allow a user to update records

• READ disallows a user to edit the database, can only view the data

• DELETE allows a user to delete records in a Database

Reading the Syntax

UPPER CASE = reserved words

lower case = user defined words

Vertical bar | = a choice i.e. asc|desc

Curly braces { } = choice from list

Square brackets [ ] = optional element

Dots … = optional repeating items

Syntax of SQL

SELECT [ALL | DISTINCT] {[table.]* | EXPRESSION

[alias],…}

FROM table [ alias [,table [alias]]…

[WHERE condition]

[GROUP BY expression [,expression]…

[HAVING condition]]

[ORDER BY {expression | position}[ASC|DESC]]

[{UNION | INTERSECT | MINUS} query]

Purpose of the Commands

SELECT Specifies which columns to appear

FROM Specifies which table/s to be used

WHERE Applies restriction to the retrieval

GROUP BY Groups rows with the same column value

HAVING Adds restriction to the groups retrieved

ORDER BY Specifies the order of the output

DB Schema is a group of DB objects – such as tables and indexes that are related to each other

[CREATE SCHEMA AUTHORIZATION {creator};]

Creating Table Structures:

CREATE TABLE tablename (

Column1 datatype [constraint],

Column2 datatype [constraint],

PRIMARY KEY (column 1),

FOREIGN KEY (column 1), REFERENCES tablename

CONSTRAINT constraint);

Foreign Key Constraint definition ensures that:

You cannot delete a tablename if at least one product row references that tablename.

On the other hand, if a change is made in an existing tablename, that change must be reflected

automatically in other tablename being refereced.

Not Null constraint: is used to ensure that a column does not accept nulls.

Unique constraint: is used to ensure that all values in a column are unique.

Default Constraint is used to assign a value to an attribute when a new row is added to a table.

Check Constraint is met for the specified attribute (that is, the condition is true), the data are

accepted for that attribute.

Purpose of the COMMIT and ROLLBACK commands are used to ensure DB update integrity in

transaction mgt.

The EXISTS special operator: EXISTS can be used wherever there is a requirement to execute a

command based on the result of another query.

A VIEW is a virtual table based on a SELECT query. The query can contain columns, computed

columns, aliases and aggregate functions from one or more tables

[CREATE VIEW viewname AS SELECT query.

Embedded SQL refers to the use of SQL statements within an application programming lang. e.g

COBOL, C++, ASP, JAVA and .NET. The lang. in which the SQL statements are embedded is

called the host lang. embedded SQL is still the most common approach to maintaining procedural

capabilities in DBMS-based applications.

Get remaining note from the slide pg 12 - 18

Review Questions

1. What are the main operations of algebra?

2. What is the Cartesian product? Illustrate your answer with an example.

3. What is the diff betw PROJECTION and SELECTION?

4. Explain the diff betw natural join and outer join?

DBMS Optimization

Database Performance- Tuning Concepts

- Goal of the database performance is to execute queries as fast as possible.

- Database performance tuning refers to a set of a activities and procedures designed to

reduce response time of the DB system, i.e to try and ensure that an end-user query is

processed by the DBMS in the minimum amount of time.

- The performance of a typical DBMS is constrained by 3 main factors:

i. CPU Processing Power

ii. Available primary Memory (RAM)

iii. Input/Output (Hard disk and network) throughput.

System Resources Client Server

Hardware CPU fastest possible Multiple processors

Fastest possible

Quad core Intel 2.66GHz

RAM Max. possible Max. possible (64GB)

Hard disk Fast IDE hard disk with

sufficient free hard disk

space.

Multiple high speed, high

capacity e.g 750GB

Network High-speed connection High-speed connection

Software Operating System Fined-tuned for best client

application performance

Fine-tuned for best server

application performance

Network Fine-tuned for best

throughput

Fine-tuned for best

throughput

Application Optimize SQL in client

application

Optimize DBMS for the

best performance.

The system performs best when its hardware and software resources are optimized. Fine-tuning the

performance of a system requires a holistic approach, i.e all factors must be checked to ensure that

each one operates at its optimum level and has sufficient resources to minimize the occurrence of

bottlenecks.

Note: Good DB performance starts with good DB design. No amount of fine-tuning will make a

poorly designed DB perform as well as a well-designed DB.

Performance Tuning: Client and Server

- On the client side, the objective is to generate a SQL query that returns the correct answer in the

least amount of time, using the minimum amount of resources at the server end. The activities

required to achieve that good resources are commonly referred to as SQL Performing Tuning.

- On the server, DBMS envt must be properly configured to response to clients request in the fastest

way possible, while making optimum use of existing resources. The activities required to achieve

that goal commonly referred to as DBMS Performance Tuning.

DBMS Architecture

It is represented by the processes and structures (in memory and in permanent storage) used to

manage a DB.

DBMS diag

DBMS Architecture Component and Functions

- All data in DB are stored in DATA FILES.

Data file can contain rows from one single table or it can contain rows from many diff

tables. DBA determines the initial size of the data files that make up the DB.

Data files can automatically expand in predefined increments known as EXTENDS. For

example, if more spaces is required, DBA can define that each new extend will be in 10KB

or 10MB increments.

Data files are generally grouped in file groups creating tables spaces. A table space or file

group is a logical grouping of several data files that store data with similar characteristic

- DBMS retrieve the data from permanent storage and place it in RAM (data cache)

- SQL Cache or Procedure Cache is a shared, reserved memory area that stores most recently

executed

- SQL statements or PL/SQL procedures including triggers and functions.

- Data Cache or Buffer Cache is a shared, reserved memory area that stores the most recently

accessed data blocks in RAM.

-To move data from permanent storage (data files) to the RAM (data cache), the DBMS issues I/O

requests and waits for the replies. An input/output request is a low-level (read or write) data access

operation to/from computer devices. The purpose of the I/O operation is to move data to and from

diff computer component or devices.

- Working with data in data cache is many times faster than working with data in data files cos

DBMS doesn‟t have to wait for hard disk to retrieve data.

- Majority of performance-tuning activities focus on minimising number of I/O operations.

Processes are:

Listener: listens for clients‟ request and hands the processing of the SQL requests to other DBMS

processes.

User: DBMS creates a user process to manage each client session

Scheduler: schedules the concurrent execution of SQL requests.

Lock Manager: manages all locks placed on DB objects

Optimizer: analyses SQL queries and finds the most efficient way to access the data.

Database Statistics: it refers to a number of measurements about DB object such as tables, indexes

and available resources such as number of processors used, processor speed and temporary space

available. Those statistics give a snapshot of DB characteristics.

Reasons for DBMS Optimiser

DBMS prevents direct access to DB

Optimiser is part of the DBMS

Optimiser processes user requests

Removes need for knowledge or data format-hence data independency

Reference to data dictionary

- Therefore increased productivity

- Provides Ad-hoc query processing.

Query Processing

DBMS processes queries in 3 phases:

Parsing: DBMS parses the SQL query and chooses the most efficient

access/execution plan.

Execution: DBMS executes the SQL query using the chosen execution plan.

Fetching: DBMS fetches the data and sends the result set back to the client.

The SQL parsing activities are performed by the query optimiser.

The Query Optimiser – analyses the SQL query and finds the most efficient way to access the data.

Parsing a SQL query requires several steps:

Interpretation

- Syntax Check: validate SQL statement

- Validation: confirms existence (table/Attribute)

- Translation: into relational algebra

- Relational Algebra optimisation

- Strategy Selection – execute plan

- Code generation: executable code

Accessing (I/O Disk access): read data from the physical data files and genearate the result

set.

Processing Time: (CPU Computation) – process data (cpu)

Query Optimisation:

Is the central activity during the parsing phase in query processing. In this phase, the DBMS must

choose what indexes to use how to perform join operations, what table to use first and so on.

Indexes facilitate searching, sorting and using aggregate functions and even join operations. The

improvement in data access speed occurs because an index is an ordered set of values that contain

the index key and pointers.

An Optimizer is used to work out how to retrieve the data in the most efficient way from a database.

Types of Optimisers

Heuristic (Rule-based): uses a set of preset rules and points to determine the best approach to

execute a query.

-15 rules, ranked in order of efficiency, particular access path for a table only chosen if

statement contains a predicate or other construct that makes that access path available.

-Score assigned to each execution strategy using these rankings and strategy with best

(lowest) selected.

The Rule Based (heuristic) optimizer – uses a set of rules to quickly choose between alternate

options to retrieve the data. It has the advantage of quickly arriving at a solution with a low

overhead in terms of processing, but the disadvantage of possibly not arriving at the most optimal

solution.

Cost Estimation (Cost based): uses sophisticated algorithm based on statistics about the objects

being accessed to determine the best approach to execute a query. The optimiser process adds up

the processing costs, the I/O costs and the resource cost (RAM and temporary space) to come up

with the total cost of a given execution plan.

The Cost Based optimizer – uses statistics which the DBA instructs to be gathered from the

database tables and based on these values it estimates the expected amount of disk I/O and CPU

usage required for alternate solutions. It subsequently chooses the solution with the lowest cost and

executes it. It has the advantage of being more lightly to arrive at an optimal solution, but the

disadvantage of taking more time with a higher overhead in terms of processing requirements.

Cost-based + hints:

The Cost Based optimizer (with Hints) – is the same as the Cost Based optimizer with the additional

facility of allowing the DBA to supply Hints to the optimizer, which instructs it to carry out certain

access methods and therefore eliminates the need for the optimizer to consider a number of

alternative strategies. It has the advantage of giving control to the DBA who may well know what

would be the best access method based on the current database data, plus the ability to quickly

compare alternate execution plans, but it has the disadvantage of taking us back to „hard coding‟

where the instructions on retrieving data are written into the application code. This could lead to the

need to rewrite application code in the future when the situation changes.

Optimiser hints are special instructions for the optimiser that are embedded inside the SQL

command text.

Query Execution Plan (QEP)

SELECT ENAME

FROM EMP E, DEPT D

WHERE E.DEPTNO = D.DEPTNO

AND DNAME = „RESEARCH‟;

OPTION 1: JOIN – S ELECT - PROJECT

OPTION 2: SELECT – JOIN - PROJECT

To calculate QEP to compare cost of both strategies: diag is here

Cost-based:

Make use of statistics in data dictionary

No of rows

No of blocks

No of occurrences

Largest/smallest value

Then calculates the „cost‟ of alternate solutions „query‟

Statistics

Cost-based Optimiser: depends on statistics for all tables, clusters and indexes accessed by query.

Users‟ responsibility to generate these statistics and keep them current.

Package DBMS_STATS can be used to generate and manage statistics and histograms.

Statistic Procedure and provides the Auto-update and Auto-create statistic options in its

initialization parameters.

Gathering Statistic: Use ANALYSE-

Analyse table TA compute statistic exact

Analyse table TA estimate statistics %selection

Analyse index TA_PK estimate statistics

Accessing Statistics:

View USER_TABLES

View USER_TAB_COLUMNS

View USER_INDEX

Check for Pro and Cons of each types of optimisers

Exercise WK4 solution …….

Review Questions

1. What is SQL performance tuning?

2. What is database performance tuning?

3. What is the focus of most performance-tuning activities and why does that focus exist?

4. What are database statistics, and why are they important?

5. How are DB statistic obtained?

6. What DB statistic measurements are typical of tables, indexes and resources?

7. How is the processing of SQL DDL statements (such as CREATE TABLE) different from the

processing required by DML statements?

8. In a simple terms, the DBMS processes queries in three phases. What are those phases and

what is accomplished in each phase?

9. If indexes are so important, why not index every column in every table?

10. What is the difference between a rule-based optimizer and a cost-based optimiser?

11. What are optimizer hints, and how are they used?

12. 13. What recommendations would you make for managing the data files in a DBMS with many

tables and indexes?

Production System

DB is a carefully designed and constructed repository of facts. The fact repository is a part of a

larger whole known as an Information System.

An Information System provides for data collection, storage and retrieval. It also facilitates the

transformation of data into info. and the mgt of both data and information. Complete information

System is composed of people, hardware, software, the DB, application programs and procedures.

System Analysis is the process that establishes the need for and the scope of an info.system. The

process of creating an info syst. is known as System Development.

A successful database design must reflect the information system of which the database is a part. A

successful info system is developed within a framework known as the Systems Development Life

Cycle (SDLC). Application transforms data into that forms that basis for decision making. The most

successful DB is subject to frequent evaluation and revision within a framework known as the DB

Life Cycle (DBLC)

Database Design Strategies: Top-down vs Bottom-up and Centralized vs decentralized.

The information Systems

Applications

- Transform data into that forms basis for decision making

- Usually produce the following: formal report, tabulations, graphic displays

- Every application is composed of 2 parts:

Data and Code by which data are transformed into information.

The performance of an information system depends on a triad of factors:

- DB design and implementation

- Application design and implementation

- Administrative procedures

The term DB Development: describes the process of DB design and implementation. The primary

objective in DB design is to create complete, normalized, non-redundant and fully integrated

conceptual, logical and Physical DB models.

System Deveopment Life Cycle (SDLC)

The SDLC is an iterative rather than a Sequential process.

SDLC divided into five phases:

Planning: such an assessment should answer some important questions.

Should the existing system be continued

Should the existing system be modified.

Should the existing system be replaced.

The feasibility study must address the following:

- The technical aspects of hardware and software requirements

- The system cost.

Analysis: problems defined during the planning phase are examined in greater detail during

analysis phase.

Addressing questions are:

What are the requirements of the current systems end users?

Do those requirements fit into the overall info requirements?

The analysis phase of the SDLC is, in effect, a thorough audit of user requirements. The existing

hardware and software system are also studied in order to give a better understanding of the

system‟s functional area, actual and potential problems and opportunities. The A.P. also includes

the creation of a logical system design. The logical design must specify the appropriate conceptual

data model, inputs, processes and expected output requirements. When creating logical design,

designer might use tools such as data flow diag (DFDs), hierarchical Input Process Output (HIPO)

diagram and ER diag.

Defining the logical system also yields functional description of the systems components (modules)

for each process within the DB envt.

Detailed Systems Design: complete the design of the system‟s processes. The design includes all

necessary technical specifications for the screens, menus, reports and other devices that might be

used to help make the system a more efficient information generator.

Implementation: the hardware, DBMS software and application programs are installed and the DB

design is implemented. During the initial stages of the I.P. the system enters into a cycle of coding,

testing and debugging until it is ready to be delivered. The DB contents may be loaded interactively

or in batch mode, using a variety of methods and devices:

- Customised user programs

- DB interface program

- Conversion program that import the data from a different file structure using batch

program, a DB utility or both.

The sysem is subjected to exhaustive testing until it is ready for use. After testing is concluded, the

final documentation is reviewed and printed and end users are trained.

Maintenance: as soon as the system is operational end users begin to request changes in it. These

changes generate system maintenance:

- Corrective maintenance in response to system errors.

- Adaptive maintenance due to changes in the business envt.

- Perfective maintenance to enhance the system.

The DB Life Cycle DBLC) : it contains 6 phases

Database Design Strategies

Two classical approaches to DB design:

Top-down Design:

- Identify data sets

- Defines data elements for each of these sets.

This process involves the identification of different entity types and the definition of each entity‟s

attributes.

Bottom-up Design

- Identifies data elements (items)

- Groups them together in data sets

i.e it first defines attributes, then groups them to form entities.

The selection of a primary emphasis on top-down or bottom-up procedures often depends on the

scope of the problem or personal preferences. The 2 methodologies are complementary rather than

mutually exclusive.

Top-down vs Bottom-up Design Sequencing

Even when a primarily top-down approach is selected, the normalization process that revises

existing table structure is (inevitably) a bottom-up technique. ER models constitute a top-down

process even when the selection of attributes and entities can be described as bottom-up because

both ER model and normalization technique form the basis for most designs, the top-down vs

bottom-up debate may be based on a distinction rather than a difference.

Production System continues

• Use estimate to refresh statistics

• Use declarative & procedural integrity

• Use stored PL/SQL procedures

– already compiled

– shared pool cache

• System configuration

System Configuration

• Size & configuration of the DB caches

–Number/size of data, buffer cache

–Size of shared pool

• SQL, PL/SQL, Triggers,

• Data Dictionary

– Log buffers

Options for the DBA

Table structure

• Heap

• Hash

• ISAM

• BTree

The main difference between the table structures is as follows:

The Heap table has no indexing ability built into it and so if left as a Heap would require a

secondary index if it was large and speedy access was required. The others have indexing ability

built into them but the Hash and ISAM would degrade over time if lots of modifications where

made to it - the additional data simply being added to the end as a heap in overflow pages, this is as

apposed to the BTree which is dynamic and grow as data is added.

Data Structures

Heap

• No key columns

• Queries, other than appends, scan every page

• Rows are appended at the end

• 1 main page, all others are overflow

• Duplicate rows are allowed

Do Use when:

• Inserting a lot of new rows

• Bulk loading data

• Table is only 1-2 pages

• Queries always scan entire table

Do Not Use when:

• You need fast access to 1 or a small subset of rows

• Tables are large

• You need to make sure a key is unique

Hash

Do Use when:

• Data is retrieved based on the exact value of a key

Do Not Use when:

• You need pattern matching, range searches

• You need to scan the table entirely

• You need to use a partial key for retrieval

ISAM

Do Use when:

• Queries involve pattern matching and range scans

• Table is growing slowly

• Key is large

• Table is small enough to modify frequently

Do Not Uses when:

• Only doing exact matches

• Table is large and is growing rapidly

Btree

• Index is dynamic

• Access data sorted by key

• Overflow does not occur if there are no duplicate keys

• Reuses deleted space on associated data pages

Do Use when

• Need pattern matching or range searches on the key

• Table is growing fast

• Using sequential key applications

• Table is too big to modify

• Joining entire tables to each other

Do Not Use when:

• Table is static or growing slowly

• Key is large

Creating Indexes:

Key fields

Foreign keys

Access fields

Disk Layout

Multiple Disks

Location of tables/index

Log file

DBMS components

Disk stripping

Other factors

• CPU

• Disk access speed

• Operating system

• Available memory

– swapping to disk

• Network performance

De-normalisation

• Including children with parents

• Storing most recent child with parent

• Hard-coding static data

• Storing running totals

• Use system assigned keys

• Combining reference of code tables

• Creating extract tables

Centralized vs Decentralized Design

Two general approaches (bottom-up and top-down) to DB design can be influenced by factors such

as the scope and Size of the system, the company‟s mgt style and the company‟s structure

(centralised or decentralised).

Centralized Design is productive when the data component is composed of a relatively small

number of objects and procedures. Centralised design is relatively simple and/or small can DB and

can be successfully done by single person (DBA)

The company operations and the scope of the problem are sufficiently limited to allow even a single

designer to define the problems create conceptual design, verify conceptual design with the user

view.

Decentralized Design: this might be used when the data component of the system has a

considerable number of entities and complex relations on which very complex operations are

performed. Decentralised design likely to be employed when the problem itself is spread across

several operations sites and each element is a subset of the entire data set.

Carefully selected team of DB designers is employed to tackle a complex DB project. Within the

decentralised design framework, the DB design task is divided into several modules. Each design

group creates a conceptual data model corresponding to the subset being modelled. Each conceptual

model is then verified individually against user views, processes and constraints for each of the

modules. After the verification process has been completed, all modules are integrated into one

conceptual model.

Naturally, after the subsets have been aggregated into a larger conceptual model, the lead designer

must verify that the combined conceptual model is still able to support all of the required

transactions.

Database Design

Conceptual, Logical and Physical Database Design.

Conceptual DB Design is where we create the conceptual representation of the DB by producing a

data model which identifies the relevant entities and relationship within our system.

Logical DB Design is where we design relations based on each entity and define integrity rules to

ensure there is no redundant relationship within our DB.

Physical DB Design is where the physical DB is implemented in the target DBMS. In this stage we

have to consider how each relation is stored and how data is accessed.

Three Stages of DB Design

Selecting a suitable file organisation is important for fast data retrieval and efficient use of storage

space. 3 most common types of file organisation are:

Heap Files: which contain randomly ordered records.

Indexed Sequential Files: which are stored on one or more fields using indexes.

Hashed Files: in which a hashing algorithm is used to determine the address of each record

based upon the value of the primary key.

Within a DBMS indexes are often stored in data structure known as B-trees which allow fast data

retrieval. Two other kinds of indexes are Bitmap Indexes and Join Indexes. These are often used on

multi-dimensional data held in data warehouses.

Indexes are crucial in speeding up data access. Indexes facilitate searching, sorting and using

aggregate functions and even join operations. The improvement in data access speed occurs because

an index is an ordered set of value that contains the index key and pointers.

Data Sparsity refers to the number of different values a column could possibly have. Indexes are

recommended in highly sparse columns used in search conditions.

Concurrency and Recovery

A transaction is any action that reads from and/or writes to a DB. A transaction may consist of a

simple SELECT statement to generate a list of table contents. Other statements are UPDATE,

INSERT, or combinations of SELECT, UPDATE & INSERT statement.

A transaction is a logical unit of work that must be entirely completed or entirely aborted, no

intermediate states are acceptable. All of the SQL statements in the transaction must be completed

successfully. If any of the SQL statements fail, the entire transaction is rolled back to the original

DB state that existed before the transaction started.

A successful transaction changes the DB from one consistent state to another. A consistent DB

State is one in which all data integrity constraints are satisfied.

To ensure consistency of the DB, every transaction must begin with the DB in a known consistent

State. If the DB is not in a consistent state, the transaction will yield an inconsistent DB that

violates its integrity and business rules. All transactions are controlled and executed by the DBMS

to guarantee DB integrity. Most real-world DB transaction are formed by the 2 or more DB

requests- Equivalent of a single SQL statement in an application program or transaction.

Terms to know

Transaction: logical unit of work

Consistent State: DB reflecting true position

Concurrent: at the same time.

Sequence: Read disk block, update data, rewrite disk

Serializability: Ensures that concurrent execution of several transactions yields consistent results.

Transaction properties

All transaction must display atomicity, consistency, isolation durability and serializability (ACIDS

test)

Atomicity: requires that all operations (SQL requests) of a transaction be completed; if not,

the transaction is aborted.

Consistency: indicates the performance of the DB‟s consistent state, when a transaction is

completed, the DB reaches a consistent state.

Isolation: means that the data used during the execution of a transaction cannot be used by a

2nd transaction until the 1st one is completed.

Durability: ensures that once transaction changes are done (committed), they cannot be

undone or lost even in the event of a system failure.

Serializability: ensures that the concurrent execution of several transanctions yield

consistent results. This is important in multi-user and distributed databases where multiple

transaction are likely to be executed concurrently. Naturally, if only a single transaction is

executed, serializaability is not an issue.

The Transaction Log

DBMS uses a transaction log to keep track of all transactions that update the DB. The information

stored in this log is used by the DBMS for a recovery requirement triggered by a ROLLBACK

statement.

Log with Deferred Updates

- Transaction recorded in log file

- Updates are not written to the DB

- Log entries are used to update the DB

In the event of a failure…..

- Any transactions not completed are ignored

- Any transactions committed are redone

- Checkpoint used to limit amount of rework.

Log with Immediate Updates

- Writes to DB as well as the log file

- Transaction record contains old and new value

- Once log record written DB can be updated.

In the event of failure…..

Transaction not completed are undone – old values.

Updates take place in reverse order

Transaction committed are redone – new value.

Concurrency Control

The coordination of the simultaneous execution of transaction in a multi-user DB system is known

as Concurrency Control. The objective of concurrency control is to ensure the serializability of

transaction in a multi-user DB envt. Concurrency control is important because the simultaneous

execution of transactions over a shared DB can create several data integrity and consistency

problems. Both disk I/O and CPU are used.

3 main problems are

Lost Updates

Uncommitted Data

Inconsistent Retrievals.

Uncommitted Data occurs when 2 transactions are executed concurrently and the 1st transaction is

rolled back after the second transaction has already accessed the uncommitted data – thus violating

the isolation property of transactions.

Inconsistent Retrievals: occur when a transaction calculates some summary (aggregates) functions

over a set of data while other transactions are updating the data. The problem is that the transaction

might read some data before they are changed and other data after they are changed, thereby

yielding inconsistent results.

Lost Updates: When 1st transaction T1 has not yet been committed when the 2

nd transaction T2 is

executed. Therefore T2 still operates on the initial value of T1.

The Scheduler:

Is responsible for establishing the order in which the concurrent transaction operations are executed.

The transaction execution order is critical and ensures DB integrity in multi-user DB systems.

Locking, Time-stamping and Optimistic methods are used by the scheduler to ensure the

serializabilty of transactions.

Serializability of Schedules is guaranteed through the use of 2-phase locking. The 2-phase locking

schema has a growing phase in which the transaction acquires all of the locks that it needs without

unlocking any data and a shrinking phase in which the transaction releases all of the locks without

acquiring new locks.

Serializability:

Serial execution means performing transactions one after another

If 2 transactions are only reading a variable, they do not conflict and order is not important.

If 2 transactions operate on separate variables, they do not conflict and order is not important.

Only when a transaction writes to a variable and another either reads or writes to the same variable,

then order is important. Serializability is making sure that when it counts, transaction operates in

order.

Lock Granularity:

It indicates the level of lock use. Locking can take place at the following levels: database, table,

page, row, or even field (attribute)

Database Level Lock: the entire DB is locked, thus preventing the use of any tables in the DB by

transaction T2 while transaction T1 is being executed. This level of locking is good for batch

processes, but it is unsuitable for online multi-user DBMS.

Note that transaction T1 and T2 cannot access the same DB concurrently even when they use diff

tables.

Table Level Lock:

The entire table is locked, preventing access to any row by transaction T2 while transaction T1 is

using the table. If a transaction requires access to several tables, each table may be locked. However

2 transactions can access the same DB as long as they access diff tables.

Page Level Lock

DBMS locks an entire disk page. A disk page or page is the equivalent of a disk block which can be

described as directly addressable section of a disk. A page has a fixed size.

Row Level Lock:

It is much less restrictive than the locks discussed above. DBMS allows concurrent transactions to

access diff rows of the same table, even when the rows are locked on the same page.

Field Level Lock:

It allows concurrent transaction to access the same row as long as they require the use of diff fields

(attributes) within that row.

Lock Types:

Shared/Exclusive Locks: an exclusive lock exists when access is reserved specifically for the

transaction that locked the object.

Read (shared) Lock: allows the reading but not updating of a data item, allowing multiple

accesses.

Write (Exclusive) – allows exclusive update of a data item.

A shared Lock is issued when a transaction wants to read data from the DB and no exclusive lock

is held on that data item.

An Exclusive Lock is issued when a transaction wants to update (Write) a data item and no locks

are currently held on that data item by any other transaction.

Two-Phase Locking:

Defines how transactions acquire and relinquish locks. It guarantees serializability, but it does not

prevent deadlocks the two phases are:

1. Growing Phase: transaction acquires all required locks without unlocking any data. Once

all locks have been acquired the transaction is in its locked point.

2. Shrinking Phase: transaction releases all locks and cannot obtain any new lock.

The two-phase locking protocol is governed by the following rules:

· 2 transactions cannot have conflicting locks.

· No unlock operation can precede a lock operation in the same transaction.

· No data are affected until all locks are obtained i.e until the transaction is in its locked point.

Deadlocks:

A deadlock occurs when 2 transactions wait for each other to unlock data.

Three Basic Techniques to Control Deadlocks:

Deadlock Prevention: a transaction requesting a new lock is aborted when there is the

possibility that a deadlock can occur. If the transaction is aborted, all changes made by this

transaction are rolled back and all locks obtained by the transaction are released. (statically

make deadlock structurally impossible )

Deadlock Detection: DBMS periodically tests the DB for deadlocks if a deadlock is found,

one of the transactions (the „victim‟) is aborted (rolled back and restarted) and the other

transaction continues. (let deadlocks occur, detect them and try to recover)

Deadlock Avoidance: The transaction must obtain all of the locks it needs before it can be

executed. (avoid deadlocks by allocating resources carefully)

Concurrency Control with Time-Stamping Methods:

Time-stamping: the time-stamping approach to scheduling concurrent transactions assign a global,

unique time stamp to each transaction.

Time stamps must have two properties: Uniqueness and Monotomicity.

Uniqueness ensures that no equal time stamp values can exist.

Monotomicity ensures that time stamp values always increase.

All DB operates (Read and Write) within the same transaction must have the same time stamp. The

DBMS executes conflicting operations in time stamp order, thereby ensuring serializability of the

transactions. If 2 transaction conflict, one is stopped, rolled back, rescheduled and assigned a new

time stamp value. No locks are used so no deadlock can occur.

Disadvantage of the time stamping approach is that each value stored in the DB requires 2

additional time stamp fields.

Concurrency Control With Optimistic Methods:

The optimistic approach is based on the assumption that the majority of that DB operations do not

conflict. The optimistic approach does not require locking or time stamping technique. Instead a

transaction is executed without restrictions until is committed. Each transaction moves through 2 or

3 phases which are READ, VALIDATION and WRITE.

- Some envts may have relatively few conflicts between transactions.

- Locking would be an inefficient overhead.

- Eliminate this by optimistic technique

- Assume there will be no problems

- Before committing a check done

- If conflict occurred transaction is rolled back.

Database Recovery:

DB recovery restores a DB from a given state, usually inconsistent, to a previously consistent state.

Need for Recovery:

Physical disasters – fire, flood

Sabotage – internal

Carelessness – unintentional

Disk Malfunctions – headcrash, unreadable tracks

System crashes – hardware

System software errors – termination of DBMS

Application software errors – logical errors.

Recovery Technique: are based on the atomic transaction property, all portions of the transaction

must be treated as a single, logical unit of work in which all operations are applied and completed to

produce a consistent DB.

- Technique to restore DB to a consistent state.

- Transactions not completed – rolled back

- To record transaction using a log file.

- Contains – transaction and checkpoint records

- Checkpoint record – lists current transactions.

Four Important Concepts that Effect the Recovery Process:

Write-ahead-log protocol: ensures that transaction logs are always written before any DB

are actually updated.

Redundant Transaction Log: most DBMS keep several copies of the transaction log to

ensure that a physical disk failure will not impair the DBMS ability to recover data.

Database Buffers: buffer is a temporary storage area in primary memory used to speed up

disk operations.

Database Checkpoints: is an operation in which the DBMS writes all of its updated buffers

to disk. Checkpoint operation is also registered in the transaction log.

Recovery procedure uses deferred Write or deferred update, the transaction operations do not

immediately update the physical DB. Instead only the transaction log is updated. The recovery

process for all started and committed transactions (before the failure) follow these steps:

Identify the last checkpoint in the transaction log.

For a transaction that started and committed before the last checkpoint, nothing

needs to be done because the data are already saved.

For a transaction that performed a commit operation after the last checkpoint, the

DBMS uses the transaction log records to redo, the transaction and to update the DB,

using the „after‟ values in the transaction log.

For any transaction that had a Rollback operation after the last checkpoint or that

was left active before the failure occurred, nothing needs to be done because DB was

never updated.

When the recovery procedure was Write-through or Immediate Update the DB is immediately

updated by transaction operations during the transaction execution, even before the transaction

reaches it commit point.

Deadlocks in Distributed Systems

Deadlocks in distributed systems are similar to deadlocks in single processor systems, only worse

- They are harder to avoid, prevent or even detect.

- They are hard to cure when tracked down because all relevant information is scattered over

many machines.

Distributed Deadlock Detection

Since preventing and avoiding deadlocks to happen is difficult, researchers works on detecting the

occurrence of deadlocks in distributed system.

The presence of atomic transaction in some distributed systems makes a major conceptual

difference.

When a deadlock is detected in a conventional system, we kill one or more processes to break the

deadlock.

When deadlock is detected in a system based on atomic transaction, it is resolved by aborting one or

more transactions. But transactions have been designed to withstand being aborted. When a

transaction is aborted, the system is first restored to the state it had before the transaction began, at

which point the transaction can start again. When a bit of luck, it will succeed the second time. Thus

the difference is that the consequences of killing off a process are much less severe when

transactions are used.

1 Centralised Deadlock Detection

We use a centralised deadlock detection algorithm and try to imitate the nondistributed algorithm.

Each machine maintains the resource graph for its own processes and resources.

A centralised coordinator maintain the resource graph for the entire system.

In updating the coordinator‟s graph, messages have to be passed.

- Method 1: whenever an arc is added or deleted from the resource graph, a message

have to be sent to the coordinator.

- Method 2: periodically, every process can send a list of arcs added and deleted since

previous update.

- Method 3: coordinator asks for information when it needs it.

One possible way to prevent false deadlock is to use the Lamport‟s algorithm to provide global

timing for the distributed systems.

When the coordinator gets a message that leads to a suspect deadlock:

It sends everybody a message saying “I just received a message with a timestamp T which leads to

deadlock. If anyone has a message for me with an earlier timestamp, please send it immediately”

When aevery machine has replied, positively or negativel, the coordinator will see the the deadlock

has really occurred on not.

2. The Chandy-Misra-Haas algorithm:

Processes are allowed to request multiple resources at once – the growing phase of a

transaction can be speeded up.

The consequence of this change is a process may now on two or more resources at the same

time.

When a process has to wait for some resources, a probe message is generated and sent to the

process holding the resources. The message consists of three numbers: the process being

blocked, the process sending the messga and the process receiving the message.

When message arrived, the recipient checks to see it if itself is waiting for any processes. If

so, the message is updated, keeping the first number unchanged and replaced the second and

third field by the corresponding process number.

The message is then sent to the process holding the needed resources.

If a message goes all the way around and comes back to the original sender

- The process that initiates the probe, a cycle exists and the system is deadlocked.

Review Questions

I. Explain the following statement: a transaction is a logical unit of work.

II. What is a consistsent database state, and how is it achieved?

III. The DBMS does not guarantee that the semantic meaning of the transaction truly represents

the real-world event. What are the possible consequences of that limitation? Give example.

IV. List and discuss the four transaction properties.

V. What is transaction log, and what is its function?

VI. What is scheduler, what does it do, and why is its activity important to concurrency control?

VII. What is lock and how, in general, does it work?

VIII. What is concurrency control and what is its objectives?

IX. What is an exclusive lock and under what circumstance is it granted?

X. What is deadlock, and how can it be avoided? Discuss several deadlock avoidance

stategies.

11. What three levels of backup may be used in DB recovery mgt? briefly describe what each of

those three backup levels does.

Database Security Issues

Types of Security

Legal and ethical issues regarding the right to access certain information. Some

information may be deemed to be private and cannot be accessed legally by unauthorized

persons.

Policy Issues at the governmental, institutional or corporate level as to what kinds of info

should not be made publicly available- for example Credit ratings and personal medical

records.

System-related issues: such as the system level at which various security function should

be enforced – for example, whether a security function should be handled at the physical

hardware level, the operating system level or the DBMS level.

The need to identify multiple security levels and to categorize the data and users based on

these classifications for example, top secret, secret, confidential and unclassified. The

security policy of the organisation with respect to permitting access to various

classifications of data must be enforced.

Threats to Database: This result in the loss or degradation of some or all of the following commonly accepted security

goals: Integrity, Availability and Confidentiality.

Loss of Integrity: Database Integrity refers to the requirement that information be protected

form improper modification. Modification of data includes creation, insertion, modification,

changing the status of data and deletion. Integrity is lost if unauthorised changes are made to

the data by either intentional or accidental acts. If the loss of the system or data integrity is

not corrected, continued use of the contaminated system or corrupted data could result in

inaccuracy, fraud or erroneous decisions.

Loss of Availability: Database availability refers to making objects available to a human

user or a program to which they have a legitimate right.

Loss of Confidentiality: Database confidentiality refers to the protection of data from

unauthorized disclosure. Unauthorized, unanticipated or unintentional disclosure could

result in loss of public confidence, embarrassment or legal action against the organisation.

Control Measures

Four main control measures that are used to provide security of data in databases:

· Access control

· Inference control

· Flow control

· Data encryption

Access Control: the security mechanism of a DBMS must include provisions for restricting access

to the database system as a whole. This function is called Access control and is handled by creating

user accounts and passwords to control the login process by the DBMS.

Inference Control: Statistical databases are used to provide statistical information or summaries of

values based on various criteria e.g database for population statistics. Statistical database users e.g

govt.statistician or market researchers firms are allowed to access the database to retrieve statistical

information about population but not to access the detailed confidential information about specific

individuals. Statistical database security ensures that information about individuals cannot be

accessed. It is sometimes possible to deduce or infer certain facts concerning individuals from

queries that involve only summary statistic on groups; consequently, this must not be permitted

either. The corresponding control measures are called Inference Control.

Flow Control: it prevents information from flowing in such a way that it reaches unauthorized

users. Channels that are pathways for information to flow implicitly in ways that violate the security

policy of an organisation are called Covert Channels.

Data Encryption: is used to protect sensitive data such as credit cards numbers, that is transmitted

via some type of communications network. The data is encoded using some coding algorithm. An

unauthorized user who access encoded data will have difficulty deciphering it but authorized users

are given decoding or decrypting algorithms (or keys) to decipher data.

A DBMS typically includes a database security and authorization subsystem that is responsible

for security of portions of a database against unauthorized access.

Two Types of database Security Mechanism:

i) Discretionary Security mechanism: these re used to grant privileges to users including the

capability to access specific data files, records or fields in a specified mode (such as read, insert,

delete or update)

ii) Mandatory Security Mechanism: used to enforce multilevel security by classifying the data

and users into various security classes/levels and then implementing the appropriate security policy

of the organisation. E.g a typical security policy is to permit users at certain classification level to

see only the data items classified at the user‟s own or lower classification level. An extension of this

is Role-based Security, which enforces policies and privileges based on concepts of roles.

Database Security and the DBA

DBA is the central authority for managing a database system. The DBA‟s responsibilities include

granting privileges to users who need to use the system and classifying users and data in accordance

with the policy of the organisation. The DBA has a DBA account in DBMS, sometimes called a

System or Superuser Account, which provides powerful capabilities that are not made available to

regular database accounts and users. DBA-privileged commands include commands for granting

and revoking privileges to individual accounts or user groups and for performing the following

types of actions:

Account Creation: this action creates a new account and password for a user or

group of users to enable access to the DBMS.

Privilege Granting: this action permits the DBA to grant certain privileges to

certain accounts.

Privilege Revocation: this action permits the DBA to revoke certain privileges that

were previously given to certain accounts.

Security Level Assignment: this action consists of assigning user accounts to the

appropriate security classification level.

The DBA is responsible for the overall security of the database system. Action i above is used to

control access to the DBMS as a whole, whereas actions ii and iii are used to control Discretionary

database authorization and action iv is used to control Mandatory authorization.

Access Protection, User Accounts and Database Audits

DBA will create a new account number and password for the user if there is a legitimate need to

access the database. The user must log in to the DBMS by entering the account number and

password whenever database access is needed.

It‟s straightforward to keep track of all database users and their accounts and passwords by creating

and encrypted table or file with two fields: Account Number and Password. This table can be easily

maintained by the DBMS.

The database system must also keep track of all operations on the database that are applied by a

certain user throughout each login session, which consists of the sequence of database interactions

that a user performs from the time of logging in to the time of logging off.

To keep a record of all updates applied to the database and of particular users who applied each

update, we can modify system log which includes an entry for each operation applied to the

database that may be required for recovery from a transaction failure or system crash. If any

tampering with the database is suspected, a database audit is performed, which consists of

reviewing the log to examine all access and operations applied to the database during a certain time

period. When an illegal or unauthorized operation is found, the DBA can determine the account

number used to perform the operation. Database audits are particularly important for sensitive

databases that are updated by many transactions and users such as a banking database that is

updated by many bank tellers. A database log that is used mainly for security purposes is sometimes

called an Audit Trail.

Discretionary Access Control based on Granting and Revoking Privileges

The typical method of enforcing discretionary access control in a database system is based on the

granting and revoking privileges.

Types of Discretionary Privileges:

The Account Level: at this level, the DBA specifies the particular privileges that each account

holds independently of the relations in the database.

The privileges at the account level apply to the capabilities provided to the account itself and can

include the CREATE SCHEMA or CREATE TABLE privilege, to create a schema or base

relations; the CREATE VIEW privilege; the ALTER privilege, to apply schema changes such

adding or removing attributes from relations; the DROP privilege, to delete relations or views; the

MODIFY privilege, to insert, delete, or update tuples; and the SELECT privilege, to retrieve

information from the database by using a SELECT query.

The Relation (or Table) Level: at this level, the DBA can control the privileges to access each

individual relation or view in the database.

The second level of privileges applies to the relation level, whether they are base relations or

virtual (view) relations.

The granting and revoking of privileges generally follow an authorization model for discretionary

privileges known as Access Matrix model, where the rows of a matrix M represents subjects

(users, accounts, programs) and the columns represent objects (relations, records, columns, views,

operations). Each position M(i,j) in the matrix represents the types of privileges ( read, write,

update) that subject I holds on object j

To control the granting and revoking of relation privileges, each relation R in a database is assigned

an owner account which is typically the account that was used when the relation was created in the

first place. The owner of a relation is given all privileges on that relation. In SQL2, the DBA can

assign and owner to a whole schema by creating the schema and associating the appropriate

authorization identifier with that schema using the CREATE SCHEMA command. The owner

account holder can pass privileges on any of the owned relation to other users by granting privileges

to their accounts.

In SQL the following types of privileges can be granted on each individual relation R:

SELECT (retrieval or read) privilege on R: gives the account retrieval privilege. In SQL

this gives the account the privilege to use the SELECT statement to retrieve tuples from R

MODIFY privilege on R: this gives the account the capability to modify tuples of R. in

SQL this privilege is divided into UPDATE, DELETE and INSERT privileges to apply the

corresponding SQL command to R. additionally both the INSERT and UPDATE privileges

can specify that only certain attributes of R can be updated by the account.

REFERNCES privilege on R: this gives the account the capability to reference relation R

when specifying integrity constraints. This privilege can also be restricted to specific

attributes of R.

Notice that to create a view, the account must have SELECT privilege on all relations involved in

the view definition.

Specifying Privileges using Views

The mechanism of views is an important discretionary authorization mechanism in its own right.

For example, if the owner A of a relation R wants another account B to be able to retrieve only

some fields of R, then A can create a view V of R that includes only those attributes and then grant

SELECT on V to B. The same applies to limiting B to retrieving only certain tuples of R; a view V‟

can be created by defining the view by means of a query that selects only those tuples from R that A

wants to allow B to access.

Revoking Privileges:

In some cases it is desirable to grant a privilege to a user temporarily. For example, the owner of a

relation may want to grant the SELECT privilege to a user for a specific task and then revoke that

privilege once the task is completed. Hence, a mechanism for revoking privileges is needed. In

SQL, a REVOKE command is included for the purpose of canceling privileges.

Propagation of privileges using the GRANT OPTION

Whenever the owner A of a relation R grants a privilege on R to another account B, privilege can be

given to B with or without the GRANT OPTION. If the GRANT OPTION is given, this means that

B can also grant that privilege on R to other accounts. Suppose that B is given the GRANT

OPTION by A and that B then grants the privilege on R to a third account C, also with GRANT

OPTION. In this way, privileges on R can propagate to other accounts without the knowledge of

the owner of R. If the owner account A now revokes the privilege granted to B, all the privileges

that B propagated based on that privilege should automatically be revoked by the system.

It is possible for a user to receive a certain privilege from two or more sources. e .g A4 may receive

a certain UPDATE R privilege from both A2 and A3. In such a case, if A2 revokes this privilege

from A4, A4 will still continue to have the privilege by virtue of having been granted it from A3. if

A3 later revokes the privilege from A4, A4 totally loses the privilege. Hence a DBMS that allows

propagation of privilege must keep track of how all the privileges were granted so that revoking of

privileges can be done correctly and completely.

Specifying Limits on Propagation of Privileges

Techniques to limit the propagation of privileges have been developed, although they have not yet

been implemented in most DBMSs and are not a part of SQL.

Limiting horizontal propagation to an integer number i means that an account B given the

GRANT OPTION can grant the privilege to at most i other accounts.

Vertical propagation is more complicated; it limits the depth of the granting of privileges.

Granting a privilege with a vertical propagation of zero is equivalent to granting the privilege with

no GRANT OPTION. If account A grants a privilege to account B with the vertical propagation set

to an integer number j>0, this means that the account B has the GRANT OPTION on that privilege,

but B can grant the privilege to other accounts only with a vertical propagation less than j.

Mandatory Access Control and Role-Based Access Control for Multilevel Security

The discretionary access control techniques of granting and revoking privileges on relations has

traditionally been the main security mechanism for relational database systems.

This is an all-or-nothing method: A user either has or does not have a certain privilege.

In many applications, and additional security policy is needed that classifies data and users based

on security classes. This approach known as mandatory access control, would typically be

combined with the discretionary access control mechanisms. It is important to note that most

commercial DBMS currently provide mechanisms only for discretionary access control.

Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U),

where TS is the highest level and U the lowest: TS ≥ S ≥ C ≥ U

The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies

each subject (user, account, program) and object (relation, tuple, column, view, operation) into one

of the security classifications, T, S, C, or U: clearance (classification) of a subject S as class(S) and

to the classification of an object O as class(O).

Two restrictions are enforced on data access based on the subject/object classifications:

1. A subject S is not allowed read access to an object O unless class(S) ≥ class(O). This is

known as the Simple Security Property.

2. A subject S is not allowed to write an object O unless class(S) ≤ class(O). This known as the

Star Property (or * property).

The first restriction is intuitive and enforces the obvious rule that no subject can read an object

whose security classification is higher than the subject‟s security clearance.

The second restriction is less intuitive. It prohibits a subject from writing an object at a lower

security classification than the subject‟s security clearance.

Violation of this rule would allow information to flow from higher to lower classifications which

violates a basic tenet of multilevel security.

To incorporate multilevel security notions into the relational database model, it is common to

consider attribute values and tuples as data objects. Hence, each attribute A is associated with a

classification attribute C in the schema, and each attribute value in a tuple is associated with a

corresponding security classification. In addition, in some models, a tuple classification attribute

TC is added to the relation attributes to provide a classification for each tuple as a whole. Hence, a

multilevel relation schema R with n attributes would be represented as

R(A1,C1,A2,C2, …, An,Cn,TC)

where each Ci represents the classification attribute associated with attribute Ai.

The value of the TC attribute in each tuple t – which is the highest of all attribute classification

values within t – provides a general classification for the tuple itself, whereas each Ci provides a

finer security classification for each attribute value within the tuple.

The apparent key of a multilevel relation is the set of attributes that would have formed the

primary key in a regular (single-level) relation.

A multilevel relation will appear to contain different data to subjects (users) with different clearance

levels. In some cases, it is possible to store a single tuple in the relation at a higher classification

level and produce the corresponding tuples at a lower-level classification through a process known

as Filtering.

In other cases, it is necessary to store two or more tuples at different classification levels with the

same value for the apparent key. This leads to the concept of Polyinstantiation where several

tuples can have the same apparent key value but have different attribute values for users at different

classification levels.

In general, the entity integrity rule for multilevel relations states that all attributes that are

members of the apparent key must not be null and must have the same security classification within

each individual tuple.

In addition, all other attribute values in the tuple must have a security classification greater than or

equal to that of the apparent key. This constraint ensures that a user can see the key if the user is

permitted to see any part of the tuple at all. Other integrity rules, called Null Integrity and

Interinstance Integrity, informally ensure that if a tuple value at some security level can be

filtered from a higher-classified tuple, then it is sufficient to store the higher-classified tuple in the

multilevel relation.

Comparing Discretionary Access Control and Mandatory Access Control

Discretionary Access Control (DAC) policies are characterized by a high degree of

flexibility, which makes them suitable for a large variety of application domains.

The main drawback of DAC models is their vulnerability to malicious attacks, such as

Trojan horses embedded in application programs.

By contrast, mandatory policies ensure a high degree of protection in a way, they prevent

any illegal flow of information.

Mandatory policies have the drawback of being too rigid and they are only applicable in

limited environments.

In many practical situations, discretionary policies are preferred because they offer a better

trade-off between security and applicability.

Role-Based Access Control

Role-based access control (RBAC) emerged rapidly in the 1990s as a proven technology for

managing and enforcing security in large-scale enterprise-wide systems. Its basic notion is that

permissions are associated with roles, and users are assigned to appropriate roles. Roles can be

created using the CREATE ROLE and DESTROY ROLE commands. The GRANT and REVOKE

commands discussed under DAC can then be used to assign and revoke privileges from roles.

RBAC appears to be a viable alternative to traditional discretionary and mandatory access controls;

it ensures that only authorized users are given access to certain data or resources. Role hierarchy in

RBAC is a natural way to organize roles to reflect the organization‟s lines of authority and

responsibility. Another important consideration in RBAC system is the possible temporal

constraints that may exist on roles, such as the time and duration of role activations and timed

triggering of a role by an activation of another role. RBAC model is a highly desirable goal for

addressing the key security requirements of Web-based applications.

RBAC models have several desirable features such as flexibility, policy neutrality, better support

for security management and administration and other aspects that make them attractive candidates

for developing secure Web-based applications. RBAC models can represent traditional DAC and

MAC policies as well as user-defined or organization-specific policies.

RBAC model provides a natural mechanism for addressing the security issues related to the

execution of tasks and workflows. Easier deployment over the internet has been another reason for

the success of RBAC models.

Access Control Policies for E-commerce and the Web

E-Commerce environments require elaborate policies that go beyond traditional DBMSs.

In conventional database environments, access control is usually performed using a set of

authorizations stated by security officers or users according to some security policies. Such

a simple paradigm is not well suited for a dynamic environment like e-commerce.

– In an e-commerce environment the resources to be protected are not only traditional

data but also knowledge and experience. Such peculiarities call for more flexibility

in specifying access control policies.

– The access control mechanism should be flexible enough to support a wide spectrum

of heterogeneous protection objects.

A second related requirement is the support for content-based access-control. Content-based

access control allows one to express access control policies that take the protection object

content into account. In order to support content-based access control, access control

policies must allow inclusion of conditions based on the object content.

A third requirement is related to heterogeneity of subjects, which requires access control

policies based on user characteristic and specifications rather than on specific and individual

characteristic. e. g user IDs.

Credential is a set of properties concerning a user that are relevant for security purposes.

It is believed that the XML lang. can play a key role in access control for e-commerce

applications because XML is becoming the common representation lang. for document

interchange over the web and is also becoming the lang. for e-commerce.

Statistical Database Security

Statistical databases are used mainly to produce statistics on various populations.

The database may contain confidential data on individuals, which should be protected from

user access.

Users are permitted to retrieve statistical information on the populations, such as averages,

sums, counts, maximums, minimums, and standard deviations.

A population is a set of tuples of a relation (table) that satisfy some selection condition.

Statistical queries involve applying statistical functions to a population of tuples.

For example, we may want to retrieve the number of individuals in a population or the

average income in the population. However, statistical users are not allowed to retrieve

individual data, such as the income of a specific person. Statistical database security

techniques must prohibit the retrieval of individual data.

This can be achieved by prohibiting queries that retrieve attribute values and by allowing

only queries that involve statistical aggregate functions such as COUNT, SUM, MIN, MAX,

AVERAGE, and STANDARD DEVIATION. Such queries are sometimes called Statistical

Queries.

It is DBMS‟s responsibility to ensure confidentiality of information about individuals, while

still providing useful statistical summaries of data about those individuals to users.

Provision of privacy protection of users in a statistical database is paramount.

In some cases it is possible to infer the values of individual tuples from a sequence statistical

queries. This is particularly true when the conditions result in a population consisting of a

small number of tuples.

Flow Control

Flow control regulates the distribution or flow of information among accessible objects. A

flow between object X and object Y occurs when a program reads values from X and writes

values into Y.

Flow controls check that information contained in some objects does not flow explicitly or

implicitly into less protected objects.

A flow policy specifies the channels along which information is allowed to move. The

simplest flow policy specifies just two classes of information: confidential (C) and

nonconfidential (N), and allows all flows except those from class C to class N. This policy

can solve the confinement problem that arises when a service program handles data such as

customer information, some of which may be confidential.

Flow controls can be enforced by an extended access control mechanism, which involves

assigning a security class (usually called the clearance) to each running program.

Flow control mechanism must verify that only authorized flows, both explicit and implicit

are executed. A set of rules must be satisfied to ensure secure information flow.

Covert Channels

Covert channel allows a transfer of information that violates the security or the policy. It allows

information to pass from a higher classification level to a lower classification level through

improper means.

Covert channels can be classified into two broad categories: timing channel and storage.

In a Timing Channel the information is conveyed by the timing of events or processes whereas

Storage Channels do not require any temporal synchronization, in that information is conveyed by

accessing system information or what is otherwise inaccessible to the user.

Encryption and Public key Infrastructures

Encryption is a means of maintaining secure data in an insecure environment. Encryption consists

of applying an encryption algorithm to data using some prespecified encryption key. The resulting

data has to be decrypted using a decryption key to recover the original data.

The Data and Advanced Encryption Standards

The Data Encryption Standard (DES) is a system developed by the U.S government for use by the

general public. It has been widely accepted as cryptographic standard both in the United States and

abroad. DES can provide end-to-end encryption on the channel between sender A and receiver B.

The DES algorithm is a careful and complex combination of two of the fundamental building

blocks of encryption: substitution and permutation (transposition)

Public Key Encryption

The two keys used for public key encryption are referred to as the public key and the private key.

Invariably, the private key is kept secret, but it is referred to as a private key rather than a secret

key (the key used in conventional encryption) to avoid confusion with conventional encryption.

Public key encryption refers to a type of cypher architecture known as public key cryptography that

utilizes two keys, or a key pair), to encrypt and decrypt data. One of the two keys is a public key,

which anyone can use to encrypt a message for the owner of that key. The encrypted message is

sent and the recipient uses his or her private key to decrypt it. This is the basis of public key

encryption.

Other encryption technologies that use a single shared key to both encrypt and decrypt data rely on

both parties deciding on a key ahead of time without other parties finding out what that key is. This

type of encryption technology is called symmetric encryption, while public key encryption is

known as asymmetric encryption.

The public key of the pair is made public for others to use, whereas the private key is known only to

its owner.

Public key encryption scheme or infrastructure has six ingredients:

i. Plaintext: data or readable message that is fed into algorithm as input.

ii. Encryption algorithm: this algorithm performs various transformations on the plaintext.

iii. Public and iv private keys: these are a pair of keys that have been selected so that if one

is

used for encryption, the other is used for decryption.

v. Ciphertext: this scramble message produced as output. It is depends on the plaintext and

the key. For a given message, two different keys will produce two different cipher texts.

vi. Decryption algorithm: this algorithm accepts the ciphertext and the matching key and

produces the original plaintext.

A "key" is simply a small bit of text code that triggers the associated algorithm to encode or

decode text. In public key encryption, a key pair is generated using an encryption program and the

pair is associated with a name or email address. The public key can then be made public by posting

it to a key server, a computer that hosts a database of public keys.

Public key encryption can also be used for secure storage of data files. In this case, your public key

is used to encrypt files while your private key decrypts them.

User Authentication: is a way of identifying the user and verifying that the user is allowed to

access some restricted data or application. This can be achieved through the use of passwords and

access rights.

Methods of attacking a distributed systems

- Eavesdropping: is the act of surreptitiously listening to a private conversation. - Masquerading

- Message tampering

- Replaying

- Denial of service: A denial-of-service attack (DoS attack) or distributed denial-of-

service attack (DDoS attack) is an attempt to make a computer resource unavailable to its

intended users. Although the means to carry out, motives for, and targets of a DoS attack

may vary, it generally consists of the concerted efforts of a person or persons to prevent an

Internet site or service from functioning efficiently or at all, temporarily or indefinitely.

Perpetrators of DoS attacks typically target sites or services hosted on high-profile web

servers such as banks, credit card payment gateways, and even root nameservers

- “phishing”. “Phishers” use electronic communications that look as if they came from

legitimate banks or other companies to persuade people to divulge sensitive information,

including passwords and credit card numbers.

Why Cryptography is necessary in a Distributed System

Supporting the facilities of a distributed system, such as resource distribution, requires the use of an underlying message

passing system. Such systems are, in turn, reliant on the use of a physical transmission network, upon which the

messages may physically be communicated between hosts.

Physical networks and, therefore, the basic message passing systems built over them are vulnerable to attack. For

example, hosts may easily attach to the network and listen in on the messages (or 'conversations') being held. If the

transmissions are in a readily understandable form, the eavesdroppers may be able to pick out units of information, in

effect stealing their information content.

Aside from the theft of user data, which may be in itself of great value, there may also be system information being

passed around as messages. Eavesdroppers from both inside and outside the system may attempt to steal this system

information as a means of either breaching internal access constraints, or to aid in the attack of other parts of the

system. Two possibly worse scenarios may exist where the attacking system may modify or insert fake transmissions on

the network. Accepting faked or modified messages as valid could lead a system into chaos.

Without adequate protection techniques, Distributed Systems are extremely vulnerable to the standard types of attack

outlined above. The encryption techniques discussed in the remainder of this report aim to provide the missing

protection by transforming a message into a form where if it were intercepted in transit, the contents of the original

message could not be explicitly discovered. Such encrypted messages, when they reach their intended recipients,

however, are capable of being transformed back into the original message.

There are two main frameworks in which this goal may be achieved, they are named Secret Key Encryption Systems

and Public Key Encryption Systems.

Secret Key Encryption Systems

Secret key encryption uses a single key to both encrypt and decrypt messages. As such it must be present at both the

source and destination of transmission to allow the message to be transmitted securely and recovered upon receipt at the

correct destination. The key must be kept secret by all parties involved in the communication. If the key fell into the

hands of an attacker, they would then be able to intercept and decrypt messages, thus thwarting the attempt to attain

secure communications by this method of encryption.

Secret key algorithms like DES assert that even although it is theoretically possible to derive the secret key from the

encrypted message alone, the quantities of computation involved in doing so make any attempts infeasible with current

computing hardware. The Kerberos architecture is a system based on the use of secret key encryption.

Public Key Encryption

Public key systems use a pair of keys, each of which can decrypt the messages encrypted by the other. Provided one of

these keys is kept secret (the private key), any communication encrypted using the corresponding public key can be

considered secure as the only person able to decrypt it holds the corresponding private key.

http://en.wikipedia.org/wiki/Internet

http://en.wikipedia.org/wiki/Website

http://en.wikipedia.org/wiki/Web_service

http://en.wikipedia.org/wiki/Root_nameserver

The algorithmic properties of the encryption and decryption processes make it infeasible to derive a private key from a

public key, an encrypted message, or a combination of both. RSA is an example of a public key algorithm for

encryption and decryption. It can be used within a protocol framework to ensure that communication is secure and

authentic.

Data Privacy through Encryption

There are two aspects to determining the level of privacy that can be attained through the Kerberos and RSA systems.

To begin with, there is an analysis of the security of the two systems from an algorithmic view. The questions raised at

this stage aim to consider exactly how hard it is to derive a private or secret key from encrypted text or public keys.

Currently, one of the main secret key algorithms is DES, although two other more recent algorithms, RC2 and RC4

have also arisen. The size( i.e. length) of keys employed in processes is considered to be a useful metric when

considering the strength of cryptology. This is because, longer key sizes generally make encrypted text more difficult to

decrypt without the appropriate key.

The DES algorithm has a maximum key length of approximately 50 bits. Current consensus is that this range of key

size yields keys that are strong enough to withstand attacks using current technologies. The algorithms fixed size nature

may, however, constrain it in the future when hardware and theoretic advances are made. The RC2 and RC4 algorithms

also have bounded maximum key sizes that limit their usefulness similarly.

A major problem associated with secret key systems, however, is their need for a secure channel within which keys can

be propagated. In Kerberos, every client needs to be made aware of its secret key before it can begin communication.

To do so without giving away the key to any eavesdroppers requires a secure channel. In practice, maintaining a

channel that is completely secure is very difficult and often impractical.

A second aspect to privacy concerns how much inferential information can be obtained through the system. For

example, how much information is it possible to deduce without explicitly decrypting actual messages. One particularly

disastrous situation would be if it were possible to derive the secret or private keys without mounting attacks on public

keys or encrypted messages.

In Kerberos, there is a danger that the ability to watch a client progress through the authentication protocol is available.

Such information may be enough to mount an attack on the client by jamming the network at strategic points in the

protocol. Denial of service like this may be very serious in a time critical system.

In pure algorithmic terms, RSA is a strong. It has the ability to support much longer key lengths than DES etc. Key

length is also only limited by technology, and so the algorithm can keep step with increasing technology and become

stronger by being able to support longer key lengths.

Unlike secret key systems, the private keys of any public key system need never be transmitted. Provided local security

is strong, the overall strength of the algorithm gains from the fact that the private key never leaves the client.

RSA is susceptible to information leakage, however, and some recent theoretic work outlined an attack plan that could

infer the private key of a client based on some leaked, incidental information. Overall however, the RSA authentication

protocol is not as verbose as the Kerberos equivalent. Having fewer interaction stages limits the bandwidth of any

channel though which information may escape. A verbose protocol like Kerberos's simply gives an eavesdropper more

opportunity to listen and possibly defines a larger and more identifiable pattern of interaction to listen for.

Distributed systems require the ability to communicate securely with other computers in the

network. To accomplish this, most systems use key management schemes that require prior

knowledge of public keys associated with critical nodes. In large, dynamic, anonymous systems,

this key sharing method is not viable. Scribe is a method for efficient key management inside a

distributed system that uses Identity Based Encryption (IBE). Public resources in a network are

addressable by unique identifiers. Using this identifier as a public key, other entities are able to

securely access that resource. This paper evaluates key distribution schemes inside Scribe and

provides recommendations for practical implementation to allow for secure, efficient, authenticated

communication inside a distributed system

Parallel and Distributed Databases

In parallel system architecture, there are two main types of multiprocessor system architectures that

are commonplace:

Shared memory (tightly coupled) architecture. Multiple processors share secondary

(disk) storage and also share primary memory.

Shared disk (loosely coupled) architecture. Multiple processors share secondary (disk)

storage but each has their own primary memory.

These architectures enable processors to communicate without the overhead of exchanging

messages over network. Database mgt systems developed using the above types of architectures are

termed Parallel Database Mgt System rather than DDBMS, since they utilize parallel processor

technology. Another type of multiprocessor architecture is called shared nothing architecture. In

this architecture, every processor has its own primary and secondary (disk) memory, no common

memory exists and the processors communicate over a high-speed interconnection network.

BENEFITS OF A PARALLEL DBMS

Improves response time

Interquery parallelism

It is possible to process a number of transactions in parallel with each other.

Improves Throughput.

INTRAQUERY PARALLELISM

It is possible to process „sub-tasks‟ of a transaction in parallel with each other.

How to Measure the Benefits

Speed-Up

As you multiply resources by a certain factor, the time taken to execute a transaction should be

reduced by the same factor:

10 seconds to scan a DB of 10,000 records using 1 CPU

1 second to scan a DB of 10,000 records using 10 CPUs

Scale-up.

As you multiply resources the size of a task that can be executed in a given time should be

increased by the same factor.

1 second to scan a DB of 1,000 records using 1 CPU

1 second to scan a DB of 10,000 records using 10 CPUs

Characteristics of Parallel DBMSs

CPUs will be co-located.

Same machine or same building: tightly coupled.

Biggest problem:

Interference – contention for memory access and bandwidth.

Shared Architectures only!

The Evolution of Distributed Database Management Systems

Distributed database management system (DDBMS) governs storage and processing of logically

related data over interconnected computer systems in which both data and processing functions are

distributed among several sites. To understand how and why the DDBMS is different from DBMS,

it is useful to briefly examined the changes in the database environment that set the stage for the

development of the DDBMS. Corporations implemented centralized database mgt systems to meet

their structured information needs. The structured information needs are well served by centralized

systems.

Basically, the use of a centralized database required that corporate data be stored in a single central

site, usually a mainframe or midrange computer. Database mgt systems based on the relational

model could provide the environment n which unstructured information needs would be met by

employing ad hoc queries. End users would be given the ability to access data when needed.

Social and technological changes that affected DB development and design:

Business operations became more decentralized geographically.

Competition increased at the global level.

Customer demands and market needs favoured a decentralised mgt style.

Rapid technological change created low-cost microcomputers with mainframe-like power.

The large number of applications based on DBMSs and the need to protect investments in

centralised DBMS software made the notion of data sharing attractive.

Those factors created a dynamic business envt in which companies had to respond quickly to

competitive and technological pressures. Two database requirements became obvious:

Rapid ad hoc data access became crucial in the quick-response decision-making envt.

The decentralized of mgt structures based on the decentralised of business units made

decentralised multiple-access and multiple-location databases a necessity.

However, the way those factors were addressed was strongly influenced by:

The growing acceptance of the internet

The increased focus on data analysis that led to data mining and data warehousing.

The decentralized DB is especially desirable because centralised DB mgt is subject to problems

such as:

Performance degradation due to a growing number of remote locations over greater

distances.

High costs associated with maintaining and operating large central database systems.

Reliability problems created by dependence on a central site.

Dynamic business environment and centralized database‟s shortcomings spawned a demand for

applications based on data access from different sources at multiple locations. Such a multiple-

source/multiple-location database envt is managed by a distributed DB mgt system (DBMS).

DDBMS Advantages and Disadvantages

Advantages:

I. Data are located near the greatest demand site. The data in a distributed DB system are

dispersed to match business requirements.

II. Faster data access: end users often work with only a locally stored subset of the company‟s

data.

III. Faster data processing: spreads out the systems workload by processing data at several

sites.

IV. Growth facilitation: new site can be added to the network without affecting the operations

of other sites.

V. Improved communications: local sites are smaller and located closer to customers, local

sites foster better communication among departments and between customers and company

staff.

VI. Reduced operating costs: development work is done more cheaply and more quickly on

low-cost PCs than on mainframes.

VII. User-friendly interface: the GUI simplifies use and training for end users.

VIII. Less danger of a single-point failure.

IX. Processor independence: the end user is able to access any available copy of the data and

an end users request is processed by any processor at the data location

Disadvantages:

1. Complexity of mgt and control: applications must recognise data location and they must

be able to stitch together data from diff sites. DBA must have the ability to coordinate DB

activities to prevent DB degradation due to data anomalies.

2. Security: the probability of security lapses increases when data are located at multiple sites.

3. Lack of standards: there re no standard communication protocols at the DB level.

4. Increased storage requirements: multiple copies of data re required at diff sites, thus

requiring additional disks storage space.

5. Increased training cost: generally higher in a distributed model than they would be in a

centralised model.

Distributed Processing and Distributed Databases

In distributed processing, a DB‟s logical processing is shared among two or more physically

independent sites that re connected through a network.

A distributed Database, on the other hand, stores a logically related DB over two or more

physically independent sites. In contrast, the distributed processing system uses only a single-site

DB but shares the processing chores among several sites. In a distributed database systems, a DB is

composed of several parts known as Database Fragments. An example of a distributed DB envt is

shown below:

The DB is divided into three database fragments (E1, E2 and E3) located at diff sites. The

computers are connected through a network system. The users Alan, Betty and Hernado do not need

to know the name or location of each fragment in order to access the DB. As you examine and

contrast fig 14.2 and 14.3 you should keep in mind that:

Distributed processing does not require a distributed DB, but a distributed DB requires

distributed processing.

Distributed processing may be based on a single DB located on a single computer.

Both distributed processing and distributed DBs require a network to connect all

components.

Characteristics of DDBMS

i. Application interface to interact with the end user.

ii. Validation to analyse data requests.

iii. Transformation to determine which data request components are distributed

and which are local.

iv. Query optimisation to find the best access strategy.

v. Mapping to determine the data location of local and remote fragments.

vi. I/O interface to read or write data from or to permanent local storage.

vii. Formatting to prepare the data for presentation to the end user

viii. Security to provide data privacy at both local and remote DBs

ix. Concurrency control to manage simultaneous data access and to ensure data

consistency across DB fragments in the DBMS.

DDBMS Components

Computer workstations (sites or nodes) that form the network system.

Network hardware and software components that reside in each workstation.

Communications media that carry the data from one workstation to another.

Transaction processor TP, which is the software component found in each computer that

requests data. The TP receives and processes the application‟s data requests. The TP is also

known as Application Processor (AP) or the transaction Manager (TM).

The Data Processor (DP) which is a software component residing on each computer that

stores and retrieves data located at the site. Also known as Data Manager (DM). A data

processor may even be a centralised DBMS.

Levels of Data and Process Distribution.

Current DB systems can be classified on the basis of how process distribution and data distribution

are supported. For example, a DBMS may store data in a single site (centralised DB) or in multiple

sites (Distributed DB) and may support data processing at a single site or at multiple sites. The table

below uses a simple matrix to classify DB systems according to data and process distribution.

Database System: levels of data and process distribution

Single-site Data Multiple-site Data

Single-site process Host DBMS (mainframe) Not applicable

(requires multiple processes

Multiple-site process File server

Client/server DBMS (LAN

DBMS)

Fully distributed

Client/server DDBMS

Single-site Processing, Single-site Data (SPSD): all processing is done on a single CPU or host

computer and all data re stored on the host computer‟s local disk. Processing cannot be done on the

end user‟s side of the system. The functions of TP and the DP are embedded within the DBMS

located on a single computer. All data storage and data processing are handled by a single CPU.

Multiple-site Processing, Single-site Data (MPSD):

Multiple processed run on diff computers sharing a single data repository. MPSD scenario requires

a network file server running conventional applications that re accessed through a LAN.

Note

- The TP on each workstation acts only as a redirector to route all network data requests to

the file server.

- The end user sees the file server as just another hard disk.

- The end user must make a direct reference to the file server in order to access remote

data. All record- and file-locking activity is done at the end-user location.

- All data selection, search and update functions take place at the workstation.

Multiple-site Processing, Multiple-site Data (MPMD):

This describes a fully distributed database management system with support for multiple data

processors and transaction processors at multiple sites. Classified as either homogeneous or

heterogeneous

Homogeneous DDBMSs – Integrate only one type of centralized DBMS over a network.

Heterogeneous DDBMSs– Integrate different types of centralized DBMSs over a network

Fully heterogeneous DDBMS– Support different DBMSs that may even support different data

models (relational, hierarchical, or network) running under different computer systems, such as

mainframes and microcomputers.

Distributed Database Transparency Features:

This has the common property of allowing the end user to feel like the DB‟s only user. The use

believes that (s)he is working with centralised DBMS; all complexities of a distributed DB re

hidden or transparent to the user. The features are:

Distribution Transparency: which allows a distributed DB to be treated as a single

logical DB.

Transaction Transparency: which allows a transaction to update data at several

network sites.

Failure Transparency: which ensures that the system will continue to operate in the

event of a node failure.

Performance Transparency: which allows the system to perform as if it were a

centralised DBMS. It also ensures that system will find the most cost-effective path to

access remote data.

Heterogeneity Transparency: which allows the integration of several diff local

DBMSs under a common or global schema.

Distribution Transparency: Allows management of physically dispersed database as though it

were a centralized database. Three levels of distribution transparency are recognized:

Fragmentation transparency: the highest level of transparency. The end user or programmer

does not need to know that the a DB is partitioned

Location transparency: exists when the end user or programmer must specify the DB

fragment names but does not need to specify where those fragments are located.

Local mapping transparency: exists when the end user or programmer must specify both the

fragment names and their locations.

Transaction Transparency: Ensures database transactions will maintain distributed database‟s

integrity and consistency. Transaction transparency ensures that the transactions are completed only

when all DB sites involved in the transaction complete their part of the transaction.

Distributed Requests and Distributed Transactions

• Distributed transaction

– Can update or request data from several different remote sites on network

• Remote request

– Lets single SQL statement access data to be processed by single remote database

processor

• Remote transaction

– Accesses data at single remote site

• Distributed transaction – Allows transaction to reference several different (local or remote) DP

sites

• Distributed request – Lets single SQL statement reference data located at several different local or

remote DP sites. Because each request (SQL statement) can access data from more than one local or

remote DP site, a transaction can access several sites.

Distributed Concurrency Control

Concurrency control becomes especially important in the distributed envt because multisite,

multiple-process operations are much more likely to create data inconsistencies and deadlocked

transactions than are single-site systems. For example, the TP component of a DBMS must ensure

that all parts of the transaction are completed at all sites before a final COMMIT is used to record

the transaction.

Performance Transparency and Query Optimization

Because all data reside at a single site in a centralised DB, the DBMS must evaluate every data

request and find the most efficient way to access the local data. In contrast, the DDBMS makes it

possible to partition a DB into several fragments, thereby rendering the query translation more

complicated because the DDBMS must decide which fragment of the DB to access. The objective

of a query optimization routine is to minimise the total cost associated with the execution of a

request.

One of the most important characteristics of query optimisation in distributed DB system is that it

must provide distribution transparency as well as Replica transparency. Replica Transparency

refers to the DDBMS‟s ability to hide the existence of multiple copies of data from the user.

Operation modes can be classified as manual or automatic. Automatic query Optimization

means that the DDBMS finds the most cost-effective access path without user intervention. Manual

query optimisation requires that the optimisation be selected and scheduled by the end or

programmer.

Query optimisation algorithms can also be classified as:

Static query optimisation: takes place at the compilation time. It creates the plan necessary to

access the DB. When the program is executed, the DBMS uses that plan to access the DB.

Dynamic query optimisation: takes place at execution time. DB access strategy is defined when

the program is executed. It‟s efficient; its cost is measured by run-time processing overhead. The

best strategy is determined every time the query is executed, this could happen several times in the

same program.

Distributed Database Design

• Data fragmentation – deals with How to partition database into fragments

• Data replication – deals with which fragments to replicate

• Data allocation – Where to locate those fragments and replicas.

Data Fragmentation

• Breaks single object into two or more segments or fragments

• Each fragment can be stored at any site over computer network

• Information about data fragmentation is stored in distributed data catalog (DDC), from which it is

accessed by TP to process user requests.

Three Types of data fragmentation Strategies

Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples (rows).

Each fragment is stored at a different node, and each node has a unique rows. However, the

unique rows all have the same attributes (columns). In short, each fragment represents the

equivalent of a SELECT statement, with the WHERE clause on a single attribute.

Vertical fragmentation: Division of a relation into attribute (column) subsets. Each subset

(fragment) is stored on diff node and each fragment has unique columns – with the

exception of the key column, which is common to all fragments. This is equivalent of the

PROJECT statement.

Mixed fragmentation: Combination of horizontal and vertical strategies. In other words, a

table may be divided into several horizontal subsets (rows), each one having a subset of the

attributes (columns).

Data Replication

This refers to the storage of data copies at multiple sites served by computer network.

Fragment copies can be stored at several sites to serve specific information requirements

– Can enhance data availability and response time

– Can help to reduce communication and total query costs.

Replicated data are subject to the mutual consistency rule. The mutual consistency rule requires

that all copies of data fragments be identical. Therefore to maintain data consistency among the

replicas, the DDBMS must ensure that a DB update is performed at all sites where replicas exist.

Three Replication scenarios exist: a DB can be:

Fully replicated database - Stores multiple copies of each database fragment at multiple

sites. It can be impractical due to amount of overhead it imposes on the system.

Partially replicated database - Stores multiple copies of some database fragments at

multiple sites. Most DDBMSs are able to handle the partially replicated database well.

Unreplicated database: Stores each database fragment at single site. No duplicate database

fragments.

Several factors influence the decision to use data replication:

• Database size

• Usage frequency

• Costs

Data Allocation

It describes the process of deciding where to locate data. Data allocation strategies as follows:

With Centralized data allocation, the entire database is stored at one site

With Partitioned data allocation, the database is divided into several disjointed parts

(fragments) and stored at several sites.

With Replicated data allocation, Copies of one or more database fragments are stored at

several sites.

Data distribution over computer network is achieved through data partition, data replication, or

combination of both. Data allocation is closely related to the way a database is divided or

fragmented.

Data allocation algorithms take into consideration a variety of factors, including:

Performance and data availability goals.

Size, number of rows and number of relations that an entity maintains with other

entities.

Types of transactions to be applied to the DB and the attributes accessed by each of

those transactions.

Client/Server vs. DDBMS

It refers to the way in which computers interact to form system. The architecture features user of

resources, or client, and provider of resources, or server. The client/server architecture can be used

to implement a DBMS in which client is the TP and server is the DP.

Client/server advantages

Less expensive than alternate minicomputer or mainframe solutions

Allow end user to use microcomputer‟s GUI, thereby improving functionality and simplicity

More people in job market have PC skills than mainframe skills

PC is well established in workplace.

Numerous data analysis and query tools exist to facilitate interaction with DBMSs available

in PC market

There is a considerable cost advantage to offloading applications development from

mainframe to powerful PCs.

Client/server disadvantages

Creates more complex environment in which different platforms (LANs, operating system

etc) are often difficult to manage.

An increase in number of users and processing sites often paves the way for security

problems.

The C/S envt makes it possible to spread data access to much wider circle of users. Such and

envt increases demand for people with broad knowledge of computers and software

applications. The burden of training increases cost of maintaining the environment.

C. J. Date’s Twelve Commandments for Distributed Databases

1. Local site independence. Each local site can act as an independent, autonomous, centralized

DBMS. Each site is responsible for security, concurrency control backup and recovery.

2. Central site independence. No site in the network relies on a central site or any other site.

All sites have the same capabilities.

3. Failure independence. The system is not affected by node failures.

4. Location transparency. The user does not need to know the location of the data in order to

retrieve those data

5. Fragmentation transparency. The user sees only one logical DB. Data fragmentation is

transparent to the user. The user does not need to know the name of the DB fragments in

order to retrieve them.

6. Replication transparency. The user sees only one logical DB. The DDBMS transparently

selects the DB fragment to access.

7. Distributed query processing. A distributed query may be executed at several different DP

sites.

8. Distributed transaction processing. A transaction may update data at several different sites.

The transaction is transparently executed at several diff DP sites.

9. Hardware independence. The system must run on any hardware platform.

10. Operating system independence. The system must run on any OS software platform.

11. Network independence. The system must run on any network platform.

12. Database independence. The system must support any vendor‟s DB product.

Two-phase commit protocol

Two-phase commit is a standard protocol in distributed transactions for achieving ACID properties. Each transaction has a coordinator who initiates and coordinates the transaction.

In the two-phase commit the coordinator sends a prepare message to all participants (nodes) and waits for

their answers. The coordinator then sends their answers to all other sites. Every participant waits for these

answers from the coordinator before committing to or aborting the transaction. If committing, the

coordinator records this into a log and sends a commit message to all participants. If for any reason a

participant aborts the process, the coordinator sends a rollback message and the transaction is undone using

the log file created earlier. The advantages of this are all participants reach a decision consistently, yet independently.

However, the two-phase commit protocol also has limitations in that it is a blocking protocol. For example,

participants will block resource processes while waiting for a message from the coordinator. If for any

reason this fails, the participant will continue to wait and may never resolve its transaction. Therefore the

resource could be blocked indefinitely. On the other hand, a coordinator will also block resources while

waiting for replies from participants. In this case, a coordinator can also block indefinitely if no

acknowledgement is received from the participant. This is most likely the reason why systems still use the two-phase commit protocol.

Three-phase commit protocol

An alternative to the two-phase commit protocol used by many database systems is the three-phase commit.

Dale Skeen describes the three-phase commit as a non blocking protocol. He then goes on to say that it was

developed to avoid the failures that occur in two-phase commit transactions.

As with the two-phase commit, the three-phase also has a coordinator who initiates and coordinates the

transaction. However, the three-phase protocol introduces a third phase called the pre-commit. The aim of

this is to ‘remove the uncertainty period for participants that have committed and are waiting for the global

abort or commit message from the coordinator’. When receiving a pre-commit message, participants know

that all others have voted to commit. If a pre-commit message has not been received the participant will abort and release any blocked resources.

Review Questions

1. Describe the evolution from centralized DBMS to distributed DBMS.

2. List and discuss some of the factors that influenced the evolution of the DBMS.

3. What are the advantages and disadvantages of DBMS?

4. Explain the difference between a distributed DB and distributed processing.

5. What is a fully distributed DB mgt system?

6. What are the components of a DDBMS

7. Explain the transparency features of a DBMS.

8. Define and explain the different types of distribution transparency.

9. Explain the need for the two-phase commit protocol. Then describe the two phases.

10. What is the objective of query optimisation function?.

11. To which transparency feature are the query optimisation functions related?

12. What are different types of query optimisation algorithms?

13. Describe the three data fragmentation strategies. Give some examples.

14. What is data replication and what are the three replication strategies?

15. Explain the difference between file server and client/server architecture.

Data Warehouse

The Need for Data Analysis

Managers must be able to track daily transactions to evaluate how the business is

performing.

By tapping into operational database, management can develop strategies to meet

organizational goals.

Data analysis can provide information about short-term tactical evaluations and strategies.

Given the many and varied competitive pressures, managers are always looking for a competitive

advantage through product development and maintenance, service, market positioning, sales

promotion. In addition, the modern business climate requires managers to approach increasingly

complex problems that involve a rapidly growing number of internal and external variables.

Different managerial levels require different decision support needs. Managers require detailed

information designed to help them make decisions in a complex data and analysis environment. To

support such decision making, information systems IS depts. have created decision support systems

or DSSs.

Decision support System

• Decision support is methodology (or series of methodologies) designed to extract

information from data and to use such information as a basis for decision making

• Decision support system (DSS)

– Arrangement of computerized tools used to assist managerial decision making within

business

– Usually requires extensive data “massaging” to produce information

– Used at all levels within organization

– Often tailored to focus on specific business areas or problems such as finance,

insurance, banking and sales.

– The DSS is interactive and provides ad hoc query tools to retrieve data and to display

data in different formats.

Keep in mind that managers must initiate decision support process by asking the appropriate

questions. The DSS exists to support the manager; it does not replace the mgt function.

• DSS is composed of following four main components:

Data store component: Basically a DSS database. The data store contains two or main

types of data: business data and business model data. The business data are extracted from

the operational DB and from external data sources. The external data source provides data

that cannot be found within the company. The business models are generated by special

algorithms that model the business to identify and enhance the understanding of business

situation and problems.

Data extraction and data filtering component: Used to extract and validate data taken

from operational database and external data sources. For example, to determine the relative

market share by selected product line, the DSS requires data from competitor‟s products.

Such data can be located in external DBs provided by the industry groups or by companies

that market the data. This component extracts the data, filters the extracted data to select the

relevant records, and packages the data in the right format to be added to the DSS data store

component.

End-user query tool: Used by data analyst to create queries that access database.

Depending on the DSS implementation, the query tool accesses either the operational DB or

more commonly, the DSS DB. This tool advises the user on which data to select and how to

build a reliable business data model.

End-user presentation tool: Used to organize and present data. This also helps the end user

select the most appropriate presentation format such as summary report or mixed graphs.

Although the DSS is used at strategic and tactical managerial levels within organization, its

effectiveness depends on the quality of data gathered at the operational level.

Operational Data vs. Decision Support Data

• Operational Data

– Mostly stored in relational database in which the structures (tables) tend to be highly

normalized.

– Operational data storage is optimized to support transactions representing daily

operations.

• DSS Data

– Give tactical and strategic business meaning to operational data.

– Differs from operational data in following three main areas:

Timespan: operational data covers a short time frame.

Granularity (level of aggregation) DSS data must be presented at different

levels of aggregation from highly summarized to near atomic.

Dimensionality: operational data focus on representing individual

transactions rather than on the effects of the transactions over time.

Difference Between Operational and DSS Data

Characteristic Operational Data DSS Data

Data currency Current operations

Real-time data

Historic data, Snapshot of

company data, Time component

(week/month/year)

Granularity Atomic-detailed data Summarized data

Summarization level Low; some aggregate yields High; many aggregation levels

Data model Highly normalized

Mostly relational DBMS

Non-normalised,

Complex structures

Some relational, but mostly

multidimensional DBMS

Transaction type Mostly updates Mostly query

Transaction volumes High update volumes Periodic loads and summary

calculations

Transaction speed Updates are critical Retrievals are critical

Query activity Low to medium High

Query scope Narrow range Broad range

Query complexity Simple to medium Very complex

Data volumes Hundreds of megabytes, up to

gigabytes.

Hundreds of gigabytes, up to

terabytes.

DSS Database Requirements: A DSS DB is specialized DBMS tailored to provide fast answers to

complex queries.

• Four main requirements:

– Database schema

– Data extraction and loading

– End-user analytical interface

– Database size

Database schema

– Must support complex data representations

– Must contain aggregated and summarized data

– Queries must be able to extract multidimensional time slices

Data extraction

– Should allow batch and scheduled data extraction

– Should support different data sources

• Flat files

• Hierarchical, network, and relational databases

• Multiple vendors

Data filtering: Must allow checking for inconsistent data or data validation rules.

End-user analytical interface

The DSS DBMS must support advanced data modeling and data presentation tools. Using those

tools makes it easy for data analysts to define the nature and extent of business problems.

The end-user analytical interface is one of most critical DSS DBMS components. When properly

implemented, an analytical interface permits the user to navigate through the data to simplify and

accelerate the decision-making process.

Database size

In 2005, Wal-Mart had 260 terabytes of data in its data warehouses.

DSS DB typically contains redundant and duplicated data to improve retrieval and simplify

information generation. Therefore, the DBMS must support very large databases (VLDBs)

The Data Warehouse

The acknowledge „father of the data warehouse defines the term as an Integrated, subject-oriented,

time-variant, nonvolatile collection of data that provides support for decision making.

Integrated: the data warehouse is centralized, consolidated DB that integrates data derived from

the entire organization and form multiple sources with diverse formats. Data integration implies

that all business entities, data elements, data characteristic and business metrics are described in

the same way throughout the enterprise.

Subject-oriented: data warehouse data are arranged and optimized to provide answers to questions

coming from diverse functional areas within a company. Data warehouse data are organized and

summarized by topic such as sales, marketing.

Time-variant: warehouse data represent the flow of data through time. The data warehouse can

even contain projected data generated through statistical and other models. It is also time-variant

in the sense that once data are periodically uploaded to the data warehouse, all time-dependent

aggregations are recomputed.

Non-volatile: once data enter the data warehouse, they are never removed. Because the data in the

warehouse represent the company’s history, the operatonal data, representing the near-term

history, are always added to it. Data are never deleted and new data are continually added, so the

data warehouse is always growing.

Comparison of Data Warehouse and Operational database characteristics

Characteristic Operational Database Data Data Warehouse Data

Integrated Similar data can have different

representations or meanings.

Provide a united view of all data

elements with a common

definition and representation for

all business units.

Subject-orientated Data are stored with a function

or process, orientation. For

example, data may be stored for

invoices, payments and credit

amounts.

Data are stored with a subject

orientation that facilitates

multiple views of the data and

facilitates decision making. For

example, sales may be recorded

by product, by division, by

manager or by region.

Time-variant Data are recorded as current

transactions. For example, the

sales data may be the sale of a

product on a given data.

Data are recorded with a

historical perspective in mind.

Therefore, a time dimension is

added to facilitate data analysis

and various time comparisons.

Non-volatile Data updates are frequent and

common. For example, an

inventory amount charges with

each sale. Therefore the data

environment is fluid.

Data cannot be changed. Data

are added only periodically from

historical systems. Once the data

are properly stored, no changes

are allowed. Therefore the data

environment is relatively static.

In summary data warehouse is usually a read-only database optimized for data analysis and query

processing. Creating a data warehouse requires time, money, and considerable managerial effort.

Data Warehouse properties

• The warehouse is organized around the major subjects of an enterprise (e.g. customers,

products, and sales) rather than the major application areas (e.g. customer invoicing, stock

control, and order processing). – Subject Oriented

• The data warehouse integrates corporate application-oriented data from different source

systems, which often includes data that is inconsistent. Such data, must be made consistent

to present a unified view of the data to the users. – Integrated

• Data in the warehouse is only accurate and valid at some point in time or over some time

interval. Time-variance is also shown in the extended time that the data is held, the

association of time with all data, and the fact that data represents a series of historical

snapshots. – Time Variant

• Data in the warehouse is not updated in real-time but is refreshed from operational systems

on a regular basis. New data is always added as a supplement to the database, rather than a

replacement. – Non-volatile

Data mart is a Small, single-subject data warehouse subset that provides decision support to a

small group of people. Some organization choose to implement data marts not only because of the

lower cost and shorter implementation time, but also because of the current technological advances

and inevitable people issues that make data marts attractive.

Data marts can serve as a test vehicle for companies exploring the potential benefits of data

warehouses. By migrating gradually from data marts to data warehouses, a specific dept‟s decision

support needs can be addressed within a reasonable time frame, as compared to the longer time

frame usually required to implement a data warehouse.

The diff between a data warehouse and a data warehouse is only the size and scope of the problem

being solved.

Twelve Rules That Define a Data Warehouse

1. Data warehouse and operational environments are separated.

2. Data warehouse data are integrated.

3. Data warehouse contains historical data over long time horizon.

4. Data warehouse data are snapshot data captured at given point in time.

5. Data warehouse data are subject oriented.

6. Data warehouse data are mainly read-only with periodic batch updates from operational

data. No online updates allowed

7. Data warehouse development life cycle differs from classical systems development. The

data warehouse development is data-driven; the classical approach is process-driven.

8. Data warehouse contains data with several levels of detail: current detail data, old detail

data, lightly summarized data, and highly summarized data.

9. Data warehouse environment is characterized by read-only transactions to very large

data sets. The operational envt is characterized by numerous update transactions to a few

data entities at a time.

10. Data warehouse environment has system that traces data sources, transformations, and

storage.

11. Data warehouse‟s metadata are critical component of this environment. The metadata

identify and define all data elements.

12. Data warehouse contains chargeback mechanism for resource usage that enforces

optimal use of data by end users.

Online Analytical Processing (OLAP), create an aadvanced data analysis environment that

supports decision making, business modeling, and operations research.

OLAP systems share four main characteristics:

– Use multidimensional data analysis techniques

– Provide advanced database support

– Provide easy-to-use end-user interfaces

– Support client/server architecture

Multidimensional Data Analysis Techniques

The most distinct characteristic of modern OLAP is their capacity for multidimensional analysis. In

multidimensional analysis:

Data are processed and viewed as part of a multidimensional structure.

This type of data analysis is particularly attractive to business decision makers because they

tend to view business data as data that are related to other business data.

Multidimensional data analysis techniques are augmented by following functions:

– Advanced data presentation functions: 3-D graphics, pivot tables, crosstabs,

– Advanced data aggregation, consolidation and classification functions that allow the

data analyst to create multiple data aggregation levels, slice and dice data.

– Advanced computational functions: business-oriented variables, financial and

accounting ratios

– Advanced data modeling functions: support for what-if scenarios, variable

assessment, variable contributions to outcome, linear programming and other

modeling tools.

Advanced Database Support

To deliver efficient decision support, OLAP tools must have advanced data access features include:

– Access to many different kinds of DBMSs, flat files, and internal and external data

sources.

– Access to aggregated data warehouse data as well as to detail data found in

operational databases.

– Advanced data navigation features such as drill-down and roll-up.

– Rapid and consistent query response times

– Ability to map end-user requests to appropriate data source and then to proper data

access language (usually SQL)

– Support for very large databases

Easy-to-Use End-User Interface: Many of interface features are “borrowed” from previous

generations of data analysis tools that are already familiar to end users. This familiarity makes

OLAP easily accepted and readily used.

Client Server Architecture

This provides a framework within which new systems can be designed, developed, and

implemented. The client/server envt:

– Enables OLAP system to be divided into several components that define its

architecture

– OLAP is designed to meet ease-of-use as well as system flexibility requirements

OLAP ARCHITECTURE :OLAP operational characteristics can be divided into three main

modules:

· Graphical user interface (GUI)

· Analytical processing logic.

· Data-processing logic.

· Designed to use both operational and data warehouse data

· Defined as an “advanced data analysis environment that supports decision making, business

modeling, and an operation‟s research activities”

· In most implementations, data warehouse and OLAP are interrelated and complementary

environments

RELATIONAL OLAP: Provides OLAP functionality by using relational databases and familiar

relational query tools to store and analyze multidimensional data

• Adds following extensions to traditional RDBMS:

– Multidimensional data schema support within RDBMS

– Data access language and query performance optimized for multidimensional data

– Support for very large databases (VLDBs).

Relational vs. Multidimensional OLAP

Characteristic ROLAP MOLAP

Schema Uses star schema

Additional dimensions can be

added dynamically

Uses data cubes

Additional dimensions require

re-creation of the data cube

Database size Medium to large Small to medium

Architecture Client/server

Standard-based

Open

Client/server

Proprietary

Access Supports ad hoc requests

Unlimited dimensions Limited to predefined

dimensions

Resources High Very high

Flexibility High Low

Scalability High Low

Speed Good with small data sets;

average for medium to large

data sets.

Faster for small to medium data

sets; average for large data sets.

Review Questions

1. What are decision support systems and what role do they play in the business envt.?

2. Explain how the main components of a DSS interact to form a system?

3. What are the most relevant differences between operational and decision support data?

4. What is a data warehouse and what are its main characteristics?

5. Give three examples of problems likely to be encountered when operational data are

integrated into the data warehouse.

While working as a DB analyst for a national sales organization, you are asked to be part of its

data warehouse project team.

6. Prepare a high level summary of the main requirements for evaluating DBMS products for

data warehousing.

8.Suppose you re selling the data warehouse idea to your users. How would you define

multidimensional data analysis for them? How would you explain its advantages to them?

9. before making a commitment, the data warehousing project group has invited you to provide

an OLAP overview. The group’s members are particularly concerned about the OLAP

client/server architecture requirements and how OLAP will fit the existing environment. Your

job is to explain to them the main OLAP client/server components and architectures.

11. The project group is ready to make a final decision , choosing between ROLAP and

MOLAP. What should be the basis for this decision? Why?

14. What is OLAP and what are its main characteristics?

15. Explain ROLAP and give the reasons you would recommend its use in the relational DB

envt.

20. Explain some of the most important issues in data warehouse implementation.

Web DBMS

Database System: An Introduction to OODBMS and Web DBMS

PROBLEMS WITH RDBMSs

Poor representation of „real world‟ entities.

Semantic overloading.

Poor support for integrity and business constraints.

Homogeneous data structure.

Limited operations.

Difficulty handling recursive queries.

Impedance mismatch.

Difficulty with „Long Transactions‟.

Object Oriented Database Management Systems (OODBMSs): These are an attempt at

marrying the power of Object Oriented Programming Languages with the persistence and

associated technologies of a DBMS.

Object Oriented Database Management System

OOPLs DBMSs

· Complex Objects

· Object Identity

· Methods and Messages

· Inheritance

· Polymorphism

· Extensibility

· Computational completeness

· Persistence

· Disc Management

· Data sharing

· Reliability

· Security

· Ad Hoc Querying

THE OO DATABASE MANIFESTO

CHARACTERISTICS THAT ‘MUST BE’ SUPPORTED

· Complex objects

· Object Identity

· Encapsulation

· Classes

· Inheritance

· Overriding and late-binding

· Extensibility

· Computational completeness

· Persistence

· Concurrency

· Recovery

· Ad-hoc querying

Requirements and Features

Requirements:

· Transparently add persistence to OO programming languages

· Ability to handle complex data - i.e., Multimedia data

· Ability to handle data complexity - i.e., Interrelated data items

· Add DBMS Features to OO programming languages.

Features:

· The host programming language is also the DML.

· The in-memory and storage models are merged.

· No conversion code between models and languages is needed.

Data Storage for Web Site

File based systems:

information in separate HTML files

file management problems

information update problems

static web pages, non-interactive

Database based systems:

database accessed from the web

dynamic information handling

data management and data updating through the DBMS

Interconnected networks

TCP/IP (Transmission Control Protocol/ Internet Protocol)

http: (HyperText Transfer Protocol)

Internet Database

Web database connectivity allows new innovative services that:

Permit rapid response to competitive pressures by bringing new services and products to

market quickly.

Increase customer satisfaction through creation of Web-based support services.

Yield fast and effective information dissemination through universal access from across

street or across globe.

Characteristics and Benefits of Internet Technologies

Internal Characteristic Benefit

Hardware and Software independence Saving in equipment/software acquisition

Ability to run on most existing equipment

Platform independence and portability

No need for multiple platform development

Common and simple user interface Reduced training time and cost

Reduced end-user support cost

No need for multiple platform development

Location independence Global access through internet infrastructure

Reduced requirements (and costs) for dedicated

connections.

Rapid development at manageable costs Availability of multiple development tools

Plug-and-play development tools (open

standards)

More interactive development

Reduced development times

Relatively inexperience tools

Free client access tools (Web browsers)

Low entry costs frequent availability of free web

servers.

Reduced costs of maintaining private networks

Distributed processing and scalability, using

multiple servers.

Web-to-Database Middleware: Server-Side Extensions

A server-side extension is a program that interacts directly with the web server to handle specific

types of requests. It also makes it possible to retrieve and present the query results, but what‟s more

important is that it provides its services to the web server in a way that is totally transparent to the

client browser. In short, the server-side-extension adds significant functionality to the web server

and therefore to the internet.

A database server-side extension program is also known as Web-to-database middleware.

The client browser sends a page request to the Web server.

The web server receives and validates the request.

The web-to-database middleware reads, validates and executes the script. In this

case, it connects to the database and passes the query using the database connectivity layer.

The database server executes the query and passes the result back to the Web-to-

database middleware.

The Web-to-database middleware complies the result set, dynamically generates an

HTML-formatted page that includes the data retrieved from the database and sends it to the

Web server.

The Web server returns the just-created HTML page, which now includes the query

result, to the client browser.

The client browser displays the page on the local computer.

The interaction between the Web server and the Web-to-database middleware is crucial to the

development of a successful internet database implementation. Therefore, the middleware must be

well integrated with the other internet services and the components that are involved in its use.

Web Server Interfaces:

It defines how a Web server communicates with external programs.

• Two well-defined Web server interfaces:

– Common Gateway Interface (CGI)

– Application programming interface (API)

The Common Gateway Interface uses script files that perform specific functions based on the

client‟s parameters that are passed to the Web server. The script file is a small program containing

commands written in a programming language. The script file‟s contents can be used to connect to

the DB and to retrieve data from it, using the parameters data to the Web server.

A script is a series of instructions executed in interpreter mode. The script is a plain text file that is

not compiled like COBOL, C++, or Java. Scripts are normally used in Web application

development environments.

An Application programming Interface is a newer Web server interface standard that is more

efficient and faster than a CGI script. APIs are more efficient because they are implemented as

shared code or as dynamic-link libraries (DLL). API are faster than CGI because the code resides in

memory and there is no need to run an external program for each request. APIs share the same

memory space as the Web server, an API error can bring down the server. The other disadvantage is

that APIs are specific to the Web server and to the operating system.

The Web Browser

This is the application software e.g Microsoft Internet Explorer, Mozilla Firefox, that lets users

navigate (browse) Web. Each time the end user clicks a hyperlink, the browser generates an HTTP

GET page request that is sent to the designated Web server using the TCP/IP internet protocol. The

Web browser‟s job is to interpret the HTML code that it receives from the Web server and to the

present the different page components in standard formatted way.

The Web as a Stateless System: Stateless system indicates that at any given time, Web server does

not know status of any of clients communicating with it. Client and server computers interact in

very short “conversations” that follow request-reply model.

XML Presentation

Extensible Markup Language (XML) is a metalanguage used to represent and manipulate data

elements. XML is designed to facilitate the exchange of structured documents, such as orders and

invoices over the internet. XML provides the semantics that facilitates the exchange, sharing and

manipulation of structured documents across organizational boundaries.

One of the main benefits of XML is that it separates data structure from its presentation and

processing.

Data Storage for Web Sites

• “http” provides multiple transactions between clients and server

• Based on Request-Response paradigm

- Connection (from client to web server)

- Request (message to web server)

- Response (information required as an HTML file)

- Close (connection to web server closed)

• Web is used as an interface platform that provides access to one or more databases

• Question of database connectivity

• Open architecture approach to allow interoperability:

- (Distributed) Common Object Model (MS DCOM/ COM)

- CORBA (Common Request Broker Architecture)

- Java/RMI (Remote Method Invocation)

DBMS Architecture

• Integration of web with database application

• Vendor/ product independent connectivity

• Interface independent of proprietary Web browsers

• Capability to use all the features of the DBMS

• Access to corporate data in a secure manner

Two-tier client-server architecture

-user interface/ transaction logic

- Database application/ data storage

Three-tier client-server architecture maps suitably to the Web environment

- First tier: Web browser, “thin” client

- Second tier: Application server, Web server

- Third tier: DBMS server, DBMS

DBMS - Web Architecture

Three-tier client-server

architecture -User interface

-Transaction / application logic

- DBMS

• N-tier client-server architecture,

Internet Computing Model

- Web browser, (thin client)

- Web server

- Application server,

- DBMS server, DBMS


Three-tier client-

server architecture

2

N-Tier

Client-

Server

(Internet

Computing)

Mo

del


2

Integrating the Web and DBMS

Integration between Web Server and Application Server

Web requests received by the Web Server invoke transactions on the Application Server

CGI (Common Gate Interface)

Non-CGI Gateways.

• CGI (Common Gate Interface) - transfers information between a Web Server and a CGI Program

- CGI programs (scripts) run on either Web Server or Application Server

- scripts can be written in VBScript or Perl

- CGI is web server independent and scripting language independent.

N-Tier Client-Server

(Internet Computing

Model)


2

• Non-CGI Gateways - Proprietary interfaces, specific to a vendor‟s web server

- Netscape API (Sun Micro Systems)

- ASP (Active Server Pages), (Microsoft Internet Information Server)

• Integration between Application Server and DBMS

- applications on the Application Server connect to and interacts with the Database

- connections between Application Server and the Databases provided by API

(Application Programming Interface)

Standard API‟s:

• ODBC (Open Database Connectivity) connects application programs to DBMS.

• JDBC (Java Database Connectivity) connects Java applications to DBMS

• ODBC (Open Database Connectivity)

- standard API, common interface for accessing SQL databases

- DBMS vendor provides a set of library functions for database access

- Functions are invoked by application software

- execute SQL statements to return rows of data as a result of data search

- de facto industry standard

ODBC is Microsoft‟s implementation of a superset of the SQL Access Group Call Level Interface

(CLI) standard for DB access. ODBC is probably the most widely supported database connectivity

interface. ODBC allows any Windows application to access relational data source using SQL via a

standard application programming interface (API). Microsoft also developed two other data access

interface: Data Access Objects (DAO) and Remote Data Objects (RDO).

Integrating the Web and DBMS

CGI (Common Gate Interface) Environment

2

DAO is an object-oriented API used to access MS Access, MS FpxPro, and dBase databases (using

the Jet data engine) from visual basic programs.

RDO: is a higher-level object-oriented application interface used to access remote database servers.

The Basic ODBC architecture has three main Components:

A higher-level ODBC API through which application programs access ODBC functionality.

A driver manager that is in charge of managing all DB connections.

An ODBC driver that communicates directly to the DBMS

Defining a data source is the first step in using ODBC. To define a data source, you must create a

data source name DSN for the data source. To create a DSN you need to provide:

An ODBC driver

A DSN Name

ODBC Driver Parameters

JDBC (Java Database Connectivity)

- modelled after ODBC, a standard API

- provides access to DBMS‟s from Java application programs

- machine independent architecture

- direct mapping of RDBMS tables to Java classes

- SQL statements used as string variables in Java methods (embedded SQL).

Review Questions

1. Difference between DAO and RDO

2. Three basic components of ODBC architecture.

3. What steps are required to create an ODBC data source name?

4. What are Web server interfaces used for? Give some examples.

5. What does this statement mean: The web is a stateless system? What implications does a

stateless system have for DB applications developers?

6. What is a web application server and how does it work from a database perspertive.

7. What are scripts, and what is their function? ( Think in terms of DB applications

development.)

8. What is XML and why is it important?

Database Systems

Documents

Transcript of Database Systems