SEM 4 MC0077 Advances Database System

download SEM 4 MC0077 Advances Database System

of 38

Transcript of SEM 4 MC0077 Advances Database System

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    1/38

    MCA Sem. IV

    Advance Database Systems

    Assignment Set -1

    Q. 1. Describe the Following.

    A. Dimensional Model

    The dimensional model is a specialized adaptation of the relational model used to represent data

    in data warehouses in a way that data can be easily summarized using OLAP queries. In the

    dimensional model, a database consists of a single large table of facts that are described using

    dimensions and measures. A dimension provides the context of a fact (such as who participated,

    when and where it happened, and its type) and is used in queries to group related facts together.

    Dimensions tend to be discrete and are often hierarchical; for example, the location might

    include the building, state, and country. A measure is a quantity describing the fact, such as

    revenue. Its important that measures can be meaningfully aggregated for example, the revenue

    from different locations can be added together.

    In an OLAP query, dimensions are chosen and the facts are grouped and added together to create

    a summary.

    The dimensional model is often implemented on top of the relational model using a star schema,

    consisting of one table containing the facts and surrounding tables containing the dimensions.

    Particularly complicated dimensions might be represented using multiple tables, resulting in a

    snowflake schema.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    2/38

    A data warehouse can contain multiple star schemas that share dimension tables, allowing them

    to be used together. Coming up with a standard set of dimensions is an important part of

    dimensional modeling.

    B. Object Database Model

    In recent years, the object-oriented paradigm has been applied to database technology, creating a

    new programming model known as object databases. These databases attempt to bring the

    database world and the application programming world closer together, in particular by ensuring

    that the database uses the same type system as the application program. This aims to avoid the

    overhead (sometimes referred to as the impedance mismatch) of converting information between

    its representation in the database (for example as rows in tables) and its representation in the

    application program (typically as objects). At the same time, object databases attempt to

    introduce the key ideas of object programming, such as encapsulation and polymorphism, into

    the world of databases.

    A variety of these ways have been tried for storing objects in a database. Some products have

    approached the problem from the application programming end, by making the objects

    manipulated by the program persistent. This also typically requires the addition of some kind of

    query language, since conventional programming languages do not have the ability to findobjects based on their information content. Others have attacked the problem from the database

    end, by defining an object-oriented data model for the database, and defining a database

    programming language that allows full programming capabilities as well as traditional query

    facilities.

    Object databases suffered because of a lack of standardization: although standards were defined

    by ODMG, they were never implemented well enough to ensure interoperability between

    products. Nevertheless, object databases have been used successfully in many applications:

    usually specialized applications such as engineering databases or molecular biology databases

    rather than mainstream commercial data processing. However, object database ideas were picked

    up by the relational vendors and influenced extensions made to these products and indeed to the

    SQL language

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    3/38

    C. Post Relational Database Model.

    Several products have been identified as post-relational because the data model incorporates

    relations but is not constrained by the Information Principle, requiring that all information is

    represented by data values in relations. Products using a post-relational data model typically

    employ a model that actually pre-dates the relational model. These might be identified as a

    directed graph with trees on the nodes.

    Post-relational databases could be considered a sub-set of object databases as there is no need for

    object-relational mapping when using a post-relational data model. In spite of many attacks on

    this class of data models, with designations of being hierarchical or legacy, the post-relationaldatabase industry continues to grow as a multi-billion dollar industry, even if the growth stays

    below the relational database radar.

    Examples of models that could be classified as post-relational are PICK aka MultiValue, and

    MUMPS, aka M.

    Q. 2. Explain the Concept of query ? How a Query Optimizer Works ?

    Ans. The aim of query processing is to find information in one or more databases and deliver it

    to the user quickly and efficiently. Traditional techniques work well for databases with standard,

    single-site relational structures, but databases containing more complex and diverse types of data

    demand new query processing and optimization techniques. Most real-world data is not well

    structured. Todays databases typically contain much non-structured data such as text, images,

    video, and audio, often distributed across computer networks. In this complex milieu (typified bythe World Wide Web), efficient and accurate query processing becomes quite challenging.

    Principles of Database Query Processing for Advanced Applications teaches the basic concepts

    and techniques of query processing and optimization for a variety of data forms and database

    systems, whether structured or unstructured.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    4/38

    Query Optimizer

    The Query Optimizer is the component of a database management system that attempts to

    determine the most efficient way to execute a query. The optimizer considers the possible query

    plans (discussed below) for a given input query, and attempts to determine which of those plans

    will be the most efficient. Cost-based query optimizers assign an estimated "cost" to each

    possible query plan, and choose the plan with the least cost. Costs are used to estimate the

    runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU

    requirements, and other factors.

    Query plan

    A Query Plan (orQuery Execution Plan) is a set of steps used to access information in a SQL

    relational database management system. This is a specific case of the relational model concept of

    access plans. Since SQL is declarative, there are typically a large number of alternative ways to

    execute a given query, with widely varying performance. When a query is submitted to the

    database, the query optimizer evaluates some of the different, correct possible plans for

    executing the query and returns what it considers the best alternative. Because query optimizers

    are imperfect, database users and administrators sometimes need to manually examine and tune

    the plans produced by the optimizer to get better performance.

    The set of query plans examined is formed by examining the possible access paths (e.g. index

    scan, sequential scan) and join algorithms (e.g. sort-merge join, hash join, nested loops). The

    search space can become quite large depending on the complexity of the SQL query.

    The query optimizer cannot be accessed directly by users. Instead, once queries are submitted to

    database server, and parsed by the parser, they are then passed to the query optimizer where

    optimization occurs.

    Implementation

    Most query optimizers represent query plans as a tree of "plan nodes". A plan node encapsulates

    a single operation that is required to execute the query. The nodes are arranged as a tree, in

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    5/38

    which intermediate results flow from the bottom of the tree to the top. Each node has zero or

    more child nodes those are nodes whose output is fed as input to the parent node. For example,

    a join node will have two child nodes, which represent the two join operands, whereas a sort

    node would have a single child node (the input to be sorted). The leaves of the tree are nodes

    which produce results by scanning the disk, for example by performing an index scan or a

    sequential scan.

    Q. 3. Explain the following with respect to Heuristics of Query Optimizations:

    A. Equivalence of Expressions

    The first step in selecting a query-processing strategy is to find a relational algebra expression

    that is equivalent to the given query and is efficient to execute.

    Well use the following relations as examples:

    Customer-scheme = (cname, street, ccity)

    Deposit-scheme = (bname, account#, name, balance)

    Branch-scheme = (bname, assets, bcity)

    B. Selection Operation

    . Consider the query to find the assets and branch-names of all banks who have depositors living

    in Port Chester. In relational algebra, this is

    bname, assets( ccity=Port Chester

    (customer deposit branch))

    - This expression constructs a huge relation,

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image002110.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image002110.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00245.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    6/38

    customer deposit branch of which we are only interested in a few tuples.

    - We also are only interested in two attributes of this relation.

    - We can see that we only want tuples for which ccity = Port Chester.

    - Thus we can rewrite our query as:

    bname, assets(ccity=Port Chester(customer))

    customer deposit branch)

    - This should considerably reduce the size of the intermediate relation.

    2. Suggested Rule for Optimization:

    - Perform select operations as early as possible.

    - If our original query was restricted further to customers with a balance over $1000, the

    selection cannot be done directly to the customer relation above.

    - The new relational algebra query is

    - The selection cannot be applied to customer, as balance is an attribute ofdeposit.

    We can still rewrite as

    - If we look further at the subquery (middle two lines above), we can split the selection predicate

    in two:

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image002310.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00251.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00634.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00436.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00251.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00246.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image002310.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image002210.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    7/38

    - This rewriting gives us a chance to use our perform selections early rule again.

    - We can now rewrite our subquery as:

    3. Second Transformational Rule:

    - Replace expressions of the form P1^P2(C) by P1( P2( C)) where P1 and P2 predicates and e is a

    relational algebra expression.

    - Generally,

    P1( P2( C)) = P2( P1( C)) = P1^P2(C)

    C). Projection Operation.

    Like selection, projection reduces the size of relations.

    It is advantageous to apply projections early. Consider this form of our example query:

    2. When we compute the subexpression

    we obtain a relation whose scheme is (cname, ccity, bname, account#, balance)

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01418.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01221.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01026.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00826.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    8/38

    3. We can eliminate several attributes from this scheme. The only ones we need to retain are

    those that

    - appear in the result of the query or

    - are needed to process subsequent operations.

    4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate

    result, and thus its size.

    5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite

    our expression as:

    Note that there is no advantage in doing an early project on a relation before it is needed for

    some other operation:

    - We would access every block for the relation to remove attributes.

    - Then we access every block of the reduced-size relation when it is actually needed.

    - We do more work in total, rather than less!

    D) Natural Join Operation

    Another way to reduce the size of temporary results is to choose an optimal ordering of the join

    operations.

    Natural join is associative:

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01616.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image012110.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    9/38

    Although these expressions are equivalent, the costs of computing them may differ.

    Look again at our expression

    we see that we can compute deposit branch first and then join with the first part.

    However, deposit branch is likely to be a large relation as it contains one tuple for every

    account.

    The other part, is probably a small relation (comparatively).

    So, if we compute first, we get a reasonably small relation.

    It has one tuple for each account held by a resident of Port Chester.

    This temporary relation is much smaller than deposit branch.

    Natural join is commutative:

    Thus we could rewrite our relational algebra expression as:

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0192.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0271.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0251.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0192.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0231.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0211.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0191.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image019.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01814.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    10/38

    But there are no common attributes between customerand branch, so this is a Cartesian

    product.

    Lots of tuples!

    If a user entered this expression, we would want to use the associativity and commutativity of

    natural join to transform this into the more efficient expression we have derived earlier (join with

    depositfirst, then with branch).

    Q. 4. There are a number of historical, organizational, and technological reasons explain

    the lack of an all-encompassing data management system. Discuss few of them with

    appropriate examples.

    Ans. Models of Failures

    Failures can be classified as

    1) Transaction Failures

    a) Error in transaction due to incorrect data input.

    b) Present or potential deadlock.

    c) Abort of transactions due to non-availability of resources or deadlock.

    2) Site Failures: From recovery point of view, failure has to be judged from the viewpoint of

    loss of memory. So failures can be classified as

    a) Failure with Loss of Volatile Storage: In these failures, the content of main memory is lost;however, all the information which is recorded on disks is not affected by failure. Typical

    failures of this kind are system crashes.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    11/38

    b) Media Failures (Failures with loss of Nonvolatile Storage): In these failures the content of

    disk storage is lost. Failures of this type can be reduced by replicating the information on several

    disks having independent failure modes.

    Stable storage is the most resilient storage medium available in the system implemented by

    replicating the same information on several disks with (i) independent failure modes, and (ii)

    using the so-called careful replacement strategy, at every update operation, first one copy of the

    information is updated, then the correctness of the update is verified, and finally the second copy

    is updated.

    3) Communication Failures: There are two basic types of possible communication errors: lost

    messages and partitions.

    When a site X does not receive an acknowledgment of a message from a site Y within a

    predefined time interval, X is uncertain about the following things:

    i) Did a failure occur at all, or is the system simply slow?

    ii) If a failure occurred, was it a communication failure, or a crash of site Y?

    iii) Has the message been delivered at Y or not? (as the communication failure or the crash canhappen before or after the delivery of the message.)

    Network Partition

    Thus all failures can be regrouped as

    i) Failure of a site

    ii) Loss of message(s), with or without site failures but no partitions.

    iii) Network Partition: Dealing with network partitions is a harder problem than dealing with site

    crashes or lost messages.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    12/38

    Q.5 Describe the Structural Semantic Data Model (SSM) with relevant examples.

    Ans. The Structural Semantic Model, SSM, first described in Nordbotten (1993a & b), is an

    extension and graphic simplification of the EER modeling tool 1st presented in the 89 edition of

    (Elmasri & Navathe, 2003). SSM was developed as a teaching tool and has been and can

    continue to be modified to include new modeling concepts. A particular requirement today is the

    inclusion of concepts and syntax symbols for modeling multimedia objects.

    4.7.1 SSM Concepts

    The current version of SSM belongs to the class of Semantic Data Model types extended with

    concepts for specification of user defined data types and functions, UDT and UDF. It supports

    the modeling concepts defined in below and compared in below. Following diagram shows the

    concepts and graphic syntax of SSM, which include:

    Data Modeling Concepts

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    13/38

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01618.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    14/38

    1. Three types of entity specifications: base (root), subclass, and weak

    2. Four types of inter-entity relationships: n-ary associative, and 3 types of classification

    hierarchies,

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01815.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    15/38

    3. Four attribute types: atomic, multi-valued, composite, and derived,

    4. Domain type specifications in the graphic model, including;

    standard data types, Binary large objects (blob, text, image, ), user-defined types (UDT) and

    functions (UDF),

    5. Cardinality specifications for entity to relationship-type connections and for multi-valued

    attribute types and

    6. Data value constraints.

    Q-6. Describe the following with respect to Fuzzy querying to relational databases:

    A. Proposed Model

    The easiest way of introducing fuzziness in the database model is to use classical

    relational databases and formulate a front end to it that shall allow fuzzy querying to the

    database. A limitation imposed on the system is that because we are not extending the

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image02016.jpg
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    16/38

    database model nor are we defining a new model in any way, the underlying database

    model is crisp and hence the fuzziness can only be incorporated in the query.

    To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute

    domains / linguistic variables e.g. on the attribute domain AGE we may define fuzzy sets

    as YOUNG, MIDDLE and OLD. These are defined as the following:

    Age

    For this we take the example of a student database which has a table STUDENTS with

    the following attributes:

    A snapshot of the data existing in the database

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image013.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image011.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0091.gif
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    17/38

    B. Meta knowledge

    At the level of meta knowledge we need to add only a single table, LABELS with the

    following structure:

    Meta Knowledge

    This table is used to store the information of all the fuzzy sets defined on all the attribute

    domains. A description of each column in this table is as follows:

    Label: This is the primary key of this table and stores the linguistic term associated with

    the fuzzy set.

    Column_Name: Stores the linguistic variable associated with the given linguistic term.

    Alpha,Beta, Gamma, Delta: Stores the range of the fuzzy set

    C. Implementation

    The main issue in the implementation of this system is the parsing of the input fuzzy

    query. As the underlying database is crisp, i.e. no fuzzy data is stored in the database, the

    INSERT query will not change and need not be parsed therefore it can be presented to the

    database as it is. During parsing the query is parsed and divided into the following

    1. Query Type: Whether the query is a SELECT, DELETE or UPDATE.

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0168.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0145.gif
  • 7/29/2019 SEM 4 MC0077 Advances Database System

    18/38

    2. Result Attributes: The attributes that are to be displayed used only in the case of the

    SELECT query.

    3. Source Tables: The tables on which the query is to be applied.

    4. Conditions: The conditions that have to be specified before the operation is performed.

    It is further sub-divided into Query Attributes (i.e. the attributes on which the query is to

    be applied) and the linguistic term. If the condition is not fuzzy i.e. it does not contain a

    linguistic term then it need not be subdivided.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    19/38

    Master of Computer Application (MCA) Semester 4

    MC0077 Advanced Database Systems 4 Credits

    (Book ID: B0882)

    Assignment Set 2 (60 Marks)

    1. How costs are computed for execution of a query? Discuss the method

    of Measuring Index Selectivity?

    Ans 1:

    Heuristics of Query Optimizations

    Equivalence of Expressions

    The first step in selecting a query-processing strategy is to find a relational algebraexpression that is equivalent to the given query and is efficient to execute.We'll use the following relations as examples:Customer-scheme = (cname, street, ccity)Deposit-scheme = (bname, account#, name, balance)Branch-scheme = (bname, assets, bcity)

    Selection Operation

    1. Consider the query to find the assets and branch-names of all banks who havedepositors living in Port Chester. In relational algebra, this is

    bname, assets( ccity=Port Chester

    (customer deposit branch))o This expression constructs a huge relation,customer deposit branch of which we are only interested in a few tuples.o We also are only interested in two attributes of this relation.o We can see that we only want tuples for which ccity = Port Chester''.o Thus we can rewrite our query as:

    bname, assets( ccity=Port Chester (customer))

    customer deposit branch)o This should considerably reduce the size of the intermediate relation.

    2. Suggested Rule for Optimization:

    o Perform select operations as early as possible.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    20/38

    o If our original query was restricted further to customers with a balance over$1000, the selection cannot be done directly to the customer relation above.o The new relational algebra query is

    bname, assets(ccity = PortChester ^ balance >1000

    (customer deposit branch))

    o The selection cannot be applied to customer, as balance is an attribute ofdeposit.We can still rewrite as

    bname, assets(( ccity = PortChester ^ balance >1000(customer deposit)) branch)o If we look further at the subquery (middle two lines above), we can split theselection predicate in two:ccity = PortChester( balance >1000

    (customer deposit))o This rewriting gives us a chance to use our perform selections early'' rule again.o We can now rewrite our subquery as:ccity = PortChester(customer)

    balance >1000 ( deposit)3. Second Transformational Rule:

    o Replace expressions of the form P1^P2(C) by P1( P2( C)) where P1 and P2 predicates and e is a relational algebra expression.o Generally,

    P1( P2( C)) = P2( P1( C)) = P1^P2(C)

    Projection Operation

    1. Like selection, projection reduces the size of relations.It is advantageous to apply projections early. Consider this form of our examplequery:

    bname, assets((( ccity = PortChester ( customer))deposit) branch)2. When we compute the subexpression((( ccity = PortChester ( customer)) deposit)we obtain a relation whose scheme is (cname, ccity, bname, account#, balance)3. We can eliminate several attributes from this scheme. The only ones we need toretain are those thato appear in the result of the query oro are needed to process subsequent operations.4. By eliminating unneeded attributes, we reduce the number of columns of theintermediate result, and thus its size.5. In our example, the only attribute we need is bname (to join with branch). So wecan rewrite our expression as:

    bname, assets((( ccity = PortChester ( customer))deposit)) branch)6. Note that there is no advantage in doing an early project on a relation before it isneeded for some other operation:o We would access every block for the relation to remove attributes.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    21/38

    o Then we access every block of the reduced-size relation when it is actuallyneeded.o We do more work in total, rather than less!

    Natural Join Operation

    Another way to reduce the size of temporary results is to choose an optimalordering of the join operations.Natural join is associative:(r1 r2) r3 = r1 (r2 r3)Although these expressions are equivalent, the costs of computing them may differ.Look again at our expression

    bname, assets(( ccity = PortChester ( customer))deposit branch)we see that we can compute deposit branch first and then join with the first part.However, deposit branch is likely to be a large relation as it contains one tuple forevery account.

    The other part, is probably a small relation (comparatively).( ccity = PortChester ( customer)So, if we compute first, we get a reasonably small relation.( ccity = PortChester ( customer) depositIt has one tuple for each account held by a resident of Port Chester.

    This temporary relation is much smaller than deposit branch.Natural join is commutative:r1 r2 = r2 r1

    Thus we could rewrite our relational algebra expression as:bname, assets

    ((( ccity = PortChester ( customer))deposit)) branch)But there are no common attributes between customerand branch, so this is aCartesian product.Lots of tuples!If a user entered this expression, we would want to use the associativity andcommutativity of natural join to transform this into the more efficient expression wehave derived earlier (join with depositfirst, then with branch).

    2. Describe the following with respect to SQL3 DB specification:

    A) Complex Structures C) Relationships

    B) Hierarchical Structures D) Large Objects, LOBs E) Storage of

    LOBs

    Ans 2:

    (A) Complex structures

    1. Create row type Address_tdefines the address structure that is used in line 8.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    22/38

    2. Street#, Street, ... are regular SQL2 specifications for atomic attributes.3. PostCode and Geo-Loc are both defined as having user defined data types, Pcodeand Point respectively. Pcode is typically locally defined as a list or table of validpostal codes, perhaps with the post office name.4. Create function Age_fdefines a function for calculation of an age, as a decimalvalue, given a start date as the input argument and using a simple algorithm based

    on the current date. This function is used as the data type in line 9 and will beactivated each time the Person.age attribute is retrieved. The function can also beused as a condition clause in a SELECT statement.5. Create table PERSON initiates specification of the implementation structure forthe Person entity-type.6. Id is defined as the primary key. The not null phrase only controls that some 'notnull' value is given. Theprimary keyphrase indicates that the DBM is to guarantythat the set of values for Id are unique.7. Name has a data-type, PersName, defined as a Row type similar to the one

    defined in lines 1-3. BirthDate is a date that can be used as the argument for the

    function Age_f defined in line 4.

    8.Address is defined using the row type Address_t, defined in lines 1-3. Picture isdefined as a BLOB, or Binary Large Object. Note that there are no functions forcontent search, manipulation or presentation, which support BLOB data types.

    These must be defined either by the user as user-defined functions, UDFs, or by theORDBMS vendor in a supplementary subsystem. In this case, we need functions forimage processing.9.Age is defined as a function, which will be activated each time the attribute isretrieved. This costs processing time (though this algorithm is very simple), butgives a correct value each time the attribute is used.

    (B) Hierarchical Structures

    1. Create table STUDENTinitiates specification of the implementation of a subclassentity type.2. GPA, Level, ... are the attributes for the subclass, here with simple SQL2 datatypes.3. under PERSON specifies the table as a subclass of the table PERSON. The DBMthus knows that when the STUDENT table is requested, all attributes and functionsin PERSON are also relevant. An OR-DBMS will store and use the primary key ofPERSON as the key for STUDENT, and execute ajoin operation to retrieve the fullset of attributes.4. Create table COURSE specifies a new table specification, as done for statementsin lines 5 and 10 above.

    5. Id, Name, and Level are standard atomic attribute types with SQL2 data types. Idis defined as requiring a unique, non null value, as specified for PERSON in line 6above.6. Note that attributes must have unique names within their tables, but the namemay be reused, with different data domains in different tables. Both Id and Nameare such attribute-names, appearing in both PERSON and COURSE, as is Level usedin STUDENT and COURSE.7. Course.Description is defined as a character large object, CLOB. A CLOB datatype has the same defined character-string functions as char, varchar, and long

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    23/38

    char, and can be compared to these. User_id is defined as Ucode, which is the nameof a user defined data type, presumably a list of acceptable user codes. The DBimplementer must define both the data type and the appropriate functions forprocessing this type.8. User_Id is also specified as a foreign keywhich links the Course records to their"user" record, modeled as a categorysub entity - type, through the primary key in

    the User table.

    (C) Relationships

    The relationship TakenByis defined in Figure b. This definition needs only SQL2specifications.Note that:

    {Sid, Cid, and Term} form the primary key, PK. Since the key is composite, a separate Primary key clause is required. (As compared with the single attribute PKspecifications for PERSON.Id and COURSE.Id.)

    The 2 foreign key attributes in the PK, must b e defined separately.TakenBy.Report is a foreign key to a report entity-type, forming a ternary

    relationship as modeled in Figure a. The ON DELETE trigger is activated if theReport relation is deleted and assures that the FK link has a valid value, in this case'null'.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    24/38

    (D) Large OBjects, LOBs

    The SSM syntax includes data types for potentially very long media types, such astext, image, audio and video, as shown in Figure 6.8 . If this model is to be realizedin a single database, the DMS will have to have the capability to manage - store,search, retrieve, and manipulate different media types. Object-relational dbmsvendors claim to be able to do this.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    25/38

    Figure: Media objects as attributes

    SQL3 provides support for storage of Binary Large OBjects, BLOBs. A BLOB is simplya very long bit string, limited in many systems today to 2 or 4GB. Several OR-Dbmsvendors differentiate BLOBs into data-types that give more information about theformat of the content and provide basic/primitive manipulation functions for theselarge object, LOB, types. For example, IBM's DB2 has 3 LOB types:

    BLOB for long bit strings,

    CLOB for long character strings, andDBCLOB for double-byte character strings.

    Oracle data types for large objects are BLOB, CLOB, NCLOB (fixed-width multi-byteCLOB) and BFILE (binary file stored outside the DB). Note that the 1st 3 areequivalent to the DB2 LOBs, while the last is really not a data-type, but rather a linkto an externally stored media object.SQL3 has no functions for processing, f.ex. indexing the content of a BLOB,and provides only functions to store and retrieve it given an external identifier. Forexample, if the BLOB is an image, SQL3 does not 'know' how to display it, i.e. it hasno functions for image presentation.DBMS vendors who provide differentiated blob types have also extended the basic

    SQL string comparison operators so that they will function for LOBs, or at leastCLOBs. These operators include the pattern match function "LIKE", which gives atrue/false response if the search string is found/not found in the *LOB attribute.Note: "LIKE" is a standard SQL predicate that simply has been extended to searchvery long data domains.Storage of LOBs

    There are 3 strategies for storing LOBs in an or-DB:1. Embedded in a column of the defining relation, or

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    26/38

    2. Stored in a separate table within the DB, linked from the *LOB column of thedefining relation.3. Stored on an external (local or geographically distant) medium, again linked fromthe *LOB column of the defining relation.Embedded storage in the defining relation closely maps the logical view of themedia object with its physical storage. This strategy is best if the other attributes of

    the table are primarily structural metadata used to specify display characteristics,for example length, language, format.

    The problem with embedded storage is that a DMS must transfer at least a wholetuple, more commonly a block of tuples, from storage for processing. If blobs areembedded in the tuples, a great deal of data must be transmitted even if the LOBobjects are not part of the query selection criteria or the result. For example, aquery retrieving the name and address of persons living in Bergen, Norway, wouldalso retrieve large quantities of image data if the data for the Person.Pictureattribute of Figure 8 were stored as an embedded column in the Person table.Separate table storage gives indirect access via a link in the defining relation anddelays retrieval of the LOB until it is to be part of the query result set. Though thisgives a two-step retrieval, for example when requesting an image of Joan

    Nordbotten, it will reduce general or average transfer time for the query processingsystem.A drawback of this storage strategy is a likely fragmentation of the DB area, asLOBs can be stored 'anywhere'. This will decrease the efficiency of any algorithmsearching the content of a larger set of LOBs, for example to find images that aresimilar to or contain a given image segment. As usual, the storage structure chosenfor a DB should be based on an analysis of anticipated user queries.External storage is useful if the DB data is 'connected' to established mediadatabases, either locally on CD, DVD, ... or on other computers in a network as willmost likely be the case when sharing media data stored in autonomousapplications, such as cooperating museums, libraries, archives, or governmentagencies. This storage structure eliminates the need for duplication of largequantities of data that are normally offered in read-only mode. The cost is in accesstime which may currently be nearly unnoticeable. A good multimedia DMS shouldsupport each of these storage strategies.

    3. Explain: A) Data Warehouse Architecture B) Data Storage Methods

    Ans 3:

    A. Data Warehouse Architecture

    The term Data Warehouse Architecture is primarily used today to describe theoverall structure of a Business Intelligence system. Other historical terms includeDecision Support Systems (DSS), Management Information Systems (MIS), andothers.

    The Data Warehouse Architecture describes the overall system from variousperspectives such as data, process, and infrastructure needed to communicate thestructure, function and interrelationships of each component. The infrastructure or

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    27/38

    technology perspective details the various hardware and software products used toimplement the distinct components of the overall system. The data perspectivetypically diagrams the source and target data structures and aid the user inunderstanding what data assets are available and how they are related.

    The process perspective is primarily concerned with communicating the process

    and flow of data from the originating source system through the process of loadingthe data warehouse, and often the process that client products use to access andextract data from the warehouse.

    B. Data Storage Methods

    In OLTP - Online Transaction Processing Systems relational database design use thediscipline of data modeling and generally follow the Codd rules of datanormalization in order to ensure absolute data integrity. Less complex information

    is broken down into its most simple structures (a table)where all of the individual atomic level elements relate to each other and satisfy thenormalization rules. Codd defines 5 increasing stringent rules of normalization andtypically OLTP systems achieve a 3rd level normalization. Fully normalized OLTPdatabase designs often result in having information from a business transactionstored in dozens to hundreds of tables.

    Relational database managers are efficient at managing the relationships betweentables and result in very fast insert/update performance because only a little bit ofdata is affected in each relational transaction.

    OLTP databases are efficient because they are typically only dealing with theinformation around a single transaction. In reporting and analysis, thousands tobillions of transactions may need to be reassembled imposing a huge workload onthe relational database. Given enough time the software can usually return therequested results, but because of the negative performance impact on the machineand all of its hosted applications, data warehousing professionals recommend thatreporting databases be physically separated from the OLTP database.

    Designing the data warehouse data Architecture synergy is the realm of DataWarehouse Architects. The goal of a data warehouse is to bring data together froma variety of existing databases to support management and reporting needs. Thegenerally accepted principle is that data should be stored at its most elementallevel because this provides for the most useful and flexible basis for use in reportingand information analysis. However, because of different focus on specificrequirements, there can be alternative methods for design and implementing datawarehouses. There are two leading approaches to organizing the data in a datawarehouse.

    In the "dimensional" approach, transaction data is partitioned into either ameasured "facts", which are generally numeric data that captures specific values or"dimensions" which contain the reference information that gives each transaction

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    28/38

    its context. As an example, a sales transaction would be broken up into facts suchas the number of products ordered, and the price paid, and dimensions such asdate, customer, product, geographical location and salesperson.

    The main advantages of a dimensional approach are that the data warehouse iseasy for business staff with limited information technology experience to

    understand and use. Also, because the data is pre-joined into the dimensional form,the data warehouse tends to operate very quickly. The main disadvantage of thedimensional approach is that it is quite difficult to add or change later if thecompany changes the way in which it does business.

    The main advantage of this approach is that it is quite straightforward to add newinformation into the database the primary disadvantage of this approach is thatbecause of the number of tables involved, it can be rather slow to produceinformation and reports.

    Subject areas are just a method of organizing information and can be defined alongany lines. The traditional approach has subjects defined as the subjects or nouns

    within a problem space. For example, in a financial services business, you mighthave customers, products and contracts. An alternative approach is to organizearound the business transactions, such as customer enrollment, sales and trades.

    4. Discuss, how the process of retrieving a Text Data differs from the

    process of retrieval of an Image?

    Text Retrieval Using SQL3/TextRetrieval

    SQL3 supports storage of multimedia data, such as text documents, in an or-database using the blob/clob data types. However, the standard SQL3 specificationdoes not include support for such media content processing functions as indexing orsearching using elements of the media content. For example SQL3's support for aquery to retrieve documents about famous Norwegian artistsis limited to using a serial search of all documents using the pattern match operator'LIKE'. Queries using this operator are likely to miss the Web sites dedicated to thecomposerSeekers of information from text-based documents, commonly use 'free text'queries, i.e. queries that consist of a set of selection terms, as illustrated above.Depending on the underlying query processing system, the input can vary from asingle search term to a longer document. This is a 'normal' input format forInformation retrieval, IR, systems, such as the web search engines, but not forsystems based on SQL. do not have a specific

    Therefore, most of the larger or-dbms vendors (IBM, Oracle, Ingres, Postgress, etc.)have used SQL3's UDT/UDF support to extend their or-dbms with sub-systems forthe management of media data. The approach used has been to add-on own orpurchased specialized media management systems to the basic or-dbms.Basically, the new - to SQL3 - functionality includes:

    Indexing Routines for the various types of media data, as discussed in CH.6, for

    example using:

    o Content terms for text data and

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    29/38

    o Color, shape, and texture features for image data.

    Selection Operators for the SQL3 WHERE clause for specification of selection

    criteria for media retrieval.

    Text Processing Sub-Systems for similarity evaluation and result ranking.

    Unfortunately, the result of this 'independent' activity, is non standard or-dbms/mm(multimedia) systems that differ in the functionality included and limit data retrievalfrom multiple or-dbm system types. For example, unified access to data stored inOracle and DB2 systems is difficult, both in query formulation and resultpresentation. Since the syntax of the SQL3 extensions varies between or-dbms/mmimplementations, the examples used in the following are given in genericSQL3/TextRetrieval(or sql3/tr) statements.

    Text Document Retrieval

    Text-based documents are basically unstructured and can be complex. They canconsist of the raw text only, have a tagged structure (such as for html documents),include embedded images, and can have a number of fixed attributes containingthe metadata describing aspects of the document. They may also include links tosupplementary materials. For example, a news report for an election could includethe following components: where n, m, k, and x are the number of occurrences ofeach component type.1. Identifier, date, and author(s) of the report,

    2. n* text blocks - (titles, abstract, content text),

    3. m* images - example: image_of_candidate

    4. k* charts, and

    5. x* maps.

    Note that the document elements listed in pt.1 above function as context metadatafor the report, while the text itself can function as semantic metadata for both thetext (through indexing) and the image materials. The Web document shown inillustrates elements of a semi-structured document. Since an OR-DB can containtext documents such as web pages, SQL3 should be extended with processingoperators that support access to each of the element types listed above.

    Retrieval using Context Metadata

    In an OR-DB, document descriptors such as Document ID, Date, andAuthor(s)

    function as context metadata. The metadata can be implemented as standardatomic attributes and relationships, thus enabling use of standard SQL queries forretrieval of the document(s). For example, an SQL query to find recent articles ondatabase managementby Joan Nordbotten could be expressed as:Select R.*

    FROM Person P, Author A, Report R

    WHERE P.id = A.Pid AND A.Rid=R.id

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    30/38

    AND Name = 'Joan Nordbotten'

    AND A.Date > 1999-12-31

    AND Title LIKE '%Database%';

    Note that this query assumes that there could be reports on different topics and

    therefore requires use of a semantic descriptor to select only those whosedocuments that indicate that the report has something to do with databases. TheTitle attribute was used in this query, but other semantic metadata, such as thesummaryand/or keyword attributes could also have been chosen - alone or incombination.

    Execution optimization of this query, will place the LIKE operator 'last' so that itstime consuming serial search of the Report.title attribute will be restricted to thosereports that satisfy the Author.name and date conditions. However, as notedpreviously, no term index functionality for multiple term attributes has beenincluded in the standard SQL3, thus there is no option to the serial search for theLIKE operator.

    Information retrieval using the standard SQL exact match operators functions wellfor the context metadata of all media types and moderately well for the semanticcontent metadata attributes. The problem is that the user must know the DBstructure, the attribute names and the DB values in order to form a query. This willnot be the case for Internet searchers.

    Text Retrieval by Semantic Content

    Researchers and developers of document collections strongly recommend that thesemantic information content of the documents be described using such semanticcontent metadata attributes as a title, (a list of) subject keywords, and a contentdescription - all multiple term descriptors. This information can be stored with thedocument as standard SQL attributes using variable length character data types.For example, an OR-DB for web-site maintenance could be developed to containWeb documents described using Dublin Core metadata elements. If the DBcontained the Web page, it could be retrieved using the following SQL statementbased on the semantic metadata and the text itself.Select * from Document

    where (Title LIKE '%Edvard Grieg%'

    or Text LIKE '%Edvard Grieg%');

    In this case, the document was selected by a match in the title, since Edvard Griegis not mentioned by full name in the text of the article. However, the following SLQ3

    query will not return this document though it is relevant to the intent of the query,unless the phrase Norwegian composerhas been defined in the Keywords list.

    Select * from Document

    Where (Title LIKE '%Norwegian composer%'

    or Keywords LIKE '%Norwegian composer%');

    or Text LIKE '%Norwegian composer%');

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    31/38

    The most obvious problems using a standard SQL3 system for text search includethe:

    Lack of utilization of the document structure.Dependency on the serial search of the LIKE operator for the multiple term

    semantic metadata attributes and text body.The potential mismatch between the user query terms and the terms in the

    document descriptors.As noted earlier, SQL3 has no concept of a document or words and therefore thereare no search operators for specification of the placement search terms in adocument (adjacent, near, before, after,...). Since data retrieval in SQL3 is based onan exact match of the query terms and the DB values, no support is provided forsimilarityevaluation between the query terms and the document content.Obviously, more powerful operators are needed for text retrieval. Ideally, a querylanguage that supports text search and retrieval by the semantic contentof textdocuments must provide at least the following functionality.Search Criteria ExampleList of terms Norwegian, composer, Grieg

    Term proximity Edvard near GriegSynonym concepts about "Norwegian

    composers"Similar documents like this document.

    To help avoid problems with the use of various term forms, a root extractionfunction must be available for both document indexing and query pre-processing.Using the above examples some elements in the root-term table could be:

    Root Term VariationsNorway Norwegian, Norsk,

    Norge, ...Compose composer, composers,

    composes, ...Music tune, tunes, song,

    songs, ...

    Note that there exist numerous electronic dictionaries, thesauri, taxonomies,ontologies that can be incorporated into a text query processor.

    SQL3/Text

    Information Retrieval Systems( IRS) have been under development since themid 1950s. They provide search and retrieval functions for text documentcollections based on document structure, concepts of words, and grammar. It isfunctionality from these systems that has been added by or-DBMS vendors tosupport management of multimedia data. The resulting ORDBMS / MM (Multimedia)conforms (to some degree) to the Multimedia Information Retrieval Systems, MIRS,envisioned by Lu (1999).

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    32/38

    Basic ORDBMS / MM - text retrieval functionality includes generation of multipletypes ofterm indexes, as well as a contains operator with sub-operators for theWHERE clause. The contains operator differs from an exact match query in that itgives a probability for a match - a similarity score - between the query searchterms and the documents in the database, rather than a true/false result. Thisoperator can be used with multiple search terms and operators that specify

    relationships between the search terms, for example: the Boolean operators AND,OR, Not and location operators such as: adjacent, within same sentence orparagraph for text documents as illustrated in the following table.

    Term combination AND, OR, NOT

    Term location ADJACENT, NEAR, WITHIN, ...

    Concept ABOUT, SIMILAR

    Various other operators FUZZY, LIKE, ...

    Assuming that whole Web pages are stored in an OR-DB attribute Document.text,the following examples will retrieve the document, in addition to other documents

    containing the search terms.

    1) Select * from Document

    where Text CONTAINS ('Edvard' AND 'Grieg');

    2) Select * from Document

    where Text CONTAINS ('Edvard' ADJACENT 'Grieg');

    3) Select * from Document

    where Text ABOUT ('composers');

    In processing the above queries, the SQL3/Text processing system utilizes the termindexes generated for the document set, as well as a thesaurus for query 3. Notethat a term location index is required for query 2, while query 1 needs a frequencyindex if the retrieved documents are to be ranked /ordered by the frequency of thesearch terms within the documents.

    Image Retrieval

    Popular knowledge claims that an image is worth 1000 words. Unfortunately, these1000 words may differ from one individual to another depending on theirperspective and/or knowledge of the image context. For example, Figure 6 gives a

    familiar demonstration that an image can have multiple, quite differentinterpretations. Thus, even if a 1000-word image description were available, it is notcertain that the image could be retrieved by a user with a different description.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    33/38

    The problem is fundamentally one of communication between an information/imageseeker/user and the image retrieval system. Since the user may have differingneeds and knowledge about the image collection, an image retrieval system mustsupport various forms for query formulation. In general, image retrieval queries canbe classified as:

    1.Attribute-BasedQueries: which use context and/ structural metadata valuesto retrieve images, for example:o Find image number 'x' oro Find images from the 17th of May(the Norwegian national holiday day).

    2. TextualQueries: which use a term-based specification of the desired imagesthat can be matched to textual image descriptors, for example:o Find images of Hawaiian sunsets oro Find images of President Bush delivering a campaign speech

    3. VisualQueries: which give visual characteristics (color, texture) or an imagethat can be compared to visual descriptors. Examples include:o Find images where the dominant color is blue and gold oro Find images like .

    These query types utilize different image descriptors and require differentprocessing functions. Image descriptors can be classified into:

    Metadata Descriptors: those that describe the image, as recommended in thenumerous metadata standards, such as Dublin Core, CIDOC/CRM and MPEG-7, fromthe library, museum and motion picture communities respectively.

    These metadata can again be classified as:1.Attribute-based context and structural metadata, such as creator, dates, genre,(source) image type, size, file name, ..., or2. Text-based semantic metadata, such as title/caption, subject/keyword lists, free-text descriptions and/or the text surrounding embedded images, for example as

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    34/38

    used in a html document. Note that for embedded images, content indexing can begenerated using the nearby text.

    5. What are differences in Centralized and Distributed Database Systems?

    List the relative advantages of data distribution.

    Ans 5: Features of Distributed vs. Centralized Databases or Differences in

    Distributed & Centralized Databases

    Centralized Control vs. Decentralized Control

    In centralized control one "database administrator" ensures safety of data whereasin distributed control, it is possible to use hierarchical control structure based on a"global database administrator" having the central responsibility of whole dataalong with "local database administrators", who have the responsibility of localdatabases.

    Data Independence

    In central databases it means the actual organization of data is transparent to theapplication programmer. The programs are written with "conceptual" view of thedata (called "Conceptual schema"), and the programs are unaffected by physicalorganization of data. In Distributed Databases, another aspect of "distributiondependency" is added to the notion of data independence as used in Centralizeddatabases. Distribution Dependency means programs are written assuming the datais not distributed. Thus correctness of programs is unaffected by the movement ofdata from one site to another; however, their speed of execution is affected.

    Reduction of Redundancy

    In centralized databases redundancy was reduced for two reasons: (a)inconsistencies among several copies of the same logical data are avoided, (b)storage space is saved. Reduction of redundancy is obtained by data sharing. Indistributed databases data redundancy is desirable as (a) locality of applicationscan be increased if data is replicated at all sites where applications need it, (b) theavailability of the system can be increased, because a site failure does not stop theexecution of applications at other sites if the data is replicated. With datareplication, retrieval can be performed on any copy, while updates must beperformed consistently on all copies.

    Complex Physical Structures and Efficient Access

    In centralized databases complex accessing structures like secondary indexed,interfile chains are used. All these features provide efficient access to data. Indistributed databases efficient access requires accessing data from different sites.For this an efficient distributed data access plan is required which can be generatedeither by the programmer or produced automatically by an optimizer.Problems faced in the design of an optimizer can be classified in two categories:a) Global optimization consists of determining which data must be accessed atwhich sites and which data files must consequently be transmitted between sites.

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    35/38

    b) Local optimization consists of deciding how to perform the local databaseaccesses at each site.

    Integrity, Recovery and Concurrency Control

    A transaction is an atomic unit of execution and atomic transactions are the means

    to obtain database integrity. Failures and concurrency are two dangers of atomicity.Failures may cause the system to stop in midst of transaction execution, thusviolating the atomicity requirement. Concurrent execution of different transactionsmay permit one transaction to observe an inconsistent, transient state created byanother transaction during its execution. Concurrent execution requiressynchronization amongst the transactions, which is much harder in all distributedsystems.

    Privacy and Security

    In traditional databases, the database administrator, having centralized control, canensure that only authorized access to the data is performed.In distributed databases, local administrators face the same as well as two new

    aspects of the problem; (a) security (protection) problems because ofcommunication networks is intrinsic to database systems. (b) In certain databaseswith a high degree of "site autonomy" may feel more protected because they canenforce their own protections instead of depending on a central databaseadministrator.

    Distributed Query Processing

    The DDBMS should be capable of gathering and presenting data from more thanone site to answer a single query. In theory a distributed system can handle queriesmore quickly than a centralized one, by exploiting parallelism and reducing disccontention; in practice the main delays (and costs) will be imposed by the

    communications network.

    Routing algorithms must take many factors into account to determine the locationand ordering of operations. Communications costs for each link in the network arerelevant, as also are variable processing capabilities and loadings for differentnodes, and (where data fragments are replicated) trade-offs between cost andcurrency. If some nodes are updated less frequently than others there may be achoice between querying the local out-of-date copy very cheaply and getting amore up-to-date answer by accessing a distant location..

    Distributed Directory (Catalog) Management

    Catalogs for distributed databases contain information like fragmentationdescription, allocation description, mappings to local names, access methoddescription, statistics on the database, protection and integrity constraints(consistency information) which are more detailed as compared to centralizeddatabases.

    Relative Advantages of Distributed Databases over Centralized Databases

    Organizational and Economic Reasons

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    36/38

    Many organizations are decentralized, and a distributed database approach fitsmore naturally the structure of the organization. The organizational and economicmotivations are amongst the main reasons for the development of distributeddatabases. In organizations already having several databases and feeling thenecessity of global applications, distributed databases is the natural choice.

    Incremental Growth

    In a distributed environment, expansion of the system in terms of adding moredata, increasing database size, or adding more processors is much easier.

    Reduced Communication Overhead

    Many applications are local, and these applications do not have any communicationoverhead. Therefore, the maximization of the locality of applications is one of theprimary objectives in distributed database design.Performance Considerations

    Data localization reduces the contention for CPU and I/O services and

    simultaneously reduces access delays involved in wide area networks. Local queriesand transactions accessing data at a single site have better performance because ofthe smaller local databases. In addition, each site has a smaller number oftransactions executing than if all transactions are submitted to a single centralizeddatabase. Moreover, inter-query and intra-query parallelism can be achieved byexecuting multiple queries at different sites, or breaking up a query into a numberof sub queries that execute in parallel. This contributes to improved performance.

    Reliability and Availability

    Reliability is defined as the probability that a system is running (not down) at acertain time point. Availability is the probability that the system is continuouslyavailable during a time interval. When the data and DBMS software are distributedover several sites, one site may fail while other sites continue to operate. Only thedata and software that exist at the failed site cannot be accessed. This improvesboth reliability and availability. Further improvement is achieved by judiciouslyreplicating data and software at more than one site.

    Management of Distributed Data with Different Levels of Transparency In a

    distributed database, following types of transparencies are possible:

    Distribution or Network Transparency

    This refers to freedom for the user from the operational details of the network. Itmay be divided into location and naming transparency. Location transparency

    refers to the fact that the command used to perform a task is independent of thelocation of data and the location of the system where the command was issued.Naming transparency implies that once a name is specified, the named objects canbe accessed unambiguously without additional specification.

    Replication Transparency

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    37/38

    Copies of the data may be stored at multiple sites for better availability,performance, and reliability. Replication transparency makes the user unaware ofthe existence of copies.

    Fragmentation Transparency

    Two main types of fragmentation are Horizontal fragmentation, which distributes arelation into sets of tuples (rows), and Vertical Fragmentation which distributes arelation into sub relations where each sub relation is defined by a subset of thecolumn of the original relation. A global query by the user must be transformed intoseveral fragment queries. Fragmentation transparency makes the user unaware ofthe existence of fragments.

    6. What are Commit Protocols? Explain, how Two-Phase Commit Protocolresponds to following types of failures:-i) Failure of Participating Site,

    ii) Failure of Coordinator

    Ans 6: Commit Protocols:In distributed data base and transaction systems a distributed commit protocol isrequired to ensure that the effects of a distributed transaction are atomic, that is,either all the effects of the transaction persist or none persist, whether or notfailures occur. Several commit protocols have been proposed in the literature.

    These are variations of what has become a standard and known as the two-phasecommit (2PC) protocol.

    Two-phase commit protocol

    Intransaction processing,databases, andcomputer networking, thetwo-phase commitprotocol(2PC) is a type ofatomic commitment protocol(ACP). It is adistributed algorithmthat

    coordinates all the processes that participate in adistributed atomic transactionon whethertocommitorabort(roll back) the transaction (it is a specialized typeofconsensusprotocol). The protocol achieves its goal even in many cases oftemporary system failure (involving either process, network node, communication,etc. failures), and is thus widely utilized.

    However, it is not resilient to all possible failure configurations, and in rare casesuser (e.g., a system's administrator) intervention is needed to remedy outcome. To

  • 7/29/2019 SEM 4 MC0077 Advances Database System

    38/38

    accommodate recovery from failure (automatic in most cases) the protocol'sparticipants useloggingof the protocol's states. Log records, which are typically slowto generate but survive failures, are used by the protocol's recovery procedures.

    Though usually intended to be used infrequently, recovery procedures comprise asubstantial portion of the protocol, due to many possible failure scenarios to beconsidered and supported by the protocol.

    (I) Failure of Participating Site:Thecommit-request phase(orvoting phase), in which acoordinatorprocessattempts to prepare all the transaction's participating processes(namedparticipants,cohorts, orworkers) to take the necessary steps for eithercommitting or aborting the transaction and tovote, either "Yes": commit (if thetransaction participant's local portion execution has ended properly), or "No": abort(if a problem has been detected with the local portion), and

    Thecommit phase, in which, based onvotingof the cohorts, the coordinator decideswhether to commit (only ifallhave voted "Yes") or abort the transaction

    (otherwise), and notifies the result to all the cohorts. The cohorts then follow withthe needed actions (commit or abort) with their local transactional resources (alsocalledrecoverable resources; e.g., database data) and their respective portions inthe transaction's other output (if applicable).

    (ii) Failure of Coordinator

    If any cohort votes No during the commit-request phase (or the coordinator'stimeout expires):(1) The coordinator sends a rollback message to all the cohorts.(2) Each cohort undoes the transaction using the undo log, and releases theresources and locks held during the transaction.(3) Each cohort sends an acknowledgement to the coordinator.

    (4) The coordinator undoes the transaction when all acknowledgements have beenreceived.