MC0077 SMU MCA SEM4 2011

37
July 2011 Master of Computer Application (MCA) – Semester 4 MC0077 – Advanced Database Systems (Book ID: B0882) Assignment Set – 1 Q: 1. Describe the following: o Dimensional Model o Object Database Models o Post – Relational Database Models Ans: Dimensional model The dimensional model is a specialized adaptation of the relational model used to represent data in data warehouses in a way that data can be easily summarized using OLAP queries. In the dimensional model, a database consists of a single large table of facts that are described using dimensions and measures. A dimension provides the context of a fact (such as who participated, when and where it happened, and its type) and is used in queries to group related facts together. Dimensions tend to be discrete and are often hierarchical; for example, the location might include the building, state, and country. A measure is a quantity describing the fact, such as revenue. It’s important that measures can be meaningfully aggregated – for example, the revenue from different locations can be added together. In an OLAP query, dimensions are chosen and the facts are grouped and added together to create a summary. The dimensional model is often implemented on top of the relational model using a star schema, consisting of one table containing the facts and surrounding tables containing the dimensions. Particularly complicated dimensions might be represented using multiple tables, resulting in a snowflake schema. A data warehouse can contain multiple star schemas that share dimension tables, allowing them to be used together. Coming up with a standard set of dimensions is an important part of dimensionalmodeling. Object Database Models

description

MC0077 SMU MCA SEM4 2011

Transcript of MC0077 SMU MCA SEM4 2011

Page 1: MC0077 SMU MCA SEM4 2011

July 2011

Master of Computer Application (MCA) – Semester 4

MC0077 – Advanced Database Systems

(Book ID: B0882)

Assignment Set – 1

Q: 1. Describe the following:o Dimensional Modelo Object Database Models

o Post – Relational Database Models

Ans:

Dimensional model

The dimensional model is a specialized adaptation of the relational model used to represent data in data warehouses in a way that data can be easily summarized using OLAP queries. In the dimensional model, a database consists of a single large table of facts that are described using dimensions and measures. A dimension provides the context of a fact (such as who participated, when and where it happened, and its type) and is used in queries to group related facts together. Dimensions tend to be discrete and are often hierarchical; for example, the location might include the building, state, and country. A measure is a quantity describing the fact, such as revenue. It’s important that measures can be meaningfully aggregated – for example, the revenue from different locations can be added together.In an OLAP query, dimensions are chosen and the facts are grouped and added together to create a summary.The dimensional model is often implemented on top of the relational model using a star schema, consisting of one table containing the facts and surrounding tables containing the dimensions. Particularly complicated dimensions might be represented using multiple tables, resulting in a snowflake schema.A data warehouse can contain multiple star schemas that share dimension tables, allowing them to be used together. Coming up with a standard set of dimensions is an important part of dimensionalmodeling.

Object Database Models

In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.

A variety of these ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an object-oriented data model for the

Page 2: MC0077 SMU MCA SEM4 2011

database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.

Object databases suffered because of a lack of standardization: although standards were defined by ODMG, they were never implemented well enough to ensure interoperability between products. Nevertheless, object databases have been used successfully in many applications: usually specialized applications such as engineering databases or molecular biology databases rather than mainstream commercial data processing. However, object database ideas were picked up by the relational vendors and influenced extensions made to these products and indeed to the SQL language.

Post-Relational Database Models

Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations. Products using a post-relational data model typically employ a model that actually pre-dates the relational model. These might be identified as a directed graph with trees on the nodes.Post-relational databases could be considered a sub-set of object databases as there is no need for object-relational mapping when using a post-relational data model. In spite of many attacks on this class of data models, with designations of being hierarchical or legacy, the post-relational database industry continues to grow as a multi-billion dollar industry, even if the growth stays below the relational database radar.Examples of models that could be classified as post-relational are PICK aka Multivalve, and MUMPS, aka M.

Q: 2. Explain the concept of a Query? How a Query Optimizer works.

Ans:The aim of query processing is to find information in one or more databases and deliver it to the user quickly and efficiently. Traditional techniques work well for databases with standard, single-site relational structures, but databases containing more complex and diverse types of data demand new query processing and optimization techniques. Most real-world data is not well structured. Today’s databases typically contain much non-structured data such as text, images, video, and audio, often distributed across computer networks. In this complex milieu (typified by the World Wide Web), efficient and accurate query processing becomes quite challenging. Principles of Database Query Processing for Advanced Applications teaches the basic concepts and techniques of query processing and optimization for a variety of data forms and database systems, whether structured or unstructured.

Query Optimizer

The Query Optimizer is the component of a database management system that attempts to determine the most efficient way to execute a query. The optimizer considers the possible query plans (discussed below) for a given input query, and attempts to determine which of those plans will be the most efficient. Cost-based query optimizers assign an estimated "cost" to each possible query plan, and choose the plan with the least cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU requirements, and other factors.

Page 3: MC0077 SMU MCA SEM4 2011

Q: 3. Explain the following with respect to Heuristics of Query Optimizations: A) Equivalence of Expressions B) Selection Operation C) Projection Operation D) Natural Join OperationAns:

Heuristics of Query Optimizations

Equivalence of Expressions

The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute. We’ll use the following relations as examples: Customer-scheme = (cname, street, ccity)Deposit-scheme = (bname, account#, name, balance)Branch-scheme = (bname, assets, bcity)

Selection Operation

1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is Π bname, assets(σ ccity=”Port Chester”

(customer deposit branch))- This expression constructs a huge relation,

customer deposit branch of which we are only interested in a few tuples.- We also are only interested in two attributes of this relation. - We can see that we only want tuples for which ccity = “Port Chester”. - Thus we can rewrite our query as: Π bname, assets(σccity=”Port Chester”(customer))

customer deposit branch)- This should considerably reduce the size of the intermediate relation.

Projection Operation

1. Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:

2. When we compute the subexpression

we obtain a relation whose scheme is (cname, ccity, bname, account#, balance)

Page 4: MC0077 SMU MCA SEM4 2011

3. We can eliminate several attributes from this scheme. The only ones we need to retain are those that - appear in the result of the query or- are needed to process subsequent operations. 4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite our expression as:

Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: - We would access every block for the relation to remove attributes. - Then we access every block of the reduced-size relation when it is actually needed. - We do more work in total, rather than less!

Natural Join Operation

Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. Natural join is associative:

Although these expressions are equivalent, the costs of computing them may differ. Look again at our expression

we see that we can compute deposit branch first and then join with the first part. However, deposit branch is likely to be a large relation as it contains one tuple for every account. The other part, is probably a small relation (comparatively).

So, if we compute first, we get a reasonably small relation.

It has one tuple for each account held by a resident of Port Chester. This temporary relation is much smaller than deposit branch. Natural join is commutative:

Thus we could rewrite our relational algebra expression as:

But there are no common attributes between customer and branch, so this is a Cartesian product. Lots of tuples! If a user entered this expression, we would want to use the associativity and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first, then with branch).

Page 5: MC0077 SMU MCA SEM4 2011

Q: 4. There are a number of historical, organizational, and technological reasons explain the lack of

an all-encompassing data management system. Discuss few of them with appropriate examples.

Ans:

Most current data management systems, DMS, have been built on the assumption that the data collection, or database, to be administered consists of a single media type – structured tables of "fact" data or unstructured strings of bits representing such media objects as text documents, images, or video. The result is that most DMS’ store and index a specific type of media data and provide a query (data access) language that is specialized for efficient access to and retrieval of this data type.

A further assumption that has frequently been made is that the information requirements of the system users are known and can be used for structuring the data collection and tuning the data management system. It has also been assumed that the users would only infrequently require information/ data from some other type of data management system.

These assumptions have been criticized since the early 1980s by researchers who have pointed out that almost from the point of creation, a database would not (nor could) contain all of the data required by the user community. A number of historical, organizational, and technological reasons explain the lack of an all-encompassing data management system. Among these are:

· The sensible advice – to build small systems with the plan to extend their scope in later implementation phases – allows a core system to be implemented relatively quickly, but has lead to a proliferation of relatively small systems.

· Department autonomy has led to construction of department specific rather than organization wide systems, again leading to many small, overlapping, and often incompatible systems within an organization.

· The continual evolution of the organization and its interactions both within and to its external environment prohibits complete understanding of future information requirements.

· Parallel development of data management systems for particular applications has lead to different and incompatible systems for management of tabular/administrative data, text/document data, historical/statistical data, spatial/geographic data, and streamed/audio and visual data.

The result is that only a portion of an organization’s data is administered by any one data management system and most organizations have a multitude of special purpose databases, managed by different, and often incompatible, data management system types. The growing need to retrieve data from multiple databases within an organization, as well as the rapid dissemination of data through the Internet, has given rise to the requirement of providing integrated access to both internal and external data of multiple types.

A major challenge and critical practical and research problem for the information, computer, and communication technology communities is to develop data management systems that can provide efficient access to the data stored in multiple private and public database Problems to be resolved include:

Page 6: MC0077 SMU MCA SEM4 2011

1. Interoperability among systems 2. Incorporation of legacy systems and 3. Integration of management techniques for structured and unstructured data Each of the above problems entails an integration of concepts, methods, techniques and tools from separate research and development communities that have existed in parallel but independently and have had rather minimal interaction. One consequence of which is that there exist an overlapping and conflicting terminology between these communities. With this definition, no limitations are given as to the type of: · Data in the collection, · Model used to structure the collection, or · Architecture and geographic location of the database The focus of this text is on on-line – electronic and web accessible – databases containing multiple media data, thus restricting our interest/focus to multimedia databases stored on one or more computers (DB servers) and accessible from the Internet. Electronic databases are important since they contain data recording the products and services, as well as the economic history and current status of the owner organization. They are also a source of information for the organization’s employees and customers/users. However, databases can not be used effectively unless there exist efficient and secure data management systems, DMS for the data in the databases.

Q: 5. Describe the Structural Semantic Data Model (SSM) with relevant examples.

Ans:

SSM Concepts

The current version of SSM belongs to the class of Semantic Data Model types extended with concepts for specification of user defined data types and functions, UDT and UDF. It supports the modeling concepts defined in Table 4.4 and compared in Table 4. Figure 4.2 shows the concepts and graphic syntax of SSM, which include:

Data Modeling Concepts

Page 8: MC0077 SMU MCA SEM4 2011

1. Three types of entity specifications: base (root), subclass, and weak 2. Four types of inter-entity relationships: n-ary associative, and 3 types of classification hierarchies, 3. Four attribute types: atomic, multi-valued, composite, and derived, 4. Domain type specifications in the graphic model, including; standard data types, Binary large objects (blob, text, image, …), user-defined types (UDT) and functions (UDF), 5. Cardinality specifications for entity to relationship-type connections and for multi-valued attribute types and

Page 10: MC0077 SMU MCA SEM4 2011

Q: 6. Describe the following with respect to Fuzzy querying to relational databases:o Proposed Model

o Meta knowledge

Implementation

Ans:

Fuzzy Querying to Relational Databases

The proposed model

The easiest way of introducing fuzziness in the database model is to use classical relational databases and formulate a front end to it that shall allow fuzzy querying to the database. A limitation imposed on the system is that because we are not extending the database model nor are we defining a new model in any way, the underlying database model is crisp and hence the fuzziness can only be incorporated in the query.

To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute domains / linguistic variables e.g. on the attribute domain AGE we may define fuzzy sets as YOUNG, MIDDLE and OLD. These are defined as the following:

Fig. Age

For this we take the example of a student database which has a table STUDENTS with the following attributes:

Page 11: MC0077 SMU MCA SEM4 2011

Fig. A snapshot of the data existing in the database

Meta Knowledge

At the level of meta knowledge we need to add only a single table, LABELS with the following structure:

Fig. Meta Knowledge

This table is used to store the information of all the fuzzy sets defined on all the attribute domains. A description of each column in this table is as follows:

· Label: This is the primary key of this table and stores the linguistic term associated with the fuzzy set.

· Column_Name: Stores the linguistic variable associated with the given linguistic term.

· Alpha,Beta, Gamma, Delta: Stores the range of the fuzzy set.

Implementation

The main issue in the implementation of this system is the parsing of the input fuzzy query. As the underlying database is crisp, i.e. no fuzzy data is stored in the database, the INSERT query will not change and need not be parsed therefore it can be presented to the database as it is. During parsing the query is parsed and divided into the following

1. Query Type: Whether the query is a SELECT, DELETE or UPDATE.

2. Result Attributes: The attributes that are to be displayed used only in the case of the SELECT query.

Page 12: MC0077 SMU MCA SEM4 2011

3. Source Tables: The tables on which the query is to be applied.

4. Conditions: The conditions that have to be specified before the operation is performed. It is further sub-divided into Query Attributes (i.e. the attributes on which the query is to be applied) and the linguistic term. If the condition is not fuzzy i.e. it does not contain a linguistic term then it need not be subdivided.

Page 13: MC0077 SMU MCA SEM4 2011

Master of Computer Application (MCA)

MC0077 – Advanced Database Systems

(Book ID: B0882)

Assignment Set – 2

Q: 1. Describe the following with suitable examples:

o Cost Estimation

o Measuring Index Selectivity

Ans:

Cost Estimation

One of the hardest problems in query optimization is to accurately estimate the costs of alternative query plans. Optimizers cost query plans using a mathematical model of query execution costs that relies heavily on estimates of the cardinality, or number of tuples, flowing through each edge in a query plan. Cardinality estimation in turn depends on estimates of the selection factor of predicates in the query. Traditionally, database systems estimate selectivity through fairly detailed statistics on the distribution of values in each column, such as histograms. This technique works well for estimation of selectivity of individual predicates. However many queries have conjunctions of predicates such as select count(*) from R, S where R.make='Honda' and R.model='Accord'. Query predicates are often highly correlated (for example, model='Accord' implies make='Honda'), and it is very hard to estimate the selectivity of the conjunct in general. Poor cardinality estimates and uncaught correlation are one of the main reasons why query optimizers pick poor query plans. This is one reason why a DBA should regularly update the database statistics, especially after major data loads/unloads.

Measuring Index Selectivity

Index Selectivity

B*TREE Indexes improve the performance of queries that select a small percentage of rows from a table. As a general guideline, we should create indexes on tables that are often queried for less than 15% of the table’s rows. This value may be higher in situations where all data can be retrieved from an index, or where the indexed columns can be used for joining to other tables.

The ratio of the number of distinct values in the indexed column / columns to the number of records in the table represents the selectivity of an index. The ideal selectivity is 1. Such selectivity can be reached only by unique indexes on NOT NULL columns.

Example with good Selectivity

A table having 100′000 records and one of its indexed column has 88000 distinct values, then the selectivity of this index is 88′000 / 10′0000 = 0.88.

Page 14: MC0077 SMU MCA SEM4 2011

Oracle implicitly creates indexes on the columns of all unique and primary keys that you define with integrity constraints. These indexes are the most selective and the most effective in optimizing performance. The selectivity of an index is the percentage of rows in a table having the same value for the indexed column. An index’s selectivity is good if few rows have the same value.

Example with Bad Selectivity

lf an index on a table of 100′000 records had only 500 distinct values, then the index’s selectivity is 500 / 100′000 = 0.005 and in this case a query which uses the limitation of such an index will retum 100′000 / 500 = 200 records for each distinct value. It is evident that a full table scan is more efficient as using such an index where much more I/O is needed to scan repeatedly the index and the table.

Manually measure index selectivity

The ratio of the number of distinct values to the total number of rows is the selectivity of the columns. This method is useful to estimate the selectivity of an index before creating it.

select count (distinct job) "Distinct Values" from emp;

select count(*) "Total Number Rows" from emp;

Selectivity = Distinct Values / Total Number Rows = 5 / 14 = 0.35

Automatically measuring index selectivity

We can determine the selectivity of an index by dividing the number of distinct indexed values by the number of rows in the table.

create index idx_emp_job on emp(job); analyze table emp compute statistics;

select distinct_keys from user_indexes where table_name = ‘EMP’ and index_name = ‘IDX_EMP_JOB’;

select num_rows from user_tables where table_name = ‘EMP’;

Page 15: MC0077 SMU MCA SEM4 2011

Selectivity = DISTINCT_KEYS / NUM_ROWS = 0.35

Q: 2. Describe the following:o Statements and Transactions in a Distributed Database

o Heterogeneous Distributed Database Systems

Ans:

Statements and Transactions in a Distributed Database

The following sections introduce the terminology used when discussing statements and transactions in a distributed database environment.

Remote and Distributed Statements

A Remote Query is a query that selects information from one or more remote tables, all of which reside at the same remote node. A Remote Update is an update that modifies data in one or more tables, all of which are located at the same remote node. Note: A remote update may include a sub-query that retrieves data from one or more remote nodes, but because the update is performed at only a single remote node, the statement is classified as a remote update. A Distributed Query retrieves information from two or more nodes. A distributed update modifies data on two or more nodes. A distributed update is possible using a program unit, such as a procedure or a trigger, that includes two or more remote updates that access data on different nodes. Statements in the program unit are sent to the remote nodes, and the execution of the program succeeds or fails as a unit.

Remote and Distributed Transactions

A Remote Transaction is a transaction that contains one or more remote statements, all of which reference the same remote node. A Distributed Transaction is any transaction that includes one or more statements that, individually or as a group, update data on two or more distinct nodes of a distributed database. If all statements of a transaction reference only a single remote node, the transaction is remote, not distributed.

Heterogeneous Distributed Database

In a distributed database, any application directly connected to a database can issue a SQL statement that accesses remote data in the following ways (For the sake of explanation we have taken Oracle as a base): · Data in another database is available, no matter what version. Databases at other physical locations are connected through a network and maintain communication. · Data in a non-compatible database (such as an IBM DB2 database) is available, assuming that the non-Compatible database is supported by the application’s gateway architecture, say SQL*Connect in case of Oracle, One can connect the Oracle and non-Oracle databases with a network and use SQL*Net to maintain communication.

Page 16: MC0077 SMU MCA SEM4 2011

Figure illustrates a heterogeneous distributed database system encompassing different versions of Oracle and non-Oracle databases.

Heterogeneous Distributed Database Systems

When connections from an Oracle node to a remote node (Oracle or non-Oracle) initially are established, the connecting Oracle node records the capabilities of each remote system and the associated gateways. SQL statement execution proceeds. However, in heterogeneous distributed systems, SQL statements issued from an Oracle database to a non-Oracle remote database server are limited by the capabilities of the remote database server and associated gateway. For example, if a remote or distributed query includes an Oracle extended SQL function (for example, an outer join), the function may have to be performed by the local Oracle database. Extended SQL functions in remote updates (for example, an outer join in a sub-query) are not supported by all gateways.

Q: 3. Explain: A) Data Warehouse Architecture B) Data Storage Methods

Ans:

A data warehouse is the main repository of the organization’s historical data, its corporate memory. For example, an organization would use the information that’s stored in its data warehouse to find out what day of the week they sold the most widgets in May 1992, or how employee sick leave the week before the

Page 17: MC0077 SMU MCA SEM4 2011

winter break differed between California and New York from 2001-2005. In other words, the data warehouse contains the raw material for management’s decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis (such as data mining) on the information without slowing down the operational systems. The term Data Warehouse Architecture is primarily used today to describe the overall structure of a Business Intelligence system. Other historical terms include Decision Support Systems (DSS), Management Information Systems (MIS), and others.

The Data Warehouse Architecture describes the overall system from various perspectives such as data, process, and infrastructure needed to communicate the structure, function and interrelationships of each component. The infrastructure or technology perspective details the various hardware and software products used to implement the distinct components of the overall system. The data perspective typically diagrams the source and target data structures and aid the user in understanding what data assets are available and how they are related. The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse.

Data Storage Methods

In OLTP – Online Transaction Processing Systems relational database design use the discipline of data modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. Less complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules. Codd defines 5 increasing stringent rules of normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database designs often result in having information from a business transaction stored in dozens to hundreds of tables. Relational database managers are efficient at managing the relationships between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction.

OLTP databases are efficient because they are typically only dealing with the information around a single transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a huge workload on the relational database. Given enough time the software can usually return the requested results, but because of the negative performance impact on the machine and all of its hosted applications, data warehousing professionals recommend that reporting databases be physically separated from the OLTP database.

In addition, data warehousing suggests that data be restructured and reformatted to facilitate query and analysis by novice users. OLTP databases are designed to provide good performance by rigidly defined applications built by programmers fluent in the constraints and conventions of the technology. Add in frequent enhancements, and to many a database is just a collection of cryptic names, seemingly unrelated and obscure structures that store data using incomprehensible coding schemes. All factors that while improving performance, complicate use by untrained people. Lastly, the data warehouse needs to support high volumes of data gathered over extended periods of time and are subject to complex queries and need to accommodate formats and definitions of inherited from independently designed package and legacy systems.

Designing the data warehouse data Architecture synergy is the realm of Data Warehouse Architects. The goal of a data warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elemental level because this provides for the most useful and flexible basis for use in reporting and

Page 18: MC0077 SMU MCA SEM4 2011

information analysis. However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse. The dimensional approach advocated by Ralph Kimball and the normalized approach advocated by Bill Inmon. Whilst the dimension approach is very useful in data mart design, it can result in a rat’s nest of long term data integration and abstraction complications when used in a data warehouse.

In the "dimensional" approach, transaction data is partitioned into either a measured "facts", which are generally numeric data that captures specific values or "dimensions" which contain the reference information that gives each transaction its context. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and salesperson. The main advantages of a dimensional approach are that the data warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.

The "normalized" approach uses database normalization. In this method, the data in the data warehouse is stored in third normal form. Tables are then grouped together by subject areas that reflect the general definition of the data (customer, product, finance, etc.). The main advantage of this approach is that it is quite straightforward to add new information into the database – the primary disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports. Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a precise understanding of the data structure.

Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within a problem space. For example, in a financial services business, you might have customers, products and contracts. An alternative approach is to organize around the business transactions, such as customer enrollment, sales and trades.

Q: 4. Discuss, how the process of retrieving a Text Data differs from the process of retrieval of an Image?

Ans:

Text-based Information Retrieval Systems

As indicated in Table 3.1, text-based information retrieval systems, or more correctly text document retrieval systems, have as long a development history as systems for management of structured, administrative data. The basic structure for digital documents, illustrated in Figure 3.5, has remained relatively constant – a ‘header’ of descriptive attributes, currently called metadata, is prefixed to the text of each document. The resulting document collection is stored in a Document DB. Note that in Figure 3.5 the attribute body can be replaced by a pointer (or link) to a storage location separate from the metadata.

Page 19: MC0077 SMU MCA SEM4 2011

Basic digital document structure

In comparison to the structured/regular data used by administrative applications, documents are unstructured, consisting of a series of characters that represent words, sentences and paragraphs of unequal length. This requires different techniques for indexing, search and retrieval than that used for structured administrative data. Rather than indexing attribute values separately, a document retrieval system develops a term index similar to the ones found in the back of books, i.e. a list of the terms found in the documents with lists of where each term is located in the document collection. The frequency of term occurrence within a document is assumed to indicate the semantic content of the document.

Search for relevant documents is commonly based on the semantic content of the document, rather than on the descriptive attribute values connected to it. For example, if we assume that the data stored in the attribute Document. Body in Figure 3.3a is the actual text of the document, than the retrieval algorithm, when processing Q2 in Figure 3.3c, searches the term index and selects those documents that contain one or more of the query terms database, management, sql3 and msql. It then sorts the resulting document list according to the frequency of these terms in each document.

There are two principle problems in using term matching for document retrieval: 1. Terms can be ambiguous, having meaning dependent on context, and 2. There is frequently a mismatch between the terms used by the searcher in his/her query and the terms used by the authors in the document collections. Techniques and tools developed to address these problems and thus improve retrieval quality include: · Indexing techniques based on word stems, · Dictionaries, thesauri, and grammatical rules as tools for interpretation of both search terms and documents. · Similarity and clustering algorithms, · ‘Mark-up’ languages (adaptations of the editor’s tag set) to indicate areas of the text, such as titles, chapters, … and its layout, that can be used to enhance relevance evaluations, and finally · Metadata standards for describing the semantic content and context for a document. None of these techniques or tools is supported by the standard for relational database management systems. However, since there is a need to store text data with regular administrative data, various text management techniques are being added to or-dbms systems.

Page 20: MC0077 SMU MCA SEM4 2011

Recently, Baeza-Yates & Ribeiro-Neto, (1999) estimated that 90% of computerized data is in the form of text documents. This data is accessible using the retrieval technology developed for off-line document/information retrieval systems and adapted for the newer Digital Libraries and Web search engines. Due to the expanding quantity of text available on the internet, research and development efforts are (still) focused on improving the indexing and retrieval (similarity) algorithms used.

Image Retrieval Systems

Due to the large storage requirements for images, computer generation of image material, in the form of charts, illustrations and maps, predated the creation of image databases and the need for ad-hoc image retrieval. Development of scanning devices, particularly for medical applications, and digital cameras, as well as the rapidly increasing capacity of computer storage has lead to the creation of large collections of digital image material. Today, many organizations, such as news media, museums and art galleries, as well as police and immigration authorities, maintain large collections of digital images. For example, the New Your Public Library has made their digital gallery, with over 480,000 scanned images, available to the Internet public.

Maintaining a large image collection leads necessarily to a need for an effective system for image indexing and retrieval. Image data collections have a similar structure as that used for text document collections, i.e. each digital image is associated with descriptive metadata, an example of which is illustrated in Figure 3.6. While management of the metadata is the same for text and image collections, the techniques needed for direct image comparison are quite different from those used for text documents. Therefore, current image retrieval systems use 2 quite different approaches for image retrieval (not necessarily within the same system).

Digital image document structure

1. Retrieval based on metadata, generated manually, that describe the content, meaning/interpretation and/or context for each image, and/or

2. Retrieval based on automatically selected, low level features, such as color and texture distribution and identifiable shapes. This approach is frequently called CBIR or content based image retrieval

Page 21: MC0077 SMU MCA SEM4 2011

Most of the metadata attributes used for digitized images, such as those listed in Figure 3.6, can be stored as either regular structured attributes or text items. Once collected, metadata can be used to retrieve images using either exact match on attribute values or text-search on text descriptive fields. Most image retrieval systems utilize this approach. For example, a Google search for images about Humpback whales listed over 15,000 links to images based on the text – captions, titles, file names – accompanying the images (July 26th 2006).

As noted earlier, images are strings of pixels with no other explicit relationship to the following pixel(s) than their serial position. Unlike text documents, there is no image vocabulary that can be used to index the semantic content. Instead, image pixel analysis routines extract dominant low-level features, such as the distribution of the colors and texture(s) used, and location(s) of identifiable shapes. This data is used to generate a signature for each image that can be indexed and used to match a similar signature generated for a visual query, i.e. a query based on an image example. Unfortunately, using low-level features does not necessarily give a good ’semantic’ result for image retrieval.

5. What are differences in Centralized and Distributed Database Systems? List the relative

advantages of data distribution.

Ans:

Differences in Distributed & Centralized Databases

1 Centralized Control vs. Decentralized Control

In centralized control one "database administrator" ensures safety of data whereas in distributed control, it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators", who have the responsibility of local databases.

2 Data Independence

In central databases it means the actual organization of data is transparent to the application programmer. The programs are written with "conceptual" view of the data (called "Conceptual schema"), and the programs are unaffected by physical organization of data. In Distributed Databases, another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. Distribution Dependency means programs are written assuming the data is not distributed. Thus correctness of programs is unaffected by the movement of data from one site to another; however, their speed of execution is affected.

3 Reduction of Redundancy

In centralized databases redundancy was reduced for two reasons: (a) inconsistencies among several copies of the same logical data are avoided, (b) storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it, (b) the availability of the system can be increased, because a site failure does not stop the

Page 22: MC0077 SMU MCA SEM4 2011

execution of applications at other sites if the data is replicated. With data replication, retrieval can be performed on any copy, while updates must be performed consistently on all copies.

4 Complex Physical Structures and Efficient Access

In centralized databases complex accessing structures like secondary indexed, interfile chains are used. All these features provide efficient access to data. In distributed databases efficient access requires accessing data from different sites. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer.

Problems faced in the design of an optimizer can be classified in two categories:

a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites.

b) Local optimization consists of deciding how to perform the local database accesses at each site.

5 Integrity, Recovery and Concurrency Control

A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of transaction execution, thus violating the atomicity requirement. Concurrent execution of different transactions may permit one transaction to observe an inconsistent, transient state created by another transaction during its execution. Concurrent execution requires synchronization amongst the transactions, which is much harder in all distributed systems.

6 Privacy and Security

In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed.

In distributed databases, local administrators face the same as well as two new aspects of the problem; (a) security (protection) problems because of communication networks is intrinsic to database systems. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator.

7 Distributed Query Processing

The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralized one, by exploiting parallelism and reducing disc contention; in practice the main delays (and costs) will be imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. The ability to do query optimization is essential in this context – the main objective being to minimize the quantity of data to be moved around. With single-

Page 23: MC0077 SMU MCA SEM4 2011

site databases one must consider both generalized operations on internal query representations and the exploitation of information about the current state of the database.

8 Distributed Directory (Catalog) Management

Catalogs for distributed databases contain information like fragmentation description, allocation description, mappings to local names, access method description, statistics on the database, protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases.

Relative Advantages of Distributed Databases over Centralized Databases

Organizational and Economic Reasons

Many organizations are decentralized, and a distributed database approach fits more naturally the structure of the organization. The organizational and economic motivations are amongst the main reasons for the development of distributed databases. In organizations already having several databases and feeling the necessity of global applications, distributed databases is the natural choice.

Incremental Growth

In a distributed environment, expansion of the system in terms of adding more data, increasing database size, or adding more processors is much easier.

Reduced Communication Overhead

Many applications are local, and these applications do not have any communication overhead. Therefore, the maximization of the locality of applications is one of the primary objectives in distributed database design.

Performance Considerations

Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide are networks. Local queries and transactions accessing data at a single site have better performance because of the smaller local databases. In addition, each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. Moreover, inter-query and intra-query parallelism can be achieved by executing multiple queries at different sites, or breaking up a query into a number of sub queries that execute in parallel. This contributes to improved performance.

Reliability and Availability

Reliability is defined as the probability that a system is running (not down) at a certain time point. Availability is the probability that the system is continuously available during a time interval. When the data and DBMS software are distributed over several sites, one site may fail while other sites continue to operate. Only the data and software that exist at the failed site cannot be accessed. This improves both

Page 24: MC0077 SMU MCA SEM4 2011

reliability and availability. Further improvement is achieved by judiciously replicating data and software at more than one site.

Management of Distributed Data with Different Levels of Transparency In a distributed database, following types of transparencies are possible:

Distribution or Network Transparency

This refers to freedom for the user from the operational details of the network. It may be divided into location and naming transparency. Location transparency refers to the fact that the command used to perform a task is independent of the location of data and the location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unambiguously without additional specification.

Replication Transparency

Copies of the data may be stored at multiple sites for better availability, performance, and reliability. Replication transparency makes the user unaware of the existence of copies.

Fragmentation Transparency

Two main types of fragmentation are Horizontal fragmentation, which distributes a relation into sets of tuples (rows), and Vertical Fragmentation which distributes a relation into sub relations where each sub relation is defined by a subset of the column of the original relation. A global query by the user must be transformed into several fragment queries. Fragmentation transparency makes the user unaware of the existence of fragments.

Q: 6. How the process of retrieval of text differs from the retrieval of Images? What are the

considerations that should be taken care of during information retrieval?

Ans:

Text-based documents are basically unstructured and can be complex. They can consist of the raw text only, have a tagged structure (such as for html documents), include embedded images, and can have a number of fixed attributes containing the metadata describing aspects of the document. They may also include links to supplementary materials. For example, a news report for an election could include the following components: where n, m, k, and x are the number of occurrences of each component type.

1. Identifier, date, and author(s) of the report,

2. n* text blocks – (titles, abstract, content text),

3. m* images – example: image_of_candidate

Page 25: MC0077 SMU MCA SEM4 2011

4. k* charts, and

5. x* maps.

Popular knowledge claims that an image is worth 1000 words. Unfortunately, these 1000 words may differ from one individual to another depending on their perspective and/or knowledge of the image context. For example, Figure 6 gives a familiar demonstration that an image can have multiple, quite different interpretations. Thus, even if a 1000-word image description were available, it is not certain that the image could be retrieved by a user with a different description. The problem is fundamentally one of communication between an information/image seeker/user and the image retrieval system. Since the user may have differing needs and knowledge about the image collection, an image retrieval system must support various forms for query formulation. In general, image retrieval queries can be classified as:

1. Attribute-Based Queries: which use context and/ structural metadata values to retrieve images, for example:

- Find image number ‘x’ or

- Find images from the 17th of May (the Norwegian national holiday day).

2. Textual Queries: which use a term-based specification of the desired images that can be matched to textual image descriptors, for example:

- Find images of Hawaiian sunsets or

- Find images of President Bush delivering a campaign speech

3. Visual Queries: which give visual characteristics (color, texture) or an image that can be compared to visual descriptors. Examples include:

- Find images where the dominant color is blue and gold or

- Find images like <this one>.

These query types utilize different image descriptors and require different processing functions. Image descriptors can be classified into:

· Metadata Descriptors: those that describe the image, as recommended in the numerous metadata standards, such as Dublin Core, CIDOC/CRM and MPEG-7, from the library, museum and motion picture communities respectively.

These metadata can again be classified as:

1. Attribute-based context and structural metadata, such as creator, dates, genre, (source) image type, size, file name, …, or

2. Text-based semantic metadata, such as title/caption, subject/keyword lists, free-text descriptions and/or the text surrounding embedded images, for example as used in a html document. Note that for embedded images, content indexing can be generated using the nearby text.

Page 26: MC0077 SMU MCA SEM4 2011

· Visual descriptors that can be extracted from the image during the storage process by an image retrieval system as recommended and used by the image interpretation community. These descriptors include:

1. Low/pixel level features describing the color, texture, and/or (primitive) shapes within the image.

2. The object set, identified within an image.

Visual descriptors are used to form the basis for one or more image signatures that can be indexed. An image query is analyzed using the same descriptor technique(s) giving a query signature, which is then compared to the image signature(s) to determine similarity between the query specification and the DB image signatures.

For example, assuming that the document collection contains the Web-document, a search for documents containing images that look like Edvard Grieg assumes that the query processor can use:

1. An attribute-based search using the metadata describing the image data.

and/or

2. A text-based search in the titles and text sections of the document collection and extract associated images, and/or

3. A picture of Edvard Grieg, retrieved from the DB or given as input through the query language, to search DB for images containing similar images.

Given an input image as search criteria and a query: “find images similar to this”, the system will characterize the query image according to the same methods used for the DB images and compare the result to the index entries.

· This process works well when the query image is the same type as those in the DB, i.e. photo-photo DB, scanned painting to painting DB, etc.

· It does not work as well when the level of detail differs as when the input is a sketch and the DB contains photos and/or scanned paintings.

Finding images from an image collection depends on the system being able to ‘understand’ the query specifications and match these specifications to the images. Matching each stored image to the query specifications at query request time can be very time consuming. Therefore, researchers and developers of image retrieval systems recommend the use of predefined image descriptors or metadata as the basis for image retrieval. An Image retrieval system needs to be able to utilize each of the descriptor types listed above. Most of these systems have some support for both metadata and visual search, though not necessarily in combination.

Page 27: MC0077 SMU MCA SEM4 2011

Information Retrieval Systems( IRS) have been under development since the mid 1950s. They provide search and retrieval functions for text document collections based on document structure, concepts of words, and grammar. It is functionality from these systems that has been added by or-DBMS vendors to support management of multimedia data. The resulting ORDBMS / MM (Multimedia) conforms (to some degree) to the Multimedia Information Retrieval Systems, MIRS, envisioned by Lu (1999).

Basic ORDBMS / MM – text retrieval functionality includes generation of multiple types of term indexes, as well as a contains operator with sub-operators for the WHERE clause. The contains operator differs from an exact match query in that it gives a probability for a match – a similarity score – between the query search terms and the documents in the database, rather than a true/false result. This operator can be used with multiple search terms and operators that specify relationships between the search terms, for example: the Boolean operators AND, OR, Not and location operators such as: adjacent, within same sentence or paragraph for text documents as illustrated in the following table.

Term combination AND, OR, NOTTerm location ADJACENT, NEAR, WITHIN, …Concept ABOUT, SIMILARVarious other operators FUZZY, LIKE, …Assuming that whole Web pages are stored in an OR-DB attribute Document.text, the following examples will retrieve the document, in addition to other documents containing the search terms. 1) Select * from Document where Text CONTAINS (’Edvard’ AND ‘Grieg’); 2) Select * from Document where Text CONTAINS (’Edvard’ ADJACENT ‘Grieg’); 3) Select * from Document where Text ABOUT (’composers’);

In processing the above queries, the SQL3/Text processing system utilizes the term indexes generated for the document set, as well as a thesaurus for query 3. Note that a term location index is required for query 2, while query 1 needs a frequency index if the retrieved documents are to be ranked /ordered by the frequency of the search terms within the documents.