MC0077 – Advanced Database Systems

MC0077 – Advanced Database Systems

1. Describe the following: o Dimensional Model

o Object Database Models

o Post – Relational Database Models

Ans: Dimensional model

The dimensional model is a specialized adaptation of the relational model used to represent data in data warehouses in a way that data can be easily summarized using OLAP queries. In the dimensional model, a database consists of a single large table of facts that are described using dimensions and measures. A dimension provides the context of a fact (such as who participated, when and where it happened, and its type) and is used in queries to group related facts together. Dimensions tend to be discrete and are often hierarchical; for example, the location might include the building, state, and country. A measure is a quantity describing the fact, such as revenue. It’s important that measures can be meaningfully aggregated – for example, the revenue from different locations can be added together.

In an OLAP query, dimensions are chosen and the facts are grouped and added together to create a summary.

The dimensional model is often implemented on top of the relational model using a star schema, consisting of one table containing the facts and surrounding tables containing the dimensions. Particularly complicated dimensions might be represented using multiple tables, resulting in a snowflake schema.

A data warehouse can contain multiple star schemas that share dimension tables, allowing them to be used together. Coming up with a standard set of dimensions is an important part of dimensional modelling.

Object Database Models

In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.

A variety of these ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.

Object databases suffered because of a lack of standardization: although standards were defined by ODMG, they were never implemented well enough to ensure interoperability between products. Nevertheless, object databases have been used successfully in many applications: usually specialized applications such as engineering databases or molecular biology databases rather than mainstream commercial data processing. However, object database ideas were picked up by the relational vendors and influenced extensions made to these products and indeed to the SQL language.

Post-Relational Database Models

Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations. Products using a post-relational data model typically employ a model that actually pre-dates the relational model. These might be identified as a directed graph with trees on the nodes.

Post-relational databases could be considered a sub-set of object databases as there is no need for object-relational mapping when using a post-relational data model. In spite of many attacks on this class of data models, with designations of being hierarchical or legacy, the post-relational database industry continues to grow as a multi-billion dollar industry, even if the growth stays below the relational database radar.

Examples of models that could be classified as post-relational are PICK aka MultiValue, and MUMPS, aka M.

2. Describe the following with respect to Database Management Systems: o Information & Data Retrieval

o Image Retrieval Systems

o Multiple Media Information Retrieval Systems, MIRS

Ans: Information & Data Retrieval

The terms information and data are often used interchangeably in the data management literature causing some confusion in interpretation of the goals of different data management system types. It is important to remember that despite the name of a data management system type, it can only manage data. These data are representations of information. However, historically (since the late 1950’s) a distinction has been made between:

· Data Retrieval, as retrieval of ‘facts’, commonly represented as atomic data about some entity of interest, for example a person’s name, and

· Information Retrieval, as the retrieval of documents, commonly text but also visual and audio, that describe objects and/or events of interest.

Both retrieval types match the query specifications to database values. However, while data retrieval only retrieves items that match the query specification exactly, information retrieval systems return items that are deemed (by the retrieval system) to be relevant or similar to the query terms. In the later case, the information requester must select the items that are actually relevant to his/her request. Quick examples include the request for the balance of bank account vs selecting relevant links from a google.com result list.

User requests for data are typically formed as "retrieval-by-content", i.e. the user asks for data related to some desired property or information characteristic. These requests or queries must be specified using one of the query languages supported by the DMS query processing

subsystem. A query language is tailored to the data type(s) of the data collection. Figure 3.3 models a multiple media database and illustrates 2 query types:

1. A Data Retrieval Query expressed in SQL, shown in Figure 3.3b, based on attribute-value matches. In this case, a request for titles of documents containing the term "database" and authored by "Joan Nordbotten" and 2. A Document Retrieval Query, shown in Figure 2.3c. In this case, the documents requested should contain the search terms (keywords): database, management, sql3 or msql. The query in Figure 3.3b is stated in standard SQL2, while the query in Figure 3.3b is an example of a content query and is typical of those used withinformation retrieval systems.

Note that 2 different query languages are needed to retrieve data from the structured and non structured (document) data in the DB.

· The SQL query is based on specification of attribute values and requires that the user knows the attribute names used to describe the database that are stored in the DB schema, while

· The Document query assumes that the system knows the location of Document.Body and is able to perform a keyword search and a similarity evaluation.

Image Retrieval Systems

Due to the large storage requirements for images, computer generation of image material, in the form of charts, illustrations and maps, predated the creation of image databases and the need for ad-hoc image retrieval. Development of scanning devices, particularly for medical applications, and digital cameras, as well as the rapidly increasing capacity of computer storage has lead to the creation of large collections of digital image material. Today, many organizations, such as news media, museums and art galleries, as well as police and immigration authorities, maintain large collections of digital images. For example, the New Your Public Library has made their digital gallery, with over 480,000 scanned images, available to the Internet public.

Maintaining a large image collection leads necessarily to a need for an effective system for image indexing and retrieval. Image data collections have a similar structure as that used for text document collections, i.e. each digital image is associated with descriptive metadata, an example of which is illustrated in Figure 3.6. While management of the metadata is the same for text and image collections, the techniques needed for direct image comparison are quite different from those used for text documents. Therefore, current image retrieval systems use 2 quite different approaches for image retrieval (not necessarily within the same system).

1. Retrieval based on metadata, generated manually, that describe the content, meaning/interpretation and/or context for each image, and/or 2. Retrieval based on automatically selected, low level features, such as color and texture distribution and identifiable shapes. This approach is frequently called CBIR or content based image retrieval Most of the metadata attributes used for digitized images, such as those listed in Figure 3.6, can be stored as either regular structured attributes or text items. Once collected, metadata can be used to retrieve images using either exact match on attribute values or text-search on text descriptive fields. Most image retrieval systems utilize this approach. For example, a Google search for images about Humpback whales listed over 15,000 links to images based on the text – captions, titles, file names – accompanying the images (July 26th 2006).

As noted earlier, images are strings of pixels with no other explicit relationship to the following pixel(s) than their serial position. Unlike text documents, there is no image vocabulary that can be used to index the semantic content. Instead, image pixel analysis routines extract dominant low-level features, such as the distribution of the colors and

texture(s) used, and location(s) of identifiable shapes. This data is used to generate a signature for each image that can be indexed and used to match a similar signature generated for a visual query, i.e. a query based on an image example. Unfortunately, using low-level features does not necessarily give a good ’semantic’ result for image retrieval

Multiple Media Information Retrieval Systems, MIRS

Today, many organizations maintain separate digital collections of text, images, audio, and video data in addition to their basic administrative database systems. Increasingly, these organizations need to integrate their data collections, or at least give seamless access across these collections in order to answer such questions as "What information do we have about this <service/topic>?", for example about a particular kind of medical operation or all of the information from an archeological site.

This gives rise to a requirement for multiple media information retrieval systems, i.e. systems capable of integrating all kinds of media data: tabular/administrative, text, image, spatial, temporal, audio, and/or video data. A Multimedia Information Retrieval System, MIRS can be defined as:

A system for the management (storage, retrieval and manipulation) of multiple types of media data. In practice, an MIRS is a composite system that can be modeled as shown in Figure 3.7. As indicated in the figure, The principle data retrieval sub-systems, located in the connector ‘dots’ on the connection lines of Figure 3.7, can be adapted from known technology used in current specialized media retrieval systems. The actual placement of these components within a specific MIRS may vary, depending on the anticipated co-location of the media data.

The major vendors of Object-Relational, O-R systems such as IBM’s DB2, Informix, and Oracle, have included data management subsystems for such media types as text documents, images, spatial data, audio and video. These ‘new’ (to sql) data types have been defined using the user defined type functionality available in SQL3. Management functions/methods, based on those developed for the media types in specialized systems, have been implemented as user defined functions. The result is an extension to the standard for SQL3 with system dependent implementations for the management of multimedia data.

The intent of this book is to explore how or – DBMS technology can be utilized to create generalized MIRS that can support databases containing any combination of media data.

3. Describe the following: o New Features in SQL3

o Query Optimization

Ans: New Features in SQL3

SQL3 was accepted as the new standard for SQL in 1999, after more than 7 years of debate. Basically, SQL3 includes data definition and management techniques from Object-Oriented DBMS, while maintaining the relational DBMS platform. Based on this merger of concepts and techniques, DBMSs that support SQL3 are called Object-Relational or ORDBMS.

The most central data modeling notions included in SQL3 are illustrated in Figure 5.2 and support specification of:

· Classification hierarchies,

· Embedded structures that support composite attributes,

· Collection data-types (sets, lists/arrays, and multi-sets) that can be used for multi-valued attribute types,

· Large OBject types, LOBs, within the DB, as opposed to requiring external storage, and

· User defined data-types and functions (UDT/UDF) that can be used to define complex structures and derived attribute value calculations, among many other function extensions.

Query formulation in SQL3 remains based in the structured, relational model, though several functional additions have been made to support access to the new structures and data types.

1 Accessing Hierarchical Structures Hierarchic structures can be used at 2 levels, illustrated in Figure 5.2 for:

1. Distinguishing roles between entity-types and

2. Detailing attribute components.

Figure 5.2: DMS support for complex data-types

A cascaded dot notation has been added to the SQL3 syntax to support specification of access paths within these structures. For example, the following statement selects the names and pictures of students from Bergen, Norway, using the OR DB specification given by the SQL3 declarations in Figure 5.3a.

http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00439.jpg

Figure 5.3: Entity and relationship specification in SQ

SELECT name, picture FROM Student

WHERE address.city = ‘Bergen’

AND address.country = ‘Norway’;

The SQL3 query processor recognizes that Student is a sub-type of Person and that the attributes name, picture and address are inherited from Person, making it unnecessary for the user to:

· specify the Person table in the FROM clause,

http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00637.jpg

· use the dot notation to specify the parent entity-type Person in the SELECT or WHERE clauses, or

· specify an explicit join between the levels in the entity-type hierarchy, here Student to Person.

2 Accessing Multi-Valued Structures SQL3 supports multi-valued (MV) attributes using a number of different implementation techniques. Basically, MV attribute structures can be defined as ordered or unordered sets and implemented as lists, arrays or tables either embedded in the parent table or ‘normalized’ to a linked table.

In our example in Figure 5.1a, Person.address is a multi-valued complex attribute, defined as a set of addresses. In execution of the previous query the query processor must search each City and Country combination for the result. If the query intent is to locate students with a home address in Bergen, Norway and we assume that the address set has been implemented as an ordered array in which the 1st address is the home address, the query should be specified as: SELECT name, picture FROM Student

WHERE address[1].city = ‘Bergen’

AND address[1].country = ‘Norway’; 3 Utilizing User Defined Data Types (UDT)

User defined functions can be used in either the SELECT or WHERE clauses, as shown in the following example, again based on the DB specification given in Figure 5.3a.

SELECT Avg (age) FROM Student

WHERE Level > 4;

AND age > 22;

In this query age is calculated by the function defined for Person.age. The SQL3 processor must calculate the relevant student.age for each graduate student (assuming that Level represents the number of years of higher education) and then calculate the average age of this group.

4 Accessing Large Objects SQL3 has added data-types and storage support for unstructured binary and character large objects, BLOB and CLOB respectively, that can be used to store multimedia documents. However, no new query functionality has been added to access the content of these LOB data, though most SQL3 implementations have extended the LIKE operator so that it can also search through CLOB data. Thus, access to BLOB/CLOB data must be based on search conditions in the metadata of formatted columns or on use of the LIKE operator. Some ORDBMS implementations have extended other character string operators to operate on CLOB data, such as

· LOCATE, which returns the position of the first character or bit string within a LOB that matches the search string and

· Concatenation, substring, and length calculation.

Note that LIKE, concatenation, substring and length are original SQL operators that has been extended to function with LOBs, while LOCATE is a new SQL3 operator. An example of using the LIKE operator, based on the MDB defined in Figure 5.3a is

SELECT Description FROM Course

WHERE Description LIKE ‘%data management%’

OR Description LIKE ‘%information management%’ ;

Figure 5.4

Note that the LIKE operator does not make use of any index, rather it searches serially through the CLOB for the pattern given in the query specification.

5 Result Presentation While there are no new presentation operators in SQL3, both complex and derived attributes can be used as presentation criteria in the standard clauses "group by, having, and order by". However, Large objects, LOBs, cannot be used, since 2 LOBs are unlikely to be identical and have no logical order. SQL3 expands embedded attributes, displaying them in 1 ‘column’ or as multiple rows.

Depending on ORDBMS implementation, the result set is presented either totally, the first ‘n’ rows or one tuple at a time. If an attribute of a relation in the result set is defined as a large object, LOB, its presentation may fill one or more screens/pages for each tuple.

SQL3, as a relational language using exact match selection criterion, has no concept of degrees of relevance and thus no support for ranking the tuples in the result set by semantic nearness to the query. Providing this functionality will require user defined output functions, or specialized document processing subsystems as provided by some OR-DBMS vendors.

Query Optimization

The goal of any query processor is to execute each query as efficiently as possible. Efficiency here can be measured in both response time and correctness.

The traditional, relational DB approach to query optimization is to transform the query to an execution tree, and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time) elements as long as possible. A commonly used execution heuristic is:

1. Execute all select and project operations on single tables first, in order to eliminate unnecessary rows and columns from the result set.

2. Execute join operations for further reduce the result set.

3. Execute operations on media data, since these can be very time consuming.

4. Prepare the result set for presentation.

Using the example from the query in Figure 5.5, a near optimal execution plan would be to execute the statements in the following order:

1. Clauses 4, 6 and 7 in any order. Each of these statements reduces the number of rows in their respective tables.

2. Clause 3. The join will further reduce the number of course tuples that satisfy the age and time constraints. This will be a reasonably quick operation if:

- There are indexes on TakenBy.Sid and TakenBy.Cid so that an index join can be performed, and

- The Course.Description clob has been stored outside of the Course table and is represented by a link to its location.

3. Clause 5 will now search only course descriptions that meet all other selection criteria. This will still be a time consuming serial search.

4. Finally, clause 8 will order the result set for presentation through the layout specified in clause 1

4. Describe the following with suitable real time examples: o Data Storage Methods

o Data dredging

Ans: Data Storage Methods

In OLTP – Online Transaction Processing Systems relational database design use the discipline of data modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. Less complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules. Codd defines 5 increasing stringent rules of normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database designs often result in having information from a business transaction stored in dozens to hundreds of tables. Relational database managers are efficient at managing the relationships between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction.

OLTP databases are efficient because they are typically only dealing with the information around a single transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a huge workload on the relational database. Given enough time the software can usually return the requested results, but because of the negative performance impact on the machine and all of its hosted applications, data warehousing professionals recommend that reporting databases be physically separated from the OLTP database.

In addition, data warehousing suggests that data be restructured and reformatted to facilitate query and analysis by novice users. OLTP databases are designed to provide good performance by rigidly defined applications built by programmers fluent in the constraints and conventions of the technology. Add in frequent enhancements, and to many a database is just a collection of cryptic names, seemingly unrelated and obscure structures that store data using incomprehensible coding schemes. All factors that while improving performance, complicate use by untrained people. Lastly, the data warehouse needs to support high volumes of data gathered over extended periods of time and are subject to complex queries and need to accommodate formats and definitions of inherited from independently designed package and legacy systems.

Designing the data warehouse data Architecture synergy is the realm of Data Warehouse Architects. The goal of a data warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elemental level because this provides for the most useful and flexible basis for use in reporting and information analysis. However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse. The dimensional approach advocated by Ralph Kimball and the normalized approach advocated by Bill Inmon. Whilst the dimension approach is very useful in data mart design, it can result in a rat’s nest of long term data integration and abstraction complications when used in a data warehouse.

In the "dimensional" approach, transaction data is partitioned into either a measured "facts", which are generally numeric data that captures specific values or "dimensions" which contain the reference information that gives each transaction its context. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and salesperson. The main advantages of a dimensional approach are that the data warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.

The "normalized" approach uses database normalization. In this method, the data in the data warehouse is stored in third normal form. Tables are then grouped together by subject areas that reflect the general definition of the data (customer, product, finance, etc.). The main advantage of this approach is that it is quite straightforward to add new information into the database – the primary disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports. Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a precise understanding of the data structure.

Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within a problem space. For example, in a financial services business, you might have customers, products and contracts. An alternative approach is to organize around the business transactions, such as customer enrollment, sales and trades

Data Dredging

Data Dredging or Data Fishing are terms one may use to criticize someone’s data mining efforts when it is felt the patterns or causal relationships discovered are unfounded. In this case the pattern suffers of over fitting on the training data.

Data Dredging is the scanning of the data for any relationships, and then when one is found coming up with an interesting explanation. The conclusions may be suspect because data sets with large numbers of variables have by chance some "interesting" relationships. Fred Schwed said:

"There have always been a considerable number of people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it."

Nevertheless, determining correlations in investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies), and correlation analysis

has shown to be very useful in risk management. Indeed, finding correlations in the financial markets, when done properly, is not the same as finding false patterns in roulette wheels. Some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear. Most data mining efforts are focused on developing highly detailed models of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data.

When data sets contain a big set of variables, the level of statistical significance should be proportional to the patterns that were tested. For example, if we test 100 random patterns, it is expected that one of them will be "interesting" with a statistical significance at the 0.01 level.

Cross Validation is a common approach to evaluating the fitness of a model generated via data mining, where the data is divided into a training subset and a test subset to respectively build and then test the model. Common cross validation techniques include the holdout method, k-fold cross validation, and the leave-one-out method.

5. Describe the following with respect to Fuzzy querying to relational databases: o Proposed Model

o Meta knowledge

o Implementation

Ans: The easiest way of introducing fuzziness in the database model is to use classical relational databases and formulate a front end to it that shall allow fuzzy querying to the database. A limitation imposed on the system is that because we are not extending the database model nor are we defining a new model in any way, the underlying database model is crisp and hence the fuzziness can only be incorporated in the query.

To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute domains / linguistic variables e.g. on the attribute domain AGE we may define fuzzy sets as YOUNG, MIDDLE and OLD. These are defined as the following:

Fig. 8.4: Age

For this we take the example of a student database which has a table STUDENTS with the following attributes:

http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0091.gif


Fig. 8.5: A snapshot of the data existing in the database

Meta Knowledge

At the level of meta knowledge we need to add only a single table, LABELS with the following structure:

Fig. 8.6: Meta Knowledge

This table is used to store the information of all the fuzzy sets defined on all the attribute domains. A description of each column in this table is as follows:

· Label: This is the primary key of this table and stores the linguistic term associated with the fuzzy set.

· Column_Name: Stores the linguistic variable associated with the given linguistic term.

· Alpha,Beta, Gamma, Delta: Stores the range of the fuzzy set.

Implementation

The main issue in the implementation of this system is the parsing of the input fuzzy query. As the underlying database is crisp, i.e. no fuzzy data is stored in the database, the INSERT query will not change and need not be parsed therefore it can be presented to the database as it is. During parsing the query is parsed and divided into the following

1. Query Type: Whether the query is a SELECT, DELETE or UPDATE.

2. Result Attributes: The attributes that are to be displayed used only in the case of the SELECT query.

3. Source Tables: The tables on which the query is to be applied.

4. Conditions: The conditions that have to be specified before the operation is performed. It is further sub-divided into Query Attributes (i.e. the attributes on which the query is to be




applied) and the linguistic term. If the condition is not fuzzy i.e. it does not contain a linguistic term then it need not be subdivided.

6. Describe the Data Replication concepts

Data Replication

Replication is the process of copying and maintaining database objects, such as tables, in multiple databases that make up a distributed database system. Changes applied at one site are captured and stored locally before being forwarded and applied at each of the remote locations. Advanced Replication is a fully integrated feature of the Oracle server; it is not a separate server. Replication uses distributed database technology to share data between multiple sites, but a replicated database and a distributed database are not the same. In a distributed database, data is available at many locations, but a particular table resides at only one location. For

example, the employees table resides at only the loc1.world database in a distributed

database system that also includes the loc2.world and loc3.world databases. Replication

means that the same data is available at multiple locations. For example, the employees table

is available at loc1.world, loc2.world, and loc3.world. Some of the most common reasons for using replication are described as follows: 9.10.1 Availability Replication provides fast, local access to shared data because it balances activity over multiple sites. Some users can access one server while other users access different servers, thereby reducing the load at all servers. Also, users can access data from the replication site that has the lowest access cost, which is typically the site that is geographically closest to them. 9.10.2 Performance Replication provides fast, local access to shared data because it balances activity over multiple sites. Some users can access one server while other users access different servers, thereby reducing the load at all servers. Also, users can access data from the replication site that has the lowest access cost, which is typically the site that is geographically closest to them. 9.10.3 Disconnected Computing A Materialized View is a complete or partial copy (replica) of a target table from a single point in time. Materialized views enable users to work on a subset of a database while disconnected from the central database server. Later, when a connection is established, users can synchronize (refresh) materialized views on demand. When users refresh materialized views, they update the central database with all of their changes, and they receive any changes that may have happened while they were disconnected. 9.10.4 Network Load Reduction Replication can be used to distribute data over multiple regional locations. Then, applications can access various regional servers instead of accessing one central server. This configuration can reduce network load dramatically. 9.10.5 Mass Deployment Replication can be used to distribute data over multiple regional locations. Then, applications can access various regional servers instead of accessing one central server. This configuration can reduce network load dramatically.

MC0077 – Advanced Database Systems

Documents

Transcript of MC0077 – Advanced Database Systems