mc0077 set 2 july 2011

download mc0077 set 2 july 2011

of 14

Transcript of mc0077 set 2 july 2011

  • 8/3/2019 mc0077 set 2 july 2011

    1/14

    July 2011Master of Computer Application (MCA) Semester 4MC0077 Advanced Database Systems 4 Credits

    (Book ID: B0882)

    Assignment Set 2

    1. Describe the following with suitable examples:o Cost Estimation

    o Measuring Index Selectivity

    Ans: Cost Estimation

    One of the hardest problems in query optimization is to accurately estimate the costs ofalternative query plans. Optimizers cost query plans using a mathematical model of queryexecution costs that relies heavily on estimates of the cardinality, or number of tuples, flowingthrough each edge in a query plan. Cardinalityestimation in turn depends on estimates of theselection factor of predicates in the query. Traditionally, database systemsestimate selectivitythrough fairly detailed statistics on the distribution of values in each column,such as histograms. This technique works well for estimation of selectivity of individual

    predicates. However many queries have conjunctions of predicates such as select count(*)

    from R, S where R.make='Honda' and R.model='Accord'. Query predicates are often highly

    correlated (for example, model='Accord'implies make='Honda'), and it is very hard to estimatethe selectivity of the conjunct in general. Poor cardinality estimates and uncaught correlation areone of the main reasons why query optimizers pick poor query plans. This is one reason why a DBA

    should regularly update the database statistics, especially after major data loads/unloads.The Cardinality of a set is a measure of the "number of elements of the set". There are twoapproaches to cardinality one which compares sets directly using bijections and injections, andanother which uses cardinal numbers

    Measuring Index Selectivity

    Index Selectivity

    B*TREE Indexes improve the performance of queries that select a small percentage of rows from atable. As a general guideline, we should create indexes on tables that are often queried for lessthan 15% of the tables rows. This value may be higher in situations where all data can be

    retrieved from an index, or where the indexed columns can be used for joining to other tables.

    The ratio of the number ofdistinct values in the indexed column / columns to the number ofrecords in the table represents the selectivity of an index. The ideal selectivity is 1.Suchselectivity can be reached only by unique indexes on NOT NULL columns.

    Example with good Selectivity

  • 8/3/2019 mc0077 set 2 july 2011

    2/14

    A table having 100000 records and one of its indexed column has 88000 distinct values, then theselectivity of this index is 88000 / 100000 = 0.88.

    Oracle implicitly creates indexes on the columns of all unique and primary keys that you definewith integrity constraints. These indexes are the most selective and the most effective inoptimizing performance. The selectivity of an index is the percentage of rows in a table having

    the same value for the indexed column. An indexs selectivity is good if few rows have the samevalue.

    Example with Bad Selectivity

    lf an index on a table of 100000 records had only 500 distinct values, then the indexs selectivityis 500 / 100000 = 0.005 and in this case a query which uses the limitation of such an index willretum 100000 / 500 = 200 records for each distinct value. It is evident that a full table scan ismore efficient as using such an index where much more I/O is needed to scan repeatedly theindex and the table.

    How to Measure Index Selectivity?

    Manually measure index selectivity

    The ratio of the number of distinct values to the total number of rows is the selectivity of thecolumns. This method is useful to estimate the selectivity of an index before creating it.

    select count (distinct job) "Distinct Values" from emp;

    select count(*) "Total Number Rows" from emp;

    Selectivity = Distinct Values / Total Number Rows= 5 / 14= 0.35

    Automatically measuring index selectivity

    We can determine the selectivity of an index by dividing the number of distinct indexed values bythe number of rows in the table.

    create index idx_emp_job on emp(job);analyze table emp compute statistics;

    select distinct_keys from user_indexeswhere table_name = EMPand index_name = IDX_EMP_JOB;

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0334.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0312.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0291.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0334.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0312.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0291.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0334.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0312.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0291.gif
  • 8/3/2019 mc0077 set 2 july 2011

    3/14

    select num_rows from user_tableswhere table_name = EMP;

    Selectivity = DISTINCT_KEYS / NUM_ROWS = 0.35

    Selectivity of each individual Column

    Assuming that the table has been analyzed it is also possible to query USER_TAB_COLUMNS toinvestigate the selectivity of each column individually.

    select column_name, num_distinctfrom user_tab_columnswhere table_name = EMP;

    2. Describe the following:o Statements and Transactions in a Distributed Database

    o Heterogeneous Distributed Database Systems

    Ans:

    Statements and Transactions in a Distributed DatabaseThe following sections introduce the terminology used when discussing statements andtransactions in a distributed database environment.

    Remote and Distributed Statements

    A Remote Queryis a query that selects information from one or more remote tables, all of whichreside at the same remote node.A Remote Update is an update that modifies data in one or more tables, all of which are locatedat the same remote node.

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0371.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0352.gifhttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0371.jpghttp://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image0352.gif
  • 8/3/2019 mc0077 set 2 july 2011

    4/14

    Note: A remote update may include a sub-query that retrieves data from one or more remotenodes, but because the update is performed at only a single remote node, the statement isclassified as a remote update.A Distributed Queryretrieves information from two or more nodes. A distributed update modifiesdata on two or more nodes. A distributed update is possible using a program unit, such as aprocedure or a trigger, that includes two or more remote updates that access data on different

    nodes. Statements in the program unit are sent to the remote nodes, and the execution of theprogram succeeds or fails as a unit.Remote and Distributed Transactions

    A Remote Transaction is a transaction that contains one or more remote statements, all of whichreference the same remote node. A Distributed Transaction is any transaction that includes one ormore statements that, individually or as a group, update data on two or more distinct nodes of adistributed database. If all statements of a transaction reference only a single remote node, thetransaction is remote, not distributed.

    Heterogeneous Distributed Database SystemsThe Oracle distributed database architecture allows the mix of different versions of Oracle along

    with database products from other companies to create a heterogeneous distributed databasesystem.The Mechanics of a Heterogeneous Distributed Database

    In a distributed database, any application directly connected to a database can issue a SQLstatement that accesses remote data in the following ways (For the sake of explanation we havetaken Oracle as a base):

    Data in another database is available, no matter what version. Databases at other physicallocations are connected through a network and maintain communication.

    Data in a non-compatible database (such as an IBM DB2 database) is available, assuming that the

    non-Compatible database is supported by the applications gateway architecture, say SQL*Connectin case of Oracle, One can connect the Oracle and non-Oracle databases with a network and useSQL*Net to maintain communication.Figure 9.3 illustrates a heterogeneous distributed database system encompassing differentversions of Oracle and non-Oracle databases.

  • 8/3/2019 mc0077 set 2 july 2011

    5/14

    Figure 9.3: Heterogeneous Distributed Database SystemsWhen connections from an Oracle node to a remote node (Oracle or non-Oracle) initially areestablished, the connecting Oracle node records the capabilities of each remote system and theassociated gateways. SQL statement execution proceeds. However, in heterogeneous distributedsystems, SQL statements issued from an Oracle database to a non-Oracle remote database serverare limited by the capabilities of the remote database server and associated gateway. Forexample, if a remote or distributed query includes an Oracle extended SQL function (for example,an outer join), the function may have to be performed by the local Oracle database. Extended SQL

    functions in remote updates (for example, an outer join in a sub-query) are not supported by allgateways.

    3. Explain:A) Data Warehouse Architecture B) Data Storage Methods

    Ans:

    Data Warehouse Architecture

    The term Data Warehouse Architecture is primarily used today to describe the overall structure ofa Business Intelligence system. Other historical terms include Decision Support Systems (DSS),Management Information Systems (MIS), and others.

    The Data Warehouse Architecture describes the overall system from various perspectives such asdata, process, and infrastructure needed to communicate the structure, function andinterrelationships of each component. The infrastructure or technology perspective details thevarious hardware and software products used to implement the distinct components of the overall

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image00612.gif
  • 8/3/2019 mc0077 set 2 july 2011

    6/14

    system. The data perspective typically diagrams the source and target data structures and aid theuser in understanding what data assets are available and how they are related. The processperspective is primarily concerned with communicating the process and flow of data from theoriginating source system through the process of loading the data warehouse, and often theprocess that client products use to access and extract data from the warehouse.

    Data Storage Methods

    In OLTP Online Transaction Processing Systems relational database design use the discipline ofdata modeling and generally follow the Codd rules of data normalization in order to ensureabsolute data integrity. Less complex information is broken down into its most simple structures(a table) where all of the individual atomic level elements relate to each other and satisfy thenormalization rules. Codd defines 5 increasing stringent rules of normalization and typically OLTPsystems achieve a 3rd level normalization. Fully normalized OLTP database designs often result inhaving information from a business transaction stored in dozens to hundreds of tables. Relationaldatabase managers are efficient at managing the relationships between tables and result in veryfast insert/update performance because only a little bit of data is affected in each relationaltransaction.

    OLTP databases are efficient because they are typically only dealing with the information arounda single transaction. In reporting and analysis, thousands to billions of transactions may need to bereassembled imposing a huge workload on the relational database. Given enough time thesoftware can usually return the requested results, but because of the negative performanceimpact on the machine and all of its hosted applications, data warehousing professionalsrecommend that reporting databases be physically separated from the OLTP database.

    In addition, data warehousing suggests that data be restructured and reformatted to facilitatequery and analysis by novice users. OLTP databases are designed to provide good performance byrigidly defined applications built by programmers fluent in the constraints and conventions of thetechnology. Add in frequent enhancements, and to many a database is just a collection of crypticnames, seemingly unrelated and obscure structures that store data using incomprehensible coding

    schemes. All factors that while improving performance, complicate use by untrained people.Lastly, the data warehouse needs to support high volumes of data gathered over extended periodsof time and are subject to complex queries and need to accommodate formats and definitions ofinherited from independently designed package and legacy systems.

    Designing the data warehouse data Architecture synergy is the realm of Data WarehouseArchitects. The goal of a data warehouse is to bring data together from a variety of existingdatabases to support management and reporting needs. The generally accepted principle is thatdata should be stored at its most elemental level because this provides for the most useful andflexible basis for use in reporting and information analysis. However, because of different focus onspecific requirements, there can be alternative methods for design and implementing datawarehouses. There are two leading approaches to organizing the data in a data warehouse. The

    dimensional approach advocated by Ralph Kimball and the normalized approach advocated by BillInmon. Whilst the dimension approach is very useful in data mart design, it can result in a ra tsnest of long term data integration and abstraction complications when used in a data warehouse.

    In the "dimensional" approach, transaction data is partitioned into either a measured "facts",which are generally numeric data that captures specific values or "dimensions" which contain thereference information that gives each transaction its context. As an example, a sales transactionwould be broken up into facts such as the number of products ordered, and the price paid, anddimensions such as date, customer, product, geographical location and salesperson. The main

  • 8/3/2019 mc0077 set 2 july 2011

    7/14

    advantages of a dimensional approach are that the data warehouse is easy for business staff withlimited information technology experience to understand and use. Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. The maindisadvantage of the dimensional approach is that it is quite difficult to add or change later if thecompany changes the way in which it does business.

    The "normalized" approach uses database normalization. In this method, the data in the datawarehouse is stored in third normal form. Tables are then grouped together by subject areas thatreflect the general definition of the data (customer, product, finance, etc.). The main advantageof this approach is that it is quite straightforward to add new information into the database theprimary disadvantage of this approach is that because of the number of tables involved, it can berather slow to produce information and reports. Furthermore, since the segregation of facts anddimensions is not explicit in this type of data model, it is difficult for users to join the requireddata elements into meaningful information without a precise understanding of the data structure.

    Subject areas are just a method of organizing information and can be defined along any lines. Thetraditional approach has subjects defined as the subjects or nouns within a problem space. Forexample, in a financial services business, you might have customers, products and contracts. Analternative approach is to organize around the business transactions, such as customerenrollment, sales and trades.

    4. Discuss, how the process of retrieving a Text Data differs from the process of retrieval of an

    Image?

    Ans:

    Text-based Information Retrieval Systems

    As indicated in Table 3.1, text-based information retrieval systems, or more correctly textdocument retrieval systems, have as long a development history as systems for management ofstructured, administrative data. The basic structure for digital documents, illustrated in Figure3.5, has remained relatively constant a header of descriptive attributes, currently calledmetadata, is prefixed to the text of each document. The resulting document collection is stored ina Document DB. Note that in Figure 3.5 the attribute bodycan be replaced by a pointer (or link)to a storage location separate from the metadata.

  • 8/3/2019 mc0077 set 2 july 2011

    8/14

    Figure 3.5: Basic digital document structure

    In comparison to the structured/regular data used by administrative applications, documents areunstructured, consisting of a series of characters that represent words, sentences and paragraphsof unequal length. This requires different techniques for indexing, search and retrieval than thatused for structured administrative data. Rather than indexing attribute values separately, adocument retrieval system develops a term indexsimilar to the ones found in the back of books,i.e. a list of the terms found in the documents with lists of where each term is located in thedocument collection. The frequency of term occurrence within a document is assumed to indicatethe semantic content of the document.Search for relevant documents is commonly based on the semantic content of the document,rather than on the descriptive attribute values connected to it. For example, if we assume thatthe data stored in the attribute Document.Bodyin Figure 3.3a is the actual text of the document,

    than the retrieval algorithm, when processing Q2 in Figure 3.3c, searches the term index andselects those documents that contain one or more of the query terms database, management, sql3and msql. It then sorts the resulting document list according to the frequency of these terms ineach document.

    There are two principle problems in using term matching for document retrieval:

    1. Terms can be ambiguous, having meaning dependent on context, and

    2. There is frequently a mismatch between the terms used by the searcher in his/her query andthe terms used by the authors in the document collections.

    Techniques and tools developed to address these problems and thus improve retrieval qualityinclude:

    Indexing techniques based on word stems,

    Dictionaries, thesauri, and grammatical rules as tools for interpretation of both search terms anddocuments. Similarity and clustering algorithms,

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01222.jpg
  • 8/3/2019 mc0077 set 2 july 2011

    9/14

    Mark-up languages (adaptations of the editors tag set) to indicate areas of the text, such astitles, chapters, and its layout, that can be used to enhance relevance evaluations, and finally

    Metadata standards for describing the semantic content and context for a document.

    None of these techniques or tools is supported by the standard for relational database

    management systems. However, since there is a need to store text data with regularadministrative data, various text management techniques are being added to or-dbms systems.

    Recently, Baeza-Yates & Ribeiro-Neto, (1999) estimated that 90% of computerized data is in theform of text documents. This data is accessible using the retrieval technology developed for off-line document/information retrieval systems and adapted for the newer Digital Libraries and Websearch engines. Due to the expanding quantity of text available on the internet, research anddevelopment efforts are (still) focused on improving the indexing and retrieval (similarity)algorithms used.

    Image Retrieval Systems

    Due to the large storage requirements for images, computer generation of image material, in theform of charts, illustrations and maps, predated the creation of image databases and the need forad-hoc image retrieval. Development of scanning devices, particularly for medical applications,and digital cameras, as well as the rapidly increasing capacity of computer storage has lead to thecreation of large collections of digital image material. Today, many organizations, such as newsmedia, museums and art galleries, as well as police and immigration authorities, maintain largecollections of digital images. For example, the New Your Public Library has made their digitalgallery, with over 480,000 scanned images, available to the Internet public.

    Maintaining a large image collection leads necessarily to a need for an effective system for imageindexing and retrieval. Image data collections have a similar structure as that used for textdocument collections, i.e. each digital image is associated with descriptive metadata, an exampleof which is illustrated in Figure 3.6. While management of the metadata is the same for text andimage collections, the techniques needed for direct image comparison are quite different fromthose used for text documents. Therefore, current image retrieval systems use 2 quite differentapproaches for image retrieval (not necessarily within the same system).

    http://resources.smude.edu.in/slm/wp-content/uploads/2010/07/clip-image01419.jpg
  • 8/3/2019 mc0077 set 2 july 2011

    10/14

    Figure 3.6: Digital image document structure

    1. Retrieval based on metadata, generated manually, that describe the content,meaning/interpretation and/or context for each image, and/or2. Retrieval based on automatically selected, low levelfeatures, such as color and texturedistribution and identifiable shapes. This approach is frequently called CBIR or content based

    image retrievalMost of the metadata attributes used for digitized images, such as those listed in Figure 3.6, canbe stored as either regular structured attributes or text items. Once collected, metadata can beused to retrieve images using either exact match on attribute values or text-search on textdescriptive fields. Most image retrieval systems utilize this approach. For example, a Googlesearch for images about Humpback whales listed over 15,000 links to images based on the text captions, titles, file names accompanying the images (July 26th 2006).

    As noted earlier, images are strings of pixels with no other explicit relationship to the followingpixel(s) than their serial position. Unlike text documents, there is no image vocabulary that can beused to index the semantic content. Instead, image pixel analysis routines extract dominant low-level features, such as the distribution of the colors and texture(s) used, and location(s) ofidentifiable shapes. This data is used to generate a signature for each image that can be indexedand used to match a similar signature generated for a visual query, i.e. a query based on an imageexample. Unfortunately, using low-level features does not necessarily give a good semanticresult for image retrieval

    5. What are differences in Centralized and Distributed Database Systems? List the relative

    advantages of data distribution.

    Ans:

    Features of Distributed vs. Centralized Databases or Differences in Distributed & CentralizedDatabases

    Centralized Control vs. Decentralized Control

    In centralized control one "database administrator" ensures safety of data whereas in distributedcontrol, it is possible to use hierarchical control structure based on a "global databaseadministrator" having the central responsibility of whole data along with "local databaseadministrators", who have the responsibility of local databases.

    Data Independence

    In central databases it means the actual organization of data is transparent to the applicationprogrammer. The programs are written with "conceptual" view of the data (called "Conceptualschema"), and the programs are unaffected by physical organization of data. In DistributedDatabases, another aspect of "distribution dependency" is added to the notion of dataindependence as used in Centralized databases. Distribution Dependency means programs arewritten assuming the data is not distributed. Thus correctness of programs is unaffected by themovement of data from one site to another; however, their speed of execution is affected.

  • 8/3/2019 mc0077 set 2 july 2011

    11/14

    Reduction of Redundancy

    In centralized databases redundancy was reduced for two reasons:(a) inconsistencies among several copies of the same logical data are avoided, (b) storage space issaved. Reduction of redundancy is obtained by data sharing. In distributed databases dataredundancy is desirable as (a) locality of applications can be increased if data is replicated at all

    sites where applications need it, (b) the availability of the system can be increased, because asite failure does not stop the execution of applications at other sites if the data is replicated.With data replication, retrieval can be performed on any copy, while updates must be performedconsistently on all copies.

    Complex Physical Structures and Efficient Access

    In centralized databases complex accessing structures like secondary indexed, interfile chains areused. All these features provide efficient access to data. In distributed databases efficient accessrequires accessing data from different sites. For this an efficient distributed data access plan isrequired which can be generated either by the programmer or produced automatically by anoptimizer.

    Problems faced in the design of an optimizer can be classified in two categories:

    a) Global optimization consists of determining which data must be accessed at which sites andwhich data files must consequently be transmitted between sites.

    b) Local optimization consists of deciding how to perform the local database accesses at each site.

    Integrity, Recovery and Concurrency Control

    A transaction is an atomic unit of execution and atomic transactions are the means to obtaindatabase integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the

    system to stop in midst of transaction execution, thus violating the atomicity requirement.Concurrent execution of different transactions may permit one transaction to observe aninconsistent, transient state created by another transaction during its execution. Concurrentexecution requires synchronization amongst the transactions, which is much harder in alldistributed systems.

    Privacy and Security

    In traditional databases, the database administrator, having centralized control, can ensure thatonly authorized access to the data is performed.

    In distributed databases, local administrators face the same as well as two new aspects of theproblem; (a) security (protection) problems because of communication networks is intrinsic todatabase systems. (b) In certain databases with a high degree of "site autonomy" may feel moreprotected because they can enforce their own protections instead of depending on a centraldatabase administrator.

    Distributed Query Processing

    The DDBMS should be capable of gathering and presenting data from more than one site to answera single query. In theory a distributed system can handle queries more quickly than a centralized

  • 8/3/2019 mc0077 set 2 july 2011

    12/14

    one, by exploiting parallelism and reducing disc contention; in practice the main delays (andcosts) will be imposed by the communications network. Routing algorithms must take many factorsinto account to determine the location and ordering of operations. Communications costs for eachlink in the network are relevant, as also are variable processing capabilities and loadings fordifferent nodes, and (where data fragments are replicated) trade-offs between cost and currency.If some nodes are updated less frequently than others there may be a choice between querying

    the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing adistant location. The ability to do query optimization is essential in this context the mainobjective being to minimize the quantity of data to be moved around. With single-site databasesone must consider both generalized operations on internal query representations and theexploitation of information about the current state of the database.

    Distributed Directory (Catalog) Management

    Catalogs for distributed databases contain information like fragmentation description, allocationdescription, mappings to local names, access method description, statistics on the database,protection and integrity constraints (consistency information) which are more detailed ascompared to centralized databases.

    6. How the process of retrieval of text differs from the retrieval of Images? What are the

    considerations that should be taken care of during information retrieval?

    Ans:

    Text-based Information Retrieval Systems

    As indicated in Table 3.1, text-based information retrieval systems, or more correctly textdocument retrieval systems, have as long a development history as systems for management of

    structured, administrative data. The basic structure for digital documents, illustrated in Figure3.5, has remained relatively constant a header of descriptive attributes, currently calledmetadata, is prefixed to the text of each document. The resulting document collection is stored ina Document DB. Note that in Figure 3.5 the attribute bodycan be replaced by a pointer (or link)to a storage location separate from the metadata.Basic digital document structure

    In comparison to the structured/regular data used by administrative applications, documents areunstructured, consisting of a series of characters that represent words, sentences and paragraphsof unequal length. This requires different techniques for indexing, search and retrieval than thatused for structured administrative data. Rather than indexing attribute values separately, adocument retrieval system develops a term indexsimilar to the ones found in the back of books,i.e. a list of the terms found in the documents with lists of where each term is located in thedocument collection. The frequency of term occurrence within a document is assumed to indicatethe semantic content of the document.Search for relevant documents is commonly based on the semantic content of the document,rather than on the descriptive attribute values connected to it. For example, if we assume thatthe data stored in the attribute Document.Bodyin Figure 3.3a is the actual text of the document,than the retrieval algorithm, when processing Q2 in Figure 3.3c, searches the term index andselects those documents that contain one or more of the query terms database, management, sql3

  • 8/3/2019 mc0077 set 2 july 2011

    13/14

    and msql. It then sorts the resulting document list according to the frequency of these terms ineach document.

    There are two principle problems in using term matching for document retrieval:

    1. Terms can be ambiguous, having meaning dependent on context, and

    2. There is frequently a mismatch between the terms used by the searcher in his/her query andthe terms used by the authors in the document collections.

    Techniques and tools developed to address these problems and thus improve retrieval qualityinclude:

    Indexing techniques based on word stems,

    Dictionaries, thesauri, and grammatical rules as tools for interpretation of both search terms anddocuments. Similarity and clustering algorithms,

    Mark-up languages (adaptations of the editors tag set) to indicate areas of the text, such astitles, chapters, and its layout, that can be used to enhance relevance evaluations, and finally

    Metadata standards for describing the semantic content and context for a document.

    None of these techniques or tools is supported by the standard for relational databasemanagement systems. However, since there is a need to store text data with regularadministrative data, various text management techniques are being added to or-dbms systems.

    Recently, Baeza-Yates & Ribeiro-Neto, (1999) estimated that 90% of computerized data is in theform of text documents. This data is accessible using the retrieval technology developed for off-

    line document/information retrieval systems and adapted for the newer Digital Libraries and Websearch engines. Due to the expanding quantity of text available on the internet, research anddevelopment efforts are (still) focused on improving the indexing and retrieval (similarity)algorithms used.

    Image Retrieval Systems

    Due to the large storage requirements for images, computer generation of image material, in theform of charts, illustrations and maps, predated the creation of image databases and the need forad-hoc image retrieval. Development of scanning devices, particularly for medical applications,and digital cameras, as well as the rapidly increasing capacity of computer storage has lead to thecreation of large collections of digital image material. Today, many organizations, such as news

    media, museums and art galleries, as well as police and immigration authorities, maintain largecollections of digital images. For example, the New Your Public Library has made their digitalgallery, with over 480,000 scanned images, available to the Internet public.

    Maintaining a large image collection leads necessarily to a need for an effective system for imageindexing and retrieval. Image data collections have a similar structure as that used for textdocument collections, i.e. each digital image is associated with descriptive metadata, an exampleof which is illustrated in Figure 3.6. While management of the metadata is the same for text andimage collections, the techniques needed for direct image comparison are quite different from

  • 8/3/2019 mc0077 set 2 july 2011

    14/14

    those used for text documents. Therefore, current image retrieval systems use 2 quite differentapproaches for image retrieval (not necessarily within the same system).

    Information Retrieval, as the retrieval of documents, commonly text but also visual and audio,

    that describe objects and/or events of interest.

    Both retrieval types match the query specifications to database values. However, while dataretrieval only retrieves items that match the query specification exactly, information retrievalsystems return items that are deemed (by the retrieval system) to be relevant or similarto thequery terms. In the later case, the information requester must select the items that are actuallyrelevant to his/her request. Quick examples include the request for the balance of bank accountvs selecting relevant links from a google.com result list.

    User requests for data are typically formed as "retrieval-by-content", i.e. the user asks for datarelated to some desired property or information characteristic. These requests or queries must bespecified using one of the query languages supported by the DMS query processing subsystem. A

    query language is tailored to the data type(s) of the data collection