4/17/07, Tecnológico de Monterrey, SMU CSE 8337 1 DATA WAREHOUSING & INFORMATION RETRIEVAL Margaret...

43
4/17/07, Tecnológico de Monte rrey, SMU CSE 8337 1 DATA WAREHOUSING DATA WAREHOUSING & & INFORMATION RETRIEVAL INFORMATION RETRIEVAL Margaret H. Dunham Margaret H. Dunham Department of Computer Science and Engineering Department of Computer Science and Engineering Southern Methodist University Southern Methodist University POBox 750122 POBox 750122 Dallas, Texas 75275-0122 Dallas, Texas 75275-0122 214-768-3087 214-768-3087 [email protected] [email protected] The contents of this presentation draw extensively from slides for: The contents of this presentation draw extensively from slides for: Data Mining, Introductory and Advanced Topics Data Mining, Introductory and Advanced Topics , by Margaret H. Dunham, Prentice Hall, , by Margaret H. Dunham, Prentice Hall, 2003. 2003.

Transcript of 4/17/07, Tecnológico de Monterrey, SMU CSE 8337 1 DATA WAREHOUSING & INFORMATION RETRIEVAL Margaret...

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

1

DATA WAREHOUSINGDATA WAREHOUSING&&

INFORMATION RETRIEVALINFORMATION RETRIEVAL

Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering

Southern Methodist UniversitySouthern Methodist UniversityPOBox 750122POBox 750122

Dallas, Texas 75275-0122Dallas, Texas 75275-0122214-768-3087214-768-3087

[email protected]@engr.smu.edu

The contents of this presentation draw extensively from slides for: The contents of this presentation draw extensively from slides for: Data Mining, Introductory and Advanced TopicsData Mining, Introductory and Advanced Topics , by Margaret H. Dunham, Prentice Hall, 2003., by Margaret H. Dunham, Prentice Hall, 2003.

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

2

DW&IR OutlineDW&IR Outline

IntroductionIntroduction Data WarehousingData Warehousing ResearchResearch SummarySummary

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

3

DW&IR OutlineDW&IR Outline

IntroductionIntroduction– Data Warehousing OverviewData Warehousing Overview

– Information RetrievalInformation Retrieval Data WarehousingData Warehousing ResearchResearch SummarySummary

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

4

Data WarehousingData Warehousing ““Subject-oriented, integrated, time-variant, nonvolatile” Subject-oriented, integrated, time-variant, nonvolatile”

William InmonWilliam Inmon http://www.inmondatasystems.com/http://www.inmondatasystems.com/ Operational Data:Operational Data: Data used in day to day needs of Data used in day to day needs of

company.company. Informational Data:Informational Data: Supports other functions such as Supports other functions such as

planning and forecasting.planning and forecasting. Data mining tools often access data warehouses rather Data mining tools often access data warehouses rather

than operational data.than operational data.

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

5

Data Warehouse VariationsData Warehouse Variations

Data MartData Mart – Subset of complete data – Subset of complete data warehousewarehouse

Virtual WarehouseVirtual Warehouse – Warehouse – Warehouse implemented as a view of operational implemented as a view of operational datadata

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

6

Operational vs. InformationalOperational vs. Informational

  Operational Data Data Warehouse

Application OLTP OLAP

Use Precise Queries Ad Hoc

Temporal Snapshot Historical

Modification Dynamic Static

Orientation Application Business

Data Operational Values Integrated

Size Gigabits TerabitsLevel Detailed Summarized

Access Often Less Often

Response Few Seconds Minutes

Data Schema Relational Star/Snowflake

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

7

Information Retrieval Information Retrieval

Information Retrieval (IR):Information Retrieval (IR): retrieving desired retrieving desired information from textual data.information from textual data.

Library ScienceLibrary Science Digital LibrariesDigital Libraries Web Search EnginesWeb Search Engines Traditionally keyword basedTraditionally keyword based Sample query:Sample query:

Find all documents about “data mining”Find all documents about “data mining” IR being applied to other unformatted dataIR being applied to other unformatted data

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

8

DB vs IRDB vs IR

Records (tuples) vs. documentsRecords (tuples) vs. documents Well defined results vs. fuzzy resultsWell defined results vs. fuzzy results DB grew out of files and traditional DB grew out of files and traditional

business systesmbusiness systesm IR grew out of library science and need IR grew out of library science and need

to categorize/group/access to categorize/group/access books/articlesbooks/articles

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

9

DB vs IR (cont’d)Data retrieval

which docs contain a set of keywords?Well defined semanticsa single erroneous object implies failure!

Information retrievalinformation about a subject or topicsemantics is frequently loosesmall errors are tolerated

IR system:interpret contents of information itemsgenerate a ranking which reflects relevancenotion of relevance is most important

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

10

Information Retrieval (cont’d)Information Retrieval (cont’d)

Similarity:Similarity: measure of how close a measure of how close a query is to a document.query is to a document.

Documents which are “close enough” Documents which are “close enough” are retrieved.are retrieved.

Metrics:Metrics:– PrecisionPrecision = |Relevant and Retrieved| = |Relevant and Retrieved|

|Retrieved||Retrieved|– RecallRecall = |Relevant and Retrieved|= |Relevant and Retrieved|

|Relevant||Relevant|

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

11

IR Query Result Measures IR Query Result Measures and Classificationand Classification

IR Classification

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

12

DW&IR OutlineDW&IR Outline

IntroductionIntroduction Data WarehousingData Warehousing

– Dimensional ModelingDimensional Modeling

– OLAPOLAP

– Decision Support SystemsDecision Support Systems ResearchResearch SummarySummary

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

13

Data Transformation for Data Data Transformation for Data WarehouseWarehouse

ETL ETL – Extract, Transform, Load– Extract, Transform, Load Unwanted data must be removedUnwanted data must be removed Convert heterogeneous sources into one Convert heterogeneous sources into one

common schemacommon schema As the operational data is probably a As the operational data is probably a

snapshot of the data, multiple snapshots may snapshot of the data, multiple snapshots may need to be merged to create the historical need to be merged to create the historical viewview

Summarize dataSummarize data New derived dataNew derived data Handle missing and erroneous dataHandle missing and erroneous data

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

14

Data Warehouse CreationData Warehouse Creation

Fig 1 [1]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

15

Dimensional ModelingDimensional Modeling

View data in a hierarchical manner more as View data in a hierarchical manner more as business executives mightbusiness executives might

Useful in decision support systems and miningUseful in decision support systems and mining Dimension:Dimension: collection of logically related collection of logically related

attributes; axis for modeling data.attributes; axis for modeling data. Facts:Facts: data stored data stored Ex: Dimensions – products, locations, dateEx: Dimensions – products, locations, date

Facts – quantity, unit priceFacts – quantity, unit price

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

16

Multidimensional Model ExampleMultidimensional Model Example

Fig 2 [1]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

17

Cube view of DataCube view of Data

Fig 4 [1]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

18

Aggregation HierarchiesAggregation Hierarchies

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

19

Multidimensional SchemasMultidimensional Schemas

Star Schema shows facts and dimensionsStar Schema shows facts and dimensions– Center of the star has facts shown in fact tablesCenter of the star has facts shown in fact tables– Outside of the facts, each diemnsion is shown Outside of the facts, each diemnsion is shown

separately in dimension tablesseparately in dimension tables– Access to fact table from dimension table via joinAccess to fact table from dimension table via join

SELECT Quantity, PriceSELECT Quantity, PriceFROM Facts, LocationFROM Facts, LocationWhere (Facts.LocationID = Location.LocationID) andWhere (Facts.LocationID = Location.LocationID) and(Location.City = ‘Dallas’)(Location.City = ‘Dallas’)

– View as relations, problem volume of data and View as relations, problem volume of data and indexingindexing

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

20

Star SchemaStar Schema

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

21

Flattened StarFlattened Star

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

22

Normalized StarNormalized Star

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

23

Snowflake SchemaSnowflake Schema

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

24

OLAPOLAP Online Analytic Processing (OLAP):Online Analytic Processing (OLAP): provides more provides more

complex queries than OLTP.complex queries than OLTP. OnLine Transaction Processing (OLTP):OnLine Transaction Processing (OLTP): traditional traditional

database/transaction processing.database/transaction processing. Dimensional data; cube view Dimensional data; cube view Support ad hoc queryingSupport ad hoc querying Require analysis of dataRequire analysis of data Can be thought of as an extension of some of the basic Can be thought of as an extension of some of the basic

aggregation functions available in SQLaggregation functions available in SQL OLAP tools may be used in DSS systemsOLAP tools may be used in DSS systems Mutlidimentional view is fundamentalMutlidimentional view is fundamental

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

25

OLAP ImplementationsOLAP Implementations MOLAP (Multidimensional OLAP)MOLAP (Multidimensional OLAP)

– Multidimential Database (MDD)Multidimential Database (MDD)– Specialized DBMS and software system capable of supporting Specialized DBMS and software system capable of supporting

the multidimensional data directlythe multidimensional data directly– Data stored as an n-dimensional array (cube)Data stored as an n-dimensional array (cube)– Indexes used to speed up processingIndexes used to speed up processing

ROLAP (Relational OLAP)ROLAP (Relational OLAP)– Data stored in a relational databaseData stored in a relational database– ROLAP server (middleware) creates the multidimensional view ROLAP server (middleware) creates the multidimensional view

for the userfor the user– Less Complex; Less efficientLess Complex; Less efficient

HOLAP (Hybrid OLAP)HOLAP (Hybrid OLAP)– Not updated frequently – MDDNot updated frequently – MDD– Updated frequently - RDBUpdated frequently - RDB

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

26

OLAP OperationsOLAP Operations

Single Cell Multiple Cells Slice Dice

Roll Up

Drill Down

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

27

OLAP OperationsOLAP Operations

Simple query – single cell in the cubeSimple query – single cell in the cube SliceSlice – Look at a subcube to get more – Look at a subcube to get more

specific informationspecific information Dice Dice – Rotate cube to look at another – Rotate cube to look at another

dimensiondimension Roll UpRoll Up – Dimension Reduction; Aggregation – Dimension Reduction; Aggregation Drill DownDrill Down Visualization: These operations allow the Visualization: These operations allow the

OLAP users to actually “see” results of an OLAP users to actually “see” results of an operation.operation.

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

28

Relationship Between TopcsRelationship Between Topcs

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

29

Decision Support SystemsDecision Support Systems Tools and computer systems that assist Tools and computer systems that assist

management in decision makingmanagement in decision making What if types of questionsWhat if types of questions High level decisionsHigh level decisions Data warehouse – data which supports Data warehouse – data which supports

DSSDSS

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

30

Data Warehouse LinksData Warehouse Links OLAPOLAP

– http://www.olapreport.com/http://www.olapreport.com/ General Data WarehousingGeneral Data Warehousing

– http://www.inmoncif.com/home/http://www.inmoncif.com/home/– http://www.datawarehouseconsulting.com/http://www.datawarehouseconsulting.com/– http://www.datawarehousing.com/http://www.datawarehousing.com/– http://www.dw-institute.com/http://www.dw-institute.com/

DW ProductsDW Products– http://www-306.ibm.com/software/data/informix/redbrick/http://www-306.ibm.com/software/data/informix/redbrick/– http://www.oracle.com/solutions/business_intelligence/dw_home.htmlhttp://www.oracle.com/solutions/business_intelligence/dw_home.html– http://www.sas.com/technologies/dw/index.htmlhttp://www.sas.com/technologies/dw/index.html– http://msdn2.microsoft.com/en-us/library/aa545535.aspxhttp://msdn2.microsoft.com/en-us/library/aa545535.aspx– http://www.sybase.com/detail?id=1027323http://www.sybase.com/detail?id=1027323

Interesting ArticlesInteresting Articles– “Teaching Effective Methodologies to Design a Data Warehouse,” by Behrooz Seyed-

Abbassihttp://isedj.org/isecon/2001/35c/ISECON.2001.Seyed-Abbassi.pdfhttp://isedj.org/isecon/2001/35c/ISECON.2001.Seyed-Abbassi.pdf

– An Oracle DBA’s Guide to the OLAP Option,” by by Mark RittmanAn Oracle DBA’s Guide to the OLAP Option,” by by Mark Rittmanhttp://www.dbazine.com/datawarehouse/dw-articles/rittman1http://www.dbazine.com/datawarehouse/dw-articles/rittman1

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

31

DW&IR OutlineDW&IR Outline

IntroductionIntroduction Data WarehousingData Warehousing ResearchResearch

– BibliominingBibliomining

– Hierarchical Multimedia IRHierarchical Multimedia IR

– Ontology-based OLAP & IROntology-based OLAP & IR SummarySummary

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

32

Bibliomining [2,3]Bibliomining [2,3] Data Warehousing + Data Mining + LibrariesData Warehousing + Data Mining + Libraries Abstract, cleanse, summarize library dataAbstract, cleanse, summarize library data

– DocumentsDocuments– Users (including demographics)Users (including demographics)– Circulation Records (including Web server records)Circulation Records (including Web server records)

Privacy of utmost importancePrivacy of utmost importance

http://www.bibliomining.com/nicholson/biblioprocess.htm [2]http://www.bibliomining.com/nicholson/biblioprocess.htm [2]

http://bibliomining.com/nicholson/nicholsonbibliointro.html [3]http://bibliomining.com/nicholson/nicholsonbibliointro.html [3]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

33

Hierarchical Multimedia IR [4]Hierarchical Multimedia IR [4] DW Approach to Multimedia IRDW Approach to Multimedia IR

– Allows easier integration of multiple data typesAllows easier integration of multiple data types– Facilitates indexingFacilitates indexing– Facilitates searchingFacilitates searching– Allows data to be stored at many different Allows data to be stored at many different

granularities and dimensionsgranularities and dimensions– Data aggregationData aggregation

““data warehouses are not just large databases; data warehouses are not just large databases; they are large, complex environments that they are large, complex environments that integrate many technologies” [p729]integrate many technologies” [p729]

Multimedia starflake schemaMultimedia starflake schema– Denormalized star dimension tableDenormalized star dimension table– Normalized snowflake tablesNormalized snowflake tables

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

34

StarflakeStarflake

Fig 2 [4]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

35

Hierarchy of Data CubesHierarchy of Data Cubes

Fig 4 [4]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

36

Ontology-Based OLAP & IR [5]Ontology-Based OLAP & IR [5]

Combine structured and document data Combine structured and document data obtained from Webobtained from Web

Global OntologyGlobal Ontology– Includes OLAP dimensionsIncludes OLAP dimensions– Contains resource metadataContains resource metadata– RDF basedRDF based

IR based onIR based on– Both queries and resources represented as Both queries and resources represented as

RDF metadataRDF metadata– http://www.w3.org/RDF/http://www.w3.org/RDF/

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

37

Ontology OLAP&IR ArchitectureOntology OLAP&IR Architecture

Fig 1 [5]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

38

OLAP Dimensions in RDFOLAP Dimensions in RDF

Fig 2 [5]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

39

RDF QueryRDF Query

Fig 6 [5]

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

40

DW&IR OutlineDW&IR Outline

IntroductionIntroduction Data WarehousingData Warehousing ResearchResearch SummarySummary

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

41

SummarySummary

Information Retrieval is being extended to many Information Retrieval is being extended to many different data typesdifferent data types– MultimediaMultimedia– Data warehouseData warehouse

Data Warehousing is being extended beyond the Data Warehousing is being extended beyond the basic business domainbasic business domain

Little research in combining DW and IRLittle research in combining DW and IR Integrating Unstructured Text into the Structured Integrating Unstructured Text into the Structured

Environment: The Value Proposition“, by Bill InmonEnvironment: The Value Proposition“, by Bill Inmon– http://www.inmondatasystems.com/whitepapers/http://www.inmondatasystems.com/whitepapers/

integratingunstructured.pdfintegratingunstructured.pdf

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

42

BibliographyBibliography[1] [1] Anne-Muriel Arigon, Anne Tchounikine, and Maryvonne Miquel, “Handling Anne-Muriel Arigon, Anne Tchounikine, and Maryvonne Miquel, “Handling

Multiple Points of View in a Multimedia Data Warehouse,” Multiple Points of View in a Multimedia Data Warehouse,” ACM Transactions on ACM Transactions on Multimedia Computing, Communications and ApplicationsMultimedia Computing, Communications and Applications, Vol. 2, No. 3, August , Vol. 2, No. 3, August 2006, Pages 199–218.2006, Pages 199–218.

[2] S. Nicholson, “The Bibliomining Process: Data Warehousing and Data Mining [2] S. Nicholson, “The Bibliomining Process: Data Warehousing and Data Mining for Library Decision-Making,” for Library Decision-Making,” Information Technology and Libraries,Information Technology and Libraries, 22(4), 22(4), 2003.2003.

[3] S. Nicholson, “The Basis for Biliomining: Frameworks for Bringing Together [3] S. Nicholson, “The Basis for Biliomining: Frameworks for Bringing Together Usage-Based Data Mining and Bibliometrics through Data Warehousing in Usage-Based Data Mining and Bibliometrics through Data Warehousing in Digital Library Services,” Digital Library Services,” Information Processing & Management,Information Processing & Management, 42(3), May 42(3), May 2006, pp 785-804.2006, pp 785-804.

[4] Jane You, Tharam Dillon, James Liu, Edwige Pissaloux, “On Hierarchical [4] Jane You, Tharam Dillon, James Liu, Edwige Pissaloux, “On Hierarchical Multimedia Information Retrieval,” You, J.; Multimedia Information Retrieval,” You, J.; Proceedings of the 2001 Proceedings of the 2001 International Conference on Image ProcessingInternational Conference on Image Processing, 7-10 Oct 2001, pp 729 – 732., 7-10 Oct 2001, pp 729 – 732.

[5] Torsten Priebe and Gunther Pernul, “Ontology-based Integration of OLAP and [5] Torsten Priebe and Gunther Pernul, “Ontology-based Integration of OLAP and Information Retrieval,” Information Retrieval,” Proceedings of the 14Proceedings of the 14thth International Workshop on International Workshop on Database and expert Systems Applications, 2003.Database and expert Systems Applications, 2003.

4/17/07, Tecnológico de Monterrey, SMU CSE 8337

43