4/17/07, Tecnológico de Monterrey, SMU CSE 8337 1 DATA WAREHOUSING & INFORMATION RETRIEVAL Margaret...
-
Upload
maximillian-gregory -
Category
Documents
-
view
213 -
download
0
Transcript of 4/17/07, Tecnológico de Monterrey, SMU CSE 8337 1 DATA WAREHOUSING & INFORMATION RETRIEVAL Margaret...
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
1
DATA WAREHOUSINGDATA WAREHOUSING&&
INFORMATION RETRIEVALINFORMATION RETRIEVAL
Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering
Southern Methodist UniversitySouthern Methodist UniversityPOBox 750122POBox 750122
Dallas, Texas 75275-0122Dallas, Texas 75275-0122214-768-3087214-768-3087
[email protected]@engr.smu.edu
The contents of this presentation draw extensively from slides for: The contents of this presentation draw extensively from slides for: Data Mining, Introductory and Advanced TopicsData Mining, Introductory and Advanced Topics , by Margaret H. Dunham, Prentice Hall, 2003., by Margaret H. Dunham, Prentice Hall, 2003.
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
2
DW&IR OutlineDW&IR Outline
IntroductionIntroduction Data WarehousingData Warehousing ResearchResearch SummarySummary
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
3
DW&IR OutlineDW&IR Outline
IntroductionIntroduction– Data Warehousing OverviewData Warehousing Overview
– Information RetrievalInformation Retrieval Data WarehousingData Warehousing ResearchResearch SummarySummary
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
4
Data WarehousingData Warehousing ““Subject-oriented, integrated, time-variant, nonvolatile” Subject-oriented, integrated, time-variant, nonvolatile”
William InmonWilliam Inmon http://www.inmondatasystems.com/http://www.inmondatasystems.com/ Operational Data:Operational Data: Data used in day to day needs of Data used in day to day needs of
company.company. Informational Data:Informational Data: Supports other functions such as Supports other functions such as
planning and forecasting.planning and forecasting. Data mining tools often access data warehouses rather Data mining tools often access data warehouses rather
than operational data.than operational data.
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
5
Data Warehouse VariationsData Warehouse Variations
Data MartData Mart – Subset of complete data – Subset of complete data warehousewarehouse
Virtual WarehouseVirtual Warehouse – Warehouse – Warehouse implemented as a view of operational implemented as a view of operational datadata
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
6
Operational vs. InformationalOperational vs. Informational
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits TerabitsLevel Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
7
Information Retrieval Information Retrieval
Information Retrieval (IR):Information Retrieval (IR): retrieving desired retrieving desired information from textual data.information from textual data.
Library ScienceLibrary Science Digital LibrariesDigital Libraries Web Search EnginesWeb Search Engines Traditionally keyword basedTraditionally keyword based Sample query:Sample query:
Find all documents about “data mining”Find all documents about “data mining” IR being applied to other unformatted dataIR being applied to other unformatted data
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
8
DB vs IRDB vs IR
Records (tuples) vs. documentsRecords (tuples) vs. documents Well defined results vs. fuzzy resultsWell defined results vs. fuzzy results DB grew out of files and traditional DB grew out of files and traditional
business systesmbusiness systesm IR grew out of library science and need IR grew out of library science and need
to categorize/group/access to categorize/group/access books/articlesbooks/articles
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
9
DB vs IR (cont’d)Data retrieval
which docs contain a set of keywords?Well defined semanticsa single erroneous object implies failure!
Information retrievalinformation about a subject or topicsemantics is frequently loosesmall errors are tolerated
IR system:interpret contents of information itemsgenerate a ranking which reflects relevancenotion of relevance is most important
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
10
Information Retrieval (cont’d)Information Retrieval (cont’d)
Similarity:Similarity: measure of how close a measure of how close a query is to a document.query is to a document.
Documents which are “close enough” Documents which are “close enough” are retrieved.are retrieved.
Metrics:Metrics:– PrecisionPrecision = |Relevant and Retrieved| = |Relevant and Retrieved|
|Retrieved||Retrieved|– RecallRecall = |Relevant and Retrieved|= |Relevant and Retrieved|
|Relevant||Relevant|
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
11
IR Query Result Measures IR Query Result Measures and Classificationand Classification
IR Classification
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
12
DW&IR OutlineDW&IR Outline
IntroductionIntroduction Data WarehousingData Warehousing
– Dimensional ModelingDimensional Modeling
– OLAPOLAP
– Decision Support SystemsDecision Support Systems ResearchResearch SummarySummary
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
13
Data Transformation for Data Data Transformation for Data WarehouseWarehouse
ETL ETL – Extract, Transform, Load– Extract, Transform, Load Unwanted data must be removedUnwanted data must be removed Convert heterogeneous sources into one Convert heterogeneous sources into one
common schemacommon schema As the operational data is probably a As the operational data is probably a
snapshot of the data, multiple snapshots may snapshot of the data, multiple snapshots may need to be merged to create the historical need to be merged to create the historical viewview
Summarize dataSummarize data New derived dataNew derived data Handle missing and erroneous dataHandle missing and erroneous data
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
14
Data Warehouse CreationData Warehouse Creation
Fig 1 [1]
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
15
Dimensional ModelingDimensional Modeling
View data in a hierarchical manner more as View data in a hierarchical manner more as business executives mightbusiness executives might
Useful in decision support systems and miningUseful in decision support systems and mining Dimension:Dimension: collection of logically related collection of logically related
attributes; axis for modeling data.attributes; axis for modeling data. Facts:Facts: data stored data stored Ex: Dimensions – products, locations, dateEx: Dimensions – products, locations, date
Facts – quantity, unit priceFacts – quantity, unit price
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
16
Multidimensional Model ExampleMultidimensional Model Example
Fig 2 [1]
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
19
Multidimensional SchemasMultidimensional Schemas
Star Schema shows facts and dimensionsStar Schema shows facts and dimensions– Center of the star has facts shown in fact tablesCenter of the star has facts shown in fact tables– Outside of the facts, each diemnsion is shown Outside of the facts, each diemnsion is shown
separately in dimension tablesseparately in dimension tables– Access to fact table from dimension table via joinAccess to fact table from dimension table via join
SELECT Quantity, PriceSELECT Quantity, PriceFROM Facts, LocationFROM Facts, LocationWhere (Facts.LocationID = Location.LocationID) andWhere (Facts.LocationID = Location.LocationID) and(Location.City = ‘Dallas’)(Location.City = ‘Dallas’)
– View as relations, problem volume of data and View as relations, problem volume of data and indexingindexing
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
24
OLAPOLAP Online Analytic Processing (OLAP):Online Analytic Processing (OLAP): provides more provides more
complex queries than OLTP.complex queries than OLTP. OnLine Transaction Processing (OLTP):OnLine Transaction Processing (OLTP): traditional traditional
database/transaction processing.database/transaction processing. Dimensional data; cube view Dimensional data; cube view Support ad hoc queryingSupport ad hoc querying Require analysis of dataRequire analysis of data Can be thought of as an extension of some of the basic Can be thought of as an extension of some of the basic
aggregation functions available in SQLaggregation functions available in SQL OLAP tools may be used in DSS systemsOLAP tools may be used in DSS systems Mutlidimentional view is fundamentalMutlidimentional view is fundamental
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
25
OLAP ImplementationsOLAP Implementations MOLAP (Multidimensional OLAP)MOLAP (Multidimensional OLAP)
– Multidimential Database (MDD)Multidimential Database (MDD)– Specialized DBMS and software system capable of supporting Specialized DBMS and software system capable of supporting
the multidimensional data directlythe multidimensional data directly– Data stored as an n-dimensional array (cube)Data stored as an n-dimensional array (cube)– Indexes used to speed up processingIndexes used to speed up processing
ROLAP (Relational OLAP)ROLAP (Relational OLAP)– Data stored in a relational databaseData stored in a relational database– ROLAP server (middleware) creates the multidimensional view ROLAP server (middleware) creates the multidimensional view
for the userfor the user– Less Complex; Less efficientLess Complex; Less efficient
HOLAP (Hybrid OLAP)HOLAP (Hybrid OLAP)– Not updated frequently – MDDNot updated frequently – MDD– Updated frequently - RDBUpdated frequently - RDB
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
26
OLAP OperationsOLAP Operations
Single Cell Multiple Cells Slice Dice
Roll Up
Drill Down
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
27
OLAP OperationsOLAP Operations
Simple query – single cell in the cubeSimple query – single cell in the cube SliceSlice – Look at a subcube to get more – Look at a subcube to get more
specific informationspecific information Dice Dice – Rotate cube to look at another – Rotate cube to look at another
dimensiondimension Roll UpRoll Up – Dimension Reduction; Aggregation – Dimension Reduction; Aggregation Drill DownDrill Down Visualization: These operations allow the Visualization: These operations allow the
OLAP users to actually “see” results of an OLAP users to actually “see” results of an operation.operation.
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
28
Relationship Between TopcsRelationship Between Topcs
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
29
Decision Support SystemsDecision Support Systems Tools and computer systems that assist Tools and computer systems that assist
management in decision makingmanagement in decision making What if types of questionsWhat if types of questions High level decisionsHigh level decisions Data warehouse – data which supports Data warehouse – data which supports
DSSDSS
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
30
Data Warehouse LinksData Warehouse Links OLAPOLAP
– http://www.olapreport.com/http://www.olapreport.com/ General Data WarehousingGeneral Data Warehousing
– http://www.inmoncif.com/home/http://www.inmoncif.com/home/– http://www.datawarehouseconsulting.com/http://www.datawarehouseconsulting.com/– http://www.datawarehousing.com/http://www.datawarehousing.com/– http://www.dw-institute.com/http://www.dw-institute.com/
DW ProductsDW Products– http://www-306.ibm.com/software/data/informix/redbrick/http://www-306.ibm.com/software/data/informix/redbrick/– http://www.oracle.com/solutions/business_intelligence/dw_home.htmlhttp://www.oracle.com/solutions/business_intelligence/dw_home.html– http://www.sas.com/technologies/dw/index.htmlhttp://www.sas.com/technologies/dw/index.html– http://msdn2.microsoft.com/en-us/library/aa545535.aspxhttp://msdn2.microsoft.com/en-us/library/aa545535.aspx– http://www.sybase.com/detail?id=1027323http://www.sybase.com/detail?id=1027323
Interesting ArticlesInteresting Articles– “Teaching Effective Methodologies to Design a Data Warehouse,” by Behrooz Seyed-
Abbassihttp://isedj.org/isecon/2001/35c/ISECON.2001.Seyed-Abbassi.pdfhttp://isedj.org/isecon/2001/35c/ISECON.2001.Seyed-Abbassi.pdf
– An Oracle DBA’s Guide to the OLAP Option,” by by Mark RittmanAn Oracle DBA’s Guide to the OLAP Option,” by by Mark Rittmanhttp://www.dbazine.com/datawarehouse/dw-articles/rittman1http://www.dbazine.com/datawarehouse/dw-articles/rittman1
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
31
DW&IR OutlineDW&IR Outline
IntroductionIntroduction Data WarehousingData Warehousing ResearchResearch
– BibliominingBibliomining
– Hierarchical Multimedia IRHierarchical Multimedia IR
– Ontology-based OLAP & IROntology-based OLAP & IR SummarySummary
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
32
Bibliomining [2,3]Bibliomining [2,3] Data Warehousing + Data Mining + LibrariesData Warehousing + Data Mining + Libraries Abstract, cleanse, summarize library dataAbstract, cleanse, summarize library data
– DocumentsDocuments– Users (including demographics)Users (including demographics)– Circulation Records (including Web server records)Circulation Records (including Web server records)
Privacy of utmost importancePrivacy of utmost importance
http://www.bibliomining.com/nicholson/biblioprocess.htm [2]http://www.bibliomining.com/nicholson/biblioprocess.htm [2]
http://bibliomining.com/nicholson/nicholsonbibliointro.html [3]http://bibliomining.com/nicholson/nicholsonbibliointro.html [3]
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
33
Hierarchical Multimedia IR [4]Hierarchical Multimedia IR [4] DW Approach to Multimedia IRDW Approach to Multimedia IR
– Allows easier integration of multiple data typesAllows easier integration of multiple data types– Facilitates indexingFacilitates indexing– Facilitates searchingFacilitates searching– Allows data to be stored at many different Allows data to be stored at many different
granularities and dimensionsgranularities and dimensions– Data aggregationData aggregation
““data warehouses are not just large databases; data warehouses are not just large databases; they are large, complex environments that they are large, complex environments that integrate many technologies” [p729]integrate many technologies” [p729]
Multimedia starflake schemaMultimedia starflake schema– Denormalized star dimension tableDenormalized star dimension table– Normalized snowflake tablesNormalized snowflake tables
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
35
Hierarchy of Data CubesHierarchy of Data Cubes
Fig 4 [4]
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
36
Ontology-Based OLAP & IR [5]Ontology-Based OLAP & IR [5]
Combine structured and document data Combine structured and document data obtained from Webobtained from Web
Global OntologyGlobal Ontology– Includes OLAP dimensionsIncludes OLAP dimensions– Contains resource metadataContains resource metadata– RDF basedRDF based
IR based onIR based on– Both queries and resources represented as Both queries and resources represented as
RDF metadataRDF metadata– http://www.w3.org/RDF/http://www.w3.org/RDF/
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
37
Ontology OLAP&IR ArchitectureOntology OLAP&IR Architecture
Fig 1 [5]
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
38
OLAP Dimensions in RDFOLAP Dimensions in RDF
Fig 2 [5]
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
40
DW&IR OutlineDW&IR Outline
IntroductionIntroduction Data WarehousingData Warehousing ResearchResearch SummarySummary
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
41
SummarySummary
Information Retrieval is being extended to many Information Retrieval is being extended to many different data typesdifferent data types– MultimediaMultimedia– Data warehouseData warehouse
Data Warehousing is being extended beyond the Data Warehousing is being extended beyond the basic business domainbasic business domain
Little research in combining DW and IRLittle research in combining DW and IR Integrating Unstructured Text into the Structured Integrating Unstructured Text into the Structured
Environment: The Value Proposition“, by Bill InmonEnvironment: The Value Proposition“, by Bill Inmon– http://www.inmondatasystems.com/whitepapers/http://www.inmondatasystems.com/whitepapers/
integratingunstructured.pdfintegratingunstructured.pdf
4/17/07, Tecnológico de Monterrey, SMU CSE 8337
42
BibliographyBibliography[1] [1] Anne-Muriel Arigon, Anne Tchounikine, and Maryvonne Miquel, “Handling Anne-Muriel Arigon, Anne Tchounikine, and Maryvonne Miquel, “Handling
Multiple Points of View in a Multimedia Data Warehouse,” Multiple Points of View in a Multimedia Data Warehouse,” ACM Transactions on ACM Transactions on Multimedia Computing, Communications and ApplicationsMultimedia Computing, Communications and Applications, Vol. 2, No. 3, August , Vol. 2, No. 3, August 2006, Pages 199–218.2006, Pages 199–218.
[2] S. Nicholson, “The Bibliomining Process: Data Warehousing and Data Mining [2] S. Nicholson, “The Bibliomining Process: Data Warehousing and Data Mining for Library Decision-Making,” for Library Decision-Making,” Information Technology and Libraries,Information Technology and Libraries, 22(4), 22(4), 2003.2003.
[3] S. Nicholson, “The Basis for Biliomining: Frameworks for Bringing Together [3] S. Nicholson, “The Basis for Biliomining: Frameworks for Bringing Together Usage-Based Data Mining and Bibliometrics through Data Warehousing in Usage-Based Data Mining and Bibliometrics through Data Warehousing in Digital Library Services,” Digital Library Services,” Information Processing & Management,Information Processing & Management, 42(3), May 42(3), May 2006, pp 785-804.2006, pp 785-804.
[4] Jane You, Tharam Dillon, James Liu, Edwige Pissaloux, “On Hierarchical [4] Jane You, Tharam Dillon, James Liu, Edwige Pissaloux, “On Hierarchical Multimedia Information Retrieval,” You, J.; Multimedia Information Retrieval,” You, J.; Proceedings of the 2001 Proceedings of the 2001 International Conference on Image ProcessingInternational Conference on Image Processing, 7-10 Oct 2001, pp 729 – 732., 7-10 Oct 2001, pp 729 – 732.
[5] Torsten Priebe and Gunther Pernul, “Ontology-based Integration of OLAP and [5] Torsten Priebe and Gunther Pernul, “Ontology-based Integration of OLAP and Information Retrieval,” Information Retrieval,” Proceedings of the 14Proceedings of the 14thth International Workshop on International Workshop on Database and expert Systems Applications, 2003.Database and expert Systems Applications, 2003.