Fundamentals of Data Warehousing
Transcript of Fundamentals of Data Warehousing
Fundamentals of Data
Warehousing A Business Analytics Course
University of the Philippines Open University
Dr. Eugene Rex Jalao Dr. Ria Mae Borromeo
Asst. Prof. Mari Anjeli Crisanto Ms. Marie Karen Enrile
Course Writers
Fundamentals of Data Warehousing 1
University of the Philippines
OPEN UNIVERSITY
University of the Philippines
OPEN UNIVERSITY
COMMISSION ON HIGHER EDUCATION
Fundamentals of Data Warehousing 2
UNIVERSITY OF THE PHILIPPINES OPEN UNIVERSITY
Fundamentals of Data Warehousing
A Business Analytics Course
Welcome! This course is designed to introduce you to the fundamentals of data
warehousing for managers. Data warehousing is used in business intelligence, enabling
managers to make critical decisions based on different business transactions. Managers
of businesses should be able to see opportunities for exploiting data coming from
transactions using data warehousing. This provides a discussion on how to adapt data
warehousing as an approach for managing data, highlighting the needed resources to roll
out a data warehouse. Before taking this course, you should have completed the
Fundamentals of Business Analytics course (BAFBANA).
This course guide is your road map to BAFWARE. Please read it thoroughly before starting
on the course work. You will need to refer to this course guide from time to time as you go
through the course in the next six weeks.
COURSE OBJECTIVES
At the end of the course, you should be able to:
1. Understand database management systems;
2. Discuss the key concepts of data warehousing;
3. Identify opportunities for data warehousing;
4. Identify resources needed for data warehousing;
5. Write a project charter for a data warehouse project;
6. Communicate data requirements;
7. Describe data inference considerations, interestingness metrics, complexity
considerations;
8. Understand various techniques used for post-processing of discovered structures
and visualization;
9. Describe formalized means of organizing and storing of documents and other
content in an organization related to the organization’s processes;
10. Identify the advantages and disadvantages of data warehousing; and
11. Develop an awareness of the ethical norms as required under policies and
applicable laws governing confidentiality and non-disclosure of
data/information/documents and proper conduct in the learning process and
application of business analytics.
Fundamentals of Data Warehousing 3
COURSE OUTLINE
BAFWARE consists of seven modules and runs for six weeks. MODULE 1. Database Management Systems A. Database Systems B. Functions and Components of a Database Management Systems C. Databases and Normalization D. Entity Relationship Diagram and Relational Modeling E. Case study
MODULE 2. Data Warehousing A. Data Warehouses and Data Marts B. Alternate Data Warehousing Architecture C. Case study
MODULE 3. The Kimball Lifecycle A. Background and Parts of the Kimball Lifecycle B. Kimball Lifecycle Technology Track C. Kimball Lifecycle Data Track & Application Track D. Case study
MODULE 4. Dimensional Modeling A. Dimensional Modeling B. Fact Tables C. Dimension Tables D. Case study
MODULE 5. ETL (Extraction, Transformation, Loading) A. Overview B. Case Study
MODULE 6. Post-Processing and Visualization of Data Inside the Data Warehouse A. Exercises using R B. Case study
MODULE 7. Opportunities and Ethics A. Opportunities for Data Warehousing B. Ethics in Data Warehousing C. Privacy Issues D. Case Study
Fundamentals of Data Warehousing 4
COURSE MATERIALS
Your learning package for this course consists of:
1. This course guide;
2. Study guides for each module, with lecture notes and learning activity guides;
3. Video lectures; and
4. Additional reading materials in digital form.
All learning resources will be made available for downloading so you can review them as
often as you wish without having to go online.
STUDY SCHEDULE
You will be learning through independent study combined with collaborative learning in the online discussions. Discussions will be asynchronous, meaning you and your classmates may log in and post your contributions to the discussion whenever you are available but not necessarily in the same time like in a chat.
In general, it is up to you to decide how many hours to spend on each module, including the online discussions and other learning activities. Discussion forums, however, will be open only for two weeks each.
You can use the study schedule below as a guide to pace yourself accordingly. However, make sure you note the dates for important activities like quizzes and discussion forums as they will only be open on the dates specified.
Week Topic/s Activity 1-3 Course
Overview 1) Read the course guide. 2) Introduce yourself in the “Self-Introductions” forum
Module 1 1. Go through the study guide for Module 1 and complete the individual learning activities, including viewing the video lecture titled “Database Management Systems” by Asst. Professor Reginald Neil Recario
2. Self-Introductions Forum Closes. Join Discussion Forum 1
4 Module 2 1. Go through the Study Guide for Module 2 and complete the individual learning activities, including viewing the video lecture on “Data Warehousing” (0:00-10:08) by Asst. Professor Mari Anjeli Crisanto.
2. Take Quiz 1
Fundamentals of Data Warehousing 5
5 Module 3 1. Go through the Study Guide for Module 3 and complete the individual learning activities and view the video lecture on “Data Warehousing Lifecycle and Project Management” by Raymond Lagria
2. Read "Kimball DW/BI Lifecycle Methodology". 3. Discussion 1 Forum closes. Join Discussion Forum 2
6-7 Module 4 1. Go through the Study Guide for Modules 4 and complete the learning activities.
2. View the video lectures on “Dimensional Modeling”, “Designing Fact Tables”, and “Designing Dimension Tables” by Dr. Eugene Rex Jalao. Take Quiz 2.
8
Do Assignment 1: Dimensional Modelling Case Study on the Northwind Database
9-10 Module 5 1. Go through the Study Guide for Module 5 and complete the learning activities.
2. View “Extraction, Transformation, Loading” by Raymond Lagria for Module 5.
3. Discussion 2 Forum closes.
11
Do Assignment 2: ETL Planing: Source to Target Mapping and Data Profiling
12-13 Module 6 1. Go through the Study Guide for Module 6 and complete the individual learning activities.
2. Read "Comprehensive Guide to Data Visualization in R" and "R-analyst Cheat sheet: Data Visualization in R".
3. View “Data Post-Processing” by Raymond Lagria
14-15 Module 7 1. Go through the Study Guide for Module 6 and complete the individual learning activities, including viewing the video lecture on “Opportunities and Ethics in Data Warehousing” by Asst. Professor Mari Anjeli Crisanto.
2. Read "Benefits of data warehouses for business" and "9 Disadvantages and Limitations of Data Warehouse".
3. Join Discussion Forum 3.
16
Final Exam
COURSE REQUIREMENTS
To earn a certificate of completion for this course, you must do the following:
1. Participate in all online discussions (i.e. the self-introductions forum and the three
class discussions).
2. Take the two online quizzes.
3. Complete two assignments
4. Complete a final exam.
Fundamentals of Data Warehousing 6
ONLINE DISCUSSION FORUMS
BAFWARE will have three online discussions where you can share your thoughts and
learnings with your classmates. Take note that the forums will only be open for two weeks
and will close down once the next discussion forum opens.
Guide questions based on the week’s module will be posted in each discussion forum.
Answer those questions intelligently and don’t forget to cite references you may have used
for research. Make sure that you also reply to at least one or two other answers from your
classmates. However, take care to reply constructively and respectfully, following the
proper rules of netiquette.
ONLINE QUIZZES
Several modules will have accompanying online quizzes where you can test your
knowledge and understanding. The quiz can be taken anytime within the week it has been
scheduled. However, take note that a time limit will be set once you begin the quiz. Your
score and feedback will be shown automatically once you are done.
ASSIGNMENTS
Two assignments will be given to allow you to practice what you have learned in
BAFWARE. These assignments will be marked by experts and you will have to get passing
scores for both to get a certificate of completion. Details for each assignments will be
posted in the course site.
GENERAL GUIDELINES
Here are some of our guidelines for this class:
1. Check MyPortal frequently and participate in all the activities.
2. Practice academic integrity. Cheating and plagiarism of any kind will not be
tolerated.
3. Uphold honor and excellence. Simply give your best, your 110% if you can.
4. LEARN. Learn from this course, learn from yourself, learn from you and your
classmates.
Fundamentals of Data Warehousing 7
MODULE 1: DATABASE MANAGEMENT SYSTEMS
Introduction
As we learn about data warehousing and how this can be applied to businesses, it is
important to first understand the basic concepts of database management systems. Data
warehouses utilize information from different database management systems. Managers
need to know key points about database management systems so that they can have a
deeper understanding of how data warehouses work.
Learning Objectives
After working on this module, you should be able to:
1. Explain what data management systems are;
2. Describe the functions and components of database management systems;
3. Perform database normalization;
4. Create a simple entity relationship diagram; and
5. Identify database management systems used in businesses.
1.1. What are database management systems?
In order to understand what database management systems are, we must first understand what a database is. In computer terms, a database is a collection of data. Typically, it is the data of one specific enterprise.
A database is not necessarily always stored in a computer. Records stored in a filing cabinet, in a notebook, or whatnot can be considered a database. But often, this manual method of storing information is not as efficient as using a computer, and it is not as efficient as using a database management system.
What then are database management systems or DBMSs? A DBMS is a collection of interrelated data plus the software and hardware used to access the data in a useful manner.
Fundamentals of Data Warehousing 8
Study Question Do you agree that using a DBMS is more efficient compared to using manual methods of
storing information? Why or why not?
1.2. What are the functions and components of a database management
system?
DBMS main functions include the following (among many others):
a. The manipulation of data;
b. The definition of your database;
c. The processing of your data; and
d. The sharing of your data
Note that a DBMS is only one component of what is known as a database system. This
database system therefore has these four components:
a. Users
b. Database Application
c. DBMS
d. Database
Study Questions
1. How does the DBMS perform the functionalities listed in this module?
2. How do the different components of a database system relate to one another?
Fundamentals of Data Warehousing 9
1.3. Databases and Normalization
Now that we know what databases and database management systems are, let us take a closer look at the data that goes into a database. Our databases will be made up of records which are in turn made up of fields. You can think of records as the individual items that go into the databases.
In designing databases, it is helpful to first identify what records should be included in the system. Once you have identified these records, you can then start to create a DESIGN for the database.
The design should be NORMALIZED. There are several levels of normalization: 1st Normal Form, 2nd Normal Form, 3rd Normal Form, Boyce and Codd Normal Form (BCNF), 4th Normal Form, and even 5th Normal Form. We normalize in order to reduce data redundancy and improve data integrity. Normalizing until 3rd Normal Form will be enough.
Study Question
Why is it important for managers to know how to normalized databases are created?
Activity 1-1
Objective: To create a normalized database design.
Task: Think of a simple database management system an organization might want to
have. Identify at least 10 records that will go into the database. Group these records into
tables based on similar fields. Convert the resulting tables to 3rd Normal Form.
Tools & Resources (Video): Database Management Systems by Asst. Professor Mari
Anjeli Crisanto (10:13-10:38)
1.4. Entity Relationship Diagrams and Relational Modeling
Tables created during database normalization often correspond to ENTITIES, which are
used in relational modeling. Most DBMS packages for microcomputers make use of the
Fundamentals of Data Warehousing 10
Relational Data Model which highlight relationships between entities. We use Entity
Relationship Diagrams (ERDs) to design and visualize Relational Data Models.
Entity Relationship Diagrams are composed of:
a. Entities - the representation we use to contain information on one real-world
person, object, place, etc. These are represented by rectangles in an ERD.
b. Attributes – properties that describe entities. They correspond to the fields in
records and are represented by ovals.
c. Relationships – how each entities are connected. They are represented by
diamonds.
Activity 1-2
Objective: To create a simple entity relationship diagram.
Task: From the normalized tables in Activity 1-2, create a simple ERD. Add relationships
and attributes that you might have missed out during the initial process.
Tools & Resources: Your output from Activity 1-2 and Database Management Systems by
Asst. Professor Mari Anjeli Crisanto (15:41-22:36)
"
Study Question Why is important for managers to know how to entity relationship diagrams are
designed?
1.5. Case Study
Read through the Northwind Business Case Study. The document discusses
implementing Dimensional Modeling, but let us take a few steps back and first try to
envision a Database Management System for the company. What will its DBMS look like?
What information should go there? The document already contains an Entity-Relationship
Model (another representation of the ERD) and this will give you an idea of the data to be
stored in Northwind Business’ DBMS.
Fundamentals of Data Warehousing 11
Study Questions 1. What businesses or organizations are you involved with? Will they benefit from using
a Database Management System?
2. Identify what DBMS would be useful for this business. If there are many, list down
as many as you can.
3. Take one DBMS from the list. Enumerate and explain the processes connecting the
business and the DBMS.
References
Database Management Systems (Video) by Reginald Neil Recario
Database Management Systems (Video) by Mari Anjeli Crisanto
Case Study: Dimensional Modeling - Northwind Business (Document) by Eugene Rex
Jalao
Fundamentals of Data Warehousing 12
MODULE 2: DATA WAREHOUSING
Introduction
Imagine a large organization having different departments, each with their own database
systems. A business analyst would like to generate reports for decision support. She
approaches each department but has problems with some of them whose main roles are
just to handle data transactions – not reports. Those that do give her information give data
in a number of different formats. Customer names are saved differently, birthdates are in
mm/dd/yy and dd/mm/yy and so on. Wouldn’t it save the business analyst so much time
and effort if there was a central repository containing information needed for her to
generate the reports that she needs with the data in a standardized format, too?
In this module, we will learn about data warehousing which makes tasks like the above
easier to handle.
Learning Objectives
After working on this module, you should be able to:
1. Discuss the key concepts of data warehousing; and
2. Identify resources needed for data warehousing.
2.1. Data Warehouses and Data Marts
A data warehouse is a physical repository where relational data are specially organized to
provide enterprise-wide, cleansed data in a standardized format
In our previous module, we have learned what database systems are. In turn, a data
warehouse is a collection of integrated, subject-oriented databases. Each unit of data is
non-volatile and relevant to some moment in time.
Data in data warehouses are NOT in 3NF. That being so, they are referred to as BIG
DATA. Since they are not normalized, some data may be redundant. The redundancies
will result in, well, BIG data. However, BIG DATA is more useful for DECISION SUPPORT.
This is good since the purpose of a data warehouse is provide aggregate data for decision
making. You are not that interested in what the data for each table are, you are more
interested in how the company will move forward given that data.
Fundamentals of Data Warehousing 13
There may be questions or decisions which are specialized for specific people. Thus,
separate entities called DATA MARTS are used to provide specialized and strategic
answers for specific people. This keeps it simple for the users. Small problems are easier
to solve.
Data marts, therefore, are a subset of the data warehouse that support the requirements
of a particular department or business function.
A data mart is a departmental data warehouse that stores only relevant data. Data marts
can be dependent or independent. A dependent data mart is a subset that is created
directly from a data warehouse. An independent data mart, on the other hand, is a small
data warehouse designed for a strategic business unit or a department.
Study Question
How will organizations benefit from data warehouses and data marts?
2.2. Alternate Data Warehousing Architecture
Alternative data warehousing architectures include:
a. Independent Data Marts
b. Data Mart Bus Architecture
c. Hub-and-Spoke Architecture
d. Centralized Data Warehouse
e. Federated Data Warehouse
Study Questions
1. How are the alternative data warehousing architectures different from the usual
architecture?
2. Discuss the advantages and disadvantages of the different alternative data
warehousing architectures.
Fundamentals of Data Warehousing 14
2.3. Case Study
Let’s go back to the Northwind Business Case Study. Will the company benefit from a
Data Warehouse? We can see from the document that it has already decided to build a
Business Intelligence Data Warehouse (BIWD). The company had done so because it is
interested in analyzing its sales and shipping activities and decisions so that it can improve
its customer order process. In our next modules, we will see how Northwind will shift from
a DBMS design to a BIWD.
Activity 2-1
Objective: To identify resources needed for data warehousing.
Task: Identify a business or organization that might benefit from using data warehouses
and data marts. List down the resources they will need to get these up and running.
Tools & Resources (Video): Data Warehousing (0:00-10:08) by Asst. Professor Mari
Anjeli Crisanto
References
Data Warehousing (Video) by Mari Anjeli Crisanto
Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugine
Rex Jalao
https://www.youtube.com/watch?v=zTs5zjSXnvs&t=293s&list=WL&index=20
https://www.youtube.com/watch?v=l74BAViTVns&t=194s&list=WL&index=21
Case Study: Dimensional Modeling - Northwind Business (Document) by Eugene Rex
Jalao
Fundamentals of Data Warehousing 15
MODULE 3: THE KIMBALL LIFECYCLE
Introduction
We will now learn about a lifecycle used in data warehouses and business intelligence
project teams. This is the Kimball Lifecycle was formerly known as Business Dimensional
Lifecycle before 2008.
Learning Objectives
After working on this module, you should be able to:
1. Enumerate and describe the different stages of the Kimball Lifecycle; and
2. Write a project charter for a data warehouse project.
3.1. Background and Parts of the Kimball Lifecycle
The Kimball Lifecycle focuses on adding business value across the enterprise and
dimensionally structures the data that's delivered to the business. It uses iterations and
increments in a manageable lifecycle to do this.
There have been two main approaches to building data warehouses with data marts.
The first is an approach by Bill Inmon and the second is the approach by Ralph Kimball.
Bill Inmon’s approach works this way:
a. The enterprise data warehouse (EDW) should be in at least 3rd normal form.
b. But the data marts should be in dimensional form.
c. Big Bang Approach
Meanwhile, here is Kimball’s approach:
a. The EDW is based on dimensional model design
b. Focus on user-friendliness and easy to use
c. Develop EDW on a departmental basis piece by piece
Kimball’s approach is more practical, more interpretable, easier to implement and less
costly based on industry best practices. It involves the following steps:
a. Program/Project Planning and management
Fundamentals of Data Warehousing 16
b. Deployment
c. Maintenance
d. Growth
Study Question What happens during the different steps or stages in the Kimball Lifecycle?
3.2. The Kimball Lifecycle Technology Track
The first stage in the Kimball Lifecycle involves planning. Planning for three streams
happen simultaneously. These streams are:
a. Technology Track
b. Data Track
• Dimensional Modeling
• Physical Design
• ETL (Extraction, Transformation, Loading) Design and Development
c. Application Track
• Business Intelligence Application Design
• Business Intelligence Application Development
The technology track involves technical architectural design and product selection and
installation.
The following processes occur in the technical architectural design:
a. Consideration of business requirements, current technical environment, and
planned strategic technical directions
b. Designing the back room architecture
• Designing ETL (data staging ) environment
• Identifying DBMS operating system and hardware environment
c. Designing front room architecture
d. Designing the Infrastructure and metadata
e. Managing security requirements
Fundamentals of Data Warehousing 17
As for product selection and installation, the processes are:
a. Evaluation and selection of the following tools:
• Hardware platform
• DBMS
• ETL tool (data staging tool)
• BI tool (end user data access tool)
b. Installation and testing to assure end-to-end integration
c. Training of team
Study Questions 1. Who are the people involved in technical architectural design?
2. Who are those involved in product selection and installation?
3.3. Kimball Lifecycle Data Track
The data track involves dimensional modeling, physical design, and ETL design and
development. We will cover dimensional modeling and the processes involved in Module
4. Meanwhile, we will talk more about ETL in Module 5.
Study Question Who are the people tasked to do the processes in the data track?
Fundamentals of Data Warehousing 18
3.4. Kimball Lifecycle Application Track
The application track involves business application design and business application
development.
Business application design involves:
a. Identifying standard analytic and report requirements to meet 80% – 90% of user
needs
b. Planning and assuring ad hoc query and reporting capability
c. Developing report templates for report families
d. Getting user signoff on report templates and commit to them
e. Identifying metrics and metric calculations, Key Performance Indicators (KPIs)
Meanwhile, business application development involves using ideally a single advanced BI
tool that meets all user needs. Advanced tools provide significant productivity gains for the
application development team. Good BI design enables end users to modify existing
reports and develop ad hoc reports quickly without going to IT.
Study Question Who are the workforce of the application track?
3.5. Case Study
Watch the “Data Warehousing Lifecycle and Project Management” video by Raymond
Lagria at 11:45. The lecture discusses a case study for BigCo and how a project charter
was created for this company.
Activity 3-1
Objective: Write a project charter for a data warehouse project.
Task: Look for examples of project charters for data warehouse projects. Create one
following the steps in the Kimball Lifecycle.
Fundamentals of Data Warehousing 19
Tools & Resources (Video): “Data Warehousing Lifecycle and Project Management” by
Prof. Raymond Lagria
References
Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugene
Rex Jalao
Kimball DW/BI Lifecycle Methodology: http://www.kimballgroup.com/data-warehouse-
business-intelligence-resources/kimball-techniques/dw-bi-lifecycle-method/
Data Warehousing Lifecycle and Project Management (Video) by Raymond Lagria
Fundamentals of Data Warehousing 20
MODULE 4: DIMENSIONAL MODELING
Introduction
We have learned in Module 3 that the Data Track stream in the Kimball Lifecycle
involves dimensional modeling. Dimensional modeling is a logical design technique for
structuring data so that it is intuitive for business users and delivers fast query
performance. We will take a closer look at the process involved here in this module.
Learning Objectives
After working on this module, you should be able to:
1. Explain the concept of dimensional modeling; and
2. Discuss fact tables and dimensional tables;
3. Understand the conversion of the E/R model to a dimensional model using
Dimensional Normal Form (DNF) methodology.
4.1. Dimensional Modeling
In Module 1, we have learned about relational modeling. Relational modeling is widely
used in databases nowadays. However, dimensional modeling has two advantages over
relational modeling. These are understandability and performance. The model must be
easily understood by business users while representing the complexities of the business.
It must also have fast response to queries that summarize millions of rows.
Dimensional models also have the following benefits:
1. Predictable, Standard Framework
2. Gracefully Extensible to Accommodate Change
3. Star Join Schema is Symmetrical
4. Has Standard Approaches for Common Modeling Situations
5. Aggregate Management
To design a dimensional model, we must perform the following steps:
1. Establishing Naming Conventions
2. Do the Four-Step Dimensional Modeling Process
3. Document the High Level Data Model Diagram
4. Define the Data Sources
Fundamentals of Data Warehousing 21
5. Document the Detailed Table Designs
6. Develop Detailed Bus Matrix
7. Identify, Track, and Resolve Issues
Let us now dig deeper into dimensional modeling and discuss fact tables and
dimensional tables.
4.2. Fact Tables
Let first determine what makes up a “fact”. Measurements are numeric values called
facts. Examples are sales amount and count of attendance. Dimensions, meanwhile,
describe the “who, what, where, when, why, and how” of the facts. For example,
dimensions for sales amount would be sales by quarter and sales by product.
A dimensional model consists of a fact table containing measurements surrounded by a
halo of dimension tables containing textual context. It is known as a star join and as a
star schema when stored in a relational database.
Fact tables contain the descriptive attributes (numerical values) needed to perform
decision analysis and query reporting in the star schema.
Here are some more fact table facts:
1. A fact is a performance measure. For example, "Sales of Product X".
2. Fact values are not known in advance. They are only known when event
measurement occurs.
3. Facts are numeric.
4. The most useful facts are numeric and additive.
Fact tables are usually the largest tables. A single fact table can contain either detail or
summarized data. They are primarily joined to dimension tables through foreign keys.
The business definition of the measurement event that produces the fact table is called
the fact table's grain. Declaring the grain means a fact table row represents the blank in
this statement: “A fact row is created when ____ occurs.”
Fundamentals of Data Warehousing 22
4.3. Dimension Tables
In a star schema, dimension tables contain classification and aggregation information
about the values in the fact table.
Dimension tables contain the parameters by which the fact table measures are analyzed.
For example, the amount sold is analyzed by day, month, quarter, or year. Or the amount
sold on sunny days vs. rainy days, and so on.
Dimension tables provide the context to the fact table measures they describe. They also
contain descriptors of the business, utilizing business terminology. They have many large
columns, contain textual and discrete data, and are usually smaller than fact tables.
Have a single column surrogate primary key (called the warehouse dimension key) and
are joined to a fact table through a foreign key reference to their primary key. Dimension
tables can contain one or more hierarchies. These hierarchies are de-normalized into
the dimension tables.
Dimensional tables can be classified into the following:
1. Date Based
2. Time Based
3. Business Entities
4. Analytical Profiles
5. Correlated Entities
6. Versions of Business Entities
7. Flags and Indicators
8. Degenerate Dimensions
Now how do we generate dimensional models? The Dimensional Normal Form is a
creative and practical approach originated by Mike Schmitz to design Dimension Table
Families. Here, fact tables are highly normalized for maintainability and flexibility.
Dimensions have their hierarchies de-normalized into them for usability and performance.
Its schema is limited to two levels. These are a single first level or central highly normalized
table called a fact table and multiple second level tables called dimension tables linked to
the first level table in primarily one to many relationships.
Fundamentals of Data Warehousing 23
Study Question How is the Dimensional Normal Form different from the other normalized forms
discussed in Module 1?
4.4. Case Study
Let’s go back to the Northwind Business Case Study. It is now time to see how their
system translates into a dimensional model. An Excel is provided to design and submit
your solution. From the generated dimensional model, what are the SQL Scripts needed
for each of the reports below?
1. What were Northwind’s top selling products? This month? This quarter? YTD? This
month last year? Last YTD?
2. Who are the best customers in terms of sales? How many orders did these best
customers place last month? What was the average order amount? What was the
average number of items per order per customer?
3. How many orders were shipped on time? Late? How late? Who is the top
performing
4. shipping company?
5. How much did Northwind sell by each product category in each time period?
6. Which employee sold the most orders?
Study Question
Why is it important for managers to know how to normalized databases are created?
Activity 4-1
Objective: Understand the conversion of the E/R model to a dimensional model using Dimensional Normal Form (DNF) methodology.
Fundamentals of Data Warehousing 24
Task: Do Case 1: Dimensional Modelling Case Study on the Northwind Database
Tools & Resources: “Dimensional Modeling”, “Designing Fact Tables”, and “Designing
Dimension Tables” by Dr. Eugine Rex Jalao (Videos); “Case Study: Dimensional Modeling
- Northwind Business” by Dr. Eugene Rex Jalao (Document)
References
Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugene
Rex Jalao
Designing Fact Tables (Slides) by Eugene Rex Jalao
Introduction to Dimensional Modeling (Slides) by Eugene Rex Jalao
Designing Dimension Tables (Slides) by Eugene Rex Jalao
Case Study: Dimensional Modeling - Northwind Business (Document) by Eugene Rex
Jalao
Fundamentals of Data Warehousing 25
MODULE 5: EXTRACTION, TRANSFORMATION, LOADING (ETL)
Introduction
We have also learned in Module 3 that the Data Track stream in the Kimball Lifecycle
involves the ETL (Extraction, Transformation, Loading) process. We will take a closer look
at the process involved here in this module.
Learning Objectives
After working on this module, you should be able to: 1. Discuss the steps in ETL; and
2. Identify instances where ETL would be necessary in an organization.
5.1. ETL Overview
ETL is mostly done by business analytics people following an information technology track. However, it is useful for managers to know what happens during ETL. The objective of ETL is to get data out of the source and load it into the data warehouse. It is simply a process of copying data from one database to other. Data is extracted from a database, transformed to match the data warehouse schema and loaded into the data warehouse database. When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation. The process is usually handled using Structured Query Language (SQL) scripts, a special-purpose programming language designed for managing data held in a relational database. In extraction, data is extracted from heterogeneous data sources. Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data. This is usually done using SQL Select Statements. Transformation is the main step where the ETL adds value. It changes data and provides guidance whether data can be used for its intended purposes. For example, "Male" is changed to "M" and "Yes" is changed to "1". This is performed in a staging area.
Fundamentals of Data Warehousing 26
Finally, in loading, data is then loaded into data warehouse tables. Here, surrogate keys are created and assigned. The process is usually done using Insert SQL Statements. ETL is often a major failure point in data warehousing because the effort involved in the ETL process is underestimated. Underestimating data quality problems and providing for contextual history are also prime culprits for this. The ETL process should therefore not be taken for granted. It should be noted that ETL is not a one time event as new data is added to the data warehouse periodically - monthy, daily, or hourly. Because ETL is an integral, ongoing, and recurring part of a data warehouse it is automated, well-documented, and is easily changeable. Several companies have strong ETL tools and a fairly complete suite of supplementary tools. There are three general types of Source to Target Tools:
1. Code generators - These actually compile ETL code, typically COBOL which is used by several large companies that use mainframe.
2. Engine based - These have easy-to-use graphic interfaces and interpreter style programs.
3. Database based - These involve manual coding using SQL statements augmented by scripts.
Well known ETL tools are the following:
1. Commercial a. Ab initio b. IBM DAtaStage c. Informatica PowerCenter d. Microsoft Data Integration Services e. Oracle Data Integrator f. SAP Business Objects - Data Integrator g. SAS Data Integration Studio
2. Open-Source Based . Adeptia Integration Suite
a. Apatar b. CloverETL c. Pentaho Data Integration (Kettle) d. Talend Open Studio/Integration Suite e. R/R Studio
Take note that the "best" tool does not exist. You will have to choose based on your own needs. You should also check first if the standard tools from the big vendors are alright.
Fundamentals of Data Warehousing 27
Study Question Why is it important for managers to know the processes involved in ETL?
Activity 5-1
Objective: To identify instances where ETL would be necessary in an organization.
Task: From Activity 3-1, identify what data would need to undergo ETL. What would their
final forms be?
Tools & Resources (Video): Activity 3-1 and “Extraction, Transformation, and Loading”
by Raymond Lagria
5.2. Case Study
Database. Specifically we are looking at issues on column values and row duplications.
Extract all the data into an excel sheet from the Northwind Access Database and open it
in MS Excel. Utilize Excel’s autofilter function to answer the data profile table found in
the Data_Profile_Template.xls file.
Also develop the High Level Source-to-Target Map for the Northwind data warehouse.
Use the S2T Map Template.xls file. Develop the following:
1. High Level Source-to-Target Map for all tables
2. Detailed S2T Map for the Product Dimension (D_Product) and Order Transaction
Fact (F_Order_Transaction) table.
References Introduction to Data Warehousing and Enterprise Data Management (Slides) by Eugene Rex Jalao Extraction, Transformation, and Loading (Video) by Raymond Lagria Data Profiling and Source to Target Mapping (Document) by Eugene Rex Jalao
Fundamentals of Data Warehousing 28
MODULE 6: POST-PROCESSING AND VISUALIZATION OF DATA
INSIDE THE DATA WAREHOUSE
Introduction
Let us now learn how we can post-process and visualize the data inside the data
warehouse.
Learning Objectives
After working on this module, you should be able to:
1. Understand various techniques used for post-processing of discovered structures
and visualization.
6.1. Exercises using R
First, what is R? R is an integrated suite of software facilities for data manipulation,
calculation and graphical display.
It has an effective data handling and storage facility. It also has a large, coherent,
integrated collection of intermediate tools for data analysis. In addition, it has graphical
facilities for data analysis and display either directly at the computer or on hard copy.
Take note that R is not a database but connects to a DBMS. It is not a spreadsheet view
of data, but it connects to Excel/MS Office.
R is free and open source though it has a steep learning curve. RStudio IDE is a
powerful and productive 3rd Party user interface for R. It’s free, open source, and works
great on Windows, Mac, and Linux.
Exercises for this session will include the following:
1. Working with dataset Wage
2. Studying, reducing and structuring the dataset
3. Plotting the dataset
4. Introducing a business analytics task for the dataset
5. Working with another dataset
Fundamentals of Data Warehousing 29
In post-processing, we remember that data extracted from a data warehouse or pieces
of knowledge extracted from an initial data mining task could be further processed. We
can simplify the data, apply descriptive statistics, do visualizations or graphing tasks, or
applying further business analytics tools.
Watch the "Data Post-processing" video by Raymond Lagria to understand
preliminaries, data frames, reading data, subsetting, graphing and plotting, and
regression analysis in R.
Always take note to transform your dataset into your desired format before applying
further data mining techniques.
Study Question If you were a business manager, what types of visualizations for the data warehouse’s
data would you like to see?
6.2. Case Study
Let us continue to see how post-processing and plotting is done with R in the “Data
Post-processing” Video by Raymond Lagria.
References
https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/
Data Post-Processing (Slides) by Raymond Lagria
Data Post-Processing (Video) by Raymond Lagria
Fundamentals of Data Warehousing 30
MODULE 7: OPPORTUNITIES AND ETHICS
Introduction
Finally, let us discuss the opportunities and ethics surrounding data warehousing.
Learning Objectives
After working on this module, you should be able to:
1. Identify advantages and disadvantages of data warehousing;
2. Develop an awareness of the ethical norms as required under policies and
applicable laws governing confidentiality and non-disclosure of
data/information/documents and proper conduct in the learning process and
application of business analytics.
7.1. Opportunities for Data Warehousing
Data warehousing, like every other thing, has both advantages and disadvantages. Its advantages include: a. Better decision-making b. Quick and easy access to data c. Data quality and consistency
As for its disadvantages, here are the considerations: a. Maintenance costs outweigh the benefits b. Data ownership must be considered c. Rigidity of data d. Underestimation of ETL processing time e. Hidden problems of the source f. Inability to capture required data g. Increased demands of the users h. Long-duration project i. Complications
Fundamentals of Data Warehousing 31
Study Questions Do you think the advantages outweigh the disadvantages? What can be done to
address the disadvantages?
7.2. Ethical Concerns in Data Warehousing
Data warehousing takes information from different databases as well as external sources
and puts them inside a repository which can be accessed by end-users who need decision
support.
Thus, there are ethics to consider especially when some data may be accessed only at
the departmental or only at certain levels. Remember, there is a chance that end-users
may have access to information that they should not be examining. They may be breaking
privacy laws without knowing it.
Ethics should also be considered even if data are just used in the testing phase. For
example, while testing the data warehouse, is it alright to move small data sets from source
systems to target systems for testing purposes? It is not actually ethical to do so. While
testing, sometimes users are learning things they shouldn’t know or things they aren’t
allowed to know.
What about the case of external data or data that is already made to the public? Is it ethical
to integrate everything into the data warehouse? The project manager must decide which
of the information is acceptable to integrate. Although the information is publically
available, using some of them might raise ethical considerations. The ethics would focus
on how the information is used, and by whom.
Study Question What other ethical considerations for data warehousing are you aware of?
Fundamentals of Data Warehousing 32
7.3. Checklist for Ethical Concerns in Data Warehousing
Here is a checklist of items project managers and technology implementers can use to
manage ethical concerns:
• Develop service level agreements with end users that define who has access to
what levels of information
• Have end-users involved in defining the ethical standards of use for the data
that will be delivered.
• Define the bounds around the integration efforts of public data, where it will be
integrated and where it will not – so as to avoid conflicts of interest.
• Do not use “live” or real data for testing purposes – or lock down the test
environment; too often test environments are left wide-open and accessible to
too many individuals.
• Define where, how, and who will be using Data Mining – restrict the mining
efforts to specific sets of information. Build a notification system to monitor data
mining usage.
• Allow customers to “block” the integration of their own information (this one is
questionable) depending on if the customer information after integration will be
made available on the web.
• Remember that any efforts made are still subject to governmental laws. What
laws do we have right now concerned with data privacy? Note that future laws
could also be developed and we must be aware of those.
Activity 1-2
Objective: Develop an awareness of the ethical norms in data warehousing.
Task: Use the checklist for ethical considerations in data warehousing and check
whether the project charter created in Activity 3-1 has any part which could be unethical.
Tools & Resources: Opportunities and Ethics in Data Warehousing (Asst. Professor Mari
Anjeli Crisanto) and Activity 3-1
Fundamentals of Data Warehousing 33
Task: Think of a simple database management system an organization might want to
have. Identify at least 10 records that will go into the database. Group these records into
tables based on similar fields. Convert the resulting tables to 3rd Normal Form.
Tools & Resources (Video): Database Management Systems by Asst. Professor Mari
Anjeli Crisanto (10:13-10:38)
7.4. Privacy Issues
Are you familiar with the Data Privacy Act? It was implemented “to protect the
fundamental human right of privacy, of communication while ensuring free flow of
information to promote innovation and growth.” (Republic Act. No. 10173, Ch. 1, Sec. 2)
The law specifies that consent is needed before the collection of all personal data. The
data subject must also be informed of the extent to which their personal information will
be processed.
This becomes a big consideration when we implement data warehouses because the data
warehouse might access information which a person may have given consent to be
accessible only at a certain level.
Businesses and IT developers must be well aware of laws such as these so that they can
ensure that their database or data warehouses comply with all of the law’s stipulations.
Study Question Why is data privacy important?
7.5. Case Studies
Let’s take a look at the Northwind Business Case Study one last time. What ethical
considerations are relevant to this company? Are there any privacy issues that it has to
consider when building the Business Intelligence Data Warehouse?
Fundamentals of Data Warehousing 34
References
http://www.techadvisory.org/2015/03/benefits-of-data-warehouses-for-business/
http://whatisdbms.com/9-disadvantages-and-limitations-of-data-warehouse/
Opportunities and Ethics in Data Warehousing (Asst. Professor Mari Anjeli Crisanto)
http://tdan.com/data-warehousing-ethical-concerns-security-access-and-control/5186
Case Study: Dimensional Modeling - Northwind Business (Dr. Eugene Rex Jalao)
Fundamentals of Data Warehousing 35
DISCUSSION FORUM TOPICS
DISCUSSION FORUM 1
Database Management Systems or Data Warehousing?
Week Open: Week 2
Week Closes: Week 5
Guide Question: Would your company or organization benefit from a Database
Management System? What about a Data Warehouse? If you were the manager, which
between the two would be the best fit for your company? State your reasons why.
DISCUSSION FORUM 2
Module 3 - The Kimball Lifecycle
Week Open: Week 5
Week Closes: Week 10
Guide Question: Which part of the Kimball Lifecycle would you be most involved in?
Discuss why.
DISCUSSION FORUM 3
Module 7 - Opportunities and Ethics
Week Open: Week 14
Week Closes: Week 16
Guide Question: Aside from those discussed in the video and study guide, what other
opportunities in data warehousing are there and what other ethical considerations do you
think should be looked into? Share these with the class.
Fundamentals of Data Warehousing 36
QUIZZES
QUIZ 1
Topics covered: Database Management Systems, Data Warehousing
Week scheduled: Week 4
1) A database is always stored in a computer.
a) True
b) False
2) A _____ is a collection of interrelated data plus the software and hardware used to
access the data in a useful manner.
a) Database
b) Database Management System
c) Data Warehouse
d) Data Mart
3) A ______ is a physical repository where relational data are specially organized to
provide enterprise-wide, cleansed data in a standardized format
a) Database
b) Database Management System
c) Data Warehouse
d) Data Mart
4) Which among the following is a function of a DBMS?
a) The manipulation of data;
b) The definition of your database;
c) The processing of your data
d) All of those mentioned
e) None of those mentioned
5) Which is not part of a database system?
a) Users
b) Database Application
c) DBMS
d) All of those mentioned
e) None of those mentioned
6) These are individual items that go into a database.
a) Entities
b) Attributes
c) Records
d) Fields
7) Big data is not useful for decision support.
a) True
b) False
Fundamentals of Data Warehousing 37
8) ____ are a subset of the data warehouse that support the requirements of a particular
department or business function.
a) Database
b) Database Management System
c) Data Mart
d) Big Data
9) _________ Diagrams are composed of entities, attributes, and relationships.
a) Entity Attribute
b) Entity Relationship
c) Attribute Relationship
d) Entity Attribute Relationship
10) Tables in a DBMS should be normalized.
a) True
b) False
Fundamentals of Data Warehousing 38
QUIZ 2
Topics covered: The Kimball Lifecycle, Dimensional Modeling
Week scheduled: Week 7
1) Which of the following is not included in Kimball’s approach to building data
warehouses?
a) The enterprise data warehouse (EDW) should be in at least 3rd normal form.
b) The EDW is based on dimensional model design
c) Focus on user-friendliness and easy to use
d) Develop EDW on a departmental basis piece by piece
2) Which of the following is considered to be part of the steps in the Kimball Lifecycle?
a) Program/Project Planning and management
b) Deployment
c) Maintenance
d) Growth
e) All of those mentioned
f) None of those mentioned
3) The ____ track involves dimensional modeling, physical design, and ETL (Extraction,
Transformation, Loading) design and development
a) Planning
b) Technology
c) Data
d) Application
4) The _____ track involves architectural design and product selection and installation.
a) Planning
b) Technology
c) Data
d) Application
5) The ____ track involves identifying standard analytic and report requirements to meet
80% – 90% of user needs.
a) Planning
b) Technology
c) Data
d) Application
6) Dimensional Modeling logical design technique for structuring data so that it is intuitive
for business users and delivers moderate query performance.
a) True
b) False
7) Dimensional Modeling’s advantages over Relational Modeling are understandability
and performance.
Fundamentals of Data Warehousing 39
a) True
b) False
8) Examples of ____ are sales amount and count of attendance.
a) Facts
b) Tables
c) Dimensions
d) Models
9) ____ describe the “who, what, where, when, why, and how”.
a) Facts
b) Tables
c) Dimensions
d) Models
10) Fact tables are usually very small.
a) True
b) False