S4I Nanni 1 Intro to BI and Data Warehousing
Transcript of S4I Nanni 1 Intro to BI and Data Warehousing
![Page 1: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/1.jpg)
Master of Science in Engineering in Computer Science (MSE-CS)
DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE
ANTONIO RUBERTI
(MSE-CS)
Seminars in Software and Services for the Information Society
Umberto Nanni
Introduction to
1Seminars of Software and Services for the Information SocietyUmberto Nanni
Introduction to Datawarehousing
![Page 2: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/2.jpg)
Business Intelligence Architecture
management
management
system
goals results
management
system
rep
ort
ing
OLA
P
da
ta m
inin
g
Datawarehouse
Datamart-1 Datamart-2 Datamart-3
KPI DSS MKT HRCRM …
system
2Seminars of Software and Services for the Information SocietyUmberto Nanni
operational
system
ETL systems
external
data
sources
service
systems
ERP
systems
internet /
extranet
operational system
![Page 3: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/3.jpg)
What is Data Warehousing
Collection of methods, technologies and tools to
assist the “knowledge worker” (manager,
analyst) to conduct data analysis aimed at
supporting decision-making and/or improving
3Seminars of Software and Services for the Information SocietyUmberto Nanni
the management of information assets
![Page 4: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/4.jpg)
What is a Data Warehouse
A data warehouse is a collection of data
• integrated (far beyond the organization)
• consistent (despite the heterogeneous origin)
• focused (an interest area is defined)
4Seminars of Software and Services for the Information SocietyUmberto Nanni
• historical (over a consistent timeframe)
• permanent (never delete your data!)
![Page 5: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/5.jpg)
Purpose of a Data Warehouse
A Data Warehouse helps (allows) you:
• to take decisions
• to identify and interpret phenomena
• to make predictions about the future
5Seminars of Software and Services for the Information SocietyUmberto Nanni
• to make predictions about the future
• to control a complex system
![Page 6: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/6.jpg)
Value and quantity of information
value
BD
competitors
marketing
prices
sales
logistics
value
strategic
information
reports
selected
$$$$
competitors prices
6Seminars of Software and Services for the Information SocietyUmberto Nanni
quantity
primary
information
sources
selected
information
$$$
![Page 7: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/7.jpg)
OLTP & OLAP
OLTP - On-Line Transaction Processing
– realm of (write and / or read) transactions, recovery,
consistencyconsistency
– many, fast and frequent operations
– high level of concurrency
– access to a small amount of data
– on-the-fly data update
OLAP - On-Line Analytical Processing
read only
7Seminars of Software and Services for the Information SocietyUmberto Nanni
– read only
– few operations
– low level of concurrency
– access to huge amounts of data
– historical but essentially static data
![Page 8: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/8.jpg)
Separation between:
Operational Database & Data Warehouse
• different computational load
• different needs:• different needs:
– DB: dynamic data, asynchronous updates
– DW: static data, periodic updates
• integration with business activity:
– DB: supporting operations (focused, timely)
– DW: supporting decisions (descriptive, historical)
8Seminars of Software and Services for the Information SocietyUmberto Nanni
– DW: supporting decisions (descriptive, historical)
• data collection:
– DB: minimal
– DW: maximal
![Page 9: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/9.jpg)
Two issues with different perspectives
• Data redundancy
– OLTP (DB): to avoid, bringing to inconsistency – OLTP (DB): to avoid, bringing to inconsistency
and/or inefficiency on updates
– OLAP (DW): redundancy avoids recomputation
and shorten response time
• Indexing
– OLTP (DB): good when you search – bad when you
9Seminars of Software and Services for the Information SocietyUmberto Nanni
– OLTP (DB): good when you search – bad when you
update... you need some trade-off
– OLAP (DW): the more, the best
![Page 10: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/10.jpg)
Some Data Warehouse Systems
• Oracle 12
• IBM InfoSphere Warehouse• IBM InfoSphere Warehouse
• Microsoft SQL-Server 2012 – Analysis Services
• Sybase IQ
• Hyperion (bought by Oracle)
• Teradata (division of NCR)
10Seminars of Software and Services for the Information SocietyUmberto Nanni
• Teradata (division of NCR)
• Netezza – Cognos (bought by IBM)
• Business Objects (bought by SAP)
![Page 11: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/11.jpg)
A comparison by Gartner
11Seminars of Software and Services for the Information SocietyUmberto Nanni
Donald Feinberg, Mark A. Beyer
Magic Quadrant for Data Warehouse
Database Management Systems
Gartner RAS Core Research Note G00173535, 28 January 2010
![Page 12: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/12.jpg)
Architectures for Datawarehousing: issues
• separating OLTP & OLAP
• scalability
• extensibility
• security
12Seminars of Software and Services for the Information SocietyUmberto Nanni
• security
• administrability
![Page 13: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/13.jpg)
Architecture for Datawarehousing
• determined by design choices
• determined by / determines the choice of a • determined by / determines the choice of a
software system
• determines the cost and makes possible
future integration (quantitative and / or
qualitative)
13Seminars of Software and Services for the Information SocietyUmberto Nanni
qualitative)
• affects the cost of data processing
![Page 14: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/14.jpg)
Data Mart
Collection of data focused on particular user profile or
on particular target analysison particular target analysis
Alternatives:
1. dependent Data Mart: it is a subset and/or an aggregation of
data in the primary DW
→ DM extracted from a DW
2. independent Data Mart: it is a subset and/or an aggregation
14Seminars of Software and Services for the Information SocietyUmberto Nanni
2. independent Data Mart: it is a subset and/or an aggregation
of data in the operational DB
→ DW=Ui(DMi), that is, DW is a set of DM
3. hybrid solution, combining 1, 2
![Page 15: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/15.jpg)
DW architecture: 1 Level
• there is only an operational DW
• virtual DB (no OLTP-OLAP separation)
• data coincident with DB operational
• difficult integration with other sources
data - level 1 middleware
(copy of)
operational
15Seminars of Software and Services for the Information SocietyUmberto Nanni
sources warehouse analysis
operational
DB
external
sources
![Page 16: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/16.jpg)
DW architecture: 2 Levels – dependent DMs
• data sources complemented with external sources
• running on dedicated software platform
• ETL: Extraction, Transformation, Loading• ETL: Extraction, Transformation, Loading
• materialization of the DW
• materialization of Data Marts
oper
BDDataMart
ETL
data - level 1 data - level 2
16Seminars of Software and Services for the Information SocietyUmberto Nanni
BD
ext
BD
sources warehouse analysisfeeding
DWMart
DataMart
![Page 17: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/17.jpg)
DW architecture: 2 Levels – independent DMs
• Data Mart are materialized by feeding
• DW = union of DMs• DW = union of DMs
oper
BDDataMart
ETL
data - level 1 data - level 2
17Seminars of Software and Services for the Information SocietyUmberto Nanni
sources warehouse analysisfeeding
BD
ext
BD
Mart
DataMart
![Page 18: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/18.jpg)
DW architecture: 3 Levels
• a level of "reconciled" data (operational data store) is introduced
• separation into two phases of ETL activities:• separation into two phases of ETL activities:
1. extraction / transformation
2. loading
oper
BDDataMart
ET(L)
data - level 1 data - level 2 data - level 3
18Seminars of Software and Services for the Information SocietyUmberto Nanni
BD
ext
BD
DWMart
DataMart
reconcilied
data
loading
sources warehouse analysisfeeding
![Page 19: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/19.jpg)
ETL: Extraction, Transformation, Loading
extraction
Operational Data, External Data
• extraction
• cleaning - validation - filtering
• transformation
Reconciled Data
19Seminars of Software and Services for the Information SocietyUmberto Nanni
• loading
Data Warehouse
![Page 20: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/20.jpg)
Extraction
• initial extraction:
– targeted at the creation of the DW– targeted at the creation of the DW
• furter extractions:
– static (replaces the whole DW)
– incremental
20Seminars of Software and Services for the Information SocietyUmberto Nanni
– incremental
• log
• timestamp
![Page 21: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/21.jpg)
Cleaning
• changing VALUES
• duplicates• duplicates
• inconsistencies
– domain violation
– functional dependency violation
• null values
• misuse of fields
21Seminars of Software and Services for the Information SocietyUmberto Nanni
• misuse of fields
• spelling
• abbreviations (not homogeneous)
![Page 22: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/22.jpg)
Transformation
• changing FORMATS:
• misalignment of formats
• field overloading
• unhomogeneous coding
22Seminars of Software and Services for the Information SocietyUmberto Nanni
• unhomogeneous coding
![Page 23: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/23.jpg)
Loading
• Refresh:
ex-novo load of the whole DWex-novo load of the whole DW
• Update:
differential updates
23Seminars of Software and Services for the Information SocietyUmberto Nanni
![Page 24: S4I Nanni 1 Intro to BI and Data Warehousing](https://reader034.fdocuments.in/reader034/viewer/2022051801/6281563989b5761c490de424/html5/thumbnails/24.jpg)
Metadata
• internal metadata
– concerning the administration of the DW (i.e., sources, transformations,
schemas, users, etc..)schemas, users, etc..)
• external metadata
– interesting for users (e.g., measurement units, possible combinations)
• STANDARDs
• CWM - Common Warehouse Model (OMG), defined by:
– UML (Unified Modeling Language)
24Seminars of Software and Services for the Information SocietyUmberto Nanni
– UML (Unified Modeling Language)
– XML (eXtensible Markup Language)
– XMI (XML Metadata Interchange)
OMG = Object Management Group: CORBA (Common Object Request Broker Architecture), UML
(Unified Modeling Language), MDA (Model-Driven Architecture)