JBIMS MIM Second Year (2015 – 18)
Data Management15-I-131 Mufaddal Nullwala
What is Database?
Database is the collection of interrelated data OR Organised mechanism to manage, store and retrieve data
Properties Of Database:• Efficient
• Robust
• Stable
Example: • Students Information
• Bank Registrar / Book of Accounts
• Employees Master
What is database management system?
It is a software used to manage and access the database in efficient way
Advantages: • It gives you Data whenever you require in few click of buttons
• Searching of Critical Information is Easy
Example:• Oracle 11g
• MSSQL
• MySQL
ER DigramER-Diagram is a visual representation of data that describes how data is related to each other.
Components of E-R Diagram are:• Entity - An Entity can be any object, place, person or
class.
• Attribute - An Attribute describes a property or characteristic of an entity.
• Relationship - A Relationship describes relations between entities. There are three types of relationship that exist between Entities.
Relationships between ER Diagram
For a binary relationship set the mapping cardinality must be one of the following types:
One to one
One to many
Many to one
Many to many
Going up in this structure is called generalisation, where entities are clubbed together to represent a more generalised view.
Specialisation is the opposite of generalisation. In specialisation, a group of entities is divided into sub-groups based on their characteristics.
Database Keys:
Keys are used to establish and identify relation between tables.
Types of Keys:
PRIMARY KEY• Serves as the row level addressing mechanism in the relational database model.
• It can be formed through the combination of several items.
• Indicates uniqueness within records or rows in a table.
FOREIGN KEY• A column or set of columns within a table that are required to match those of a
primary key of a second table.
• The primary key from another table, this is the only way join relationships can be established.
Primary Key : In Table A, Parcel no. is the Primary Key but the Foreign key in Table B.
CRUD Operations• Create new tables & records
• Retrieve records from tables
• Update tables definition and records data
• Delete existing tables and records
What is OLTP?We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
Online Transection Processing - is characterised by a large number of short on-line transactions
• INSERT
• UPDATE
• DELETE
OLTP Systems are used for Order Entry, Financial Transections, CRM ( Customer Relationship Management), Retail Sales etc. Such systems have large number of users who conduct short transactions.
An important attribute of an OLTP system is its ability to maintain concurrency. To avoid single points of failure, OLTP systems are often decentralized.
Why OLTP is Important?Source of Data or Operational Data
To control and run fundamental business tasks
Reveals a snapshot of ongoing business process
Short and fast inserts and updates initiated by end users
Typically very fast (Performance Optimised)
Space requirements: can be relatively small if historical data is archived
Database Design Highly Optimised
Operational data is critical to run business there for the backup religiously
Design PrincipalApplication Oriented
Used to run Business
Detailed Data
Current Up to Date
Isolated Data
Repetitive Access
Clerical Users
Performance Sensitive
Few Records assessed at a time (tens)
Read / Update access
No Data Redundancy
Database Size (100 MB - 100 GB)
Business CasesEcommerce applications (eg. Amazon, Flipkart)
ERP Solutions
CRM
SCM
Data WarehouseIn computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.
DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place and are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analysis.
The data stored in the warehouse is uploaded from the operational systems (such as marketing or sales). The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the DW for reporting.
Data Warehouse continued..
A collection of data that is used primarily in organisational decision making
A decision support database that is maintained separately from the organisation’s operational databases.
A data warehouse is a • subject-oriented,
• integrated,
• time-varying,
• non-volatile
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
Characteristics of Data Warehouse
Subject oriented: Data are organised based on how the users refer to them.
Integrated: All inconsistencies regarding naming convention and value representations are removed.
Nonvolatile: Data are stored in read-only format and do not change over time.
Time variant: Data are not current but normally time series.
Why Separate Data Warehouse?
Performance• Operational databases designed & tuned for known workloads
• Complex OLAP queries would degrade performance, taxing operations
• Special data organisation, access & implementation methods needed for multidimensional views & queries
Function• Missing data: Decision support requires historical data, which operational
databases do not typically maintain
• Data consolidation: Decision support requires consolidation (aggregation, summarisation) of data from many heterogeneous sources: operational databases, external sources.
• Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
The Complete Decision Support System (Source:
Franconi)Information Sources Data Warehouse
Server(Tier 1)
OLAP Servers(Tier 2)
Clients/DSS(Tier 3)
Operational DB’s
Semistructured Sources
extracttransform
loadrefresh
etc.
Data Marts
DataWarehouse
e.g., MOLAP
e.g., ROLAP
serve
Analysis
Query/Reporting
Data Mining
serve
serve
Three-Tier ArchitectureWarehouse database server
Almost always a relational DBMS; rarely flat files
OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations.
Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.
Clients
Query and reporting tools
Analysis tools
Data mining tools (e.g., trend analysis, prediction)
Data MartsA data mart is a scaled down version of a data warehouse that focuses on a particular subject area.
A data mart is a subset of an organisational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs.
Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organisation.
Usually designed to support the unique business requirements of a specified department or business process
Implemented as the first step in proving the usefulness of the technologies to solve business problems
Eg: Departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales
Why Data mart?A data mart is the access layer of the data warehouse environment that is used to get data out to the users.
The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.
This enables each department to isolate the use, manipulation and development of their data. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.
Organizations build data warehouses and data marts because the information in the database is not organized in a way that makes it readily accessible, requiring queries that are too complicated or resource-consuming.
From the Data Warehouse to Data Marts
Data Warehouse
Less
More
HistoryNormalised
Detailed
Data
InformationIndividuallyStructured
DepartmentallyStructured
OrganisationallyStructured
Characteristics of the Departmental Data Mart
• Small
• Flexible
• Customised by Department
• OLAP
• Source is departmentally structured data warehouse
Data mart
Data warehouse
The Meta DataLast and the most component of DW environments.
It is information that is kept about the warehouse rather than information kept within the warehouse.
The metadata is simply data about data.
It is important for designing, constructing, retrieving, and controlling the warehouse data.
Types of Meta DataTechnical metadata: Include where the data come from, how the data were changed, how the data are organised, how the data are stored, who owns the data, who is responsible for the data and how to contact them, who can access the data , and the date of last update.
Business metadata: Include what data are available, where the data are, what the data mean, how to access the data, predefined reports and queries, and how current the data are.
Application of Data Ware House
Industry ApplicationFinance Credit Card Analysis
Insurance Claims, Fraud AnalysisTelecommunication Call record analysis
Transport Logistics managementConsumer goods promotion analysis
Data Service providers Value added dataUtilities Power usage analysis
What is OLAP?Definition - OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelling , thereby providing the insight and understanding they need for better decision making. Users can pivot, filter, drill down and drill up data and generate numbers of views.
Application - It is the foundation for many kinds of business applications for Business Performance Management, Planning, Budgeting, Forecasting, Financial Reporting, Analysis, Simulation Models, Knowledge Discovery, and Data Warehouse Reporting.
An OLAP structure created from the operational data is called an OLAP cube. As Figure shows, the cube holds data more like a 3D spreadsheet rather than a relational database, allowing different views of the data to be quickly displayed
The term OLAP was first introduced by E. F. Codd, who pioneered Relational Database Management Systems (RDBMS). Below are the twelve rules defined by Codd that OLAP technology must support.
Multidimensional conceptual view
Supports EIS (Executive Information System) slice and dice operations and is usually required in financial modeling.
Transparency Is part of an open system that supports heterogeneous data sources. Furthermore, the end user should not be concerned about the details of data access or conversions.
Accessibility Presents the user with a single logical schema of the data. OLAP engines act as middleware, sitting between heterogeneous data sources and an OLAP front-end.
Consistent reporting performance Performance should not degrade as the number of dimensions in the model increases.
Client/server architectureRequires open, modular systems. Not only the product should be client/server but the server component of an OLAP product should allow that various clients could be attached with minimum effort and programming for integration.
Generic dimensionality Not limited to 3-D and not biased toward any particular dimension. A function applied to one dimension should also be able to be applied to another.
Dynamic sparse-matrix handling
Related both to the idea of nulls in relational databases and to the notion of compressing large files, a sparse matrix is one in which not every cell contains data. OLAP systems should accommodate varying storage and data-handling options.
Multiuser support Supports multiple concurrent users, including their individual views or slices of a common database.
Unrestricted cross-dimensional operations
All dimensions are created equal, so all forms of calculation must be allowed across all dimensions, not just the measures dimension.
Intuitive data manipulation Users shouldn't have to use menus or perform complex multiple step operations when an intuitive drag and drop action will do.
Flexible reporting Users should be able to print just what they need, and any changes to the underlying model should be automatically reflected in reports.
Unlimited dimensional and aggregation levels Supports at least 15, and preferably 20, dimensions.
The OLAP Report, one of the most internationally authoritative sources of information on OLAP products and applications, defines OLAP in five keywords: Fast Analysis of Shared Multidimensional Information, or FASMI for short.
FastThe system is targeted to deliver most responses to users within
about five seconds, with the simplest analyses taking no more than one second and very few taking more than 20 seconds.
AnalysisThe system can cope with any business logic and statistical analysis
that is relevant for the application and the user, and keep it easy enough for the target user.
Shared
The system implements all the security requirements for confidentiality and, if multiple write access is needed, concurrent
update locking at an appropriate level. Not all applications need users to write data back, but for the growing number that do, the system
should be able to handle multiple updates in a timely, secure manner.
Multidimensional
The system must provide a multidimensional conceptual view of the data, including full support for hierarchies and multiple hierarchies.
InformationThe capacity of various products is measured in terms of how much input data they can handle, not how many gigabytes they take to
store it.
OLAP Operations
Roll-UpDecreases a number of dimensions - removes row headers.
Drill – DownIncreases a number of dimensions - adds new headers
Slice
• Performs a selection on one dimension of the given cube, resulting in a sub-cube.
• Reduces the dimensionality of the cubes.
• Sets one or more dimensions to specific values and keeps a subset of dimensions for selected values.
Dice
• Define a sub-cube by performing a selection of one or more dimensions.
• Refers to range select condition on one dimension, or to select condition on more than one dimension.
• Reduces the number of member values of one or more dimensions.
Pivot (or rotate)
• Rotates the data axis to view the data from different perspectives.
• Groups data with different dimensions.
OLAP ArchitecturesMOLAP ROLAP
Information retrieval is fast.
Information retrieval is comparatively slow.
Uses sparse array to store data-sets.
Uses relational table.
MOLAP is best suited for inexperienced users, since it is very easy to
use.
ROLAP is best suited for experienced users.
Maintains a separate database for data cubes.
It may not require space other than available in the Data
warehouse.DBMS facility is weak. DBMS facility is strong.
Static Database Dynamic Database
Dimensional Modelling
Dimensional modelling is one of the methods of data modelling, that help us store the data in such a way that it is relatively easy to retrieve the data from the database.
Different ways of storing data gives us different advantages. For example, ER Modelling gives us the advantage of storing data is such a way that there is less redundancy. Dimensional modelling, on the other hand, give us the advantage of storing data in such a fashion that it is easier to retrieve the information from the data once the data is stored in database.
Dimensional Modeling V/S ER Modeling
Dimensional Models are designed for reading, summarising and analysing numeric information, whereas Relational Models are optimised for adding and maintaining data using real-time operational systems.
Dimensional Modeling
It is comprised of "fact" and "dimension" tables.
A "fact" is a numeric value that a business wishes to count or sum
A "dimension" is essentially an entry point for getting at the facts. Dimensions are things of interest to the business.
Dimensional Modeling
Benefits
• Faster Data Retrieval
• Better Understandability
• Extensibility
https://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide
Star schema
The star schema architecture is the simplest data warehouse schema.
It is called a star schema because the diagram resembles a star, with points radiating from a centre.
The centre of the star consists of fact table and the points of the star are the dimension tables.
Star Schema
Star Schema
Fact Tables A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.
A dimension is a structure usually composed of one or more hierarchies that categories data.
http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-schema.html
Snowflake Schema
The snowflake schema architecture is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are normalised.
Snowflake Schema
ETL Process
ETL Process
ETL process
The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.
ETL Steps
Initiation
Build reference data
Extract from sources
Validate
Transform
Load into stages tables
Audit reports
Publish
Archive
Clean up
ETL Process
Steps of ETL process
Extracts data from homogeneous or heterogeneous data sources
Transforms the data for storing it in proper format or structure for querying and analysis purpose
Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse)
ExtractionExtracting the data from different sources – the data sources can be files (like CSV, JSON, XML) or RDBMS etc.
This is the first step in ETL process. It covers data extraction from the source system and makes it accessible for further processing. The main objective of the extraction step is to retrieve all required data from source system with as little resources as possible. The extraction step should be designed in a way that it does not negatively affect the source system. Most data projects consolidate data from different source systems. Each separate source uses a different format. Common data-source formats include RDBMS, XML (like CSV, JSON). Thus the extraction process must convert the data into a format suitable for further transformation.
TransformationTransforming the data – this may involve cleaning, filtering, validating and applying business rules.
In this step, certain rules are applied on the extracted data. The main aim of this step is to load the data to the target database in a cleaned and general format (depending on the organization’s requirement). This is because when the data is collected from different sources each source will have their own standards like –For example if we have two different data sources A and B. In source A, date format is like dd/mm/yyyy, and in source B, it is yyyy-mm-dd.
Transformation continued..
In the transforming step we convert these dates to a general format. The other things that are carried out in this step are:
Cleaning (e.g. “Male” to “M” and “Female” to “F” etc.)
Filtering (e.g. selecting only certain columns to load)
Enriching (e.g. Full name to First Name , Middle Name , Last Name)
Splitting a column into multiple columns and vice versa
Joining together data from multiple sources
In some cases data does not need any transformations and here the data is said to be “rich data” or “direct move” or “pass through” data.
LoadingLoading - data is loaded into a data warehouse or any other database or application that houses data.
This is the final step in the ETL process. In this step, the extracted data and transformed data is loaded to the target database. In order to make data load efficient, it is necessary to index the database and disable constraints before loading the data.
All the three steps in the ETL process can be run parallel. Data extraction takes time and so the second step of transformation process is executed simultaneously. This prepares data for the third step of loading. As soon as some data is ready it is loaded without waiting for completion of the previous steps.
ETL Tools1. Oracle Warehouse Builder (OWB)
2. SAP Data Services.
3. IBM Infosphere Information Server.
4. SAS Data Management.
5. PowerCenter Informatica.
6. Elixir Repertoire for Data ETL.
7. Data Migrator (IBI)
8. SQL Server Integration Services (SSIS)
OLTP V/S OLAP
“Thank you”