UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS · UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS ......
Transcript of UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS · UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS ......
UNIVERSITY OF LJUBLJANA
FACULTY OF ECONOMICS
MASTER THESIS
COMPARISON OF SELECTED MASTER DATA MANAGEMENT
ARCHITECTURES
Ljubljana, February 2013 Katerina Atanasovska
i
TABLE OF CONTENTS
INTRODUCTION .................................................................................................................................. 1
RESEARCH PROBLEM AND PURPOSE OF MASTER THESIS ..................................................................... 1
RESEARCH GOALS ............................................................................................................................... 5
RESEARCH METHODS .......................................................................................................................... 5
1.DEFINITION OF DATA, DATA TYPES, DATA DIMENSTIONS AND DATA
INCONSISTENCIES .............................................................................................................................. 6
1.1.DATA TYPES .................................................................................................................................. 6
1.1.1.Analytical data ....................................................................................................................... 7
1.1.2.Transactional data .................................................................................................................. 7
1.1.3.Master data ............................................................................................................................. 8
1.1.4.Metadata ................................................................................................................................. 8
1.2.DATA QUALITY DIMENSIONS ......................................................................................................... 9
1.2.1.Intrinsic data quality ............................................................................................................ 11
1.2.2.Contextual data quality ........................................................................................................ 12
1.2.3.Representational data quality ............................................................................................... 12
1.2.4.Accessibility data quality ..................................................................................................... 13
1.3.DATA INCONSISTENCY ................................................................................................................ 13
1.4.DATA QUALITY IMPROVEMENT ................................................................................................... 15
2.MASTER DATA MANAGEMENT ................................................................................................. 16
2.1.DEFINITION ................................................................................................................................. 16
2.2.GOALS OF MDM ......................................................................................................................... 18
2.3.MDM ACTIVITES ........................................................................................................................ 20
2.4.BENEFITS FROM MDM ................................................................................................................ 21
3.MASTER DATA MANAGEMENT SOLUTIONS .......................................................................... 22
3.1.HISTORICAL REVIEW OF MDM SOLUTIONS ................................................................................ 22
3.2.FUNCTIONALITIES, CONCEPTS AND ARCHITECTURE ................................................................... 24
3.3.ARCHITECTURE OF MDM DESCRIBED THROUGH SELECTED MDM SOLUTIONS ......................... 29
3.3.1.Microsoft Master Data Services ........................................................................................... 30
3.3.2.SAP Netweaver .................................................................................................................... 33
3.3.3.IBM InfoSphere ................................................................................................................... 37
3.3.4.Oracle MDM Suite ............................................................................................................... 44
4.ANALYSIS OF SELECTED MASTER DATA MANAGEMENT ARCHITECTURES ................ 50
4.1.MDM OF SELECTED ARCHITECTURES AND QUALITY DIMENSIONS ............................................. 50
4.2.COMPARISON OF SELECTED ARCHITECTURES THROUGH THE THREE DIMENSIONAL MODEL ...... 52
4.3.COMPARISON OF SELECTED ARCHITECTURES THROUGH THE FIVE MDM ACTIVITIES ............... 54
5.CASE STUDY OF MDM SOLUTION USED IN STUDIO MODERNA ........................................ 56
5.1.PROBLEMS WITH PRODUCT DATA MANAGEMENT ....................................................................... 56
ii
5.2.CENTRAL PRODUCT REGISTER (CPR)......................................................................................... 57
5.2.1.Product statuses .................................................................................................................... 58
5.2.2.Product data security ............................................................................................................ 59
5.3.BENEFITS OF CPR ....................................................................................................................... 60
5.4.COMPARISON OF CPR AND SELECTED MDM ARCHITECTURES .................................................. 61
5.5.BUILD VS BUY MDM SOLUTION ................................................................................................. 64
CONCLUSION ..................................................................................................................................... 66
LIST OF REFERENCES ...................................................................................................................... 70
LIST OF FIGURES
Figure 1: Definition of master data and the master record ..................................................................... 1
Figure 2: Applications used for MDM ................................................................................................... 4
Figure 3: Enterprise data ........................................................................................................................ 7
Figure 4: List of data attributes ............................................................................................................ 10
Figure 5: List of techniques for solving data inconsistencies .............................................................. 15
Figure 6: Workflow of MDM .............................................................................................................. 16
Figure 7: The data quality activity levels ............................................................................................. 19
Figure 8: MDM Activities ................................................................................................................... 20
Figure 9: Evolution of IBM MDM applications .................................................................................. 24
Figure 10: Dimensions of master data management ............................................................................ 25
Figure 11: Traditional MDM architecture ........................................................................................... 28
Figure 12: MDM architecture with additional published services ...................................................... 28
Figure 13: MDS data model ................................................................................................................ 31
Figure 14: Table types ......................................................................................................................... 34
Figure 15: Key mapping during import and export ............................................................................. 35
Figure 16: Logical model ...................................................................................................................... 39
Figure 17: Domain model and physical model .................................................................................... 39
Figure 18: Physical model ................................................................................................................... 39
Figure 19: Example of field mappings during data import .................................................................. 40
Figure 20: Example of SSN pattern match .......................................................................................... 41
Figure 21: Example of record merge ................................................................................................... 42
Figure 22: List of predefined tables for Customer entity ..................................................................... 45
Figure 23: Example of cross reference between PARTIES and SYS_REFERENCE .......................... 46
Figure 24: Example of data validation workflow ................................................................................ 47
Figure 25: Example of data flow in CPR .............................................................................................. 58
LIST OF TABLES
Table 1: An example estimating the positive impact of customer MDM ............................................ 22
Table 2: Gartner’s Magic Quadrant for Data ....................................................................................... 29
Table 3: MDS repository objects vs. Relational database objects ....................................................... 31
Table 4: Advantages and disadvantages of MDS ................................................................................ 33
Table 5: Advantages and disadvantages of SAP .................................................................................. 36
Table 6: Advantages and disadvantages of IBM InfoSphere ............................................................... 43
iii
Table 7: Advantages and disadvantages of Oracle MDM ................................................................... 48
Table 8: DQ dimension and MDM ...................................................................................................... 50
Table 9: MDM solutions and three-dimensional model ...................................................................... 52
Table 10: MDM overview through four data management phases ...................................................... 55
Table 11: CPR solutions for product data management ....................................................................... 60
Table 12: Comparison of MDM architectures and CPR’s three dimensional model ........................... 61
Table 13: Comparison of MDM architectures and CPR’s MDM Phases ............................................ 61
Table 14: Comparison of MDM architectures and CPR’s time and cost ............................................. 62
1
INTRODUCTION
Research problem and purpose of master thesis
Most of the businesses today perform and track their everyday transactions with the help of
various information systems. Companies use them to automate their business processes, store
their data and make further business decisions based on the end results given from various
applications. The great success of these systems is not just based on the complex processing
logic used in their backend software code, but also on their friendly user interfaces that make
such software easy to work with.
Development of new and sophisticated information technologies (IT) in the past decade
resulted in growth and expansion of numerous business solutions on the market. Benefits
from this development are seen in improving workflows of many companies. However, the
side effects from fast growth of IT created additional headaches for businesses and again
redirected them back to look for solutions from IT vendors. One of the major problems users
of such applications are dealing with is the constant growth of “dirty” data in the system.
There are two reasons why IT is responsible for producing bad data:
1. Trying to get closer to the customers vendors focused on the application design, and
various business scenarios, neglecting the data validations and filters in the whole
architecture. This weakened the system to track content of entered data;
2. New system-oriented architecture (SOA) allows integration of different applications
into one system. Knowing that each application carries its own database, there is a
high possibility that same data may be stored in different sources and this
automatically produces data redundancy in the system.
Figure 1: Definition of master data and the master record
Source: J. Bracht et al, Smarter Modeling of IBM InfoSphere MDM Solutions, 2012, p. 29
The problem of bad data became most visible and hard to handle when companies started
experiencing revenue loss, increased costs, customer complains, employment frustrations etc.
2
Statistics below, based on researches made by Arlbjørn and Haug (2010, p. 294) show the
alarming situation that companies are placed in because of poor data quality:
- 88 per cent of all data integration projects either fail completely or significantly over-
run their budgets;
- 75 per cent of organizations have identified costs stemming from dirty data;
- 33 per cent of organizations have delayed or cancelled new IT systems because of
poor data;
- $611bn per year is lost in the US in poorly targeted mailings and staff overheads
alone;
- According to Gartner, bad data is the number one cause of CRM system failure;
- Less than 50 per cent of companies claim to be very confident in the quality of their
data;
- Business intelligence (BI) projects often fail due to dirty data, so it is imperative that
BI-based business;
- decisions are based on clean data;
- Only 15 per cent of companies are very confident in the quality of external data
supplied to them;
- Customer data typically degenerates at 2 per cent per month or 25 per cent annually;
- Organizations typically overestimate the quality of their data and underestimate the
cost of errors;
- Business processes, customer expectations, source systems and compliance rules are
constantly changing.
Working as database analyst in Studio Moderna, I have been dealing with examples of bad
data every day. Duplicates, misspellings, missing values are some of the irregularities that
appear in customers databases. It is very hard to work on statistics and analysis knowing that
the numbers contain duplicates, but the huge amount of data load and the time constraint,
which has always been an issue, don’t allow you to go through and cleanse what is
considered to be obsolete in such cases. At the end, the picture you present for the requested
business scenario may be irrelevant for that time being. Not because of incorrect query
statements and miscalculations, but the content of data involved in the whole processes. It is
very frustrating when whoever works on data analysis deals with situations in which they
spent time looking for some error, trying to find the reason for mismatching results, and
discover that it’s just another misspelled name or missing address.
There are several techniques that help solving problems with bad data. Few of them are: data
mining, data cleansing, data profiling, data governance etc. Depending on the tasks they
perform, these techniques are divided into four major groups: techniques to clean,
consolidate, govern and share data. Today, all of them fall under Master Data Management
(MDM), discipline that brings together any method, technique or technology that deals with
data quality improvement.
3
In many cases throughout different literature MDM is defined as software for improving data
quality, but Master Data Management covers much broader area then that. In more formal
definition given by Mauri and Sarka (2011, p. 16) MDM is a set of coordinated processes,
policies, tools and technologies used to create and maintain accurate master data. We do not
have a single tool for MDM. Anything that is used to solve data quality issue falls under
category of MDM. For example, running nightly procedures for data cleansing, defining
constraints of tables to check inserted data, defining table users and permissions, any of that
can be considered as managing data. Master data is specifically used in this discipline,
because it represents core data for every enterprise and it needs to be correct, and precisely
maintained in systems, so company can work with lowest possible number of data issues.
In addition to this strategy, vendors developed sophisticated MDM software solutions where
they implemented numerous techniques for improving data quality. The concept of these
solutions is designed for large, medium and small sized enterprises. Entire software suites are
appropriate for larger companies that work with great amounts of data. Another example
where MDM suites are used, are companies that extended their businesses through merging
or acquisition, and are confronting problems of bad data created as a result of introducing
new systems to their existing environment. Individual modules of the suites are appropriate
for medium and small sized companies, where certain MDM applications are used for
analysis and data cleaning.
There is a significant number of established vendors who offer Master Data Management
products. D&B/Purisma Data Hub, DataFlux MDM, Data Foundations OneData, i2 MDM,
IBM InfoSphere MDM Server, Initiate Systems Master Data Services, Kalido MDM,
Liaison Technologies MDM, Microsoft MDM, Oracle Customer Data Hub, Oracle Hyperion
DRM, Oracle UCM, Orchestra Networks EBX, SAP NetWeaver MDM, Siperian MDM Hub,
Sun MDM Suite/Mural, Teradata MDM, TIBCO CIM, VisionWare MultiVue are just part of
the list of various MDM applications. Considering that MDM is fairly new technology which
is establishing on the market in the past 10 years, it’s hard to decide which of the above listed
products could be the best solution for one organization.
Market reviews predict bright future for MDM vendors. The aggregate MDM market will
grow from US$2.8 billion to US$4 billion over the forecast period (2008-2012), including
revenues from both MDM packaged solutions and implementation services as well as the
billion plus dollars related to data service providers such as Acxiom and Dun & Bradstreet.
The aggregate enterprise MDM market (customer and product hubs, plus systems
implementation services) totaled US$730 million at YE2007 and will reach US$2 billion by
the end of 2012. Software sales are but one portion as MDM systems integration services
reached US$510 million alone during 2007 and are projected to exceed US$1.3 billion per
year by 2012 (Zornes, 2009, p. 3).
Despite these predictions, majority of companies are still favoring in-house solution over
packaged MDM software. In 2006, Ventana Research surveyed 515 companies on this
4
matter. Their findings were that only 4% of the interviewed companies completed their MDM
implementation project, 7% are still in ongoing implementation phase and 33% are in
progress. Less than half of these companies have some kind of packaged software whereas
20% have their own developed solution. Nearly half of them are considering implementing
some data governance tool, but only 24% are planning to realize that some time in the future (
Smith, 2006).
Similar numbers were recently confirmed by Messerschmidt and Stuben (2011, p. 5). They
interviewed 49 companies from 12 different countries and eight industries including small
and large business. Numbers showed that most of these companies are willing to implement
some MDM software but are still using their own built MDM solution. Figure 2 represents
the answers that companies gave regarding the MDM application they use. Most of them
answered that they still use in-house development instead of packaged software.
Figure 2: Applications used for MDM
Source: M. Messerschmidt & J. Stuben, Hidden Treasure, 2011, p. 33
From the various statistics presented earlier, it seems that majority of organizations are
looking into implementing some kind of MDM tool, but are still not quite prepared for
packaged software that is placed on the market. When one organization has certain budget to
invest in technological upgrade, then it strives to make the best decision money can buy.
Their decision is introduction to the problem of this master thesis, which defines the never
ending debate on packaged vs. custom built solution. Problem would be examined through
four architectures of already established vendors: Microsoft, IBM,SAP and Oracle and also
the case study on custom built solution for the requirements of Studio Moderna.
The structure of this thesis is divided in two parts. First part explores the problem of “bad
data”, discussed through some general concepts of data, data quality, standards for quality
data and possible causes of data inconsistencies. The second part covers the purpose of my
thesis, which is an analysis of data management process implemented in selected MDM
software solutions offered on the market, and how their MDM architecture assists in
improving data quality. This analysis will be made through researching and comparing
different MDM architectures and the way they perform data modeling, validation, import and
5
exports of data and security of the system. MDM software solutions that will be compared in
this thesis are: Microsoft Master Data Services, SAP NetWeaver MDM, IBM InfoSphere and
Oracle MDM Suite. There are a lot of vendors who offer MDM solutions, but I chose these
four because they are already known for their database management systems, as well as many
business intelligence (BI) tools. As an addition to these four products, I included one custom
made solution called Central Product Register (CPR), developed for product data
management in Studio Moderna.
Research goals
Goal of my work is to create comparison model for products from selected MDM vendors.
This model will discuss domain, method of use and implementation style as main dimensions
that characterize each MDM system. Also, I will discuss in more details some of the
techniques used to consolidate, cleanse, govern and share data. This comparison will describe
how each vendor understands and implements data management in their solution. It will also
highlight advantages and disadvantages each product has, and try to find out if implementing
such solution will really benefit the business or it’s another fancy application for better
organization and view of data, which doesn’t actually solve the core problem of data quality.
Including custom build solution as case study example for MDM product is introduced to
show how master data management is understood within a company, and to describe
company’s attempt to solve the problem without help from off-the-shelf product. Discussing
the company’s internal data management introduces another goal in this thesis, which shows
that users shouldn’t just rely on MDM software as only solution to quality data. In most cases
problem should be looked much deeper, not in data itself but in sources that produce data,
whether that is user or application. Often, problem lays in lack of knowledge or experience in
business processes and company’s workflows. In such cases, MDM products can improve
data and solve current issues, but it’s not a long-term solution because the problem exists
elsewhere, and sooner or later it will produce bad results again. If proven, this finding can be
very useful, because it can save users time and money for purchasing and implementing
software that wasn’t the right choice in the first place.
Research methods
There are two research methods used in this thesis:
1. Comparative Analysis;
2. Case study of in-house MDM solution in Studio Moderna.
All of the data and statistics used in this paper are collected from different literature and
publications, so only secondary data from other sources is used. Since this topic compares
products from technical point of view, I chose: literature, white papers, technical notes to be
most appropriate sources for my thesis. And based on the title for my research, most suitable
method for comparison is the comparative analysis itself. This method is used when
6
researching collected materials and creating general summary for the four different MDM
solutions.
Another method used in this research, is case study of in-house MDM solution built for
Studio Moderna needs. I consider this case study as suitable example, because it deals with
the problem of data quality and covers data management processes, same as the four products
sold by Microsoft, Oracle, IBM and SAP. Also, it can extend the discussion of data
management in terms of business process changes and restructuring the workflows, not just
data cleansing and governance as options for data quality improvements.
As an addition to these two methods, I also used unstructured interviews. Following people
were interviewed:
- Mr. Bostjan Kos, Information Management Client Technical Professional at IBM,
Slovenia;
- Tadej Zajc, Sales Representative in Oracle, Slovenia and
- Sasa Strah, project lead for Central Product Register solution in Studio Moderna.
These were informal interviews done over email, and contained questions regarding Master
Data Management products that the representatives listed above work with, as well as their
experience with customers who use their software.
1. DEFINITION OF DATA, DATA TYPES, DATA DIMENSTIONS AND DATA
INCONSISTENCIES
1.1. Data types
Data is part of our everyday life. Words, numbers, dates, pictures are all examples of data.
'Data' represents collection of unorganized facts, which can be organized into useful
information. 'Processing' refers to a group of actions which can convert inputs into outputs.
The series of operations performed to convert unorganized data into organized information is
called data processing, and includes resources like people, procedures and devices to convert
data into information (Minimol & Sarngadharan, 2010, p.85).
As introduced earlier in this thesis, main goal of MDM is to improve data quality in
organizations, therefore organizational (enterprise) data will be discussed further in this
thesis. Enterprise data represents all the inputs that are produced, processed and stored in an
enterprise. It can be used in different business scenarios and for different purposes of the
company.
For easier management, enterprise data is divided in three categories: analytical, transactional
and master. This grouping is based not on the content or format of the data, but on different
ways that the same one is used. There is no strict rule that splits data and places it in one of
these three categories. One record can be defined as analytical or transactional data
depending on the way it’s used in certain business scenarios.
7
For example, sales data can be seen as transactional data representing the daily sales
transactions in one company. On the other hand, sales can be also used for analytical
purposes, to present the sales status in one organization for certain time period. Such example
puts sales data in two groups: transactional and analytical, depending on the way it’s used in
a given situation.
Figure 3: Enterprise data
Source: An Oracle White Paper on Master Data Management, 2011, pg. 4
1.1.1. Analytical data
Analytical data is used to provide some general picture of company’s work. It’s the end result
of statistics, analysis or other calculations performed over collected inputs. Main use is to
show the business situation in a given time period. It is usually stored in the business
intelligence (BI) part of company’s system and is shown in reports, OLAP cubes, graphs etc.
An example of analytical data can be clients’ demographics overview, yearly profits and loss,
or any summary results collected on global enterprise level. It helps in making major business
decisions and often times it determines the course of company’s progress.
1.1.2. Transactional data
Transactional data represents records which refer to transactions in one system. Transactions
are activities that are related to business events, for example: payments, sales, creation of new
account, creating new student record, with other words any change related to an object in
given time. Compared to analytical data, this type is much more detailed and it tracks and
records every new input, update or delete in a system. That’s why the amount of transactional
data increases every day, proportionally with the growth of the number of transactions.
Even though analytical and transactional data are opposite categories and completely
different, they still cannot function without one another. It is very hard to review numerous
records of transactions created on daily basis, so in such cases analytical data is used to
summarize transactional data, and provide final number of daily changes in the system. On
8
the other hand, we can always examine anomalies in analytical data by reviewing each
transactional record that was calculated in those numbers.
1.1.3. Master data
Master data is core data for each enterprise and contains detailed information for its main
domains. Since every enterprise is engaged in different businesses, it has different domains as
well. Examples would be: customer, product, location etc.
Master data can be categorized according to the kinds of questions user will address; three of
the most common questions - “Who?”, “What?,” and “How?” return the most common
domains: party, product, and account. Each of them represents a class of things - for
example, the party domain can represent any kind of person or organization, including
customers, suppliers, employees, citizens, distributors, and organizations. Similarly, the
product domain can represent all kinds of things that companies sell or use - from tangible
consumer goods to service products such as mortgages, telephone services, or insurance
policies. The account domain describes how a party is related to a product or service that the
organization offers. What are the relations of the parties to this account, and who owns the
account? Which accounts are used for which products? What are the terms and conditions
associated with the products and the accounts? And how are products bundled? ( Dreibelbis,
Hechler, Milman, Oberhofer, van Run & Wolfson, 2008, pg.14).
However, this grouping cannot be taken as general rule that all companies apply. Depending
on the business, rules and logic, each enterprise data has its own master data objects defined
for the needs of the company.
Based on some of the domains given above, example for master data would be: customer’s
date of birth, gender, name, address, product’s name, SKU, price, supplier etc. Master data is
entered once in the system and changes on rare occasions. Because business relies on this
information, it’s very important to maintain its consistencies through time. It is critical for
company to lose sales records for some customer, but it would be more crucial if it loses
personal information or contacting data for the same customer. Managing this type of data is
discussed later in the thesis.
1.1.4. Metadata
Another group of enterprise data worth mentioning is metadata. Data about data is well-
known definition for metadata, which is found throughout different literature. However,
metadata has much broader value and meaning for the enterprise, especially when Master
Data Management is discussed. Metadata helps enterprise to relate correct information to the
appropriate business terms. For example, it helps in differentiating different concepts with
similar meaning like client, customer, buyer etc.
There are two types of metadata: (1) semantic and (2) syntactic (Sheth, 2003).
9
(1) Semantic metadata describes contextually relevant or domain-specific information
about content (in the right context) based on an industry-specific or enterprise-
specific custom meta data model or ontology;
(2) In contrast, syntactic metadata focuses on elements such as size of the document,
location of a document or date of document creation that do not provide a level of
understanding about what the document says or implies.
Another categorization of metadata is based on its type of usage. In this case there are three
broad categories (Berson and Dubov, 2007, p. 129):
(1) Business metadata includes definitions of data files and attributes in business terms. It
may also contain definitions of business rules that apply to these attributes, data
owners and stewards, data quality metrics, and similar information that helps business
users to navigate the “information ocean.”
(2) Technical metadata is created and used by the tools and applications that create,
manage, and use data. Technical metadata typically includes database system names,
table and column names and sizes, data types and allowed values, and structural
information such as primary and foreign key attributes and indices.
(3) Operational metadata contains information that is available in operational systems and
run-time environments. It may contain data file size, date and time of last load,
updates, and backups, names of the operational procedures and scripts that have to be
used to create, update, restore, or otherwise access data, etc.
Based on the definitions and categorization of metadata I can conclude that this type of
enterprise data supports MDM in two ways:
(1) It contains background information for the context and technical properties of data,
which helps MDM in more precise data modeling, and also appropriate mapping of
data with master domains;
(2) It sets general data rules for business and technical definitions of data, which supports
data standardization, another process in managing master data.
1.2. Data quality dimensions
Companies need to be acquainted with data quality standards, so they can easily detect
deficiencies in their data. In my opinion, quality in general associates to how much we can
expect to gain from something and how reliable or useful that is. With this being said, data
quality shows how much information we can gain from given data and how reliable that
information is for us as users.
Classic definition found in literature defines data quality as “fitness for use”, i.e. the extent to
which some data successfully serves the purposes of users. (e.g. Tayi and Ballou, 1998;
Cappiello et al., 2003; Lederman et al., 2003; Watts et al., 2009)
10
Defining data quality is very subjective and is not seen equally by everyone. Some users may
consider data very reliable, whereas others may argue that there are still improvements to be
done. To avoid such opposite views, literature sets some common standards for data quality
defined through data dimensions. Data dimensions define data quality as multidimensional
concept and help in determining data’s “fitness for use”. According to Strong and Wang
(1996, p. 6), data quality dimensions are set of data quality attributes that represent a single
aspect or construct of data quality.
Attributes are characteristics of data and the easiest way to define them is when answering
simple data related questions. For example, the question “Which data is duplicated?” returns
uniqueness as an attribute, or “What data is incorrect?” imposes accuracy, and so on. The
table below lists several questions for determining data attributes.
Figure 4: List of data attributes
Source:R. Hillard.Information-Driven Business : How to Manage Data and Information for Maximum
Advantage, 2010, p. 136
There are many attempts in the literature to determine which attributes are most important
and best define data quality. For example, Strong and Wang (1996, p. 7) took (1) intuitive,
(2) theoretical and (3) empirical approach to find out what are the most important data
characteristics.
(1) Intuitive approach is based on authors’ intuitive understanding of importance of attribute.
They take the freedom of choosing which attributes are most important to define data
quality and in this case researchers don’t question what data attributes are important for
system users;
(2) Theoretical approach on the other hand, does not rely on researcher’s subjectivity, seen
in the previous example, and defines data characteristics based on data deficiencies that
can be found in a system. Data attributes are defined in reverse connotation, based on
data deficiencies one system has. If there is duplicate data in the system for example,
then uniqueness is the attribute that is missing and it’s crucial for quality data. This
approach, same as the previous example, doesn’t consider what data attributes are
important for users;
11
(3) In the third empirical approach, data quality is defined in terms of data attributes that are
only important for system users. Even though this approach tries to be as objective as it
can and use general opinion of consumers, still the final results can be very diverse and
inconsistent because of the different opinions collected by different people. The
difficulty in this approach is setting some basic rules upon which dimensions would be
compared.
General conclusion from all these approaches is that there are no strict rules or certain
attributes that define data quality. Data quality dimensions are relative to user requirements
and often times these requirements are subject to change, therefore priority and importance of
data quality dimensions can change as well.
Wang and Strong (1996, p. 21) used a two-stage survey and a two-phase sorting study to
develop hierarchical framework that consolidates 118 data-quality attributes collected from
data consumers into fifteen dimensions, which in turn are grouped into following four
categories, each focusing on a key issue:
1. Intrinsic - What degree of care was taken in the creation and preparation of
information? ;
2. Contextual - To what degree does the information provided meet the needs of the
user? ;
3. Representational - What degree of care was taken in the presentation and organization
of information for users? ;
4. Accessibility - What degree of freedom do users have to use data and to define and/or
refine the manner in which information is entered, processed, or presented to them?.
1.2.1. Intrinsic data quality
Intrinsic data quality, according to Wang et al (2005, p. 7), implies that information has
quality in its own right. Attributes in this category show how truthful and real data describes
objects around us. This group refers to data that comes along with the object and don’t
change due to some requirements. For example, name of a person is given as it is, and doesn’t
change because of some business requirements. Same as person’s weight, height or eye color.
Such values are intrinsic for person and the only anomalies that are found with this data are
NULLs or badly formatted data. So, quality in this case is measured in the existence and
correctness of the input, not whether it satisfies business needs. Below is a list of the most
commonly used dimensions along with their definitions (Kahn, Strong and Wang, 2002, p.
184 - 192):
- Believability - The extent to which data are accepted or regarded as true, real and
credible;
- Accuracy - The extent to which data are correct, reliable and certified free of error;
- Objectivity - The extent to which data are unbiased (unprejudiced) and impartial;
12
- Reputation - The extent to which data are trusted or highly regarded in terms of their
source or content
1.2.2. Contextual data quality
Contextual data dimensions highlight the requirements that information quality should be
considered within the context of the task at hand (Wang et al, 2005, p. 7). Based on
category’s name, dimensions define how precise data captures the context of business
objects. If again I take person as example and his address as representative data element,
what I will be interested in is if this is the only address that can be assigned to the person, and
if this address is current for the time being. In order to improve quality in contextual terms,
every business needs to increase the amount of data related to its business objects, and update
them in appropriate time, to avoid old and obsolete information in the system. Following are
some dimensions that are defined in this group (Kahn, Strong and Wang, 2002, p. 184 - 192):
- Value-added - The extent to which data are beneficial and provide advantages from
their use;
- Relevancy - The extent to which data are applicable and helpful for the task at hand;
- Timeliness - The extent to which the age of the data is appropriate for the task at
hand;
- Completeness - The extent to which data are of sufficient depth, breadth, and scope
for the task at hand;
- Appropriate amount of data - The extent to which the quantity and volume of
available data is appropriate
1.2.3. Representational data quality
Representational data dimensions address the way computer systems store and present
information (Wang et al, 2005, p. 8). This category is explored more from technical rather
than content aspect. Data quality in this case depends on how well data model and business
logic are integrated in systems. If database model is well designed then business objects are
represented by correct and unique data. Otherwise, there are duplicates, orphan records,
obsolete data that just use database space and have no use in particular. In order to meet these
data dimensions correctly, companies need to focus on technical planning and development
of their information systems. Representational data quality category includes the following
dimensions(Kahn, Strong and Wang, 2002, p. 184 - 192):
- Interpretability - The extent to which data are in appropriate language and units and
the data definitions are clear;
- Ease of understanding - The extent to which data are clear without ambiguity and
easily comprehended;
- Representational consistency - The extent to which data are always presented in the
same format and are compatible with previous data;
13
- Concise representation - The extent to which data are compactly represented without
being overwhelming (i.e., brief in presentation, yet complete and to the point)
1.2.4. Accessibility data quality
Accessibility data quality is another category that defines dimensions from technical
perspective. This multidimensional nature of information quality means that organizations
must use multiple measures to fully evaluate whether their data are fit to use for a given
purpose by a given consumer at a given time (Wang et al, 2005, p. 8).
The ability of today’s systems to serve multiple users at the same time in many occasions can
cause erroneous data. Duplicates, overwriting of important information, database changes are
some of the risks that systems undertake in their every day usage. In order to lower such risk,
companies spent some quality time building security model and limit the access to system’s
data. Data dimensions of this type are not defined by the content of data, but by the system’s
security model. There are two known dimensions from accessibility group (Kahn, Strong and
Wang, 2002, p. 184 - 192):
- Accessibility - The extent to which data are available or easily and quickly
retrievable;
- Access security - The extent to which access to data can be restricted and hence kept
secure
1.3. Data inconsistency
Data inconsistencies are irregularities found in data, such as: duplicates, misspellings,
undefined values. They are the “bad” data in systems. Any data that is obsolete, incorrect,
and unuseful falls into this category.
Bad data can have tangible and intangible effect for a business. According to some older
researches by Haapasalo et al (2010, p. 147), it is estimated that incorrect data in retail
business costs alone $40 billion annually and at the organizational level, costs are
approximately 10 percent of revenues. It is said that the decisions company makes are no
better than the data on which they are based and better data leads to better business decisions.
Looking into the intangible consequences, data inconsistencies also cause mistrust in existing
data. Working with different data versions for the same enterprise object is time consuming,
requires additional work for tracking the errors, and causes frustration among employees.
Incorrect data cannot give accurate picture for a business, and it cannot help in bringing the
right business decision for future progress and success.
Two factors play major role in producing bad data: human factor and system design. Human
errors occur every day, usually on input or during various calculations. From my personal
experience, most of the work I do is data analysis, and high number of errors I see are
misspellings or wrong data imported into inadequate data fields. Database cannot track if
14
customer’s last name was entered in the field for first name or vice versa therefore, erroneous
data is produced, unless detected on time and corrected at that moment of input.
System design is another reason for producing bad data. Wand and Wang (1996, p. 91)
discuss four states of design deficiencies in systems: (1) incomplete, (2) ambiguous, (3)
meaningless and (4) incorrect. These states are based on deficiencies that appear when user
definitions (what users expect to see in the system) are improperly mapped to the system’s
values.
(1) Incomplete state occurs when there is no system value to represent user definitions.
This state can lead to inaccurate and incomplete data;
(2) Ambiguous state appears when two or more user definitions are represented by same
value. In this case precision and accuracy are affected;
(3) Meaningless state produces irrelevant data which can’t be used for any of the
requirements. It’s an “orphan” value that stays in the system and it’s not used. This
state may not have immediate effect on data, but in future may lead to ambiguity or
incorrectness if new user definition is required and it happens to map to that same
“orphan” value;
(4) Incorrectness appears when data refers to the wrong user definition. Therefore data is
incorrect and unreliable.
Data issues can be of technical or business character. Technical data issues refer to data
structure and representation. An example of such technical errors would be (Gryz and
Pawluk, 2011,p. 3):
- Different or inconsistent standards in structure, format, or values;
- Missing data, default values;
- Spelling errors;
- Data in wrong fields;
- Buried information in free-form fields.
Business issues, on the other hand, are unique for each organization. They refer to the context
of the data and appear as a result of incorrect representation of business terms and relations.
For example, address for one customer is entered as home instead of work, or person is
related to transactions that he never committed.
It’s hard to define some general list of business data inconsistencies that will cover
irregularities of data for all organizations as it was the case with technical issues. Best way to
detect and define such business errors, is through data analysis, which will reveal if the
entered data corresponds to the defined business concepts.
Despite the fact that bad data lowers data quality and produces incorrect information, in many
cases it’s an advantage and can predict what changes need to be undertaken to improve the
work of systems. As seen from the previous chapter, another way to explore data quality is
through data deficiencies that can be found in systems. Based on this approach, existence of
15
data inconsistencies can predict the missing factors for data quality. Data errors are the
starting point for solving the problems that produce them. Once these problems are detected,
then there are appropriate measures that can be used to fix them and improve data quality in
the system.
1.4 Data quality improvement
Data quality improvement is a systematic process that occurs in several phases. It starts with
looking for the source of the problem, through cleansing of the errors to setting data
standardization rules that will prevent future problems.
Data quality improvement executes in the following order (Rivard et al, 2009, p. 62):
(1) data profiling - analyzes data to find inconsistencies, data redundancy and incomplete
information;
(2) data cleansing - corrects, standardizes and verifies data;
(3) data integration - semantically links data; reconciles, merges and associates;
(4) data augmentation - improves data by using internal and external sources, and
removes duplicates;
(5) data monitoring - monitors and checks the integrity of data over time.
Figure 5: List of techniques for solving data inconsistencies
Source: F. Rivard et al, Transverse Information Systems : New Solutions for IS and Business
Performance, 2009, p. 62
There are various tools that support DQ improvement stages listed in fig.4., leaders among
them are: Informatica, SAP, IBM, SAS/DataFlux (Gartner, 2012, p. 2).
DQ improvement phases are unified in master data management. Its concepts and goals will
be discusses in the second part of this thesis.
16
2. MASTER DATA MANAGEMENT
2.1. Definition
The problem of bad data is well known to every company. There aren’t any enterprises that
have perfect data without errors, therefore, they are constantly trying to improve data quality
and prevent their systems from further data inconsistencies. Earlier I discussed four types of
enterprise data: transactional, analytical, master and metadata. All of these types are equally
important in every organization, but core data that describes organizational business domains
is master data. Therefore management of this type of enterprise data (master data) will be
discussed further in the thesis.
There are number of data stewards, administrators of databases, software architects, business
analysts, who work with different software platforms, data methods and techniques used for
data cleansing and governance. All these people, software and methods that are involved in
solving master data errors are united in a discipline called Master Data Management (MDM).
Often Master Data Management (MDM) is defined as software package for improving data
quality. But in fact, MDM is much more than just a software application for data cleansing. It
is special IT discipline that includes people, software tools, and business rules for managing
master data. Different literatures share same views on what MDM is. For example, Berson
and Dubov (2010, p. 79) define MDM as framework of processes and technologies aimed at
creating and maintaining an authoritative, reliable, sustainable, accurate and secure data
environment that represents a “single and holistic version of the truth”, for master data and its
relationships, as well as an accepted benchmarks used within an enterprise as well as across
enterprises and spanning a diverse set of applications, lines of business, channels, and user
communities.
Loshin (2008, p. 8) defines MDM as a collection of best data management practices that
orchestrate key stakeholders, participants, and business clients in incorporating the business
applications, information management methods, and data management tools to implement the
policies, procedures, services, and infrastructure to support the capture, integration, and
subsequent shared use of accurate, timely, consistent, and complete master data.
Figure 6: Workflow of MDM
Source: D. Loshin , MDM -Paradigms and Architectures, 2008, p. 9
In other words, MDM is developed to improve, maintain and govern company’s master data
following the business rules of that enterprise.
17
Even though there are three types of enterprise data, MDM’s main concern is master data.
This fact should not underestimate the significance of the other two types of data, analytical
and transactional, but the choice is made because every company’s business processes are
designed and develop around master data. Master data holds information for the key objects
of every enterprise.
There has always been a need for MDM, but in the recent years the interest constantly grows,
especially in large and complex companies. Many reasons can be found for this urge for new
management standards for data quality. Examples would be: (1) lines of business, (2) mergers
and acquisitions and (3) new packaged software. (Dreibelbis et al, 2008, p. 6 - 11)
(1) Lines of business - Common thing about these reasons is that they bring in additional
data in the system, which in often cases is a different version of already existing data.
Lines of business, for example, create different modules in the same enterprise and
each module functions independently. They work with the same master business
domains, but each line of business keeps its own master data for the common
enterprise objects. In one sales company, customers can make purchase through
different channels like store, online, catalogue etc. If each sales channel represents
different line of business then it can happen that there are several versions of same
customer created for each sales channel.
(2) Mergers and acquisitions – it is more very common nowadays one company to
purchase another, or they merge their business and become large enterprise. In such
cases, master domains from both companies are included in the new business. The
same problem as in the first example can show here. Even though I’m taking an
example of large businesses that may work with different sets of customers, it can
happen that there is still a group of people that are stored in both systems. With merge
of two data storages, duplicate data is automatically created.
(3) Packaged software - As a result of the SOA architecture and all the independent
platforms on the market, companies often time decide to use different applications for
different business processes. They can use Enterprise Resource Planning (ERP)
software for managing their sales, purchases and stocks, or Customer Relationship
Management (CRM) software to manage their customers. In both cases, there needs to
be some connection between these different applications, so they can “communicate
and share same data for the key objects of the company. MDM is the link in this case.
Among all the existing ERP, CRM, SCM solutions, often comes up the question why
companies need another management tool, when there are already so many on the market?
Why can’t the existing management solutions, which have been on the market long before
MDM appeared, solve the problems that were just explained above? Answer to this question
is described in the following four factors (Loshin , 2008, p. 13):
(1) Despite the recognition of their expected business value, to some extent many of the
aspects of these earlier projects were technology driven and the technical challenges
18
often eclipsed the original business need, creating an environment that was
information technology centric. IT-driven projects had characteristics that suggest
impending doom: large budgets, little oversight, long schedules, and few early
business deliverables;
(2) MDM’s focus is not necessarily to create yet another silo consisting of copies of
enterprise data (which would then itself be subject to inconsistency) but rather to
integrate methods for managed access to a consistent, unified view of enterprise data
objects;
(3) These systems are seen as independent applications that address a particular stand-
alone solution, with limited ability to embed the technologies within a set of business
processes guided by policies for data governance, data quality, and information
sharing;
(4) An analytical application’s results are only as good as the organization’s ability both
to take action on discovered knowledge and to measure performance improvements
attributable to those decisions. Most of these early projects did not properly prepare
the organization along these lines.
From all that was stated above, MDM is no new technology or approach for improving data
quality but some standardization of a workflow for data management, something that wasn’t
formally defined before. There were data stewards, data management methods used in
different systems, times and places, but they didn’t belong to any category, even though were
doing the same job which was data integration, cleansing, governance and sharing. MDM is
now this category which expands with every new master data management method that is
defined. Considering the serious role it has in governing master data, MDM has yet to
develop and prove as efficient tool for data quality improvement.
2.2. Goals of MDM
Most of the literature researches refer to creation of single source of trust for master data to
be the main goal of MDM. Yang (2005, p. 3), for example, stated that the main goal of
MDM is to allow unrelated applications to share a common pool of synchronized data. Per
Berson and Dubov (2007, pg. 3), the focus of MDM is to create an integrated, accurate,
timely and complete set of data needed to manage and grow business.
Other than this goal for “golden record”, MDM focuses on lowering cost and complexity
through standards, and supporting business intelligence and information integration (Otto,
2011, p. 2).
Some of the most important goals of MDM include (Mauri and Sarka, 2011, p. 17):
- Unifying or at least harmonizing master data between different transactional, or
operational systems;
- Maintaining multiple versions of master data for different needs in different
operational systems;
19
- Integrating master data for analytical and CRM systems;
- Maintaining history for analytical systems;
- Capturing information about hierarchies in master data, especially useful for
analytical applications;
- Supporting compliance with government prescriptions (e.g., Sarbanes-Oxley) through
auditing and versioning ;
- Having a clear CRUD process through prescribed workflow;
- Maximizing Return Of Investment (ROI) through re-usage of master data.
MDM goals in the list above can be summarized into two main goals one that strives to
cleanse the data and another goal that tries to maintain the data clean. This being said, goal of
MDM is a two-step process that helps increasing data quality. The first step in achieving this
goal is to review, organize and cleanse existing data. The second step is its maintenance and
governance.
As stated in the definition by Mauri and Sarka (2011, p. 17), MDM has a list of numerous
goals that need to be accomplished, therefore defining MDM just as creating single version of
data for enterprise key objects, is a partial explanation which doesn’t cover the whole issue of
data quality. This may be the final point that needs to be accomplished when MDM is
implemented for the first time, but data management doesn’t stop here. Knowing that data
changes on daily basis, one time data reorganization and cleansing doesn’t solve the problem
for bad data because with the new data load, previously mentioned problem may reappear
again. What one company needs is some long-term solution for its bad data problem and this
is achieved in the second step of the MDM goal realization which is constant governance of
data quality standards.
MDM is a long-term solution for keeping enterprise data quality on satisfactory level. Each
MDM project should strive to achieve the top data quality level and show proactiveness
towards managing data. Creating single source of data and governing with the same, provides
flexibility for one organization to grow and increase its information pool without confronting
issues of redundant data.
Figure 7: The data quality activity levels
Source: H. Haapasalo et al, Managing one master data – challenges and preconditions, 2011, p. 158
20
In order to accomplish the desired goal, MDM should have a business focus instead of
technology focus (Loshin, 2009). Main concern of MDM is master data; type of data that
defines the business in each enterprise, therefore technology in this case is just tool for
realization of the management, whereas business standards are the core issues MDM should
deal with.
In addition to this, Smith and McKeen (2008) have defined four prerequisites for successful
MDM: (1) developing an enterprise information policy, (2) defining business ownership, (3)
data governance and (4) the role of IT systems. In this list, only the last prerequisite includes
IT as a requirement, the other three points are all business focused.
2.3. MDM Activites
MDM provides the following activities to accomplish the goals discussed: (1) profile, (2)
consolidate, (3) govern, (4) share and (5) leverage. These five categories contain different
methods, techniques and tools that support activities appropriate for each of the groups (An
Oracle White Paper, 2011, p. 14).
(1) Profile – This is the first phase of data management which examines the current data
quality state of all sources. It’s nothing more than data assessment to check if the
current data follows some predefined rules in the master repository. Examples would
be: the completeness of the data, the distribution of occurrence of values, the
acceptable rang of values etc. Profiling can be also done during data import as well as
data integration tasks;
(2) Consolidate – In this phase data from different sources is integrated. Depending on
the MDM architecture, data can be integrated in the master repository, or key
references to external applications are updated or created;
(3) Govern – Major changes can happen in this phase because the actual data updates
occur in this stage. Deduplication, cleansing, update and deletion are done based on
the assessment results provided from data profiling;
(4) Share – Once data is cleansed, it is passed on to external sources. Master data
synchronization between master repository and external applications is supported by
SOA, architecture that allows sharing data between different system platforms;
(5) Leverage – This last phase is used for analytical purposes. Enriched master data is
great source for BI tools and gives complete view of the master business domains.
Figure 8: MDM Activities
21
Source: An Oracle White Paper, 2011, p. 14
Managing master data follows this order. It’s logical that this workflow starts with data
analyzes and ends with data reporting. However, all phases don’t always occur at the same
time. It would be very expensive and time consuming if one organization runs column
analyzes or matching on daily basis. Sharing of data, on the other hand, may be more
frequent, especially if external applications send direct request for data retrieval.
2.4. Benefits from MDM
Successful MDM solution can be of positive value to an enterprise providing benefits of
intangible as well as tangible character. Intangible benefits are seen in the following areas:
(1) data quality (2) business processes and (3) users and customers (Dreibelbis, 2008, p. 37).
(1) MDM offers improved data quality, seen through some of the dimensions discussed at
the beginning. Better accuracy, consistency, completeness are some of the few
dimensions that are improved with this strategy. Also, same version of data is shared
across the system and used by various applications;
(2) Business process and workflows are better organized due to correct data. They are not
improved just because of the data they worked with, but are also reorganized to
produce and maintain correct data that will result in reliable information. This
reorganization of business processes is also of predictable nature, to detect most
valuable data and trigger new business innovations and more profitable decisions for
future progress of the enterprise;
(3) Users trust in data is returned back because now they can rely on the same version of
data across the information system. Customers are also more satisfied because most of
the delays and irregularities that were present as a result of wrong data, are greatly
reduced due to MDM.
Tangible benefits are seen in the actual profit that organizations gain after implementing
MDM. An example of this quantitative data is shown in Table 1. Benefits with the highest
amounts in the table are sales, customer loyalty (which again leads to increased sales) and
22
efficiency (of sales representatives and IT systems). Based on these facts, MDM improves the
organizational work from business and technical perspective.
Table 1: An example estimating the positive impact of customer MDM
Source: Building the Business Case for Master Data Management (MDM), 2011, p. 9
Often times, MDM is identified with MDM software. This confusion appears because MDM
applications present data management processes in the most accurate manner. Following is a
review of four MDM lines of products. Discussion about them would cover architecture,
processes and usage of MDM in enterprises.
3. MASTER DATA MANAGEMENT SOLUTIONS
3.1. Historical review of MDM solutions
As seen from the definitions from MDM in the previous chapter, every process or person who
is involved in data quality improvement is part of MDM. Data mining, cleansing, redefining
business rules, changing application logic; it’s all part of managing data. Therefore, I can say
data management appears with the first introduction of databases. However, standardization
of methods and rules is becoming more popular in the recent years.
First attempts of managing data are found in CRM and ERP applications. However, the main
problem with these applications was that they were managing their own data and they
couldn’t provide solution for single common source of master data between different
solutions. Master Data Management appeared in the late 1990’s with the release of Customer
Data Integration (CDI) and Product Information Management (PMI) on the market.
Development of MDM applications historically goes into two directions: (1) functionality
centric and (2) domain centric.
(1) First approaches of MDM were made through data warehouses. However, this type of
managing data didn’t prove as efficient. The idea was to centralize data in one place.
But, managing doesn’t mean just keeping everything in same storage, it also requests
for some functionality implementation, which was missing in the data warehouse
approach.
23
The second idea for managing data was through enterprise application integration
(EAI). Development of this new technology made it possible for different application
to work together and exchange data. Missing part in this case was the central storage
that would keep the single source of truth. MDM evolved from these two ideas, as
common ground that creates and maintain the single source of truth, and also shares it
with various applications in the system;
(2) Because master data represents the key objects of one business, customer and product
are the main domains found in enterprise data. Understandable, MDM started with
customer master data implemented in the well known customer data integration (CDI)
applications. Customer data models were initially of account-centric design, which
means that they were designed based on different customers’ roles in the system
(buyer, sales person, administrator etc.). Because the number of such business models
was growing with the growth of customer’s data, it was more difficult for
organizations to maintain several databases just for one type of entity, and consolidate
data from all of them. Therefore, an account-centric model was replaced with entity-
centric model, which represents one schema-design for buyer, sales person,
administrators or client-organizations. They were all covered included in Customer
domain. After solution for customer domain, vendors came up with product
management information (PMI), applications that support product master data.
Nowadays, the latest trends implement several domain types into one master data
management application, called multi-domain MDMs.
The evolution of IBM MDM applications is great example for these two development
directions (IBM Multiform Master Data Management, 2007). In the development cycle of
IBM MDM applications, there are two significant points : the first one is the transition from
data-centric approach tool to functional-centric application and the second point is the
transition from singe focused usage of style or domain, to multiform application. These two
points are important in MDM evolution, because they represent the culmination of problems
found in current data management tools at that time which caused the transition from one
approach to another.
The first approach that MDM used was through indexed and reference tools. In this case,
there wasn’t any significant storage for keeping the master data, but only the indexes (IDs,
references) were kept in single repository. This approach was showing the various versions of
data for master domains, but it had lack of functionalities to deal with them and solve the
same ones. This is the point when first evolution chasm appeared, and caused MDM
solutions to develop as applications from that time on.
The second chasm appeared during MDM being developed as functional approach that has its
own physical storage of data as well as functions that will manage data. Initial MDM
applications were focused either on unique usage style or one domain. Such approach created
difficulties in exchange of master data between different domains. Knowing that enterprises
have different lines of business, and multiple domains, it was hard to merge and maintain
24
data from uniform MDMs application. This problem introduced the next step in development
of MDM applications, launching multiform MDM applications. Multiform MDM
applications are functional-centric solutions that support various domains as well as various
usage styles. There are several vendors that still produce single domain applications, but there
mission and vision is directed towards multiform application. An example of MDM evolution
can be seen in the graph below.
Figure 9: Evolution of IBM MDM applications
Source: IBM Multiform Master Data Management: The evolution of MDM applications, 2007, pg 9
MDM solutions grow and develop along with technology. The newest trends of cloud
computing is also present among this discipline. Focus of MDM nowadays is towards
developing multi-domain cloud solutions.
3.2. Functionalities, concepts and architecture
The benefits of Master Data Management are best seen when MDM solution is implemented
in an enterprise and used to manage its data. Systems, applications, hubs are terms that refer
to MDM system in general, and that is why they will be interchangeably used further in the
text.
MDM system represents solution for creating single version of master data, maintains master
data through various processes and makes it available for other legacy applications in the
information systems.
25
Per Gryz and Pawluk (2011, p. 2-3), MDM solutions should offer the following
functionalities:
- Consolidate data locked within the native systems and applications;
- Manage common data and common data processes independently with functionality
for use in business processes;
- Trigger business processes that originate from data change;
- Provide a single understanding of the domain-customer, product, account, location for
the enterprise.
Functionalities of MDM are mainly developed to support data unification and are manifested
though import and export of data, business rules, validation and any other method that assists
in data consolidation and transfer. Even though different vendors try to provide different
functionalities so they can be leaders in the MDM area, there are still some basic concepts on
which MDM solutions are built.
Best way to describe MDM system’s functionality for data management, concepts of work
and their architecture, is through the three dimensional model. This model is a shortened
version of the 30 viewpoints framework proposed by Zachman.
Main dimension that describe MDM systems are: (1) domain, (2) methods of use and (3)
implementation styles. There are three main guidelines that define the scope of the three
dimensional model (Dreibelbis et al, 2008, p. 12):
(1) Business scope determines the number of domains;
(2) Primary business drivers determine the methods of use;
(3) Data volatility (instability) determines the implementation styles .
Figure 10: Dimensions of master data management
26
Source: A. Dreibelbis et al, Enterprise Master Data Management, 2008, p. 12
(1) First dimension, domain, is based on the business nature and the type of master data
domain works with. Each enterprise has different lines of business which work with
various key objects. Most common domains are: customer, product and account. The
domain Location is often times added to this list. However, this classification can be
further expanded with new domains, depending on enterprise requirements. Names of
these domains are pretty much self explanatory and describe the business objects they
cover. Customer covers: people, organizations, and all the roles in the system they can
have. For example, supplier, buyers, employee, employer etc. Products depend on the
lines of business and they cover various items company may work with. Account
domains cover relationships between customers and products. Depending on the type
of business there are different types of accounts as well, checking, savings, student
accounts etc. Based on the number of domains MDM can support, there are single-
domain MDM solutions as well as multi-domain solutions which work with several
different domains;
(2) Second dimension, methods of use, is defined according to the different purposes of
use each business have. Based on this dimension, MDM systems can belong to three
groups: collaborative, operational and analytical. Collaborative MDM is used to
support complex workflows and the data that comes from different sources. Best
example of such usage would be when introducing new item (product) in the system.
In this case there is a list of people involved for defining product properties,
approving them and launching this product on the market. Data validations,
integration of different properties as well as triggering approvals for this item are all
supported by collaborative MDM. Main functionalities that this style of MDM
solution should have are: task management, data validation, data integration of
properties from different legacy applications.
Operational MDM acts as an Online-Transaction Processing (OLTP) system that
responds to requests from multiple applications and users. However, this type of
MDM is used to support processes that are predefined by the MDM users, and doesn’t
have this decisive role as Collaborative MDM. Operational MDM method of use is
best seen in SOA services as well as main database operations, where MDM supports
transactions to retrieve data, update, create and delete.
Analytical MDM has completely different method of use and it is about the
intersection between Business Intelligence (BI) and MDM. It’s a one way
communication where data from different systems is sent to the MDM hub for data
consolidation and preparation for analytical systems. Knowing that MDM repository
stores all master data, cleansed, organized and managed, this is an excellent source for
OLAP, star schema for data warehouses, data mining, predictive analysis based on
scoring etc.
27
However, MDM systems cannot be strictly divided in these three categories.
Depending on different business processes in each enterprise and the frequency of
their change, often times MDM systems can cross over from one type to another.
(3) The last dimension, implementation styles, is based on the different ways data
attributes can be stored in the system. This dimension covers various architectural
styles of MDM.There are four general implementation styles defined throughout the
literature: external reference, registry, reconciliation engine and transaction hub.
External reference architecture is the simplest solution for MDM. It acts as system of
reference instead of system of record, because it doesn’t contain actual data, but
reference to data which remains in the legacy systems. External reference architecture
may be simple and easy for implementation, but it lacks control over its data. All this
architecture can provide is just reference for data in its legacy system, but any
functionality is disabled because MDM doesn’t have access to it.
Registry style is on higher architectural level where MDM solution is represented as
limited size data storage that contains only unique identity attributes. What this means
is that instead of containing all data from several applications in one storage, MDM
system stores only unique attributes for an object such as: ID, name and description,
and references the other data attributes that remain in the legacy systems. This
implementation style is step forward in MDM development because it stores some
basic info and also integrates data from different system with the help of references.
Disadvantage of this architecture is that MDM still doesn’t have all data available.
Keeping references still doesn’t solve the problem of bad data. Also, often times
cannot retrieve all information due to legacy systems failure.
Reconciliation engine – In this architectural style, there is an opportunity of
exchanging data in both directions: MDM database to legacy applications and vice
versa. MDM system can store complete set of data attributes for some domain, but it’s
not the only place that manages data. Legacy applications still manage their data and
synchronize it with the one that is stored in the MDM system. The ongoing matching
and synchronization in MDM repository keeps the master data up to date. The only
challenge that appears in this architecture is that data is still changed in other systems,
and can often cause unreliable data in MDM system. With the growth of data
attributes in other external sources, it is more difficult and complex to keep up with
the synchronization updates.
Transaction hub is the most sophisticated architectural style of MDM systems. This
implementation style is the actual system of record for other applications. Central data
storage is placed in the MDM system, where master data is cleansed, organized and
managed. All the other external systems are using the data from the MDM repository.
This architectural concept is the core of master data management, and achieves all
goals for single version of data that should be accomplished. However, the complexity
28
of this structure carries some difficulties when implementing among external legacy
systems. There are two major changes that need to be done during implementation of
this architecture:(1) data needs to be integrated and centralized into one data storage
and (2) other systems need to be changed to work with the new transactional hub. The
idea of transaction hub fulfills all requirements for data quality improvements but the
realization can be impossible in some large enterprises with complex systems.
The forth implementation style, transaction hub, is shown in fig 11 and 12. As seen in
both of these figures, data from external processes is imported with Extract,
Transform and Load (ETL) process and accessed through different user interfaces
(UI).
Figure 11: Traditional MDM architecture
Source: A. Berson and L. Dubov, Master Data Management and Data Governance, 2011, p. 108
Figure 12 shows more advanced model practiced in the latest solutions, where MDM
architects try to solve the problem of data sharing among MDM repository and
external applications including various SOA services. The goal is to make MDM
solution as a metadata-driven SOA platform that provides and consumes services that
allow the enterprise to resolve master entities and relationships and move from
traditional account-centric legacy systems to a new entity-centric model rapidly and
incrementally (Berson and Dubov, 2011, p.85).
Figure 12: MDM architecture with additional published services
29
Source: A. Berson and L. Dubov, Master Data Management and Data Governance, 2011, p. 108
As seen from the discussion above, there are different styles of MDM systems based on the
three dimensional model. Which approach is chosen depends on the business requirements of
one enterprise. In the recent years, trying to serve all types of business, vendors are moving
towards multiform MDM systems, solutions that implement various domains, methods of use
and implementations styles in order to develop solution that would be suitable for every type
of business.
3.3. Architecture of MDM described through selected MDM solutions
Despite the great variety of MDM solutions on the market, I chose Microsoft, IBM, Oracle
and SAP because they are already known and well established vendors for database software
as well as Business Intelligence (BI) solutions. According to Gartner (2012, p. 2), they are
placed in the leader’s quadrant for 2012 for Data Warehouse Database Management Systems
and also BI platforms. Since MDM systems main concern is data management and they are
also involved in BI processes, I was curious to find out what these leaders have to offer for
the MDM market.
Table 2: Gartner’s Magic Quadrant for Data
Magic Quadrant for Data Warehouse Database
Management Systems Magic Quadrant for BI platforms
30
Source: Gartner, 2012, p. 2
I will present short overview of the MDM solutions for the four vendors mentioned above.
Following concepts will be covered for each one of them.
History of development;
Data modeling;
Data import and export;
Data validation;
Data security;
Advantages and disadvantages.
3.3.1. Microsoft Master Data Services
Master Data Services (MDS) is a product which Microsoft acquired from Stratature in 2006.
Already a customer of Stratature, Microsoft had been impressed with the rapid time to value
and the ease of customization that Stratature’s +EDM product provided. Microsoft initially
planned to ship its MDM solution as part of SharePoint, because information workers are the
primary consumers of master data. However, because IT plays a significant role in managing
MDM solutions, MDS moved to the Server and Tools division and became a feature of SQL
Server 2008 R2 (Graham and Selhorn, 2011, p. 6) . MDS can be installed as additional
feature of SQLServer 2008 R2 or any newer version that follows.
Data modeling MDS system comes with blank database, which means that there is no data in
the MDM repository, no tables and no pre-defined data models. There is metadata model that
comes with every installation and also sample models for Product, Customer and Account,
but they serve more as an example rather than template which can be used as starting point
for developing data model. MDM model is based on the model of RDBS, just different
31
terminology is used. There are four data objects that are made available for users: entity,
attribute, member and hierarchy and correspond to certain data object in relational data model
(Graham and Selhorn, 2011, p. 56). Below is a table that relates MDS and relational database
objects.
Table 3: MDS repository objects vs. Relational database objects
MDS repository Relational database
Entity Table
Attribute Column
Member Row
Hierarchy Relationship
MDS supports several models in a repository, however it allows relationships only between
entities from same models. Hierarchical relationships are supported, and this parent-child
structure allows grouping of data in collections, hierarchical groups for better organization
and maintenance of data. Fg.13 shows an example of MDS Model explorer. As seen in this
picture there are several models: Chart of Accounts, Customer, Product each having their
objects organized in tree structure. Hierarchies are supported only between entities of the
Product model, but relationships between objects from Product and any other model are not
allowed.
Figure 13: MDS data model
Source: Bullerwell, Kashel & Kent, Microsoft SQL Server 2008 R2 Master Data Services, 2011, p. 75
Data import is done through ETL package created in SSIS. There isn’t any feature that
allows direct connection between SSIS and ETL therefore special skills are required to setup
the whole loading processes. However, MDS doesn't just rely on Microsoft SSIS, but also can
use ETLs from other vendors such as Informatica or Infosphere DataStage. In order to protect
data, each repository comes with staging tables that are copy of the tables for the existing
32
data model objects. Staging tables are used during data load, to store newly imported data
before it enters in the production data model.
Data export is done with publishing subscription views on defined server. Subscription
views are nothing else but views from the tables in MDS. Once exported, data from these
views can be queried in SQL Management Studio with simple select statement. These
subscription views can be set up to run every night in case we need frequent update of the
data. In order to be used in other systems, they can be exported as flat files and afterwards
imported in different databases. Web services are also made available in this system, so data
import and export can be performed programmatically as calls of such services.
Data validation is done in different areas of MDS. Data managing, validation and cleansing
can be done on data import and also using matching techniques or validations rules. The first
attempt for data management is done on data import. Since data is first imported in staging
tables, here is a point of precaution for bad data. Also, using the same structure of predefined
tables for every model sets general standards for each data domain organization. Data is
additionally checked when loaded from staging table to the actual data model. Batches that
run in the MDS, check for data compatibility to the MDM model structure and error on any
inconsistencies, like Nulls, improper format, length etc.
Main tool for detection duplicates and data cleansing is the matching operator in MDS. This
operator has two values: Match and Does Not Match that works on user defined Similarity
Level. Similarity level is a decimal number which defines how precise user should consider
that two values match. If the value is closer to 1, then the match is equal to the entered value.
As prevention for entering erroneous data, MDS uses validation rules, which are also logical
operators, and return an error when wrong data is entered. Examples for such attributes would
be detection of NULLs, defining range of values for some attributes etc.
MDS is also involved in speeding up some workflow processes by sending emails or
notifications to users who are in charge of some action or approval. These notifications are
usually triggered by data changes. This feature is an attempt toward collaborative method of
use, which is a step forward towards improving MDS as fully supportive solution for all types
of methods of use that are defined.
Data security is supported through different roles assigned to users of MDS. There is admin
role that has full permissions in the system, and specific groups of roles that are with limited
access to data in MDS such as for browse, edit etc.
Another way of securing data is through creation of versions, which are data snapshots of the
current information that exists in MDS repository. Versions are used (1) to track changes that
were made in the past, (2) to rollback changes and (3) to track the version of model that each
external system is using (Graham and Selhorn, 2011, p. 234). Even though versioning doesn’t
link to roles and permissions, the ability to save data in certain period can save company’s
time and work in case of major database failures.
33
New addition in SQL Server 2012 is the DQS – Data Quality services. These services work
on knowledge base that the organizational data steward maintains. Based on this knowledge
database DQS cleanse, match and validates data according some business rules. These
services are integrated with MDS through the MDS Excel Add-In for Matching.
Advantages and disadvantages of MDS are presented in table 4. Based on the discussed
functionalities and architectures there are many ways in which MDS can improve data quality
of organizational data. However due to various limits in its design, user still needs to use
workarounds and extend the data model to support certain business requirements related to
the enterprise data.
Table 4: Advantages and disadvantages of MDS
Advantages Disadvantages
Domain neutral, doesn’t limit user to specific
types of master objects
No prebuilt data model, which requires time
and work to define data model for each
domainUser knowledge
Familiar database structure similar to RDBMS No relationships are allowed between different
domains
Simple interface that doesn’t require IT skills
or programming knowledge
No support for multi valued attributes, this
causes additional tables and relationships for
their implementation
Versioning of data is enabled Data import and export are done with different
tools and require special skills and knowledge
to setup the whole loading environment
Despite the fact that Microsoft is long time in the database management software market and
also one of the leaders according to the Gartner’s Magic quadrant for 2012; still it didn’t keep
its place in the Magic quadrant for MDM applications. MDS is simple solution for SME
enterprises but due to the limited number of functionalities discussed earlier, it cannot support
complex enterprise businesses. The way MDS is designed now, all it can offer is simple user
interface for data model structure and maintenance for limited database scenarios, but it
requires a lot of remodeling and additional functionalities to fully develop in the following
areas: data import and export tool, integration among other external system, support of
complex decisive workflows, thorough data analysis functionalities and support of
relationships among different domains.
3.3.2. SAP Netweaver
SAP Netweaver MDM is part of the Netweaver computing platform consisted of several core
products such as Application Server, Business Intelligence, Enterprise Portal etc. MDM was
introduced to this family of products in 2004 when SAP purchased a small vendor in
the PIM space called A2i (Ferguson, 2004). Because this code was specifically intended for
product domain, the first release, SAP MDM 5.5, was considered to be PIM solution instead
of general MDM system. In 2008 SAP released enhanced version of MDM, called SAP
NetWeaver MDM 7.1 and a year later they launched full MDM suite containing various
34
applications as well as improved MDM technology to build pre-packaged business scenarios
and integration. Current version of SAP MDM NetWeaver Suite contains the following
components (Rao, 2011, p. 21):
MDM Import Manager;
MDM Import Server ;
MDM Data Manager;
MDM Syndication Server;
MDM Syndicator.
Data modeling SAP MDM solution stores its data into a central MDM repository. It’s a
complex structure of several types of tables so they can store different type of data, from
simple integers to pdf files and pictures.
Figure 14: Table types
Source: L. Heilig et al, SAP NetWeaver™ Master Data Management, 2007, p. 192
SAP data model reminds of star schema. Master data attributes are stored in main tables,
which are flat type in most of the cases. They contain main data attributes and references to
subtables where additional data attributes for the master objects are stored. MDM repository
supports various types of data fields. Novelty here is that multi value attributes are supported,
which is a plus because it “saves” the repository from additional tables that should be created
to store the values for such attributes. Also, relationships and hierarchies are implemented in
the same way as in relational model (Heilig et al, 2007, p. 193).
This MDM system can support any type of domain. Despite the neutral environment that
SAP offers, there is also possibility of several predefined repositories for Customer,
Employee, Supplier, Material, Business Partner, and Product domain which can be used as
starting point for data model development and extension depending on the business
requirements each enterprise has.
Data import SAP MDM suite has automated the process of data load in the repository. Also,
importing is done field by field instead record by record, which significantly speeds up the
35
process. There are two ways when data is imported in the system. Both of them are done in
the Import Manager, but SAP MDM allows either load on the actual data in its tables, or
assigning key mapping pairs, used mostly for the external systems. In the first case of data
load, SAP supports different sources of data such as database servers, xml, text, excel files
etc. During import, preparation process of source data requires more time and work instead of
the import itself. Data needs to be validated, matched and mapped to the existing fields in the
MDM repository. However, this is done only once, and the whole process can be saved as
import map and reused on the next import.
Another way of integrating data from different systems is through key mapping. Instead of
loading the actual data from the legacy repositories, key-value pairs are created in SAP MDM
databases. They contain unique MDM ID that is same for same records in different external
systems. In this case, the original record ID is kept in the legacy repository and the MDM
unique ID is the link between the external source and the master data stored in the central
database (SAP NetWeaver Master Data Management (MDM). MDM Import Manager, 2011,
p. 407).
Data export is done in the similar manner as data load. SAP MDM calls this process data
syndication . Exporting can be also done automatically, but in this case users need to create
export maps that will define the flow of data between MDM repository and destination items.
The final product from this export is xml schema or flat files that are further on imported in
other systems. What users need to be careful of is the changes in master data that may happen
during export. Incase master data is updated while export is being executed, exported file
may contain mix of old and new data.
Figure 15: Key mapping during import and export
Source: L. Heilig et al, SAP NetWeaver™ Master Data Management, 2007, p. 201
Another way of using MDM data is with the key mapping technique, discussed earlier during
import. External systems can access master data in SAP using the unique MDM IDs assigned
to each record from the legacy systems (Fg. 15) (SAP NetWeaver Master Data Management
(MDM). MDM Syndicator, 2012, p. 15-86).
Data validation This MDM suite validates data on several occasions. During data import,
data export and data management. SAP is built in such way, that any work with data is
related to data validation and management. Matching is the core functionality used to check
36
for duplicates, cleanse them from the repository and prevent their import in the system. In
order to detect duplicates and validate data, SAP has put a lot of thought in this process and
developed it as complex set of rules and strategies. All processes that fall under matching are
used for data validation and cleansing wrong values from the database. Transformations,
matching functions, matching rules, strategies, and substitutions are some of the features that
are part of MDM matching. Same records are detected during matching based on user defined
scores for similarity. Other matching rules and functions use logical operators to determine
equality between values. The whole process is record centric, which means that for each
record there is a group of zero or more potential matches. Once matches are found, they are
further merged into one record. Additional advantage for better data management is the
architectural structure of the MDM database which supports various types of tables, fields
and relationships.
Data security By default, MDM servers are not password protected and everyone can access
them. Therefore, there has to be some admin user that will create passwords and restricts user
access to the system. There are two levels of password protection: server level, which
includes password protection to various applications in the system, and repository level,
which covers repository passwords and access. User roles and permissions are stored in
separate tables in the MDM repository. For example, record for user of the system contains
username, password and reference to the privileges table where additional permission values
are stored.
Another way of keeping the central data safe is thorough copies of the repository supported
with master/slave concept. Master is the place where changes occur and slave is an auxiliary
repository that gets updated by synchronizing with the master repository. There is another
type of slave repository, called publication slave that acts as backup version of the master
repository. Once data is loaded in publication slave repository it stays unchanged unless this
repository is loaded in the system again and put online for work. Another way to keep
versions of master data is by making duplication of the existing MDM repository. This copy
of the data can be saved on other disks and loaded anytime user needs it (Heilig et al, 2007, p.
211).
Advantages and disadvantages of SAP are listed in table 5. Most of the advantages are
related to the various types f data objects supported in the database as well as the automation
of import and export which greatly facilitates user’s work. Disadvantages refer to the
complex interface and the great load of work that needs to be done when preparing
automated imports and exports.
Table 5: Advantages and disadvantages of SAP
Advantages Disadvantages
Domain neutral Complex interface
Offers prebuilt repositories for certain domains Requires time for user to get acquainted with it
Supports different types of tables, data types Time consuming preparation process for
37
and relationships, multivalued attributes import and export
Automated import and export Inconsistent updates and export of data can
lead to old and new data export
Effective matching rules that cleanse and
prevent duplicates in the repository
Key mapping approach can bring in data
inconsistency
Various IT and business scenarios for MDM
implementation and usage
Not suitable for small enterprises
Security architecture that enables different
roles for work with the data
SAP MDM system offers a lot of functions for mastering data. Complex matching processes,
various table objects, domain neutral data model create solution that gives great freedom for
users to manage any kind and any type of data. Key mapping functionality allows data
communication with external systems without changes in the code of these legacy
applications. Automated imports and exports make data loads and distribution to occur much
faster and more precisely. However, all this freedom of choices bring additional burden to
users during preparation of the data. There is a long checkpoint list that needs to be done
before processes are ready for execution. Helpful circumstance is the ability to save this
preparation work for similar scenarios in the future.
General overview for this suite is that SAP succeeded in great part of its intention to automate
master data management processes, but system functionality needs to be improved so users
can have less work in the preparation process. Due to the massiveness and complexity, this
solution is not appropriate for small enterprises but for large and complex businesses.
3.3.3. IBM InfoSphere
IBM offers great variety of products for data integration and management. InfoSphere is the
line of such applications that supports these processes. Therefore, I cannot limit to just one
application when reviewing implementation of MDM through IBM solutions, but I have to
mention several of them to explain different MDM processes.
First IBM MDM developments started off in 2004 with acquisitions of products from
different vendors. For example, launching of IBM InfoSphere Information Server was first
made in 2004, when IBM purchased the data integration company Ascential Software and
rebranded their suite to IBM Information Server. The same year IBM also acquired Trigo, a
product MDM software vendor and renamed their software in WebSphere Product Center.
The next year, IBM acquired Customer Data Integration software from DWL – a, and
rebranded the product as WebSphere Customer Center. In 2008 IBM released full version of
InfoSphere Information Server. IBM Master Data Management Server has similar
development history. It was released in 2008 and it’s a combination of IBM’s customer
integration tools from WebSphere Customer Center (WCC) with workflow capabilities from
WebSphere Product Center (WPC) (Press release notes from IBM, Retrieved February 7,
2013, from http://www-03.ibm.com/press/us/en/index.wss).
38
Other known products that fall under IBM InfoSphere brand and are used for managing data
are (Zhu et al., 2011, pg. 47):
IBM InfoSphere Blueprint Director;
IBM InfoSphere Business Glossary;
IBM InfoSphere Discovery;
IBM InfoSphere Metadata Workbench;
IBM InfoSphere Asset Manager;
IBM InfoSphere Information Analyzer;
IBM InfoSphere QualityStage;
IBM InfoSphere Audit Stage;
IBM InfoSphere FastTrack;
IBM InfoSphere DataStage;
InfoSphere Data Architect;
IBM offers rich applications suite that covers all processes in master data management, from
documenting business rules, workflows and terminology to cleansing, merging duplicate
records and their distribution to external systems or files. From the list above, each
component performs different functionalities, and same functionality can be supported in
different applications.
Data modeling There is no particular database vendor or database schema that IBM follows
during data modeling. Trying to provide product that is platform independent, IBM made
MDM solution that is domain, software platform and database neutral. IBM MDM repository
can be prebuilt in case there is data model for specific domain, or blank, where user can start
building its database from scratch. Planning and building master repository is a three-step
process supported by three different types of models: (1) Logical, (2) Domain and (3)
Physical (Wilson et al., 2011, p. 60 - 82):
(1) Logical model - is the first step of data model development and it’s where the
planning process occurs. It’s a diagram of entities, attributes and relationships which
represent the database structure and the workflows of business processes that work
with master data;
(2) Domain model - is used to define data tables that will store the future master data. It
follows the logical model, defined above, to “draw” data objects of the master
repository. Lowest level of data domain that can be modeled here is data field, along
with its data type, length and restrictions (if there are any). Same as the logical model,
domain model is vendor non-related and it’s used to set some general standards for
master database architecture.
(3) Physical model – this is the final step of data modeling when the actual database is
created. It is vendor related model so users have to choose the appropriate database
management system. Data objects and rules are created based on the concepts defined
in the first two models.
39
Data models are built in IBM InfoSphere Data Architect solution. Once the modeling process
is done, IBM InfoSphere Data Architect is capable of creating database-specific data
definition language (DDL) scripts based on the physical model. DDL scripts contain queries
to create, update or drop data objects and can be run on a specific database server.
Figure 16: Logical model Figure 17: Domain model and physical model
Source: E.Wilson et al, InfoSphere Data Architect, 2011, p. 60, 82
Figure 18: Physical model
Source: E.Wilson et al, InfoSphere Data Architect, 2011, p.90
Once created, existing database models and objects can be updated with another application
called IBM InfoSphere Asset Manager. The Asset Manager is used to import physical
models, create data objects or update the existing ones. And as every advanced MDM system,
IBM also uses staging area where all data changes are stored first, and once validations are
passed the changes are implemented in the actual central data storage.
Data modeling is not the only option for designing master database. IBM also supports
reverse engineering that allows users to convert already existing data objects to physical
model. This feature allows reuse of existing database structures when building central
repository, instead of just starting from scratch. Another advantage is the possibility to easily
compare different databases before merging data from different sources (Wilson et al., 2011,
p. 129).
40
Data import IBM InfoSphere provides different ways for importing data. Depending on
database structure and business scenarios, data can be loaded through batch transactions or
one of the applications mentioned earlier. Batch transaction processing is used in cases of
empty database, when large amounts of data need to be loaded in the repository. Each record
to be imported is read, parsed and distributed in the appropriate business objects. MDM
assigns unique identification key that serves as internal key to every record imported in the
master repository. Data files for this type of import must be in SIF (Standard Interface
Format) which is pipe delimited file format.
InfoSphere FastTrack is application used for data import. It’s mostly used during updates,
merges and data smaller data loads. The most important thing in this import is mapping of the
source file data to the appropriate database columns in the master repository. The whole
process is similar to the already known ETL process.
Figure 19: Example of field mappings during data import
Source: IBM InfoSphere FastTrack, 2011, p. 10
Data export Data from the master repository can be shared through direct transfer from
master repository tables to external applications tables or through web services. The first
option for data export is available in InfoSphere FastTrack and the export process is similar
to the import, just the data flow is in the opposite direction. The mapping that is done is from
master data objects to external system tables.
Since IBM MDM architecture is based on SOA, another way of master data sharing is
through web services. External applications can retrieve master data with web service
requests for certain entity. There is no specific rule which approach is used; it all depends on
business scenarios and the choice of users.
Data validation IBM Infosphere validates data in the same manner as the other MDM
solutions. Data is validated before import, during import and afterwards. Techniques for data
cleansing and management are organized in four steps: (1) understand organizational goals
and how they determine user’s requirements, (2) understand and analyze the nature and
41
content of the source data, (3) design and develop the jobs that cleanse the data and (4)
evaluate the results (IBM InfoSphere QualityStage, 2011, p. 2 – 5)
(1) In order to properly manage master data, users need to get acquainted with business
requirements. The role IBM InfoSphere has in this first step is assisting users in
graphical representation of their business rules. As discussed earlier, this is done
while building logical model and defining business entities;
(2) IBM Infosphere applications offer different kinds of data analysis. One application
that is mostly used for analyzing data content is InfoSphere Analyzer. This application
provides different kinds of analysis among which, column, cross-domain and key
analysis are the best known. Column analysis is performed on data in certain column
and gives general overview of the column properties as well as detects anomalies in
column data records. Cross-domain analysis matches data between different tables in
order to find duplicate and redundant data. Key column analysis is used to detect
relationships between tables and columns and define primary and foreign keys based
on uniqueness of data.
Another way to explore data content is through matching. IBM InfoSphere tools
provide matching by value and pattern. Value matching is similar to free-form lookup
where data is matched to given value. Pattern matching looks for data that matches to
given data format like SSN or email. IBM uses regular expression to perform pattern
matching. Below is an example of results from SSN pattern matching in which results
list all tables that contain fields with SSN format.
Figure 20: Example of SSN pattern match
Source: J. Zhu, Metadata Management with IBM InfoSphere Information Server, 2011, p. 241
(3) Once data is analyzed, the next step is to define jobs that will match and cleanse data.
IBM MDM offers prebuilt matching jobs that ship with its product. However, users
can define their own matching jobs and rules based on business requirements.
Matching process is similar to the ones defined in other vendors. It’s based on some
starting points (cutoffs) and weights that measure similarity of data. An interesting
attempt that IBM tries to introduce here is speeding up data matching jobs by setting
up some rules which group data in different block. Such approach is used to lower the
number of combinations that appear when two columns are to be matched. Blocking
works on the rule of sort-group-divide. However, this may turn into costly operation
42
which requires building complex subqueries for data processing and comparison.
Also, incorrect data blocks may result into false negatives, when a record pair that
does represent the same entity is not a match, because the records are not members of
the same block.
(4) Last step in data validation is evaluation of results and setting up rules to prevent
further inconsistencies. In case of several duplicates found for one master entity, IBM
MDM rules make merge of all the unique data representation that refer to the same
master object. The goal is to retrieve as more information as they can for the master
record.
Figure 21: Example of record merge
Source: IBM InfoSphere QualityStage, 2011, p. 150
Similar to the matching engine that contains predefined matching jobs, IBM also
offers rule engine to save all user defined rule jobs that can be further on reused.
Other than data rules, consistency of master data can be accomplished with data
transformations. It usually applies for some common values like gender codes, streets
and address. Data is transformed to some general format that is used across whole
master database.
Data security Security in IBM Information Suite is based on user/password authentication,
role based permissions and monitoring. User permissions are questioned and checked on
several levels. As mentioned earlier IBM Infosphere platform is based on SOA architecture
and user transactions are web service requests to the MDM server. Therefore, the security
system checks if user has permissions to invoke such requests, if user has permissions to
make updates and it also controls visibility of data objects in master repository to which users
have or don’t have access. Another benefit of this system is that MDM is configured to keep
history log of all changes, so changed records can be reconstructed at anytime. Monitoring is
done when users connect to the system. Administrator can observe and control their actions
(sessions of work).
Advantages and disadvantages of IBM InfoSphere - as table 6 shows, there are far more
advantages than disadvantagses due the size and variety of this solution. These
characteristics support different models and functionalities that are compatible for different
types of business requirements.
43
Table 6: Advantages and disadvantages of IBM InfoSphere
Advantages Disadvantages
Domain neutral, but also offers prebuilt models
for Party, Product and Account domains.
Same functionalities are repeating in different
applications in the InfoSphere portfolio. Many
of the applications are intertwined between
themselves, which can be confusing to users
often times
Provides documenting of workflow process
and further reuse of the same.
Special file adjustments to SIF format during
batch transaction processing.
Systematic planning of master data model
through several types of models: logical,
domain, physical.
Excessive mapping needs to be done before
importing data from external sources. Same
process happens on export, too.
Exports of models in reusable files: XML and
DDL scripts.
Great variety of data transformations and
standardization can change the data in great
form and may produce completely new
records. Such transformations can result in
false positives, matches that are not actual
matches.
Offers prebuilt matching jobs, rule engine for
rule definitions and their common use.
Blocking process during matching can be
efficient but also complex and time consuming.
Irregular block division can cause false
negatives, records that match but are not
detected because they were placed in different
blocks.
Variety of data content analysis, data
transformations and standardizations.
Compatible with different kinds of databases
and platforms.
Blocking process techniques for more efficient
matching.
Data can be shared through web requests, so
external applications don’t have to make major
changes to their databases.
Reverse engineering
Security provided at different levels in the
system
IBM InfoSphere is a rich portfolio of tools for data management and integration. It offers
great variety of applications that cover all the processes throughout data management, from
planning and modeling to cleansing, merging and distribution to external systems. It supports
all domains, implementations styles and methods of use as well as is platform independent
and can be compatible with all types of software. IBM didn’t create solution just for the time
being, but they also include features that allow users to save all that documentation, modeling
and rules into some common knowledge database to which they can recall afterwards.
Another novelty IBM can be proud of, is the possibility for reverse engineering, which
facilitates system integration during merging or acquisition.
However, this perfect solution has few disadvantages. According to the various data
transformation techniques supported in the QualityStage, users are given freedom to make
data transformation for easier matching. However, there is no limit how far user can go in
44
changing data and often times such transformations can change data in a way that it loses its
context and no longer represents the correct business object.
Defining complicated subsets of data for faster matching can be expensive, time consuming
and create wrong results, false negatives and positives that were mentioned earlier. It’s good
that users have freedom to work with master data anyway they want, but there still need to be
some system restrictions that will give users some guidance and warn them for the possible
mistakes.
Another thing I noticed is that many similar features can be found in different applications.
For example, data cleaning can be done in DataStage and QualityStage; data analysis in
Information Analyzer and Information Discovery, import and export of data in Metadata
Workbench and Asset Manager but also in any other component. IBM’s intention for this
shared functionality was maybe to broaden the application set of feature and don’t allow user
to work on several application to get clean data, but there should be either one application
that will support the whole data management process, or several components with precise set
of features so it would be less confusing for the user.
Overall, IBM InfoSphere is a mature solution that implements great techniques for data
management. Both Information Server and MDM Server can be used for managing data from
large and complex systems. Many modules from InfoSphere can be acquired and used
independently, for data analysis and cleansing. Therefore, InfoSphere line of products is
suitable for all sorts of enterprises and lines of business.
3.3.4. Oracle MDM Suite
Oracle introduced its MDM products ten years ago, starting first with programs for managing
customer and product data, and ending up with solutions for data management, called
Customer and Product Data Hubs. The whole idea for developing applications in MDM area
started internally when Oracle’s E-Business Suite was dealing with customer data quality
issues. They first developed program to manage customer data model, called Oracle
Customers Online, and shortly after its release they built Oracle Advanced Product
Catalogue, another program for the same suite to manage product data. Adding data quality,
source system management, and application integration capabilities, these two products grew
into the Oracle Customer Data Hub and the Oracle Product Hub. Major breakthrough on the
MDM market happened when Oracle acquired Siebel and Hyperion Data Relationship
Management (DRM). After releasing Customer and Product hub, Oracle expanded its MDM
line of products to Finance, Site and Supplier Hubs (Butler, 2011).
Oracle is currently focused on developing fusion versions for its existing Hubs. These fusion
applications are combination of SOA and MDM. They provide integration, management and
distribution of master data among applications from external systems. So far, Customer,
Product and Accounting Fusion Hubs are available on the market.
45
MDM solutions from this vendor will be discussed through several products from the Oracle
MDM Suite. Below is a list of applications that belong to the Oracle MDM portfolio (Oracle
Master Data Management. Retrieved March 5, 2012, from
http://www.oracle.com/us/products/applications/master-data-management/master-data-
management-ds-075053.pdf, 2010, p. 1)
Enterprise Data Quality;
Oracle Customer Hub;
Oracle Product Hub;
Oracle Supplier Hub;
Oracle Site Hub;
Oracle Higher Education Constituent Hub;
Hyperion Data Relationship Management.
Data modeling Oracle’s MDM products come with already predefined data models for each
entity. Users don’t have the ability to start from blank database, but what they can do is
update already existing tables in the master repository with new columns. Data models that
Oracle uses are based on the Trading Community Architecture (TCA) model. Oracle Trading
Community Architecture (TCA) is a data model that allows users to manage complex
information about customers, organizations and customer’s accounts. The base of this model
is used and readjusted when designing models for other types of domains such as product,
site etc. Tables in the master repository have standardized names; each starting with HZ
prefix followed by the name of entity which attributes are stored. For example,
HZ_PARTIES stores data for Parties, HZ_CONTACT_POINTS for party’s contact points
etc. This database is of relational type, organized in tables (entity), columns (attributes) and
relationships (hierarchies) (Oracle Trading Community Architecture, 2006, p. 1).
Figure 22: List of predefined tables for Customer entity
Source: S. Anand, Trading Community Architecture, 2008
46
Data import Since Oracle provides predefined data model, data is imported in the HZ tables
discussed earlier. There are several different ways for data import: (1) SQL/ETL Load, (2)
D&B Load and (3) File Load (Oracle Trading Community Architecture, 2006, p. 8 - 18):
(1) SQL/ETL Load: data is first extracted with scripts or tools, values are transformed to
meet the data requirements of the interface tables, and afterwards data is loaded;
(2) D&B Load: data is prepared by D&B sent in standard D&B bulk file which is next
run through the D&B Import Adapter and automatically mapped and loaded into the
interface tables;
(3) File Load: data is loaded from a comma-separated value (CSV) file, or file delimited
by another allowed character with Oracle Customers Online (OCO) or Oracle
Customer Data Librarian (CDL).
Before loading data in the master repository, data is first imported into staging tables,
matched and cleansed and afterwards imported in the interface tables. The staging tables are
copies of the existing tables and are temporary storage for the external data that is being
imported in the repository. Even after importing data, TCA runs post import processes for
data standardization. There are various data transformations such as: name conversions to
meet database standards, replacement of letters in phone numbers, removing of NULLs etc.
Data export Data for certain entity can be exported in Excel spreadsheet. However, for data
distribution to external applications Oracle uses cross referencing. Cross referencing is
approach that assigns unique key IDs for each record in the central repository and maps it to
the appropriate record from the external systems (similar to the key mapping discussed in the
SAP solution earlier)
Figure 23: Example of cross reference between PARTIES (master table) and
SYS_REFERENCES (external systems)
Source: Better Information through Master Data Management – MDM as a Foundation for BI, 2011, p. 9
With the help of Application Integration Architecture (AIA), data can be shared with other
application through web services. This enables external applications from different platform
to receive managed data from Oracle MDM Hubs.
47
Since every hub that Oracle MDM offers has its own domain of concern and different
architecture, there is also difference in the cross referencing processes. There are two
possibilities for cross reference: one way and two way cross reference. In the first approach
data flow occurs one way, from Hub to external applications. This means that data is
managed and updated only in the Hub and afterwards sent out to the other systems. This
approach is used in the Product Hub. The two way cross reference is implemented in the
Customer Hub and data flow is managed in two directions: from hub to external systems and
vice versa. Data is managed in the hub, but can be also updated in the external systems and
sent to the master repository for import. Changed data that is sent from external systems
needs to pass predefined validations before is loaded in the central database. This type of data
sharing gives freedom to external systems to use managed data from Oracle Hubs without
major changes made to their legacy applications (Cross-Referencing for Master Data
Management with Oracle Application Integration Architecture Foundation Pack, 2008, p. 5).
Data validation As mentioned earlier, data is checked for errors right after being imported in
the repository. There are several techniques that Oracle uses to validate, cleanse and manage
data. Most of them are similar to the ones discussed in the previous MDM solutions. Data
validation techniques are based on transformation, matching and merging and are part of the
Data Quality Management (DQM), mechanism for managing data found in the TCA model.
Figure 24: Example of data validation workflow
Source: Data Quality Management, 2012, p. 9
Oracle MDM examines data through several steps (Data Quality Management, 2002, p. 2 -
25):
(1) Step one - Transformation functions. These functions include characters or blank
space replacement, removing double letter, or any other data changes that will achieve
certain standards throughout the database. Also, Oracle uses word replacement which
replaces similar word variations with one standard word. Often times user enter
48
different data for the same item. In some cases they use item’s full name, in others the
shortcut. To avoid such irregularities MDM gives one name per item. If before
Slovenia was entered in the system as Slovenia, SLO or SI, MDM can replace all
these variations with SLO. Word replacement and transformation functions can show
some duplicate data throughout the system which was hard to be detected before;
(2) Step two - Match rules. Matching is done in the similar manner as discussed in the
earlier solutions. Because user cannot be familiar with each and every record, the best
way to detect match is when user defines some matching points that need to be
accomplished. Oracle names these values as thresholds, and based on such limits user
can define if two records are match or no match. Thresholds should be defined as
some average value, between 40 and 60, because small value for threshold may return
results that are not a match, whereas high value may exclude a lot of data and some
possible match among;
(3) Step three - Duplicate identification and merging. Once data is transformed and
certain match rules define, subsets of data can be prepared for duplicate identification.
Oracle DQM provides batch jobs that compare records from different groups, looking
for duplicates. Each record from one dataset is matched to all records from the other
datasets. Batch jobs can run for a long time if there are a lot of records for
comparison. That’s why often times is better for user to define subset of records and
apply batch job to these subsets. Once duplicates are done, then merging of same
records follows. The old record that is merged is deleted because it is already merged
to the new one.
An extra feature that Oracle MDM provides in the validation process is monitoring and
managing data decays. Data repositories contain big loads of obsolete data that accumulates
over some period. However, an enterprise cannot always delete this data because it may still
use it for analysis of historical transactions. Oracle MDM supports this data lifecycle by
monitoring data decays and flagging the active and passive data. What this tracking allows is
marking the current active data and making it available for the live applications. Data that is
not used anymore is flagged as passive and stored in remote locations. It’s not accessible by
external applications and it’s usually used for reporting.
Data security Oracle provides robust and precise security model that gives user rights to
work with certain data or hierarchies. It is based on roles and authentication and as in any
other security model, administrators have all the rights. The security is set on a granular level
that event controls user access for different versions of data.
Advantages and disadvantages of Oracle MDM are shown in table 7. Even though there
are several disadvantages regarding Oracle domain specific solution as well as the inability to
support collaborative MDM still there are workarounds that can complement these
deficiencies.
Table 7: Advantages and disadvantages of Oracle MDM
49
Advantages Disadvantages
Supports several domains: Customer, Product,
Account, Site.
Doesn’t cover all domains.
Prebuilt data models so users don’t have to
start from blank database
Data models can be modified, but not built
from scratch for completely new entity
Keeping copy of the data in staging tables
prevents errors on import.
Hubs are domain dependable, for each new
domain Oracle launches new Hub.
Versioning data saves archives of data changes. Oracle focuses more on data cleansing and
deduplication, but doesn’t offer great support
for setting up rules to govern data.
Different ways to import data. Collaboration is excluded; only operational and
analytical implementation styles are supported.
(Collaborative implementation style may be
implemented in the new line of advanced
MDM hubs, called Fusion Hubs)
Cross reference data sharing which makes
Oracle compatible with different kinds of
external systems (include non Oracle based).
Monitoring data decays, and decreasing the
data load based on active and passive data.
Automatic batch processes for faster and more
efficient duplicate identification.
Based on the various features discussed earlier, Oracle MDM suite can join to IBM and SAP
MDM solutions with its complex architecture and the various functionalities that offer data
management. Organizing data management in different hubs, based on the type of entity, is of
great help for users because they don’t have to purchase the whole suite but only those
applications that are needed for managing their data. Also, with the use of Application
Integration Architecture and web services, these “parts” of the suite can be easily integrated
with other applications from different platforms and vendors. Prebuilt data models are also of
great help, so they can give users more time for data validations instead of creating data
model, something that comes with the Oracle software. Another advantage that is worth for
mentioning are the plenty transformation rules and matches that help in data management.
Also, the organization of the data in versions and comparing it among different versions,
allow users to keep track of changes over time, but still don’t make these versions as separate
data sets that don’t have any connections between themselves.
However, it seems that Oracle tries to facilitate the job for users by giving them everything
prebuilt. Constraints can appear because of this flexibility. First, not all domains are
supported. Database structure is already given and users may have to make changes in their
own systems before loading data in the MDM repository. Defining Hubs for each domain is
different approach than IBM and SAP. It seems that each Hub functions as independent
applications. Also, data governance is on lower level, in some cases Oracle Hub seems like
passive registry that servers cleansed data to external systems but doesn’t do match to keep it
clean and managed.
50
Overall, main domains are covered. For special lines of business these Hubs can be
readjusted. And with the new Fusion hubs that are already launched on the market, Oracle
strengthens the collaborative methods and evolves to multi-dimensional MDM suite on a
single least cost of ownership.
4. ANALYSIS OF SELECTED MASTER DATA MANAGEMENT
ARCHITECTURES
There are several aspects of comparison that can be considered when analyzing the discussed
MDM architectures. I’ve chosen three approaches: (1) data quality dimensions, (2) three
dimensional model and (3) the four data management phases which cover: consolidate,
cleanse, govern and share. These three approaches give an overview of the problem, solution
model and management processes described through the selected MDM solutions. The first
approach covers data quality issue which is common problem in every enterprise data. The
second approach is based on the three dimensional model that gives general view on MDM
solution. And the last approach summarizes different management techniques that are present
in each of the selected products.
4.1. MDM of selected architectures and quality dimensions
In the first part of the thesis I defined and covered data quality and the dimensions that
describe this subject. Since Master Data Management is found and developed in direction to
deal with data quality improvement, it’s understandable that MDM solutions would include
various validation technique that work towards accomplishing improved quality of data, and
not all kind of data, but enterprise data only.
Table 8: DQ dimension and MDM
MDM Solutions
Microsoft MDS SAP Netweaver IBM
InfoSphere
Oracle MDM
Suite
Data Quality
Dimensions
Intrinsic:
Believability
Accuracy
Objectivity
Reputation
Standardization
Matching and de-duplication
Stewardship
Contextual:
Value-added
Relevancy
Timeliness
Completeness
Central repository
Versions
Publication to
slave repositories
History logs
Web request
updates
History logs
Merge
Data decays
Merge
Representational: Data standardization
Data transformations
51
Interpretability
Ease of
understanding
Consistency
Concise
Representation
Data ranges
Dropdown fields-
qualifiers
Different types of
table structures
Domain model
Mapping on
import and
export
Word
replacement
Mapping on
import and
export
Accessibility:
Accessibility
Access security
Database queries
Web requests
Exports
User roles
User name / password authentication
The table above represents overview of data quality dimensions and how each product MDM
contributes in their maintenance and improvement. Same techniques are used for managing
different quality dimensions and some of them are common for different MDM products.
The first set of DQ dimensions is managed in similar way in all four solutions. Since intrinsic
quality is based on the actual data content, given as it is, the main goal of MDM is to keep
data content accurate for each item it relates to. Data standardization, matching, de-
duplication and stewardship are the core processes of master data management and it’s
expected that all of them will be present in the solutions. In the earlier discussion about each
solution individually, data validation techniques are list of transformations, replacement,
NULLs removals, words arrangements etc. Also, matching and de-duplication are processes
that are present in every MDM product, because one of the main issues with data quality is
duplicate data. And in order to maintain data quality in the system, data rules are available in
each of the four solutions. They are defined based on the business requirements each user
has.
Contextual data quality represents how complete and up-to-date is the information for certain
object. The fact that each of these solutions has central database where it merges data from all
external system into single “golden” record, covers the completeness as DQ dimension. So,
MDM database being central repository for an enterprise system is common strategy for all
solutions. Managing data updates to achieve accurate data as we speak is managed in
different ways. For example, MDS keeps versions of data, to “freeze” data and models for
certain point in time. Oracle keep tracks of the data decays, marks unusable data and stores it
in separate locations. However, to have real-time data, all MDM solutions implemented SOA
architecture. This is the most suitable way to integrate data from different sources and update
changes as they occur. All other exports in CSV or flat files are delayed versions of data.
Representational data quality is again equally managed in some part, but there are also
additional different features specific in each solution that support this group of quality
dimensions. To keep some uniqueness and standard format throughout the database, each of
these solutions apply data standardization rules as well as transformations. As an addition,
Micorsoft MDS for example, uses data ranges, to predefine the allowed values that one
52
attribute can posses; SAP supports different types of tables and fields, list of values and
qualifiers; IBM InfoSphere defines data format in its modeling phase, using domain model
and Oracle provides different string functions for data transformations and standardizations.
Accessibility is maintained in all solutions through SOA services and their security model.
Each model has different type of security but in general they are all based on role and
authentication type of model. The level of security is defined on general as well as modular
level. Depending on the structure of each solution, users can have permissions to view certain
applications, model or data. Also, users can have different privileges, to only browse data or
process all CRUD operations. With defined roles, data can be queried with database
statements or retrieved through web requests.
4.2. Comparison of selected architectures through the three dimensional model
MDM architecture is based on three dimensional model defined by Zachman. This model
includes: domains, methods of use and implementation styles. Because MDM solutions are
spread out in various modules, each of them with different functionality, best way to create
general picture for one vendor’s solution is if we unite them according the principles of this
model.
Table 9: MDM solutions and three-dimensional model
MDM Solutions
Microsoft MDS SAP
Netweaver
IBM
InfoSphere
Oracle MDM
Suite
Latest version add-in to Microsoft SQL
Server 2012 version 7.1 version 9.1
Last release of
Customer Hub
version 8.2
Dimension
Domain Domain neutral Domain neutral
or domain
based:
Customer,
Employee,
Supplier,
Material,
Business
Partner, Product
Domain neutral
or domain
based:
Party, Product,
Account
Domain based:
Product,
Account,
Customer,
Site
Method of use Operational and
analytical
Operational and
analytical.
Collaborative:
when combined
with SAP
(Business
Operational,
analytical and
collaborative
Operational and
analytical.
Collaborative
can be achieved
with additional
applications for
53
Process
Manager) BPM
business process
management.
Implementation
Style
Physical master
repository
Transactional Hub
Physical master
repository
Registry
Reconciliation
Engine
Transactional
Hub
Physical master
repository
Registry
Reconciliation
Engine
Transactional
Hub
Physical master
repository
Registry
Reconciliation
Engine
Transactional
Hub
(1) Domains – All of these MDM products except Oracle MDM, are domain neutral
solutions. They support model for every type of domain. Some of them, like SAP and
IBM, offer prebuilt models for the most common enterprise master domains,
Customer and Product, which is of great help for users so they don’t have to model
and plan their database from scratch. Oracle on the other hand, is the only one among
these solutions that doesn’t allow complete freedom when choosing a domain model,
all of its MDM products come with predefined domain model. However, since the
basic architecture and concepts of MDM work are well implemented, Oracle doesn’t
have problem in customization of the model to be suitable for different business
master objects;
(2) Methods of use – Operational and analytical methods of use are supported in all four
selected MDM architectures. With the support of CRUD operation for data processing
and support of business requirements, these products support the operational side of
the enterprise transactions. Also, since all of them are serving as central repository for
the external systems, storing master objects as well as other attributes and dimensions
related to the same ones, they give users 360 degrees view for their main domains.
Processed and cleansed data in the master repository is a main source of data for
reports, OLAP cubes, and various BI tools. Accumulating data from the whole
enterprise system into one place provides rich information for analytical use.
Collaborative method is the only one that is fully integrated among these products.
IBM InfoSphere is the only solution that has IBM BPM express integrated in its
MDM architecture and with this feature supports in management not only of the data,
but also on the business processes. This method can be implemented in the other
solution with collaboration with different applications, in most cases business process
management tools from the same or other vendors. Currently only IBM InfoSphere
offers the whole package that supports all three methods of use without additional
upgrades;
(3) Implementations styles – Microsoft MDS is the only one with the lowest number of
implementation styles due to its limited architecture. The other three solutions, SAP,
IBM and Oracle, offer different ways for storing and sharing data. Depending on the
business requirements and structure of the legacy applications, these three solutions
54
can offer central repository that will store all the data, or only registry style of
database that will play the role of system of reference. They also offer the most
advanced way of integration and communication of data, transactional hub, since they
all have implemented SOA services for sharing data across various platforms.
Microsoft MDS, as an addition to database management system can offer physical
storage for the cleansed data which can be accessed by querying the subscription
views with the known SQL queries. Transactional hub in this solution may be
achieved with Application Programming Interface (API) and the use of Windows
Communication Foundation (WCF) services. Key mapping is not supported in this
solution as it is the case in the other three MDM architectures. It may be achieved
with customization of the model in combination with WCF services, but it’s not
something that comes along with the product.
4.3. Comparison of selected architectures through the five MDM activities
Main goal of MDM is to consolidate data from different sources and generate single “golden”
record for use. MDM achieves this goal through the four main processes that occur in the
following order: (1) profile, (2) consolidate, (3) govern, (4) share and (5) leverage
(1) Profile – Data assessment is usually done on import. All of these solutions support
ETL processes. When new data is mapped to the existing data structures, then user
can make part of the assessment of the new records based on the predefined rules in
master repository;
(2) Consolidate –There are different approaches that selected MDM architectures use but
in all of them data import is based on the well known ETL process. Data is extracted
from the external sources, transformed to match the structure of master database and
loaded in the central repository. Even though most of these solutions try to automate
the import process, mapping is still done by the user. One facility that SAP Netweaver
and IBM InfoSphere offer is documenting of such mappings for future reuse. Data is
also consolidated with unique key value pairs that are created and stored in the master
repository, and are used as external references to the source system data;
(3) Govern –Cleansing is part of the govern phase. Typical for all solutions is the scoring
strategy in which they all start off with predefined match scores to increase
probability of duplicates in data. Also, before running match jobs, they also transform
data into some standard format for easier duplicate detection. Different logical
operators are used in the matching processes. Once duplicates are detected, they are
merged into one record and the obsolete one is removed from the master database. In
order to prevent data from future duplicates and errors these suggested solutions
implement rules engines in their architecture, to detect potential bad input and warn
the user to change the data because it doesn’t match the standards. Additional BPM
tool combined with MDM works even better to manage business process and prevent
from generating improper data. This facilitates in great deal the job of MDM;
55
(4) Share – In order to provide real time cleanse data, all systems support SOA. This
architecture, mentioned several times already, allows open architecture for sharing
data between applications;
(5) Leverage – Once data is well structured and cleansed, it serves as reliable source for
BI tools and analytical system. Reporting is supported by all of these architecture as
last phase for data preview.
Below is a table that lists different tools from the selected vendors that are used in the four
phases of MDM.
Table 10: MDM overview through four data management phases
MDM Solutions
Microsoft MDS SAP Netweaver IBM
InfoSphere
Oracle MDM
Suite
MDM phases
Profile Data is assessed on import. When mapping new data to the existing data
structures in the master repository
Consolidate Single central
repository where data
is loaded through
ETL
MDM Import
Manager
MDM Import
Server
Key mapping
Batch
transaction
processing of
SIF files.
InfoSphere
FastTrack
External
reference keys
SQL/ETL jobs
D&B Batch
loads
File Loads from
CSV files with
Oracle
Customers
Online (OCO)
or Oracle
Customer Data
Librarian
(CDL)
Govern Match operator
Data Quality Services
Data standards
Triggers for sending
notifications and e-
mails for data change
approvals
SAP Data
Manager
Scores of
similarities
Logical operators
Merge
Data
transformations
Hierarchies
Validation rules
Infosphere
Analzer
InfoSphere
Discovery
InfoSphere
DataStage
InfoSphere
QualityStage
InfoSphere
DataStage
InfoSphere
QualityStage
Data Quality
Management
(DQM)
Data Quality
Management
(DQM)
Word
replacements
Validation rules
Data decays
Share Subscription
Master Data
Manager web
MDM Syndication
Server
MDM Syndicator
InfoSphere
FastTrack
SOA and web
Excel sheets
Application
Integration
56
application
WCF services
Web services
services
Architecture
(AIA) web
services
Leverage Managed data is great source for analytical systems. All of these architectures
support reporting, final step in the MDM that provides data preview.
5. CASE STUDY OF MDM SOLUTION USED IN STUDIO MODERNA
As an addition to this discussion about MDM architecture I decided to add another solution
developed for the business requirements of Studio Moderna (SM). The reason why I chose to
add custom developed MDM is because I wanted to show how one company can handle
MDM processes internally without purchasing off-the-shelf product. For this case study, I
would briefly describe Central Product Register (CPR), solution for managing product data.
The research methodology used to gather data for this case study is based on unstructured
interviews with the project lead of CPR. Communication was done over emails and couple of
meetings, where we discussed about the architecture and development phases of this solution.
I also spoke with another SM employee who was in charge of testing CPR and entering data
in this CPR’s repository. Also, I used SM documents that describe the architecture as well as
the business logic developed in this solution.
5.1. Problems with product data management
Studio Moderna (SM) is a marketing and sales company that exists 20 years on the market.
With 5500 employees this organization operates in 21 countries in Central and Eastern
Europe, Russia and Turkey. There are 5000 different types of products for various purposes,
from electronics and health & fitness to products for kitchen & household, sold through five
different channels: TV, Internet, Print, Shops and Telemarketing. One of the most popular
brands are: Dormeo, Delimano, Kosmodisk, Top Shop etc. Studio Moderna is the distributor
of choice for all the major global direct response marketing companies and have been
responsible for all the major DRTV product winners in the region. They also work directly
with manufacturers from Europe and Asia. With strong direct customer relationships
managed through: 130 transactional websites, 220+ retail stores, 22 call center locations, and
300+ hours of daily TV advertising airtime, 6 own TV channels, thousands of retail
distributors, 15+ million catalogs, 70+ million calls handled annually, Studio Moderna strives
to turn consumer brands and products into household names (Overview of Our Company
[Studio Moderna – portal], Retrieved February 7, 2013, from http://www.studio-
moderna.com).
Working with rich portfolio of 5000 products across 21 countries, SM’s system was
experiencing problems in managing various product data from all those locations. Examples
of such problems are:
57
- Product data was scattered around various applications (system for eOrdering,
Telemarketing, Shop POS, PIS (Product Information System, OLAP Admin));
- Same products were stored in different applications, and their updates requested data
changes across multiple applications where their records existed;
- Decentralization of data was producing duplicate records;
- Products from different channels and for different countries weren’t following the
same workflows. For example, products ordered through TV channels were submitted
to internal ordering, step that was skipped from SM fashion group products;
- It was hard to track product status (orders, prices, promotions) in each country,
because each one of them managed product data in its own repository;
- As an addition to this last problem, it was difficult for management to track
customer’s interest in each product. Because product data was managed on country
(local) level, it was hard to determine if certain brand was selling enough so it can
stay on the market, or its marketing wasn’t paying off any longer.
In order to decrease the number of problems listed above, SM decided to centralize products
and change business processes related to this business domain, and store and manage them
with in-house built solution called Central Product Register (CPR).
5.2. Central Product Register (CPR)
Central Product Registry is Master Data Management solution for product data built for the
purposes of SM. It is developed with Microsoft tools, using Microsoft Dynamics ERP
platform and SQL database. It’s a central repository that stores and manages product data on
local (country) and international (all countries) level.
Development of this system began in May 2011 and it was launched for the first time in
March 2012. This solution was designed to cover the following functionalities: (1) Basic
product management, (2) Managing permissions, (3) Managing product lifecycle, (4)
Managing product marketing data, (5) Managing supply chain data, (6) Managing central
pricing data (Central prices: purchase, Suggested Retail Price (SRP), calculation, vendor
SRP) and (7) Managing local pricing data ( Local prices: CPO price, retail price, transfer
price).
Data is entered through CPR’s user interface. There aren’t any bulk imports or data transfers
from different sources, but instead data is manually entered by person assigned on this
position. Since local and international data is stored in one place, not everyone has
permission to enter or update the product data. On a local level data updates are done by
Local Sourcing Officer whereas on international level by Central Sourcing Officer. Importing
all data from one place, by limited number of people avoids duplicate data as well as multiple
data entrance in different applications.
58
Once data is entered in CPR, it’s stored in central database that represents the main source for
all the other applications that work with product data, regardless if they are used for
marketing, analytical or any other kind of purposes.
The whole architecture of this solution is designed in such way that doesn’t support data flow
from different source, but it only allows manual data import. Duplicates are prevented by
triggers setup on a database level. However, this doesn’t cover all scenarios and doesn’t
prevent potential duplicate or misspellings to be again imported in the database.
Unfortunately the system is not fully developed to detect that Dormeo and Drmeo, for
example, may be the same product with potential misspellings. There aren’t any matching
mechanisms that are scheduled to run and compare what was imported in the central
database.
External applications still work with their own databases; they are not connected to CPR’s
database in such way to use product data directly from this place. Product data from CPR to
client system databases can be loaded in two ways: (1) nightly jobs and (2) “pull” method.
(1) Nightly jobs are scheduled to run at certain time of the night, when the number of
application’s users is very low. These jobs contain complex queries that check for the
updates on both sides, CPR master database and client system database, and transfer
the changes in the client’s databases;
(2) In case there is an external application that needs to work with a product that was
entered in CPR at the same moment, and can’t wait until the next day when the
nightly jobs are executed, then this application can make web request to CPR and
retrieve the needed data (pull method).
Figure 25: Example of data flow in CPR
Data Import Data share
Two features that contribute to product data management and are unique for this solution are:
(1) product statuses and (2) data security.
5.2.1. Product statuses
One of the many problems SM systems experience with product data is the disability to track
product statuses in each country as well as on an international level. This problem becomes
more complicated and hard to solve in situations when same products follow different
Local Sourcing Officer
Central Sourcing Officer
External Applications
EOrdering
Shop POS
PIS
OLAP
CPR
master
database
59
business workflows. As a solution for proper data management, CPR introduced two new
concepts: product statuses and product operational lifecycle.
Product statuses are business terms defined specifically for the needs of SM. An example of
such statuses would be:
- “New” (product has been created in CPR);
- “Evaluating” (international sourcing department is evaluating whether to suggest this
product to the countries for consideration);
- “In local test decision” (countries are deciding whether they would like to sell this
product) etc.
Product operational lifecycle covers:
- Management of product statuses,
- Transitions of products between different statuses – determines if product can “cross”
from one status to another based on predefined business rules;
- Determines which applications can use product data in certain status.
Not all products, that Studio Moderna markets, have the same set of statuses. These statuses
are based on the brand, sales channel or other business requirements therefore some of them
can be activated or disabled for various products.
Interdependency is also defined between product statuses and countries. There is international
and local status defined for each product. In some cases, local status is the marked as primary
and it dictates the current state of the product regardless of the changes on global level. In
another case, international status takes over the primary role in determining product’s state.
For example, Dormeo Pillow is product sold in all 21 countries. It’s local and international
status is set to “Active” and the primary role is based on local level. In case sales drop
internationally and the global product status changes to “Retired“, Turkey’s Dormeo Pillow
will still have “Active” status, based on the local status definition.
Product statuses are changed manually, by Local or Central Sourcing Officer, based on the
sales and customer’s response for certain product. They are involved in the business rules for
product and dictate if product data can be visible or not in certain application. Unfortunately,
CPR doesn’t support any business logic that can track sales and change product statuses
automatically.
5.2.2. Product data security
Placing data into central storage created additional problem for data security. SM wanted to
have all product data in the same place but somehow prevent countries to see all product
records stored in the master repository. To accomplish this, unique security logic was
implemented in CRP.
Security model is based on two perspectives: (1) domain and (2) role.
60
(1) Domain - Security based on domain perspective defines two levels of permissions:
local and international. Users who have local permission ( Local Sourcing Officer)
can only work with data from the country for which they have permissions and cannot
view data from other countries. On the other hand, international permission (Central
Sourcing Officer) allows users to work with data from all countries;
(2) Roles are defined based on the various data operation that user can execute. There are
three types of permission: No permission – user has no permission of product data and
will not be able to view it or be aware of its existence; Read permission – user can
see the data, but will not be able to modify it; Full permission – user can see data and
has rights to perform CRUD operations.
As an addition to this security model, permissions are also determined by product statuses. In
some business scenarios one product status can be visible for the Central Sourcing officer but
invisible for other roles.
Users’ roles and permissions are hardcoded in web configuration files. Each country has its
own configuration file that defines roles for each user and those files are updated manually.
5.3. Benefits of CPR
Implementing CPR managed to cover product data problems in the following order:
Table 11: CPR solutions for product data management
Problems Solution in CPR
Product data was scattered around various
applications (system for eOrdering, Telemarketing,
Shop POS, IS, OLAP Admin)
Central master repository Same products were stored in different applications
multiple times
Decentralization of data was producing duplicate
records
Products from different channels and countries
weren’t following the same workflows.
Product statuses
Operational product lifecycle
It was hard to track product status (orders, prices,
promotions) in each country, because each one of
them managed product data in its own repository
As an addition to this last problem, it was difficult for
management to track customer’s interest in each
product. Because product data was managed on
country (local) level, it was hard to determine if
certain brand is selling enough so it can stay on the
market, or its marketing wasn’t paying off any longer
61
Benefits that SM gains from this solution are more of intangible nature than they can be
actually measured. CPR offers:
- Unified repository of product data (“one truth”);
- Infrastructure to enable product data management in one system;
- Consolidated view of product data from different client systems;
- No manual retyping of product data between systems which reduces human errors;
- This centralized data is great source for analytical systems (OLAP) and gives
complete summary of product data for local and international level, something that
was much harder to implement before this solution was introduced.
Unfortunately, CPR is not yet assessed for ROI therefore I can’t present any actual numbers
that can show how much SM saved when it started using CPR.
5.4. Comparison of CPR and selected MDM architectures
CPR and the selected MDM applications are compared from three perspectives: (1) three-
dimensional model, (2) MDM activities and (3) cost and time for built and implementation of
the solutions. These three perspectives are selected to find out: (1) if the custom built CPR
follows MDM standards in its structure, (2) if it supports the five activities that were
discussed earlier and (3) if CPR development and implementation is worth the invested time
and money.
(1)
Table 12: Comparison of MDM architectures and CPR’s three dimensional model
Microsoft MDS, SAP Netweaver,
IBM InfoSphere and Oracle MDM
Suite
Central Product Register
Dimension
Domain Domain neutral which means that can
support every master domain object
that business requires
Product
Method of use Operational, analytical and
collaborative. Most of the solutions
offer operational and analytical but
the collaborative style can be
achieved when BPM application is
integrated in the MDM environment
Operational, analytical and to
some point collaborative. Due to
the complex product status logic,
various workflows are supported.
Implementation
Style
Physical master repository,
Transactional Hub or Registry
Physical master repository
(2)
Table 13: Comparison of MDM architectures and CPR’s MDM Phases
62
Microsoft MDS, SAP Netweaver,
IBM InfoSphere and Oracle MDM
Suite
Central Product Register
MDM Phases
Profile
Profiling is done on data import
There are predefined business rule
that determine what data should
be imported for product items.
Consolidate ETL processes are mainly used for
data import. Bulk loads or imports
from excel or CSV files. Key
mapping is also included to map
master unique key IDs to appropriate
data in the external applications.
There is no actual consolidation
of data from various client
systems. New data import is
regulated by user roles. Only
persons with full permission can
make changes in the master
repository.
Govern
Various tools for data quality
improvement are used, such as:
column analysis, matching, merging.
Data is validated through the rules
engines that these solutions support.
Most of the data validations are
handled though triggers that are
activated on incompatible values,
types or NULLs. Also, CPR
implements complex coding logic
to keep the business rules that
depend on product statuses.
Share All of these solutions implement SOA
architecture to support data retrieval
on request. Pull and push mode are
supported. Data can be retrieved on
client request (pull mode) or
distributed by the master repository
(push mode)
CPR support only pull mode
which means that for every
update, client systems need to
make a request. Automatic
updates from CPR to client
systems are performed as
scheduled night time jobs.
Leverage Both of the compared subjects provide unified source of data for analytical
systems.
(3)
Table 14: Comparison of MDM architectures and CPR’s time and cost
Microsoft MDS, SAP Netweaver,
IBM InfoSphere and Oracle MDM
Suite
Central Product Register
Time for
development,
implementation and
testing
Less than 6 months
10 months for development and 3
years for implementation in all
countries
Cost In the range of 500K $ (350K euros) > 84K euros (just for labor)
63
These last estimations in Table 14 are based on the following facts.
Development for Central Product Register began in May 2011 and it went into production in
March 2012, so it took SM 10 months to build and launch the first complete release. Even
though this solution is being used for months now, there are still upcoming versions and
releases that are improving CPR. Implementation, on the other hand, is a long term process
and is planned to happen in the next three years. The reason for this is because Studio
Moderna implements changes related to CPR one country at a time. Data transfers, changes
in client applications in each country and testing are time consuming processes that need to
be done in 21 SM places.
The purchased packaged software is introduced much faster. It usually takes less than six
months to finish with implementation of the solution. I spoke with Mr. Boštjan Kos,
Information Management Client Technical Professional in IBM, about IBM InfoSphere
MDM Solution and his estimations for implementation and testing were following:
“Difficulty to say as it depends on InfoSphere product you have in mind and which products
are in the scope for specific project. Installation and configuration would take 3-5 days
(simple installation on a single server, without high-availability, without disaster recovery,
etc.), connecting to data sources and data targets would take another couple of days, data
migration depends if it is from source A to target B without any complex transformations is
very easy and done on few clicks per table, but if there are complex transformations needed it
might take much longer. Training would take 3-5 days per module. Looking the whole
migration project I believe it should be finished within 1-6 months, depends from
complexity. “
Regarding the cost for these solutions, it is a bit hard to make precise comparison because the
numbers given for both “types” of solutions are rough estimations and really depend on the
scenario. For CPR I wasn’t able to retrieve the final amount that was invested in designing
this solution. There were on average six to seven people working on CPR each month. Four
of them were external contractors including consultants (in the beginning of the project).
According data from SURS (Statistični Urad Republike Slovenije) the average net salary for
programmer is 1200 euros for March, 2011 which is couple of months before the project
started. So the least amount that was spent on labor force was 84K euros = (7 programmers *
1200 euros * 10 months). But this is just the minimum amount. The salaries may be higher,
there are other people involved in the process like testers and business analysts, also
additional software licenses were purchased and so on.
On the other hand, the packaged software looks much more expensive. Three to five years
ago (2003-2007) the typical MDM solution cost in excess of $1 million just for the software
and an additional $3-4 million for the implementation services during the first year. During
2008, price points and product packaging (we should say “repackaging”) provided more
modest MDM functionality and accordingly less complexity which supported market pricing
64
in the sub $500K range. Overall, MDM matured from “early adopter IT project” status to
become a mainstay “Global 5000 business strategy” during 2007-08 These new price points
are reflective of various types of projects and the related product capabilities, i.e., enterprise
MDM initiative vs. very-specific business solution. Moreover market dynamics further drove
price differentiation as the market became more sophisticated and understood the price: value
ratio of hybrid vs. registry vs. tool kit vs. fully fledged MDM application. (Zornes, 2009, p.
6)
5.5. Build vs Buy MDM solution
The comparison in the previous chapter is made to bring out some positive and negative sides
from packaged and custom developed solution and to determine the choice that will bring
more benefits to an organization.
When deciding for MDM solution, customers usually consider the following criteria
(Stratature White Paper, p. 3 – 6): (1) Model Adaptation, (2) Security, (3) Performance and
usability optimization, (4) Notification and workflow, (5) Business rules and validation, (6)
Import and export, (7) Time to complete, (8) Cost of implementation, (9) Risk of failure.
(1) Model adaptation - Based on the comparison table 9 presented earlier, all selected
architectures provided freedom in choosing any domain that is requested by
organization. They even offer prebuilt data models, like SAP Netweaver for example,
to make solution implementation process easier. On the other hand, CPR is limited to
just one domain, Product. If Studio Moderna decides to built another MDM solution
for Customer domain, then it has to start from scratch and develop new model,
because the processing logic for Customer would be different than the one developed
for Product;
(2) Security - Security model is supported in every MDM architecture and it’s based on
the same principles of roles and permissions. CPR has its own security model, which
is unique and also includes support to the product lifecycle process;
(3) Performance and usability – all of the selected MDM architectures provide friendly
user interface that provides data visibility and its management. Data analysis,
matching, merging and other data processing techniques are implemented in the front
part of the application and don’t require writing excessive queries. CPR also uses
Dynamics to build the user interface of CPR so it can be easier for data stewards to
manage product data;
(4) Notification and workflow – notifications are seen in Microsoft MDS, integrated in
the MDM solution to notify user for some approval. However, this logic is not
necessary for every architecture because it wasn’t discussed in the other three
solutions. Other than IBM, none of the other packaged software support some
business processes, unless BPM is integrated in the solution. CPR put great thought in
managing workflows for product lifecycle statuses;
65
(5) Business rules and validation – MDM packaged software offers powerful rule engines
with predefined rules that user can apply to data. Also, user can define his own rules
and store them as future system knowledge. Business rules in CPR are also applied
when automatically activating or disabling some product data based on its status.
Also, its security model includes business rules when showing visibility of some
product statuses;
(6) Import and Export – Data flow is based on the ETL process for import and SOA
architecture on export. Solutions nowadays try to be more open and available for
different types of platforms so they can be able to handle data from various external
sources. It’s hard to discuss for import and export in CPR. It’s pretty limited because
the import of data is done by users with certain permissions whereas export is done
only on client systems request or nightly jobs, so there is no automatic data transfer at
the same moment when change is done in the CPR master database.
(7) Time to complete – packaged software is implemented much faster than building
custom solution (Table 14). There may be update fixes along the way, but not some
drastically changes as in custom solution development. From personal experience,
when custom development is done, there are people from different departments who
are involved in the decision making process and the development, which additionally
slows the development time while waiting for decisions, presentations and approvals
during meetings. And even though development was done in less than a year,
implementation is still ongoing and it will take a lot more time than implementing
packaged software;
(8) Cost of implementation – According to Stratature Research (Stratature White Paper ,
p. 5) it’s much cheaper to buy rather than built MDM solution. People included in
such complex solutions are developers with well-compensated salary and above
average annual incomes. Also, there are external advisors, business consultants,
additional software that is included. The development may prolong after deadline set
for the project because this is custom made solution and there is always possible risk
of failure. Outsourcing is an option to lower costs, but it’s not recommended because
of the strong connection between business and IT and the hands-on support that is
needed for this type of solutions. In the case of CPR, the cost was much lower than
packaged software. However, this solution is very limited, domain focused and any
additional modifications brings further costs for the company;
(9) Risk of failure – This last criterion depends on the organizational needs and proper
business planning rather than the technical difficulties that can be experienced when
implementing MDM. Based on previous experiences from other companies, one
organization can gain some information for MDM packaged software and decides if
selected solution will be suitable for its business requirements. However, it cannot
expect that this software can fix business problems in the organization. MDM solution
can be adjusted to centralize data for customer for example, but it cannot solve
constant errors and disambiguates that occur in the system because of organization
failure to define the difference between customer and supplier. Custom built solution
66
is specifically designed according to predefined business model. Therefore, it is
expected from this solution to accomplish the set goals, but this is only in the
beginning when there are no business model changes. For example, CPR offers such
strict and limited number of workflows that are focused on the product lifecycle and
security. But, any introduction of new logic, status, business rule means changes in
the code, processing logic and higher risk of failure of the current work of CPR.
CONCLUSION
Master Data Management architecture matured over the years from single domain hubs for
data cleansing to complex application that provide not only data quality improvement but
collaboration with users and business processes. When determining its business values there
are two types of benefits we should look into: (1) intangible and (2) tangible.
(1) Intangible benefits – MDM helps organization in solving four key issues: data
redundancy, data inconsistency, business inefficiency because of data errors and
supporting business changes (White, 2007, p. 2). Data quality dimensions as well as
data inconsistencies were discussed in the faster part of the thesis. MDM was defined
as discipline to improve data quality; therefore the immeasurable benefits are mostly
represented with solving existing data errors;
(2) Tangible benefits – Even though I reviewed great number of literatures written for
MDM, only few of them presented some quantitative gains that organizations get
from MDM applications. These studies (Table 1) show money return in many areas of
the business, but there is very little on the way how these numbers are determined.
Was this profit achieved in the first year or after few years of MDM implementation;
how much was invested in order to come to these savings etc. Even in the case study
for Studio Moderna, I couldn’t get information if they’ve seen any actual increase in
sales, return on investment when their solution was brought into use. This difficulty in
estimating actual return on investment implies to productivity paradox that IT
companies are struggling with since the 70s and 80s. The cost of MDM solution is
around 500K dollars, and yet the savings are hard to measure.
Four selected architectures were presented and they follow the MDM activities: profile,
consolidate, govern, share and leverage. Microsoft MDS offers least functionality of all four
solutions and still needs to work on development to build complex suite that will be similar to
its competitors. SAP, IBM and Oracle on the other hand, offer all kinds of possibilities for
data management and cover various business domains. They have implemented powerful
matching engines and validation rules that are used in detecting bad data. Also, applications
of these MDM portfolios can be purchased and integrated in already existing organization
system and work with tools from other vendors.
The Central Product Registry that I presented is also unique solution for Studio Moderna and
covers the needs for product data management. However, it’s specifically designed for one
domain and one business scenario that covers the product lifecycle and it still requires a lot of
67
manual work. Imports are done by person and they don’t come as data flow from external
sources. Export is also done in one way, the already mentioned pull method. This same
solution cannot be reused for another business domain because it doesn’t support the business
logic for any other domains but Product. Product status is based on sales, but processes where
suppliers are involved are not covered. For example, how can CPR handle the status of
product that actively sales but there are problems with the supplier? This starts another
discussion for managing data and involves logistic systems related to products.
Considering the fact that most of the activities for managing product data in CPR are done
manually and don’t require some unique automation or architecture, I would say that the
selected architectures discussed earlier can be suitable replacement for CPR and here are the
reasons:
- Data model - all of these solutions are domain neutral with exception to Oracle, but
Oracle offers Product Data Hub, which means that Product domain is covered in all of
the packaged software. Some of them even have predefined product data models
which will help when defining the data model for SM;
- Data import – all four solutions offer user interface for entering data. Also, some of
them like SAP or IBM have this process automated which additionally speeds up the
process when large product data records for some SM client country need to be
migrated to CPR database;
- Data export - all four solutions support SOA, which means they are platform
independent and can send the data upon clients system web requests, something that is
also implemented in CPR;
- Data validation – rules engine is supported in all of the four solutions and can
implement the business rules of SM as well. Since product status are changed by the
Local or Central Sourcing Officer, visibility of the rest of the product data can be also
controlled when defining additional business rules that depend on these statuses;
- Data security – as discussed earlier, all of these solutions have strong security model
as part of their architecture that can be adjusted also for SM.
From the above summary, in my opinion is better if organization decides for packaged
software because:
- Organization can buy as many applications it needs, which gives certain business to
choose according its requirements without spending money on useless applications;
- Solutions are easily adaptable to various environment and platforms, and also leave
space for future upgrades and additions of new modules and domains. So, instead of
organizations to build some MDM solution with use of existing ERP systems, they
can use SAP MDM Data Manager or IBM InfoSphere Quality Stage and integrate
them with the current organizational systems.
- Time for implementation and testing is much shorter than the one for custom
development;
68
- The cost may vary depending on the number of applications, licenses and so on, but
custom built solution cost a lot starting with the large business and IT people who
work on such projects.
As far as deciding on certain vendor from the four presented architecture, I cannot make
choice for the best solution because it takes further research to come to this result. It all
depends on the type of organization and is based on its: business domain, database size and
variety, budget, existing IT systems. However, most important of all is that organizations
need to be aware of their problem of bad data, what causes it and how it should be prevented,
before choosing suitable solution because often problems come from bad business definitions
which IT solutions can’t solve.
As mentioned in the introduction part, forecasts show increase of the MDM market.
Contribution towards this increase have the sophisticated MDM applications that evolved
from simple data quality tools to complex collaborative managements suits as well as the
ongoing investments for their improvement. There are three main aspects that will dictate
future MDM development (Radcliffe, 2012): (1) Multi domain support, (2) Cloud technology
and (3) Big Data.
(1) MDM solutions are trying to cover as many business requests as they can. That’s why
in recent years more vendors focus towards development of solutions that support
various domains in one application. In the past, customers with multi-domain
businesses were dealing with multiple products that supported single domains, or one
solution that had limited number of features for multi domain support. This problem
challenged MDM vendors to invest in new developments that will support multi
domain functionality;
(2) Following the path of different IT solutions, MDM vendors also try to find a way to
support cloud computing and have their master repositories stored in a cloud. Major
concern is the security because it is still central data storage that contains data crucial
for a business and vendors still can’t trust the existing security model and implement
MDM in cloud;
(3) The third and most interesting trend is MDM and its work with big data. Due to the
fast popularization and extreme interest in social networking, many companies found
this place as great marketing blackboard for presenting their products and services and
also finding new customers. Tendencies for MDM architectures is to improve MDM
to work with unstructured data retrieved from social networking, and make
predictions for suitable customers or connect them with the current organizational
sales. This will create more potential customers and increase the cross sales. SAP
HANA is such solution for fast processing of large data amounts, but MDM is pretty
new in this area. If this will be developed and implemented, then it will increase
MDM to a completely new level, not just as solution for data improvement but also
for data mining and predictive analysis.
69
From what is offered on today’s MDM market, what was presented in this thesis and also
based on the predictions of Gartner, bigger number of MDM organizations are willing to
turn into reality their long term idea for implementing master data management.
Hopefully, in the process of choosing the best solution, organizations should first look
into their problem of bad data and maybe try to improve their business model. If business
model changes are overlooked and MDM solution is purchased without thorough
consideration, organizations are facing the risk of another silo of data in their system that
will cause more damage than improvement of their master data.
70
LIST OF REFERENCES
1. Alon, T., Arkus, G., Duran, R., Haber, M., Liebke, R., Morreale, F. Jr., Roth, I.,
Sumano, A. & Zhu, J. (October, 2011). Metadata Management with IBM InfoSphere
Information Server.
2. An Informatica and Capgemini White Paper (2011). Building the Business Case for
Master Data Management (MDM). Strategies to quantify and articulate the business
value of MDM.
3. Arlbjørn, S.J. & Haug, A.(2011). Barriers to master data quality. Journal of
Enterprise Information Management. 24(3).
4. Ballard, C., Farrell, D. M., Lee, M., Stone, P. D., Thibault, S. &Tucker, S. (2010,
September). IBM InfoSphere Streams: Harnessing Data in Motion.
5. Berson, A. & Dubov, L. (2007). Master Data Management and Customer Data
Integration
6. Berson, A. & Dubov, L. (2010). Master Data Management and Data Governance
7. Bhatia, C., Jain, R., Perniu, L., Raveendramurthy, S., Samuel, R.., Vibhute, S. &
Wilson, E. (2011, June) InfoSphere Data Architect
8. Better Information through Master Data Management – MDM as a Foundation for
BI. Retrieved March 5, 2012, from
http://www.oracle.com/us/products/applications/master-data-management/042444.pdf
9. Bracht, J., Rehr, J., Siebert, M., & Thimm, R. (2012, July) Smarter Modeling of IBM
InfoSphere Master Data Management Solutions
10. Böttcher, O., Heilig, L., Karch, S., Hofmann, C. & Pfennig, R. (2007, March). SAP
NetWeaver Master Data Management
11. Bullerwell, M., Kashel, J., & Kent, T. (2011, July). Microsoft SQL Server 2008 R2
Master Data Services.
12. Butler, D.(2011). Boiling the Ocean. Retrieved December 27, 2012, from
https://blogs.oracle.com/mdm/entry/boiling_the_ocean
13. Crosman, P. (2010). Gartner Expects 14% Growth in Master Data Management
Software Revenue for 2010. Retrieved December 27, 2012, from
http://www.banktech.com/architecture-infrastructure/gartner-expects-14-
growth-in-master-data/228800031
14. Cross-Referencing for Master Data Management with Oracle Application Integration
Architecture Foundation Pack. Retrieved December 27, 2012, from
http://www.oracle.com/us/products/applications/056910.pdf
15. Dreibelbis, A., Hechler, E., Milman, I., Oberhofer, M., Run, P. Wolfson, D. (2008)
Enterprise Master Data Management. An SOA Approach to Managing Core
Information
16. Ferguson, R.(2004). SAP Buys A2is Technology for Master Data Management.
Retrieved December 27, 2012, from http://www.eweek.com/c/a/Enterprise-
Applications/SAP-Buys-A2is-Technology-for-Master-Data-Management/
71
17. Graham, T. & Selhorn, S. (2011) Master Data Services:Implementation &
Administration
18. Gryz, J., Hazlewood, S., Pawluk, P.,& Run, P. (2011). Trusted Data in IBM’s Master
Data Management
19. Haapasalo, H., Hanna Kropsu-Vehkapera, H., Jaaskelainen, O. &Silvola, R.(2011).
Managing one master data – challenges and preconditions. Industrial Management &
Data Systems. 111(1).
20. Hillard, R..(2010) Information-Driven Business:How to Manage Data and
Information for Maximum Advantage.
21. IBM InfoSphere FastTrack (2007). Retrieved December 27, 2012, from
http://publibfp.boulder.ibm.com/epubs/pdf/c1934780.pdf
22. IBM InfoSphere Information Analyzer Retrieved December 27, 2012, from
http://publibfp.boulder.ibm.com/epubs/pdf/c1934261.pdf
23. IBM Multiform Master Data Management: The evolution of MDM applications.
(June, 2007). Retrieved March 5, 2012, from
http://www.itworldcanada.com/WhitePaperLibrary/PdfDownloads/IBM-LI-
Evolution_of_MDM.pdf
24. IBM InfoSphere Master Data Management Server Retrieved December 27, 2012,
from
http://origin01.aws.connect.clarityelections.com/Assets/Connect/RootPublish/soe-
testclient6.connect.clarityelections.com/Maps/MDMUnderstandingAndPlanning.pdf
25. IBM InfoSphere QualityStage Retrieved December 27, 2012, from
http://publibfp.boulder.ibm.com/epubs/pdf/c1934790.pdf
26. InfoSphere Metadata Asset Manager Tutorial (2012). Retrieved December 27, 2012,
from http://www-01.ibm.com/support/docview.wss?uid=swg27024462
27. IBM Software White Paper (2011). How master data management serves the
business. Retrieved March 5, 2012, from http://www-
01.ibm.com/software/data/master-data-management/overview.html
28. Kahn, B., Strong, D. & Wang, R. (2002). Information Quality Benchmarks: Product
and Service Performance.
29. Kokemuller, J. & Weisbecker, A. Master Data Management:Product and Research
30. Loshin, D. (2009). Master Data Management.
31. Loshin, D. (2008). MDM Paradigms and Architectures.
32. Magic Quadrant for Business Intelligence Platforms(2012). Retrieved December 27,
2012, from http://businessintelligence.info/docs/estudios/Magic-Quadrant-for-
Business-Intelligence-Platforms-2012.pdf
33. Magic Quadrant for Data Quality Tools (2012). Retrieved December 27, 2012, from
http://www.gartner.com/technology/reprints.do?id=1-1BO662V&ct=120809&st=sb
34. Magic Quadrant for Data Warehouse Database Management System (2012).
Retrieved December 27, 2012, from
http://www.gartner.com/technology/reprints.do?id=1-196T8S5&ct=120207&st=sb
72
35. Master Data Management (2011, September). Retrieved March 5, 2012, from
http://www.oracle.com/us/products/applications/master-data-management/018876.pdf
36. Mauri, D. &Sarka, D. (2011, June). Data Quality and Master Data Management with
Microsoft SQL Server 2008 R2
37. McKnight, W. (2006). Justifying and Implementing Master Data Management for the
Enterprise. Retrieved March 5, 2012, fromhttp://web.ebscohost.com.nukweb.nuk.uni-
lj.si/ehost/results?sid=67111ba7-9fd6-49a5-ab58-
4b092e1a5797%40sessionmgr13&vid=5&hid=108&bquery=Justifying+AND+Imple
menting+Master+Data+Management+for+the&bdata=JmRiPWE5aCZkYj1idWgmZ
GI9bmxlYmsmZGI9cG9oJmRiPXNpaCZkYj11ZmgmZGI9bXRoJmRiPWY1aCZkY
j1yaWgmZGI9bmZoJmRiPWM4aCZkYj1id2gmZGI9aGNoJmRiPWNtZWRtJmRiP
WVyaWMmZGI9aHhoJmRiPWx4aCZkYj04Z2gmbGFuZz1zbCZ0eXBlPTAmc2l0Z
T1laG9zdC1saXZl
38. MDM Aware Applications. Retrieved March 5, 2012, from
http://www.oracle.com/us/products/applications/master-data-management/021486.pdf
39. Messerschmidt, M. & Stuben, J.(2011) Hidden Treasure.
40. Michnik, J. & Lo, M. (2007). The assessment of the information quality with the aid
of multiple criteria analysis.
41. Oracle Information Framework - The power of the combined ODI and MDM suites.
Retrieved March 5, 2012, from
http://www.oracle.com/us/products/applications/master-data-management/042446.pdf
42. Oracle Master Data Management Strategy. Retrieved March 5, 2012, from
http://www.oracle.com/us/products/applications/master-data-management/042448.pdf
43. Oracle Master Data Management. Retrieved March 5, 2012, from
http://www.oracle.com/us/products/applications/master-data-management/master-
data-management-ds-075053.pdf
44. Oracle Trading Community Architecture. Data Quality Management. Retrieved
December 27, 2012, from
http://docs.oracle.com/cd/A99488_01/acrobat/115hzdqm.pdf
45. Otto, B. (2011). How to design the master data architecture: Findings from a case
study at Bosch. International Journal of Information Management. Retrieved March
5, 2012, from http://www.oracle.com/us/products/applications/master-data-
management/018876.pdf
46. Overview of Our Company [Studio Moderna –portal]. Retrieved February 7, 2013,
from http://www.studio-moderna.com
47. Press release notes from IBM, Retrieved February 7, 2013, from http://www-
03.ibm.com/press/us/en/index.wss
48. Rao, U. (2011). SAP NetWeaver MDM 7.1 Administrator’s Guide
49. Radcliffe, J. (2012). Three trends that will shape the master data management market.
Retrieved December 27, 2012, from
http://www.computerweekly.com/opinion/Three-trends-that-will-shape-the-
master-data-management-market
73
50. Rivard, F., Harb, G. & Meret, P.(2009). Transverse Information Systems : New
Solutions for IS and Business Performance
51. SAP NetWeaver Master Data Management (MDM). MDM Data Manager.(2011,
October). Retrieved March 5, 2012, from
http://help.sap.com/saphelp_mdm71/helpdata/en/4b/72b8aaa42301bae100000
00a42189b/MDMDataManager71.pdf
52. SAP NetWeaver Master Data Management (MDM). MDM Console. (2011, October).
Retrieved March 5, 2012, from
http://help.sap.com/saphelp_mdm71/helpdata/en/4b/71608566ae3260e100000
00a42189b/MDMConsole71.pdf
53. SAP NetWeaver Master Data Management (MDM). MDM Import Manager. (2011,
October). Retrieved December 27, 2012, from
http://help.sap.com/saphelp_nwmdm71/helpdata/en/4b/72b8e7a42301bae1000
0000a42189b/MDMImportManager71.pdf
54. SAP NetWeaver Master Data Management (MDM). MDM Console. (2011, October).
Retrieved March 5, 2012, from
http://help.sap.com/saphelp_mdm71/helpdata/en/4b/71608566ae3260e100000
00a42189b/MDMConsole71.pdf
55. Sarngadharan, M., Minimol, C. (2010). Management Information System.
56. Schneider-Neureither, A. (2004, May). SAP System Landscape Optimization.
57. Smith,M. (2006). Master data management trends with Mark Smith, CEO of Ventana
Research.[Podcast] Retrieved December 27, 2012, from
http://searchdatamanagement.techtarget.com/podcast/Master-data-
management-trends-with-Mark-Smith-CEO-of-Ventana-
Research?vgnextfmt=aiog&cc=8c98ce6166128210VgnVCM1000000d01c80aRCRD
58. Smith, H. A., & McKeen, J. D. (2008). Developments in practice XXX: Master data
management: Salvation or snake oil? Communications of the AIS
59. Stratature White Paper. Master Data Management – The Build vs. Buy Decision
60. Strong, Diane M. & Wang, Richard Y. (1996). Beyond accuracy: What data quality
means to data consumers.Journal of Management Information Systems 12(4).
61. Thackeray, N. (2009, August). Administrator’s Perspective of SAP NetWeaver MDM
– Part 1 & 2. Retrieved March 5, 2012, from
http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/1041a80a-f462-
2c10-3ab3-9acb03bdb816?QuickLink=index&overridelayout=true&44714905772576
http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/70c286a6-0375-
2c10-bfbb-e6e83d72d804?QuickLink=index&overridelayout=true&44925358039672
62. Understand IBM InfoSphere MDM Server security, Part 1: Overview of Master Data
Management Server security. Retrieved December 27, 2012, from
http://www.ibm.com/developerworks/data/library/techarticle/dm-
0809mccallum/
63. Venkatagiri, S. SQL Server Master Data Services – A Point of View.
74
64. Wand, Y. & Wang. Richard, Y. (1996, November). Anchoring Data Quality
Dimensions Ontological Foundations. Communications of the ACM 39 (11). Alon, T.,
Arkus, G., Duran, R., Haber, M., Liebke, R., Morreale, F. Jr., Roth, I., Sumano, A. &
Zhu, J et al. (October, 2011). Metadata Management with IBM InfoSphere
Information Server.
65. Wang, R., Pierce, M. & Madnick, S. (2005) Information Quality.
66. White, C. (2007). Using Master Data in Business Intelligence.
67. Yang, S. (2005, June). Master Data Management.
68. Zornes, A. (2009). Enterprise Master Data Management; Market Review & Forecast
for 2008 - 12