UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS · UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS ......

UNIVERSITY OF LJUBLJANA

FACULTY OF ECONOMICS

MASTER THESIS

COMPARISON OF SELECTED MASTER DATA MANAGEMENT

ARCHITECTURES

Ljubljana, February 2013 Katerina Atanasovska

i

TABLE OF CONTENTS

INTRODUCTION .................................................................................................................................. 1

RESEARCH PROBLEM AND PURPOSE OF MASTER THESIS ..................................................................... 1

RESEARCH GOALS ............................................................................................................................... 5

RESEARCH METHODS .......................................................................................................................... 5

1.DEFINITION OF DATA, DATA TYPES, DATA DIMENSTIONS AND DATA

INCONSISTENCIES .............................................................................................................................. 6

1.1.DATA TYPES .................................................................................................................................. 6

1.1.1.Analytical data ....................................................................................................................... 7

1.1.2.Transactional data .................................................................................................................. 7

1.1.3.Master data ............................................................................................................................. 8

1.1.4.Metadata ................................................................................................................................. 8

1.2.DATA QUALITY DIMENSIONS ......................................................................................................... 9

1.2.1.Intrinsic data quality ............................................................................................................ 11

1.2.2.Contextual data quality ........................................................................................................ 12

1.2.3.Representational data quality ............................................................................................... 12

1.2.4.Accessibility data quality ..................................................................................................... 13

1.3.DATA INCONSISTENCY ................................................................................................................ 13

1.4.DATA QUALITY IMPROVEMENT ................................................................................................... 15

2.MASTER DATA MANAGEMENT ................................................................................................. 16

2.1.DEFINITION ................................................................................................................................. 16

2.2.GOALS OF MDM ......................................................................................................................... 18

2.3.MDM ACTIVITES ........................................................................................................................ 20

2.4.BENEFITS FROM MDM ................................................................................................................ 21

3.MASTER DATA MANAGEMENT SOLUTIONS .......................................................................... 22

3.1.HISTORICAL REVIEW OF MDM SOLUTIONS ................................................................................ 22

3.2.FUNCTIONALITIES, CONCEPTS AND ARCHITECTURE ................................................................... 24

3.3.ARCHITECTURE OF MDM DESCRIBED THROUGH SELECTED MDM SOLUTIONS ......................... 29

3.3.1.Microsoft Master Data Services ........................................................................................... 30

3.3.2.SAP Netweaver .................................................................................................................... 33

3.3.3.IBM InfoSphere ................................................................................................................... 37

3.3.4.Oracle MDM Suite ............................................................................................................... 44

4.ANALYSIS OF SELECTED MASTER DATA MANAGEMENT ARCHITECTURES ................ 50

4.1.MDM OF SELECTED ARCHITECTURES AND QUALITY DIMENSIONS ............................................. 50

4.2.COMPARISON OF SELECTED ARCHITECTURES THROUGH THE THREE DIMENSIONAL MODEL ...... 52

4.3.COMPARISON OF SELECTED ARCHITECTURES THROUGH THE FIVE MDM ACTIVITIES ............... 54

5.CASE STUDY OF MDM SOLUTION USED IN STUDIO MODERNA ........................................ 56

5.1.PROBLEMS WITH PRODUCT DATA MANAGEMENT ....................................................................... 56

ii

5.2.CENTRAL PRODUCT REGISTER (CPR)......................................................................................... 57

5.2.1.Product statuses .................................................................................................................... 58

5.2.2.Product data security ............................................................................................................ 59

5.3.BENEFITS OF CPR ....................................................................................................................... 60

5.4.COMPARISON OF CPR AND SELECTED MDM ARCHITECTURES .................................................. 61

5.5.BUILD VS BUY MDM SOLUTION ................................................................................................. 64

CONCLUSION ..................................................................................................................................... 66

LIST OF REFERENCES ...................................................................................................................... 70

LIST OF FIGURES

Figure 1: Definition of master data and the master record ..................................................................... 1

Figure 2: Applications used for MDM ................................................................................................... 4

Figure 3: Enterprise data ........................................................................................................................ 7

Figure 4: List of data attributes ............................................................................................................ 10

Figure 5: List of techniques for solving data inconsistencies .............................................................. 15

Figure 6: Workflow of MDM .............................................................................................................. 16

Figure 7: The data quality activity levels ............................................................................................. 19

Figure 8: MDM Activities ................................................................................................................... 20

Figure 9: Evolution of IBM MDM applications .................................................................................. 24

Figure 10: Dimensions of master data management ............................................................................ 25

Figure 11: Traditional MDM architecture ........................................................................................... 28

Figure 12: MDM architecture with additional published services ...................................................... 28

Figure 13: MDS data model ................................................................................................................ 31

Figure 14: Table types ......................................................................................................................... 34

Figure 15: Key mapping during import and export ............................................................................. 35

Figure 16: Logical model ...................................................................................................................... 39

Figure 17: Domain model and physical model .................................................................................... 39

Figure 18: Physical model ................................................................................................................... 39

Figure 19: Example of field mappings during data import .................................................................. 40

Figure 20: Example of SSN pattern match .......................................................................................... 41

Figure 21: Example of record merge ................................................................................................... 42

Figure 22: List of predefined tables for Customer entity ..................................................................... 45

Figure 23: Example of cross reference between PARTIES and SYS_REFERENCE .......................... 46

Figure 24: Example of data validation workflow ................................................................................ 47

Figure 25: Example of data flow in CPR .............................................................................................. 58

LIST OF TABLES

Table 1: An example estimating the positive impact of customer MDM ............................................ 22

Table 2: Gartner’s Magic Quadrant for Data ....................................................................................... 29

Table 3: MDS repository objects vs. Relational database objects ....................................................... 31

Table 4: Advantages and disadvantages of MDS ................................................................................ 33

Table 5: Advantages and disadvantages of SAP .................................................................................. 36

Table 6: Advantages and disadvantages of IBM InfoSphere ............................................................... 43

iii

Table 7: Advantages and disadvantages of Oracle MDM ................................................................... 48

Table 8: DQ dimension and MDM ...................................................................................................... 50

Table 9: MDM solutions and three-dimensional model ...................................................................... 52

Table 10: MDM overview through four data management phases ...................................................... 55

Table 11: CPR solutions for product data management ....................................................................... 60

Table 12: Comparison of MDM architectures and CPR’s three dimensional model ........................... 61

Table 13: Comparison of MDM architectures and CPR’s MDM Phases ............................................ 61

Table 14: Comparison of MDM architectures and CPR’s time and cost ............................................. 62

1

INTRODUCTION

Research problem and purpose of master thesis

Most of the businesses today perform and track their everyday transactions with the help of

various information systems. Companies use them to automate their business processes, store

their data and make further business decisions based on the end results given from various

applications. The great success of these systems is not just based on the complex processing

logic used in their backend software code, but also on their friendly user interfaces that make

such software easy to work with.

Development of new and sophisticated information technologies (IT) in the past decade

resulted in growth and expansion of numerous business solutions on the market. Benefits

from this development are seen in improving workflows of many companies. However, the

side effects from fast growth of IT created additional headaches for businesses and again

redirected them back to look for solutions from IT vendors. One of the major problems users

of such applications are dealing with is the constant growth of “dirty” data in the system.

There are two reasons why IT is responsible for producing bad data:

1. Trying to get closer to the customers vendors focused on the application design, and

various business scenarios, neglecting the data validations and filters in the whole

architecture. This weakened the system to track content of entered data;

2. New system-oriented architecture (SOA) allows integration of different applications

into one system. Knowing that each application carries its own database, there is a

high possibility that same data may be stored in different sources and this

automatically produces data redundancy in the system.

Figure 1: Definition of master data and the master record

Source: J. Bracht et al, Smarter Modeling of IBM InfoSphere MDM Solutions, 2012, p. 29

The problem of bad data became most visible and hard to handle when companies started

experiencing revenue loss, increased costs, customer complains, employment frustrations etc.

2

Statistics below, based on researches made by Arlbjørn and Haug (2010, p. 294) show the

alarming situation that companies are placed in because of poor data quality:

- 88 per cent of all data integration projects either fail completely or significantly over-

run their budgets;

- 75 per cent of organizations have identified costs stemming from dirty data;

- 33 per cent of organizations have delayed or cancelled new IT systems because of

poor data;

- $611bn per year is lost in the US in poorly targeted mailings and staff overheads

alone;

- According to Gartner, bad data is the number one cause of CRM system failure;

- Less than 50 per cent of companies claim to be very confident in the quality of their

data;

- Business intelligence (BI) projects often fail due to dirty data, so it is imperative that

BI-based business;

- decisions are based on clean data;

- Only 15 per cent of companies are very confident in the quality of external data

supplied to them;

- Customer data typically degenerates at 2 per cent per month or 25 per cent annually;

- Organizations typically overestimate the quality of their data and underestimate the

cost of errors;

- Business processes, customer expectations, source systems and compliance rules are

constantly changing.

Working as database analyst in Studio Moderna, I have been dealing with examples of bad

data every day. Duplicates, misspellings, missing values are some of the irregularities that

appear in customers databases. It is very hard to work on statistics and analysis knowing that

the numbers contain duplicates, but the huge amount of data load and the time constraint,

which has always been an issue, don’t allow you to go through and cleanse what is

considered to be obsolete in such cases. At the end, the picture you present for the requested

business scenario may be irrelevant for that time being. Not because of incorrect query

statements and miscalculations, but the content of data involved in the whole processes. It is

very frustrating when whoever works on data analysis deals with situations in which they

spent time looking for some error, trying to find the reason for mismatching results, and

discover that it’s just another misspelled name or missing address.

There are several techniques that help solving problems with bad data. Few of them are: data

mining, data cleansing, data profiling, data governance etc. Depending on the tasks they

perform, these techniques are divided into four major groups: techniques to clean,

consolidate, govern and share data. Today, all of them fall under Master Data Management

(MDM), discipline that brings together any method, technique or technology that deals with

data quality improvement.

3

In many cases throughout different literature MDM is defined as software for improving data

quality, but Master Data Management covers much broader area then that. In more formal

definition given by Mauri and Sarka (2011, p. 16) MDM is a set of coordinated processes,

policies, tools and technologies used to create and maintain accurate master data. We do not

have a single tool for MDM. Anything that is used to solve data quality issue falls under

category of MDM. For example, running nightly procedures for data cleansing, defining

constraints of tables to check inserted data, defining table users and permissions, any of that

can be considered as managing data. Master data is specifically used in this discipline,

because it represents core data for every enterprise and it needs to be correct, and precisely

maintained in systems, so company can work with lowest possible number of data issues.

In addition to this strategy, vendors developed sophisticated MDM software solutions where

they implemented numerous techniques for improving data quality. The concept of these

solutions is designed for large, medium and small sized enterprises. Entire software suites are

appropriate for larger companies that work with great amounts of data. Another example

where MDM suites are used, are companies that extended their businesses through merging

or acquisition, and are confronting problems of bad data created as a result of introducing

new systems to their existing environment. Individual modules of the suites are appropriate

for medium and small sized companies, where certain MDM applications are used for

analysis and data cleaning.

There is a significant number of established vendors who offer Master Data Management

products. D&B/Purisma Data Hub, DataFlux MDM, Data Foundations OneData, i2 MDM,

IBM InfoSphere MDM Server, Initiate Systems Master Data Services, Kalido MDM,

Liaison Technologies MDM, Microsoft MDM, Oracle Customer Data Hub, Oracle Hyperion

DRM, Oracle UCM, Orchestra Networks EBX, SAP NetWeaver MDM, Siperian MDM Hub,

Sun MDM Suite/Mural, Teradata MDM, TIBCO CIM, VisionWare MultiVue are just part of

the list of various MDM applications. Considering that MDM is fairly new technology which

is establishing on the market in the past 10 years, it’s hard to decide which of the above listed

products could be the best solution for one organization.

Market reviews predict bright future for MDM vendors. The aggregate MDM market will

grow from US$2.8 billion to US$4 billion over the forecast period (2008-2012), including

revenues from both MDM packaged solutions and implementation services as well as the

billion plus dollars related to data service providers such as Acxiom and Dun & Bradstreet.

The aggregate enterprise MDM market (customer and product hubs, plus systems

implementation services) totaled US$730 million at YE2007 and will reach US$2 billion by

the end of 2012. Software sales are but one portion as MDM systems integration services

reached US$510 million alone during 2007 and are projected to exceed US$1.3 billion per

year by 2012 (Zornes, 2009, p. 3).

Despite these predictions, majority of companies are still favoring in-house solution over

packaged MDM software. In 2006, Ventana Research surveyed 515 companies on this

4

matter. Their findings were that only 4% of the interviewed companies completed their MDM

implementation project, 7% are still in ongoing implementation phase and 33% are in

progress. Less than half of these companies have some kind of packaged software whereas

20% have their own developed solution. Nearly half of them are considering implementing

some data governance tool, but only 24% are planning to realize that some time in the future (

Smith, 2006).

Similar numbers were recently confirmed by Messerschmidt and Stuben (2011, p. 5). They

interviewed 49 companies from 12 different countries and eight industries including small

and large business. Numbers showed that most of these companies are willing to implement

some MDM software but are still using their own built MDM solution. Figure 2 represents

the answers that companies gave regarding the MDM application they use. Most of them

answered that they still use in-house development instead of packaged software.

Figure 2: Applications used for MDM

Source: M. Messerschmidt & J. Stuben, Hidden Treasure, 2011, p. 33

From the various statistics presented earlier, it seems that majority of organizations are

looking into implementing some kind of MDM tool, but are still not quite prepared for

packaged software that is placed on the market. When one organization has certain budget to

invest in technological upgrade, then it strives to make the best decision money can buy.

Their decision is introduction to the problem of this master thesis, which defines the never

ending debate on packaged vs. custom built solution. Problem would be examined through

four architectures of already established vendors: Microsoft, IBM,SAP and Oracle and also

the case study on custom built solution for the requirements of Studio Moderna.

The structure of this thesis is divided in two parts. First part explores the problem of “bad

data”, discussed through some general concepts of data, data quality, standards for quality

data and possible causes of data inconsistencies. The second part covers the purpose of my

thesis, which is an analysis of data management process implemented in selected MDM

software solutions offered on the market, and how their MDM architecture assists in

improving data quality. This analysis will be made through researching and comparing

different MDM architectures and the way they perform data modeling, validation, import and

5

exports of data and security of the system. MDM software solutions that will be compared in

this thesis are: Microsoft Master Data Services, SAP NetWeaver MDM, IBM InfoSphere and

Oracle MDM Suite. There are a lot of vendors who offer MDM solutions, but I chose these

four because they are already known for their database management systems, as well as many

business intelligence (BI) tools. As an addition to these four products, I included one custom

made solution called Central Product Register (CPR), developed for product data

management in Studio Moderna.

Research goals

Goal of my work is to create comparison model for products from selected MDM vendors.

This model will discuss domain, method of use and implementation style as main dimensions

that characterize each MDM system. Also, I will discuss in more details some of the

techniques used to consolidate, cleanse, govern and share data. This comparison will describe

how each vendor understands and implements data management in their solution. It will also

highlight advantages and disadvantages each product has, and try to find out if implementing

such solution will really benefit the business or it’s another fancy application for better

organization and view of data, which doesn’t actually solve the core problem of data quality.

Including custom build solution as case study example for MDM product is introduced to

show how master data management is understood within a company, and to describe

company’s attempt to solve the problem without help from off-the-shelf product. Discussing

the company’s internal data management introduces another goal in this thesis, which shows

that users shouldn’t just rely on MDM software as only solution to quality data. In most cases

problem should be looked much deeper, not in data itself but in sources that produce data,

whether that is user or application. Often, problem lays in lack of knowledge or experience in

business processes and company’s workflows. In such cases, MDM products can improve

data and solve current issues, but it’s not a long-term solution because the problem exists

elsewhere, and sooner or later it will produce bad results again. If proven, this finding can be

very useful, because it can save users time and money for purchasing and implementing

software that wasn’t the right choice in the first place.

Research methods

There are two research methods used in this thesis:

1. Comparative Analysis;

2. Case study of in-house MDM solution in Studio Moderna.

All of the data and statistics used in this paper are collected from different literature and

publications, so only secondary data from other sources is used. Since this topic compares

products from technical point of view, I chose: literature, white papers, technical notes to be

most appropriate sources for my thesis. And based on the title for my research, most suitable

method for comparison is the comparative analysis itself. This method is used when

6

researching collected materials and creating general summary for the four different MDM

solutions.

Another method used in this research, is case study of in-house MDM solution built for

Studio Moderna needs. I consider this case study as suitable example, because it deals with

the problem of data quality and covers data management processes, same as the four products

sold by Microsoft, Oracle, IBM and SAP. Also, it can extend the discussion of data

management in terms of business process changes and restructuring the workflows, not just

data cleansing and governance as options for data quality improvements.

As an addition to these two methods, I also used unstructured interviews. Following people

were interviewed:

- Mr. Bostjan Kos, Information Management Client Technical Professional at IBM,

Slovenia;

- Tadej Zajc, Sales Representative in Oracle, Slovenia and

- Sasa Strah, project lead for Central Product Register solution in Studio Moderna.

These were informal interviews done over email, and contained questions regarding Master

Data Management products that the representatives listed above work with, as well as their

experience with customers who use their software.

1. DEFINITION OF DATA, DATA TYPES, DATA DIMENSTIONS AND DATA

INCONSISTENCIES

1.1. Data types

Data is part of our everyday life. Words, numbers, dates, pictures are all examples of data.

'Data' represents collection of unorganized facts, which can be organized into useful

information. 'Processing' refers to a group of actions which can convert inputs into outputs.

The series of operations performed to convert unorganized data into organized information is

called data processing, and includes resources like people, procedures and devices to convert

data into information (Minimol & Sarngadharan, 2010, p.85).

As introduced earlier in this thesis, main goal of MDM is to improve data quality in

organizations, therefore organizational (enterprise) data will be discussed further in this

thesis. Enterprise data represents all the inputs that are produced, processed and stored in an

enterprise. It can be used in different business scenarios and for different purposes of the

company.

For easier management, enterprise data is divided in three categories: analytical, transactional

and master. This grouping is based not on the content or format of the data, but on different

ways that the same one is used. There is no strict rule that splits data and places it in one of

these three categories. One record can be defined as analytical or transactional data

depending on the way it’s used in certain business scenarios.

7

For example, sales data can be seen as transactional data representing the daily sales

transactions in one company. On the other hand, sales can be also used for analytical

purposes, to present the sales status in one organization for certain time period. Such example

puts sales data in two groups: transactional and analytical, depending on the way it’s used in

a given situation.

Figure 3: Enterprise data

Source: An Oracle White Paper on Master Data Management, 2011, pg. 4

1.1.1. Analytical data

Analytical data is used to provide some general picture of company’s work. It’s the end result

of statistics, analysis or other calculations performed over collected inputs. Main use is to

show the business situation in a given time period. It is usually stored in the business

intelligence (BI) part of company’s system and is shown in reports, OLAP cubes, graphs etc.

An example of analytical data can be clients’ demographics overview, yearly profits and loss,

or any summary results collected on global enterprise level. It helps in making major business

decisions and often times it determines the course of company’s progress.

1.1.2. Transactional data

Transactional data represents records which refer to transactions in one system. Transactions

are activities that are related to business events, for example: payments, sales, creation of new

account, creating new student record, with other words any change related to an object in

given time. Compared to analytical data, this type is much more detailed and it tracks and

records every new input, update or delete in a system. That’s why the amount of transactional

data increases every day, proportionally with the growth of the number of transactions.

Even though analytical and transactional data are opposite categories and completely

different, they still cannot function without one another. It is very hard to review numerous

records of transactions created on daily basis, so in such cases analytical data is used to

summarize transactional data, and provide final number of daily changes in the system. On

8

the other hand, we can always examine anomalies in analytical data by reviewing each

transactional record that was calculated in those numbers.

1.1.3. Master data

Master data is core data for each enterprise and contains detailed information for its main

domains. Since every enterprise is engaged in different businesses, it has different domains as

well. Examples would be: customer, product, location etc.

Master data can be categorized according to the kinds of questions user will address; three of

the most common questions - “Who?”, “What?,” and “How?” return the most common

domains: party, product, and account. Each of them represents a class of things - for

example, the party domain can represent any kind of person or organization, including

customers, suppliers, employees, citizens, distributors, and organizations. Similarly, the

product domain can represent all kinds of things that companies sell or use - from tangible

consumer goods to service products such as mortgages, telephone services, or insurance

policies. The account domain describes how a party is related to a product or service that the

organization offers. What are the relations of the parties to this account, and who owns the

account? Which accounts are used for which products? What are the terms and conditions

associated with the products and the accounts? And how are products bundled? ( Dreibelbis,

Hechler, Milman, Oberhofer, van Run & Wolfson, 2008, pg.14).

However, this grouping cannot be taken as general rule that all companies apply. Depending

on the business, rules and logic, each enterprise data has its own master data objects defined

for the needs of the company.

Based on some of the domains given above, example for master data would be: customer’s

date of birth, gender, name, address, product’s name, SKU, price, supplier etc. Master data is

entered once in the system and changes on rare occasions. Because business relies on this

information, it’s very important to maintain its consistencies through time. It is critical for

company to lose sales records for some customer, but it would be more crucial if it loses

personal information or contacting data for the same customer. Managing this type of data is

discussed later in the thesis.

1.1.4. Metadata

Another group of enterprise data worth mentioning is metadata. Data about data is well-

known definition for metadata, which is found throughout different literature. However,

metadata has much broader value and meaning for the enterprise, especially when Master

Data Management is discussed. Metadata helps enterprise to relate correct information to the

appropriate business terms. For example, it helps in differentiating different concepts with

similar meaning like client, customer, buyer etc.

There are two types of metadata: (1) semantic and (2) syntactic (Sheth, 2003).

9

(1) Semantic metadata describes contextually relevant or domain-specific information

about content (in the right context) based on an industry-specific or enterprise-

specific custom meta data model or ontology;

(2) In contrast, syntactic metadata focuses on elements such as size of the document,

location of a document or date of document creation that do not provide a level of

understanding about what the document says or implies.

Another categorization of metadata is based on its type of usage. In this case there are three

broad categories (Berson and Dubov, 2007, p. 129):

(1) Business metadata includes definitions of data files and attributes in business terms. It

may also contain definitions of business rules that apply to these attributes, data

owners and stewards, data quality metrics, and similar information that helps business

users to navigate the “information ocean.”

(2) Technical metadata is created and used by the tools and applications that create,

manage, and use data. Technical metadata typically includes database system names,

table and column names and sizes, data types and allowed values, and structural

information such as primary and foreign key attributes and indices.

(3) Operational metadata contains information that is available in operational systems and

run-time environments. It may contain data file size, date and time of last load,

updates, and backups, names of the operational procedures and scripts that have to be

used to create, update, restore, or otherwise access data, etc.

Based on the definitions and categorization of metadata I can conclude that this type of

enterprise data supports MDM in two ways:

(1) It contains background information for the context and technical properties of data,

which helps MDM in more precise data modeling, and also appropriate mapping of

data with master domains;

(2) It sets general data rules for business and technical definitions of data, which supports

data standardization, another process in managing master data.

1.2. Data quality dimensions

Companies need to be acquainted with data quality standards, so they can easily detect

deficiencies in their data. In my opinion, quality in general associates to how much we can

expect to gain from something and how reliable or useful that is. With this being said, data

quality shows how much information we can gain from given data and how reliable that

information is for us as users.

Classic definition found in literature defines data quality as “fitness for use”, i.e. the extent to

which some data successfully serves the purposes of users. (e.g. Tayi and Ballou, 1998;

Cappiello et al., 2003; Lederman et al., 2003; Watts et al., 2009)

10

Defining data quality is very subjective and is not seen equally by everyone. Some users may

consider data very reliable, whereas others may argue that there are still improvements to be

done. To avoid such opposite views, literature sets some common standards for data quality

defined through data dimensions. Data dimensions define data quality as multidimensional

concept and help in determining data’s “fitness for use”. According to Strong and Wang

(1996, p. 6), data quality dimensions are set of data quality attributes that represent a single

aspect or construct of data quality.

Attributes are characteristics of data and the easiest way to define them is when answering

simple data related questions. For example, the question “Which data is duplicated?” returns

uniqueness as an attribute, or “What data is incorrect?” imposes accuracy, and so on. The

table below lists several questions for determining data attributes.

Figure 4: List of data attributes

Source:R. Hillard.Information-Driven Business : How to Manage Data and Information for Maximum

Advantage, 2010, p. 136

There are many attempts in the literature to determine which attributes are most important

and best define data quality. For example, Strong and Wang (1996, p. 7) took (1) intuitive,

(2) theoretical and (3) empirical approach to find out what are the most important data

characteristics.

(1) Intuitive approach is based on authors’ intuitive understanding of importance of attribute.

They take the freedom of choosing which attributes are most important to define data

quality and in this case researchers don’t question what data attributes are important for

system users;

(2) Theoretical approach on the other hand, does not rely on researcher’s subjectivity, seen

in the previous example, and defines data characteristics based on data deficiencies that

can be found in a system. Data attributes are defined in reverse connotation, based on

data deficiencies one system has. If there is duplicate data in the system for example,

then uniqueness is the attribute that is missing and it’s crucial for quality data. This

approach, same as the previous example, doesn’t consider what data attributes are

important for users;

11

(3) In the third empirical approach, data quality is defined in terms of data attributes that are

only important for system users. Even though this approach tries to be as objective as it

can and use general opinion of consumers, still the final results can be very diverse and

inconsistent because of the different opinions collected by different people. The

difficulty in this approach is setting some basic rules upon which dimensions would be

compared.

General conclusion from all these approaches is that there are no strict rules or certain

attributes that define data quality. Data quality dimensions are relative to user requirements

and often times these requirements are subject to change, therefore priority and importance of

data quality dimensions can change as well.

Wang and Strong (1996, p. 21) used a two-stage survey and a two-phase sorting study to

develop hierarchical framework that consolidates 118 data-quality attributes collected from

data consumers into fifteen dimensions, which in turn are grouped into following four

categories, each focusing on a key issue:

1. Intrinsic - What degree of care was taken in the creation and preparation of

information? ;

2. Contextual - To what degree does the information provided meet the needs of the

user? ;

3. Representational - What degree of care was taken in the presentation and organization

of information for users? ;

4. Accessibility - What degree of freedom do users have to use data and to define and/or

refine the manner in which information is entered, processed, or presented to them?.

1.2.1. Intrinsic data quality

Intrinsic data quality, according to Wang et al (2005, p. 7), implies that information has

quality in its own right. Attributes in this category show how truthful and real data describes

objects around us. This group refers to data that comes along with the object and don’t

change due to some requirements. For example, name of a person is given as it is, and doesn’t

change because of some business requirements. Same as person’s weight, height or eye color.

Such values are intrinsic for person and the only anomalies that are found with this data are

NULLs or badly formatted data. So, quality in this case is measured in the existence and

correctness of the input, not whether it satisfies business needs. Below is a list of the most

commonly used dimensions along with their definitions (Kahn, Strong and Wang, 2002, p.

184 - 192):

- Believability - The extent to which data are accepted or regarded as true, real and

credible;

- Accuracy - The extent to which data are correct, reliable and certified free of error;

- Objectivity - The extent to which data are unbiased (unprejudiced) and impartial;

12

- Reputation - The extent to which data are trusted or highly regarded in terms of their

source or content

1.2.2. Contextual data quality

Contextual data dimensions highlight the requirements that information quality should be

considered within the context of the task at hand (Wang et al, 2005, p. 7). Based on

category’s name, dimensions define how precise data captures the context of business

objects. If again I take person as example and his address as representative data element,

what I will be interested in is if this is the only address that can be assigned to the person, and

if this address is current for the time being. In order to improve quality in contextual terms,

every business needs to increase the amount of data related to its business objects, and update

them in appropriate time, to avoid old and obsolete information in the system. Following are

some dimensions that are defined in this group (Kahn, Strong and Wang, 2002, p. 184 - 192):

- Value-added - The extent to which data are beneficial and provide advantages from

their use;

- Relevancy - The extent to which data are applicable and helpful for the task at hand;

- Timeliness - The extent to which the age of the data is appropriate for the task at

hand;

- Completeness - The extent to which data are of sufficient depth, breadth, and scope

for the task at hand;

- Appropriate amount of data - The extent to which the quantity and volume of

available data is appropriate

1.2.3. Representational data quality

Representational data dimensions address the way computer systems store and present

information (Wang et al, 2005, p. 8). This category is explored more from technical rather

than content aspect. Data quality in this case depends on how well data model and business

logic are integrated in systems. If database model is well designed then business objects are

represented by correct and unique data. Otherwise, there are duplicates, orphan records,

obsolete data that just use database space and have no use in particular. In order to meet these

data dimensions correctly, companies need to focus on technical planning and development

of their information systems. Representational data quality category includes the following

dimensions(Kahn, Strong and Wang, 2002, p. 184 - 192):

- Interpretability - The extent to which data are in appropriate language and units and

the data definitions are clear;

- Ease of understanding - The extent to which data are clear without ambiguity and

easily comprehended;

- Representational consistency - The extent to which data are always presented in the

same format and are compatible with previous data;

13

- Concise representation - The extent to which data are compactly represented without

being overwhelming (i.e., brief in presentation, yet complete and to the point)

1.2.4. Accessibility data quality

Accessibility data quality is another category that defines dimensions from technical

perspective. This multidimensional nature of information quality means that organizations

must use multiple measures to fully evaluate whether their data are fit to use for a given

purpose by a given consumer at a given time (Wang et al, 2005, p. 8).

The ability of today’s systems to serve multiple users at the same time in many occasions can

cause erroneous data. Duplicates, overwriting of important information, database changes are

some of the risks that systems undertake in their every day usage. In order to lower such risk,

companies spent some quality time building security model and limit the access to system’s

data. Data dimensions of this type are not defined by the content of data, but by the system’s

security model. There are two known dimensions from accessibility group (Kahn, Strong and

Wang, 2002, p. 184 - 192):

- Accessibility - The extent to which data are available or easily and quickly

retrievable;

- Access security - The extent to which access to data can be restricted and hence kept

secure

1.3. Data inconsistency

Data inconsistencies are irregularities found in data, such as: duplicates, misspellings,

undefined values. They are the “bad” data in systems. Any data that is obsolete, incorrect,

and unuseful falls into this category.

Bad data can have tangible and intangible effect for a business. According to some older

researches by Haapasalo et al (2010, p. 147), it is estimated that incorrect data in retail

business costs alone $40 billion annually and at the organizational level, costs are

approximately 10 percent of revenues. It is said that the decisions company makes are no

better than the data on which they are based and better data leads to better business decisions.

Looking into the intangible consequences, data inconsistencies also cause mistrust in existing

data. Working with different data versions for the same enterprise object is time consuming,

requires additional work for tracking the errors, and causes frustration among employees.

Incorrect data cannot give accurate picture for a business, and it cannot help in bringing the

right business decision for future progress and success.

Two factors play major role in producing bad data: human factor and system design. Human

errors occur every day, usually on input or during various calculations. From my personal

experience, most of the work I do is data analysis, and high number of errors I see are

misspellings or wrong data imported into inadequate data fields. Database cannot track if

14

customer’s last name was entered in the field for first name or vice versa therefore, erroneous

data is produced, unless detected on time and corrected at that moment of input.

System design is another reason for producing bad data. Wand and Wang (1996, p. 91)

discuss four states of design deficiencies in systems: (1) incomplete, (2) ambiguous, (3)

meaningless and (4) incorrect. These states are based on deficiencies that appear when user

definitions (what users expect to see in the system) are improperly mapped to the system’s

values.

(1) Incomplete state occurs when there is no system value to represent user definitions.

This state can lead to inaccurate and incomplete data;

(2) Ambiguous state appears when two or more user definitions are represented by same

value. In this case precision and accuracy are affected;

(3) Meaningless state produces irrelevant data which can’t be used for any of the

requirements. It’s an “orphan” value that stays in the system and it’s not used. This

state may not have immediate effect on data, but in future may lead to ambiguity or

incorrectness if new user definition is required and it happens to map to that same

“orphan” value;

(4) Incorrectness appears when data refers to the wrong user definition. Therefore data is

incorrect and unreliable.

Data issues can be of technical or business character. Technical data issues refer to data

structure and representation. An example of such technical errors would be (Gryz and

Pawluk, 2011,p. 3):

- Different or inconsistent standards in structure, format, or values;

- Missing data, default values;

- Spelling errors;

- Data in wrong fields;

- Buried information in free-form fields.

Business issues, on the other hand, are unique for each organization. They refer to the context

of the data and appear as a result of incorrect representation of business terms and relations.

For example, address for one customer is entered as home instead of work, or person is

related to transactions that he never committed.

It’s hard to define some general list of business data inconsistencies that will cover

irregularities of data for all organizations as it was the case with technical issues. Best way to

detect and define such business errors, is through data analysis, which will reveal if the

entered data corresponds to the defined business concepts.

Despite the fact that bad data lowers data quality and produces incorrect information, in many

cases it’s an advantage and can predict what changes need to be undertaken to improve the

work of systems. As seen from the previous chapter, another way to explore data quality is

through data deficiencies that can be found in systems. Based on this approach, existence of

15

data inconsistencies can predict the missing factors for data quality. Data errors are the

starting point for solving the problems that produce them. Once these problems are detected,

then there are appropriate measures that can be used to fix them and improve data quality in

the system.

1.4 Data quality improvement

Data quality improvement is a systematic process that occurs in several phases. It starts with

looking for the source of the problem, through cleansing of the errors to setting data

standardization rules that will prevent future problems.

Data quality improvement executes in the following order (Rivard et al, 2009, p. 62):

(1) data profiling - analyzes data to find inconsistencies, data redundancy and incomplete

information;

(2) data cleansing - corrects, standardizes and verifies data;

(3) data integration - semantically links data; reconciles, merges and associates;

(4) data augmentation - improves data by using internal and external sources, and

removes duplicates;

(5) data monitoring - monitors and checks the integrity of data over time.

Figure 5: List of techniques for solving data inconsistencies

Source: F. Rivard et al, Transverse Information Systems : New Solutions for IS and Business

Performance, 2009, p. 62

There are various tools that support DQ improvement stages listed in fig.4., leaders among

them are: Informatica, SAP, IBM, SAS/DataFlux (Gartner, 2012, p. 2).

DQ improvement phases are unified in master data management. Its concepts and goals will

be discusses in the second part of this thesis.

16

2. MASTER DATA MANAGEMENT

2.1. Definition

The problem of bad data is well known to every company. There aren’t any enterprises that

have perfect data without errors, therefore, they are constantly trying to improve data quality

and prevent their systems from further data inconsistencies. Earlier I discussed four types of

enterprise data: transactional, analytical, master and metadata. All of these types are equally

important in every organization, but core data that describes organizational business domains

is master data. Therefore management of this type of enterprise data (master data) will be

discussed further in the thesis.

There are number of data stewards, administrators of databases, software architects, business

analysts, who work with different software platforms, data methods and techniques used for

data cleansing and governance. All these people, software and methods that are involved in

solving master data errors are united in a discipline called Master Data Management (MDM).

Often Master Data Management (MDM) is defined as software package for improving data

quality. But in fact, MDM is much more than just a software application for data cleansing. It

is special IT discipline that includes people, software tools, and business rules for managing

master data. Different literatures share same views on what MDM is. For example, Berson

and Dubov (2010, p. 79) define MDM as framework of processes and technologies aimed at

creating and maintaining an authoritative, reliable, sustainable, accurate and secure data

environment that represents a “single and holistic version of the truth”, for master data and its

relationships, as well as an accepted benchmarks used within an enterprise as well as across

enterprises and spanning a diverse set of applications, lines of business, channels, and user

communities.

Loshin (2008, p. 8) defines MDM as a collection of best data management practices that

orchestrate key stakeholders, participants, and business clients in incorporating the business

applications, information management methods, and data management tools to implement the

policies, procedures, services, and infrastructure to support the capture, integration, and

subsequent shared use of accurate, timely, consistent, and complete master data.

Figure 6: Workflow of MDM

Source: D. Loshin , MDM -Paradigms and Architectures, 2008, p. 9

In other words, MDM is developed to improve, maintain and govern company’s master data

following the business rules of that enterprise.

17

Even though there are three types of enterprise data, MDM’s main concern is master data.

This fact should not underestimate the significance of the other two types of data, analytical

and transactional, but the choice is made because every company’s business processes are

designed and develop around master data. Master data holds information for the key objects

of every enterprise.

There has always been a need for MDM, but in the recent years the interest constantly grows,

especially in large and complex companies. Many reasons can be found for this urge for new

management standards for data quality. Examples would be: (1) lines of business, (2) mergers

and acquisitions and (3) new packaged software. (Dreibelbis et al, 2008, p. 6 - 11)

(1) Lines of business - Common thing about these reasons is that they bring in additional

data in the system, which in often cases is a different version of already existing data.

Lines of business, for example, create different modules in the same enterprise and

each module functions independently. They work with the same master business

domains, but each line of business keeps its own master data for the common

enterprise objects. In one sales company, customers can make purchase through

different channels like store, online, catalogue etc. If each sales channel represents

different line of business then it can happen that there are several versions of same

customer created for each sales channel.

(2) Mergers and acquisitions – it is more very common nowadays one company to

purchase another, or they merge their business and become large enterprise. In such

cases, master domains from both companies are included in the new business. The

same problem as in the first example can show here. Even though I’m taking an

example of large businesses that may work with different sets of customers, it can

happen that there is still a group of people that are stored in both systems. With merge

of two data storages, duplicate data is automatically created.

(3) Packaged software - As a result of the SOA architecture and all the independent

platforms on the market, companies often time decide to use different applications for

different business processes. They can use Enterprise Resource Planning (ERP)

software for managing their sales, purchases and stocks, or Customer Relationship

Management (CRM) software to manage their customers. In both cases, there needs to

be some connection between these different applications, so they can “communicate

and share same data for the key objects of the company. MDM is the link in this case.

Among all the existing ERP, CRM, SCM solutions, often comes up the question why

companies need another management tool, when there are already so many on the market?

Why can’t the existing management solutions, which have been on the market long before

MDM appeared, solve the problems that were just explained above? Answer to this question

is described in the following four factors (Loshin , 2008, p. 13):

(1) Despite the recognition of their expected business value, to some extent many of the

aspects of these earlier projects were technology driven and the technical challenges

18

often eclipsed the original business need, creating an environment that was

information technology centric. IT-driven projects had characteristics that suggest

impending doom: large budgets, little oversight, long schedules, and few early

business deliverables;

(2) MDM’s focus is not necessarily to create yet another silo consisting of copies of

enterprise data (which would then itself be subject to inconsistency) but rather to

integrate methods for managed access to a consistent, unified view of enterprise data

objects;

(3) These systems are seen as independent applications that address a particular stand-

alone solution, with limited ability to embed the technologies within a set of business

processes guided by policies for data governance, data quality, and information

sharing;

(4) An analytical application’s results are only as good as the organization’s ability both

to take action on discovered knowledge and to measure performance improvements

attributable to those decisions. Most of these early projects did not properly prepare

the organization along these lines.

From all that was stated above, MDM is no new technology or approach for improving data

quality but some standardization of a workflow for data management, something that wasn’t

formally defined before. There were data stewards, data management methods used in

different systems, times and places, but they didn’t belong to any category, even though were

doing the same job which was data integration, cleansing, governance and sharing. MDM is

now this category which expands with every new master data management method that is

defined. Considering the serious role it has in governing master data, MDM has yet to

develop and prove as efficient tool for data quality improvement.

2.2. Goals of MDM

Most of the literature researches refer to creation of single source of trust for master data to

be the main goal of MDM. Yang (2005, p. 3), for example, stated that the main goal of

MDM is to allow unrelated applications to share a common pool of synchronized data. Per

Berson and Dubov (2007, pg. 3), the focus of MDM is to create an integrated, accurate,

timely and complete set of data needed to manage and grow business.

Other than this goal for “golden record”, MDM focuses on lowering cost and complexity

through standards, and supporting business intelligence and information integration (Otto,

2011, p. 2).

Some of the most important goals of MDM include (Mauri and Sarka, 2011, p. 17):

- Unifying or at least harmonizing master data between different transactional, or

operational systems;

- Maintaining multiple versions of master data for different needs in different

operational systems;

19

- Integrating master data for analytical and CRM systems;

- Maintaining history for analytical systems;

- Capturing information about hierarchies in master data, especially useful for

analytical applications;

- Supporting compliance with government prescriptions (e.g., Sarbanes-Oxley) through

auditing and versioning ;

- Having a clear CRUD process through prescribed workflow;

- Maximizing Return Of Investment (ROI) through re-usage of master data.

MDM goals in the list above can be summarized into two main goals one that strives to

cleanse the data and another goal that tries to maintain the data clean. This being said, goal of

MDM is a two-step process that helps increasing data quality. The first step in achieving this

goal is to review, organize and cleanse existing data. The second step is its maintenance and

governance.

As stated in the definition by Mauri and Sarka (2011, p. 17), MDM has a list of numerous

goals that need to be accomplished, therefore defining MDM just as creating single version of

data for enterprise key objects, is a partial explanation which doesn’t cover the whole issue of

data quality. This may be the final point that needs to be accomplished when MDM is

implemented for the first time, but data management doesn’t stop here. Knowing that data

changes on daily basis, one time data reorganization and cleansing doesn’t solve the problem

for bad data because with the new data load, previously mentioned problem may reappear

again. What one company needs is some long-term solution for its bad data problem and this

is achieved in the second step of the MDM goal realization which is constant governance of

data quality standards.

MDM is a long-term solution for keeping enterprise data quality on satisfactory level. Each

MDM project should strive to achieve the top data quality level and show proactiveness

towards managing data. Creating single source of data and governing with the same, provides

flexibility for one organization to grow and increase its information pool without confronting

issues of redundant data.

Figure 7: The data quality activity levels

Source: H. Haapasalo et al, Managing one master data – challenges and preconditions, 2011, p. 158

20

In order to accomplish the desired goal, MDM should have a business focus instead of

technology focus (Loshin, 2009). Main concern of MDM is master data; type of data that

defines the business in each enterprise, therefore technology in this case is just tool for

realization of the management, whereas business standards are the core issues MDM should

deal with.

In addition to this, Smith and McKeen (2008) have defined four prerequisites for successful

MDM: (1) developing an enterprise information policy, (2) defining business ownership, (3)

data governance and (4) the role of IT systems. In this list, only the last prerequisite includes

IT as a requirement, the other three points are all business focused.

2.3. MDM Activites

MDM provides the following activities to accomplish the goals discussed: (1) profile, (2)

consolidate, (3) govern, (4) share and (5) leverage. These five categories contain different

methods, techniques and tools that support activities appropriate for each of the groups (An

Oracle White Paper, 2011, p. 14).

(1) Profile – This is the first phase of data management which examines the current data

quality state of all sources. It’s nothing more than data assessment to check if the

current data follows some predefined rules in the master repository. Examples would

be: the completeness of the data, the distribution of occurrence of values, the

acceptable rang of values etc. Profiling can be also done during data import as well as

data integration tasks;

(2) Consolidate – In this phase data from different sources is integrated. Depending on

the MDM architecture, data can be integrated in the master repository, or key

references to external applications are updated or created;

(3) Govern – Major changes can happen in this phase because the actual data updates

occur in this stage. Deduplication, cleansing, update and deletion are done based on

the assessment results provided from data profiling;

(4) Share – Once data is cleansed, it is passed on to external sources. Master data

synchronization between master repository and external applications is supported by

SOA, architecture that allows sharing data between different system platforms;

(5) Leverage – This last phase is used for analytical purposes. Enriched master data is

great source for BI tools and gives complete view of the master business domains.

Figure 8: MDM Activities

21

Source: An Oracle White Paper, 2011, p. 14

Managing master data follows this order. It’s logical that this workflow starts with data

analyzes and ends with data reporting. However, all phases don’t always occur at the same

time. It would be very expensive and time consuming if one organization runs column

analyzes or matching on daily basis. Sharing of data, on the other hand, may be more

frequent, especially if external applications send direct request for data retrieval.

2.4. Benefits from MDM

Successful MDM solution can be of positive value to an enterprise providing benefits of

intangible as well as tangible character. Intangible benefits are seen in the following areas:

(1) data quality (2) business processes and (3) users and customers (Dreibelbis, 2008, p. 37).

(1) MDM offers improved data quality, seen through some of the dimensions discussed at

the beginning. Better accuracy, consistency, completeness are some of the few

dimensions that are improved with this strategy. Also, same version of data is shared

across the system and used by various applications;

(2) Business process and workflows are better organized due to correct data. They are not

improved just because of the data they worked with, but are also reorganized to

produce and maintain correct data that will result in reliable information. This

reorganization of business processes is also of predictable nature, to detect most

valuable data and trigger new business innovations and more profitable decisions for

future progress of the enterprise;

(3) Users trust in data is returned back because now they can rely on the same version of

data across the information system. Customers are also more satisfied because most of

the delays and irregularities that were present as a result of wrong data, are greatly

reduced due to MDM.

Tangible benefits are seen in the actual profit that organizations gain after implementing

MDM. An example of this quantitative data is shown in Table 1. Benefits with the highest

amounts in the table are sales, customer loyalty (which again leads to increased sales) and

22

efficiency (of sales representatives and IT systems). Based on these facts, MDM improves the

organizational work from business and technical perspective.

Table 1: An example estimating the positive impact of customer MDM

Source: Building the Business Case for Master Data Management (MDM), 2011, p. 9

Often times, MDM is identified with MDM software. This confusion appears because MDM

applications present data management processes in the most accurate manner. Following is a

review of four MDM lines of products. Discussion about them would cover architecture,

processes and usage of MDM in enterprises.

3. MASTER DATA MANAGEMENT SOLUTIONS

3.1. Historical review of MDM solutions

As seen from the definitions from MDM in the previous chapter, every process or person who

is involved in data quality improvement is part of MDM. Data mining, cleansing, redefining

business rules, changing application logic; it’s all part of managing data. Therefore, I can say

data management appears with the first introduction of databases. However, standardization

of methods and rules is becoming more popular in the recent years.

First attempts of managing data are found in CRM and ERP applications. However, the main

problem with these applications was that they were managing their own data and they

couldn’t provide solution for single common source of master data between different

solutions. Master Data Management appeared in the late 1990’s with the release of Customer

Data Integration (CDI) and Product Information Management (PMI) on the market.

Development of MDM applications historically goes into two directions: (1) functionality

centric and (2) domain centric.

(1) First approaches of MDM were made through data warehouses. However, this type of

managing data didn’t prove as efficient. The idea was to centralize data in one place.

But, managing doesn’t mean just keeping everything in same storage, it also requests

for some functionality implementation, which was missing in the data warehouse

approach.

23

The second idea for managing data was through enterprise application integration

(EAI). Development of this new technology made it possible for different application

to work together and exchange data. Missing part in this case was the central storage

that would keep the single source of truth. MDM evolved from these two ideas, as

common ground that creates and maintain the single source of truth, and also shares it

with various applications in the system;

(2) Because master data represents the key objects of one business, customer and product

are the main domains found in enterprise data. Understandable, MDM started with

customer master data implemented in the well known customer data integration (CDI)

applications. Customer data models were initially of account-centric design, which

means that they were designed based on different customers’ roles in the system

(buyer, sales person, administrator etc.). Because the number of such business models

was growing with the growth of customer’s data, it was more difficult for

organizations to maintain several databases just for one type of entity, and consolidate

data from all of them. Therefore, an account-centric model was replaced with entity-

centric model, which represents one schema-design for buyer, sales person,

administrators or client-organizations. They were all covered included in Customer

domain. After solution for customer domain, vendors came up with product

management information (PMI), applications that support product master data.

Nowadays, the latest trends implement several domain types into one master data

management application, called multi-domain MDMs.

The evolution of IBM MDM applications is great example for these two development

directions (IBM Multiform Master Data Management, 2007). In the development cycle of

IBM MDM applications, there are two significant points : the first one is the transition from

data-centric approach tool to functional-centric application and the second point is the

transition from singe focused usage of style or domain, to multiform application. These two

points are important in MDM evolution, because they represent the culmination of problems

found in current data management tools at that time which caused the transition from one

approach to another.

The first approach that MDM used was through indexed and reference tools. In this case,

there wasn’t any significant storage for keeping the master data, but only the indexes (IDs,

references) were kept in single repository. This approach was showing the various versions of

data for master domains, but it had lack of functionalities to deal with them and solve the

same ones. This is the point when first evolution chasm appeared, and caused MDM

solutions to develop as applications from that time on.

The second chasm appeared during MDM being developed as functional approach that has its

own physical storage of data as well as functions that will manage data. Initial MDM

applications were focused either on unique usage style or one domain. Such approach created

difficulties in exchange of master data between different domains. Knowing that enterprises

have different lines of business, and multiple domains, it was hard to merge and maintain

24

data from uniform MDMs application. This problem introduced the next step in development

of MDM applications, launching multiform MDM applications. Multiform MDM

applications are functional-centric solutions that support various domains as well as various

usage styles. There are several vendors that still produce single domain applications, but there

mission and vision is directed towards multiform application. An example of MDM evolution

can be seen in the graph below.

Figure 9: Evolution of IBM MDM applications

Source: IBM Multiform Master Data Management: The evolution of MDM applications, 2007, pg 9

MDM solutions grow and develop along with technology. The newest trends of cloud

computing is also present among this discipline. Focus of MDM nowadays is towards

developing multi-domain cloud solutions.

3.2. Functionalities, concepts and architecture

The benefits of Master Data Management are best seen when MDM solution is implemented

in an enterprise and used to manage its data. Systems, applications, hubs are terms that refer

to MDM system in general, and that is why they will be interchangeably used further in the

text.

MDM system represents solution for creating single version of master data, maintains master

data through various processes and makes it available for other legacy applications in the

information systems.

25

Per Gryz and Pawluk (2011, p. 2-3), MDM solutions should offer the following

functionalities:

- Consolidate data locked within the native systems and applications;

- Manage common data and common data processes independently with functionality

for use in business processes;

- Trigger business processes that originate from data change;

- Provide a single understanding of the domain-customer, product, account, location for

the enterprise.

Functionalities of MDM are mainly developed to support data unification and are manifested

though import and export of data, business rules, validation and any other method that assists

in data consolidation and transfer. Even though different vendors try to provide different

functionalities so they can be leaders in the MDM area, there are still some basic concepts on

which MDM solutions are built.

Best way to describe MDM system’s functionality for data management, concepts of work

and their architecture, is through the three dimensional model. This model is a shortened

version of the 30 viewpoints framework proposed by Zachman.

Main dimension that describe MDM systems are: (1) domain, (2) methods of use and (3)

implementation styles. There are three main guidelines that define the scope of the three

dimensional model (Dreibelbis et al, 2008, p. 12):

(1) Business scope determines the number of domains;

(2) Primary business drivers determine the methods of use;

(3) Data volatility (instability) determines the implementation styles .

Figure 10: Dimensions of master data management

26

Source: A. Dreibelbis et al, Enterprise Master Data Management, 2008, p. 12

(1) First dimension, domain, is based on the business nature and the type of master data

domain works with. Each enterprise has different lines of business which work with

various key objects. Most common domains are: customer, product and account. The

domain Location is often times added to this list. However, this classification can be

further expanded with new domains, depending on enterprise requirements. Names of

these domains are pretty much self explanatory and describe the business objects they

cover. Customer covers: people, organizations, and all the roles in the system they can

have. For example, supplier, buyers, employee, employer etc. Products depend on the

lines of business and they cover various items company may work with. Account

domains cover relationships between customers and products. Depending on the type

of business there are different types of accounts as well, checking, savings, student

accounts etc. Based on the number of domains MDM can support, there are single-

domain MDM solutions as well as multi-domain solutions which work with several

different domains;

(2) Second dimension, methods of use, is defined according to the different purposes of

use each business have. Based on this dimension, MDM systems can belong to three

groups: collaborative, operational and analytical. Collaborative MDM is used to

support complex workflows and the data that comes from different sources. Best

example of such usage would be when introducing new item (product) in the system.

In this case there is a list of people involved for defining product properties,

approving them and launching this product on the market. Data validations,

integration of different properties as well as triggering approvals for this item are all

supported by collaborative MDM. Main functionalities that this style of MDM

solution should have are: task management, data validation, data integration of

properties from different legacy applications.

Operational MDM acts as an Online-Transaction Processing (OLTP) system that

responds to requests from multiple applications and users. However, this type of

MDM is used to support processes that are predefined by the MDM users, and doesn’t

have this decisive role as Collaborative MDM. Operational MDM method of use is

best seen in SOA services as well as main database operations, where MDM supports

transactions to retrieve data, update, create and delete.

Analytical MDM has completely different method of use and it is about the

intersection between Business Intelligence (BI) and MDM. It’s a one way

communication where data from different systems is sent to the MDM hub for data

consolidation and preparation for analytical systems. Knowing that MDM repository

stores all master data, cleansed, organized and managed, this is an excellent source for

OLAP, star schema for data warehouses, data mining, predictive analysis based on

scoring etc.

27

However, MDM systems cannot be strictly divided in these three categories.

Depending on different business processes in each enterprise and the frequency of

their change, often times MDM systems can cross over from one type to another.

(3) The last dimension, implementation styles, is based on the different ways data

attributes can be stored in the system. This dimension covers various architectural

styles of MDM.There are four general implementation styles defined throughout the

literature: external reference, registry, reconciliation engine and transaction hub.

External reference architecture is the simplest solution for MDM. It acts as system of

reference instead of system of record, because it doesn’t contain actual data, but

reference to data which remains in the legacy systems. External reference architecture

may be simple and easy for implementation, but it lacks control over its data. All this

architecture can provide is just reference for data in its legacy system, but any

functionality is disabled because MDM doesn’t have access to it.

Registry style is on higher architectural level where MDM solution is represented as

limited size data storage that contains only unique identity attributes. What this means

is that instead of containing all data from several applications in one storage, MDM

system stores only unique attributes for an object such as: ID, name and description,

and references the other data attributes that remain in the legacy systems. This

implementation style is step forward in MDM development because it stores some

basic info and also integrates data from different system with the help of references.

Disadvantage of this architecture is that MDM still doesn’t have all data available.

Keeping references still doesn’t solve the problem of bad data. Also, often times

cannot retrieve all information due to legacy systems failure.

Reconciliation engine – In this architectural style, there is an opportunity of

exchanging data in both directions: MDM database to legacy applications and vice

versa. MDM system can store complete set of data attributes for some domain, but it’s

not the only place that manages data. Legacy applications still manage their data and

synchronize it with the one that is stored in the MDM system. The ongoing matching

and synchronization in MDM repository keeps the master data up to date. The only

challenge that appears in this architecture is that data is still changed in other systems,

and can often cause unreliable data in MDM system. With the growth of data

attributes in other external sources, it is more difficult and complex to keep up with

the synchronization updates.

Transaction hub is the most sophisticated architectural style of MDM systems. This

implementation style is the actual system of record for other applications. Central data

storage is placed in the MDM system, where master data is cleansed, organized and

managed. All the other external systems are using the data from the MDM repository.

This architectural concept is the core of master data management, and achieves all

goals for single version of data that should be accomplished. However, the complexity

28

of this structure carries some difficulties when implementing among external legacy

systems. There are two major changes that need to be done during implementation of

this architecture:(1) data needs to be integrated and centralized into one data storage

and (2) other systems need to be changed to work with the new transactional hub. The

idea of transaction hub fulfills all requirements for data quality improvements but the

realization can be impossible in some large enterprises with complex systems.

The forth implementation style, transaction hub, is shown in fig 11 and 12. As seen in

both of these figures, data from external processes is imported with Extract,

Transform and Load (ETL) process and accessed through different user interfaces

(UI).

Figure 11: Traditional MDM architecture

Source: A. Berson and L. Dubov, Master Data Management and Data Governance, 2011, p. 108

Figure 12 shows more advanced model practiced in the latest solutions, where MDM

architects try to solve the problem of data sharing among MDM repository and

external applications including various SOA services. The goal is to make MDM

solution as a metadata-driven SOA platform that provides and consumes services that

allow the enterprise to resolve master entities and relationships and move from

traditional account-centric legacy systems to a new entity-centric model rapidly and

incrementally (Berson and Dubov, 2011, p.85).

Figure 12: MDM architecture with additional published services

29

Source: A. Berson and L. Dubov, Master Data Management and Data Governance, 2011, p. 108

As seen from the discussion above, there are different styles of MDM systems based on the

three dimensional model. Which approach is chosen depends on the business requirements of

one enterprise. In the recent years, trying to serve all types of business, vendors are moving

towards multiform MDM systems, solutions that implement various domains, methods of use

and implementations styles in order to develop solution that would be suitable for every type

of business.

3.3. Architecture of MDM described through selected MDM solutions

Despite the great variety of MDM solutions on the market, I chose Microsoft, IBM, Oracle

and SAP because they are already known and well established vendors for database software

as well as Business Intelligence (BI) solutions. According to Gartner (2012, p. 2), they are

placed in the leader’s quadrant for 2012 for Data Warehouse Database Management Systems

and also BI platforms. Since MDM systems main concern is data management and they are

also involved in BI processes, I was curious to find out what these leaders have to offer for

the MDM market.

Table 2: Gartner’s Magic Quadrant for Data

Magic Quadrant for Data Warehouse Database

Management Systems Magic Quadrant for BI platforms

30

Source: Gartner, 2012, p. 2

I will present short overview of the MDM solutions for the four vendors mentioned above.

Following concepts will be covered for each one of them.

History of development;

Data modeling;

Data import and export;

Data validation;

Data security;

Advantages and disadvantages.

3.3.1. Microsoft Master Data Services

Master Data Services (MDS) is a product which Microsoft acquired from Stratature in 2006.

Already a customer of Stratature, Microsoft had been impressed with the rapid time to value

and the ease of customization that Stratature’s +EDM product provided. Microsoft initially

planned to ship its MDM solution as part of SharePoint, because information workers are the

primary consumers of master data. However, because IT plays a significant role in managing

MDM solutions, MDS moved to the Server and Tools division and became a feature of SQL

Server 2008 R2 (Graham and Selhorn, 2011, p. 6) . MDS can be installed as additional

feature of SQLServer 2008 R2 or any newer version that follows.

Data modeling MDS system comes with blank database, which means that there is no data in

the MDM repository, no tables and no pre-defined data models. There is metadata model that

comes with every installation and also sample models for Product, Customer and Account,

but they serve more as an example rather than template which can be used as starting point

for developing data model. MDM model is based on the model of RDBS, just different

31

terminology is used. There are four data objects that are made available for users: entity,

attribute, member and hierarchy and correspond to certain data object in relational data model

(Graham and Selhorn, 2011, p. 56). Below is a table that relates MDS and relational database

objects.

Table 3: MDS repository objects vs. Relational database objects

MDS repository Relational database

Entity Table

Attribute Column

Member Row

Hierarchy Relationship

MDS supports several models in a repository, however it allows relationships only between

entities from same models. Hierarchical relationships are supported, and this parent-child

structure allows grouping of data in collections, hierarchical groups for better organization

and maintenance of data. Fg.13 shows an example of MDS Model explorer. As seen in this

picture there are several models: Chart of Accounts, Customer, Product each having their

objects organized in tree structure. Hierarchies are supported only between entities of the

Product model, but relationships between objects from Product and any other model are not

allowed.

Figure 13: MDS data model

Source: Bullerwell, Kashel & Kent, Microsoft SQL Server 2008 R2 Master Data Services, 2011, p. 75

Data import is done through ETL package created in SSIS. There isn’t any feature that

allows direct connection between SSIS and ETL therefore special skills are required to setup

the whole loading processes. However, MDS doesn't just rely on Microsoft SSIS, but also can

use ETLs from other vendors such as Informatica or Infosphere DataStage. In order to protect

data, each repository comes with staging tables that are copy of the tables for the existing

32

data model objects. Staging tables are used during data load, to store newly imported data

before it enters in the production data model.

Data export is done with publishing subscription views on defined server. Subscription

views are nothing else but views from the tables in MDS. Once exported, data from these

views can be queried in SQL Management Studio with simple select statement. These

subscription views can be set up to run every night in case we need frequent update of the

data. In order to be used in other systems, they can be exported as flat files and afterwards

imported in different databases. Web services are also made available in this system, so data

import and export can be performed programmatically as calls of such services.

Data validation is done in different areas of MDS. Data managing, validation and cleansing

can be done on data import and also using matching techniques or validations rules. The first

attempt for data management is done on data import. Since data is first imported in staging

tables, here is a point of precaution for bad data. Also, using the same structure of predefined

tables for every model sets general standards for each data domain organization. Data is

additionally checked when loaded from staging table to the actual data model. Batches that

run in the MDS, check for data compatibility to the MDM model structure and error on any

inconsistencies, like Nulls, improper format, length etc.

Main tool for detection duplicates and data cleansing is the matching operator in MDS. This

operator has two values: Match and Does Not Match that works on user defined Similarity

Level. Similarity level is a decimal number which defines how precise user should consider

that two values match. If the value is closer to 1, then the match is equal to the entered value.

As prevention for entering erroneous data, MDS uses validation rules, which are also logical

operators, and return an error when wrong data is entered. Examples for such attributes would

be detection of NULLs, defining range of values for some attributes etc.

MDS is also involved in speeding up some workflow processes by sending emails or

notifications to users who are in charge of some action or approval. These notifications are

usually triggered by data changes. This feature is an attempt toward collaborative method of

use, which is a step forward towards improving MDS as fully supportive solution for all types

of methods of use that are defined.

Data security is supported through different roles assigned to users of MDS. There is admin

role that has full permissions in the system, and specific groups of roles that are with limited

access to data in MDS such as for browse, edit etc.

Another way of securing data is through creation of versions, which are data snapshots of the

current information that exists in MDS repository. Versions are used (1) to track changes that

were made in the past, (2) to rollback changes and (3) to track the version of model that each

external system is using (Graham and Selhorn, 2011, p. 234). Even though versioning doesn’t

link to roles and permissions, the ability to save data in certain period can save company’s

time and work in case of major database failures.

33

New addition in SQL Server 2012 is the DQS – Data Quality services. These services work

on knowledge base that the organizational data steward maintains. Based on this knowledge

database DQS cleanse, match and validates data according some business rules. These

services are integrated with MDS through the MDS Excel Add-In for Matching.

Advantages and disadvantages of MDS are presented in table 4. Based on the discussed

functionalities and architectures there are many ways in which MDS can improve data quality

of organizational data. However due to various limits in its design, user still needs to use

workarounds and extend the data model to support certain business requirements related to

the enterprise data.

Table 4: Advantages and disadvantages of MDS

Advantages Disadvantages

Domain neutral, doesn’t limit user to specific

types of master objects

No prebuilt data model, which requires time

and work to define data model for each

domainUser knowledge

Familiar database structure similar to RDBMS No relationships are allowed between different

domains

Simple interface that doesn’t require IT skills

or programming knowledge

No support for multi valued attributes, this

causes additional tables and relationships for

their implementation

Versioning of data is enabled Data import and export are done with different

tools and require special skills and knowledge

to setup the whole loading environment

Despite the fact that Microsoft is long time in the database management software market and

also one of the leaders according to the Gartner’s Magic quadrant for 2012; still it didn’t keep

its place in the Magic quadrant for MDM applications. MDS is simple solution for SME

enterprises but due to the limited number of functionalities discussed earlier, it cannot support

complex enterprise businesses. The way MDS is designed now, all it can offer is simple user

interface for data model structure and maintenance for limited database scenarios, but it

requires a lot of remodeling and additional functionalities to fully develop in the following

areas: data import and export tool, integration among other external system, support of

complex decisive workflows, thorough data analysis functionalities and support of

relationships among different domains.

3.3.2. SAP Netweaver

SAP Netweaver MDM is part of the Netweaver computing platform consisted of several core

products such as Application Server, Business Intelligence, Enterprise Portal etc. MDM was

introduced to this family of products in 2004 when SAP purchased a small vendor in

the PIM space called A2i (Ferguson, 2004). Because this code was specifically intended for

product domain, the first release, SAP MDM 5.5, was considered to be PIM solution instead

of general MDM system. In 2008 SAP released enhanced version of MDM, called SAP

NetWeaver MDM 7.1 and a year later they launched full MDM suite containing various

http://en.wikipedia.org/wiki/2004

http://en.wikipedia.org/wiki/Product_Information_Management

34

applications as well as improved MDM technology to build pre-packaged business scenarios

and integration. Current version of SAP MDM NetWeaver Suite contains the following

components (Rao, 2011, p. 21):

MDM Import Manager;

MDM Import Server ;

MDM Data Manager;

MDM Syndication Server;

MDM Syndicator.

Data modeling SAP MDM solution stores its data into a central MDM repository. It’s a

complex structure of several types of tables so they can store different type of data, from

simple integers to pdf files and pictures.

Figure 14: Table types

Source: L. Heilig et al, SAP NetWeaver™ Master Data Management, 2007, p. 192

SAP data model reminds of star schema. Master data attributes are stored in main tables,

which are flat type in most of the cases. They contain main data attributes and references to

subtables where additional data attributes for the master objects are stored. MDM repository

supports various types of data fields. Novelty here is that multi value attributes are supported,

which is a plus because it “saves” the repository from additional tables that should be created

to store the values for such attributes. Also, relationships and hierarchies are implemented in

the same way as in relational model (Heilig et al, 2007, p. 193).

This MDM system can support any type of domain. Despite the neutral environment that

SAP offers, there is also possibility of several predefined repositories for Customer,

Employee, Supplier, Material, Business Partner, and Product domain which can be used as

starting point for data model development and extension depending on the business

requirements each enterprise has.

Data import SAP MDM suite has automated the process of data load in the repository. Also,

importing is done field by field instead record by record, which significantly speeds up the

35

process. There are two ways when data is imported in the system. Both of them are done in

the Import Manager, but SAP MDM allows either load on the actual data in its tables, or

assigning key mapping pairs, used mostly for the external systems. In the first case of data

load, SAP supports different sources of data such as database servers, xml, text, excel files

etc. During import, preparation process of source data requires more time and work instead of

the import itself. Data needs to be validated, matched and mapped to the existing fields in the

MDM repository. However, this is done only once, and the whole process can be saved as

import map and reused on the next import.

Another way of integrating data from different systems is through key mapping. Instead of

loading the actual data from the legacy repositories, key-value pairs are created in SAP MDM

databases. They contain unique MDM ID that is same for same records in different external

systems. In this case, the original record ID is kept in the legacy repository and the MDM

unique ID is the link between the external source and the master data stored in the central

database (SAP NetWeaver Master Data Management (MDM). MDM Import Manager, 2011,

p. 407).

Data export is done in the similar manner as data load. SAP MDM calls this process data

syndication . Exporting can be also done automatically, but in this case users need to create

export maps that will define the flow of data between MDM repository and destination items.

The final product from this export is xml schema or flat files that are further on imported in

other systems. What users need to be careful of is the changes in master data that may happen

during export. Incase master data is updated while export is being executed, exported file

may contain mix of old and new data.

Figure 15: Key mapping during import and export

Source: L. Heilig et al, SAP NetWeaver™ Master Data Management, 2007, p. 201

Another way of using MDM data is with the key mapping technique, discussed earlier during

import. External systems can access master data in SAP using the unique MDM IDs assigned

to each record from the legacy systems (Fg. 15) (SAP NetWeaver Master Data Management

(MDM). MDM Syndicator, 2012, p. 15-86).

Data validation This MDM suite validates data on several occasions. During data import,

data export and data management. SAP is built in such way, that any work with data is

related to data validation and management. Matching is the core functionality used to check

36

for duplicates, cleanse them from the repository and prevent their import in the system. In

order to detect duplicates and validate data, SAP has put a lot of thought in this process and

developed it as complex set of rules and strategies. All processes that fall under matching are

used for data validation and cleansing wrong values from the database. Transformations,

matching functions, matching rules, strategies, and substitutions are some of the features that

are part of MDM matching. Same records are detected during matching based on user defined

scores for similarity. Other matching rules and functions use logical operators to determine

equality between values. The whole process is record centric, which means that for each

record there is a group of zero or more potential matches. Once matches are found, they are

further merged into one record. Additional advantage for better data management is the

architectural structure of the MDM database which supports various types of tables, fields

and relationships.

Data security By default, MDM servers are not password protected and everyone can access

them. Therefore, there has to be some admin user that will create passwords and restricts user

access to the system. There are two levels of password protection: server level, which

includes password protection to various applications in the system, and repository level,

which covers repository passwords and access. User roles and permissions are stored in

separate tables in the MDM repository. For example, record for user of the system contains

username, password and reference to the privileges table where additional permission values

are stored.

Another way of keeping the central data safe is thorough copies of the repository supported

with master/slave concept. Master is the place where changes occur and slave is an auxiliary

repository that gets updated by synchronizing with the master repository. There is another

type of slave repository, called publication slave that acts as backup version of the master

repository. Once data is loaded in publication slave repository it stays unchanged unless this

repository is loaded in the system again and put online for work. Another way to keep

versions of master data is by making duplication of the existing MDM repository. This copy

of the data can be saved on other disks and loaded anytime user needs it (Heilig et al, 2007, p.

211).

Advantages and disadvantages of SAP are listed in table 5. Most of the advantages are

related to the various types f data objects supported in the database as well as the automation

of import and export which greatly facilitates user’s work. Disadvantages refer to the

complex interface and the great load of work that needs to be done when preparing

automated imports and exports.

Table 5: Advantages and disadvantages of SAP


Domain neutral Complex interface

Offers prebuilt repositories for certain domains Requires time for user to get acquainted with it

Supports different types of tables, data types Time consuming preparation process for

37

and relationships, multivalued attributes import and export

Automated import and export Inconsistent updates and export of data can

lead to old and new data export

Effective matching rules that cleanse and

prevent duplicates in the repository

Key mapping approach can bring in data

inconsistency

Various IT and business scenarios for MDM

implementation and usage

Not suitable for small enterprises

Security architecture that enables different

roles for work with the data

SAP MDM system offers a lot of functions for mastering data. Complex matching processes,

various table objects, domain neutral data model create solution that gives great freedom for

users to manage any kind and any type of data. Key mapping functionality allows data

communication with external systems without changes in the code of these legacy

applications. Automated imports and exports make data loads and distribution to occur much

faster and more precisely. However, all this freedom of choices bring additional burden to

users during preparation of the data. There is a long checkpoint list that needs to be done

before processes are ready for execution. Helpful circumstance is the ability to save this

preparation work for similar scenarios in the future.

General overview for this suite is that SAP succeeded in great part of its intention to automate

master data management processes, but system functionality needs to be improved so users

can have less work in the preparation process. Due to the massiveness and complexity, this

solution is not appropriate for small enterprises but for large and complex businesses.

3.3.3. IBM InfoSphere

IBM offers great variety of products for data integration and management. InfoSphere is the

line of such applications that supports these processes. Therefore, I cannot limit to just one

application when reviewing implementation of MDM through IBM solutions, but I have to

mention several of them to explain different MDM processes.

First IBM MDM developments started off in 2004 with acquisitions of products from

different vendors. For example, launching of IBM InfoSphere Information Server was first

made in 2004, when IBM purchased the data integration company Ascential Software and

rebranded their suite to IBM Information Server. The same year IBM also acquired Trigo, a

product MDM software vendor and renamed their software in WebSphere Product Center.

The next year, IBM acquired Customer Data Integration software from DWL – a, and

rebranded the product as WebSphere Customer Center. In 2008 IBM released full version of

InfoSphere Information Server. IBM Master Data Management Server has similar

development history. It was released in 2008 and it’s a combination of IBM’s customer

integration tools from WebSphere Customer Center (WCC) with workflow capabilities from

WebSphere Product Center (WPC) (Press release notes from IBM, Retrieved February 7,

2013, from http://www-03.ibm.com/press/us/en/index.wss).

http://it.toolbox.com/wiki/index.php/IBM_Information_Server

http://it.toolbox.com/wiki/index.php?title=WebSphere_Product_Center&action=edit

http://it.toolbox.com/wiki/index.php/Customer_Data_Integration

http://it.toolbox.com/wiki/index.php?title=WebSphere_Customer_Center&action=edit

http://www-03.ibm.com/press/us/en/index.wss

38

Other known products that fall under IBM InfoSphere brand and are used for managing data

are (Zhu et al., 2011, pg. 47):

IBM InfoSphere Blueprint Director;

IBM InfoSphere Business Glossary;

IBM InfoSphere Discovery;

IBM InfoSphere Metadata Workbench;

IBM InfoSphere Asset Manager;

IBM InfoSphere Information Analyzer;

IBM InfoSphere QualityStage;

IBM InfoSphere Audit Stage;

IBM InfoSphere FastTrack;

IBM InfoSphere DataStage;

InfoSphere Data Architect;

IBM offers rich applications suite that covers all processes in master data management, from

documenting business rules, workflows and terminology to cleansing, merging duplicate

records and their distribution to external systems or files. From the list above, each

component performs different functionalities, and same functionality can be supported in

different applications.

Data modeling There is no particular database vendor or database schema that IBM follows

during data modeling. Trying to provide product that is platform independent, IBM made

MDM solution that is domain, software platform and database neutral. IBM MDM repository

can be prebuilt in case there is data model for specific domain, or blank, where user can start

building its database from scratch. Planning and building master repository is a three-step

process supported by three different types of models: (1) Logical, (2) Domain and (3)

Physical (Wilson et al., 2011, p. 60 - 82):

(1) Logical model - is the first step of data model development and it’s where the

planning process occurs. It’s a diagram of entities, attributes and relationships which

represent the database structure and the workflows of business processes that work

with master data;

(2) Domain model - is used to define data tables that will store the future master data. It

follows the logical model, defined above, to “draw” data objects of the master

repository. Lowest level of data domain that can be modeled here is data field, along

with its data type, length and restrictions (if there are any). Same as the logical model,

domain model is vendor non-related and it’s used to set some general standards for

master database architecture.

(3) Physical model – this is the final step of data modeling when the actual database is

created. It is vendor related model so users have to choose the appropriate database

management system. Data objects and rules are created based on the concepts defined

in the first two models.

39

Data models are built in IBM InfoSphere Data Architect solution. Once the modeling process

is done, IBM InfoSphere Data Architect is capable of creating database-specific data

definition language (DDL) scripts based on the physical model. DDL scripts contain queries

to create, update or drop data objects and can be run on a specific database server.

Figure 16: Logical model Figure 17: Domain model and physical model

Source: E.Wilson et al, InfoSphere Data Architect, 2011, p. 60, 82

Figure 18: Physical model

Source: E.Wilson et al, InfoSphere Data Architect, 2011, p.90

Once created, existing database models and objects can be updated with another application

called IBM InfoSphere Asset Manager. The Asset Manager is used to import physical

models, create data objects or update the existing ones. And as every advanced MDM system,

IBM also uses staging area where all data changes are stored first, and once validations are

passed the changes are implemented in the actual central data storage.

Data modeling is not the only option for designing master database. IBM also supports

reverse engineering that allows users to convert already existing data objects to physical

model. This feature allows reuse of existing database structures when building central

repository, instead of just starting from scratch. Another advantage is the possibility to easily

compare different databases before merging data from different sources (Wilson et al., 2011,

p. 129).

40

Data import IBM InfoSphere provides different ways for importing data. Depending on

database structure and business scenarios, data can be loaded through batch transactions or

one of the applications mentioned earlier. Batch transaction processing is used in cases of

empty database, when large amounts of data need to be loaded in the repository. Each record

to be imported is read, parsed and distributed in the appropriate business objects. MDM

assigns unique identification key that serves as internal key to every record imported in the

master repository. Data files for this type of import must be in SIF (Standard Interface

Format) which is pipe delimited file format.

InfoSphere FastTrack is application used for data import. It’s mostly used during updates,

merges and data smaller data loads. The most important thing in this import is mapping of the

source file data to the appropriate database columns in the master repository. The whole

process is similar to the already known ETL process.

Figure 19: Example of field mappings during data import

Source: IBM InfoSphere FastTrack, 2011, p. 10

Data export Data from the master repository can be shared through direct transfer from

master repository tables to external applications tables or through web services. The first

option for data export is available in InfoSphere FastTrack and the export process is similar

to the import, just the data flow is in the opposite direction. The mapping that is done is from

master data objects to external system tables.

Since IBM MDM architecture is based on SOA, another way of master data sharing is

through web services. External applications can retrieve master data with web service

requests for certain entity. There is no specific rule which approach is used; it all depends on

business scenarios and the choice of users.

Data validation IBM Infosphere validates data in the same manner as the other MDM

solutions. Data is validated before import, during import and afterwards. Techniques for data

cleansing and management are organized in four steps: (1) understand organizational goals

and how they determine user’s requirements, (2) understand and analyze the nature and

41

content of the source data, (3) design and develop the jobs that cleanse the data and (4)

evaluate the results (IBM InfoSphere QualityStage, 2011, p. 2 – 5)

(1) In order to properly manage master data, users need to get acquainted with business

requirements. The role IBM InfoSphere has in this first step is assisting users in

graphical representation of their business rules. As discussed earlier, this is done

while building logical model and defining business entities;

(2) IBM Infosphere applications offer different kinds of data analysis. One application

that is mostly used for analyzing data content is InfoSphere Analyzer. This application

provides different kinds of analysis among which, column, cross-domain and key

analysis are the best known. Column analysis is performed on data in certain column

and gives general overview of the column properties as well as detects anomalies in

column data records. Cross-domain analysis matches data between different tables in

order to find duplicate and redundant data. Key column analysis is used to detect

relationships between tables and columns and define primary and foreign keys based

on uniqueness of data.

Another way to explore data content is through matching. IBM InfoSphere tools

provide matching by value and pattern. Value matching is similar to free-form lookup

where data is matched to given value. Pattern matching looks for data that matches to

given data format like SSN or email. IBM uses regular expression to perform pattern

matching. Below is an example of results from SSN pattern matching in which results

list all tables that contain fields with SSN format.

Figure 20: Example of SSN pattern match

Source: J. Zhu, Metadata Management with IBM InfoSphere Information Server, 2011, p. 241

(3) Once data is analyzed, the next step is to define jobs that will match and cleanse data.

IBM MDM offers prebuilt matching jobs that ship with its product. However, users

can define their own matching jobs and rules based on business requirements.

Matching process is similar to the ones defined in other vendors. It’s based on some

starting points (cutoffs) and weights that measure similarity of data. An interesting

attempt that IBM tries to introduce here is speeding up data matching jobs by setting

up some rules which group data in different block. Such approach is used to lower the

number of combinations that appear when two columns are to be matched. Blocking

works on the rule of sort-group-divide. However, this may turn into costly operation

42

which requires building complex subqueries for data processing and comparison.

Also, incorrect data blocks may result into false negatives, when a record pair that

does represent the same entity is not a match, because the records are not members of

the same block.

(4) Last step in data validation is evaluation of results and setting up rules to prevent

further inconsistencies. In case of several duplicates found for one master entity, IBM

MDM rules make merge of all the unique data representation that refer to the same

master object. The goal is to retrieve as more information as they can for the master

record.

Figure 21: Example of record merge

Source: IBM InfoSphere QualityStage, 2011, p. 150

Similar to the matching engine that contains predefined matching jobs, IBM also

offers rule engine to save all user defined rule jobs that can be further on reused.

Other than data rules, consistency of master data can be accomplished with data

transformations. It usually applies for some common values like gender codes, streets

and address. Data is transformed to some general format that is used across whole

master database.

Data security Security in IBM Information Suite is based on user/password authentication,

role based permissions and monitoring. User permissions are questioned and checked on

several levels. As mentioned earlier IBM Infosphere platform is based on SOA architecture

and user transactions are web service requests to the MDM server. Therefore, the security

system checks if user has permissions to invoke such requests, if user has permissions to

make updates and it also controls visibility of data objects in master repository to which users

have or don’t have access. Another benefit of this system is that MDM is configured to keep

history log of all changes, so changed records can be reconstructed at anytime. Monitoring is

done when users connect to the system. Administrator can observe and control their actions

(sessions of work).

Advantages and disadvantages of IBM InfoSphere - as table 6 shows, there are far more

advantages than disadvantagses due the size and variety of this solution. These

characteristics support different models and functionalities that are compatible for different

types of business requirements.

43

Table 6: Advantages and disadvantages of IBM InfoSphere


Domain neutral, but also offers prebuilt models

for Party, Product and Account domains.

Same functionalities are repeating in different

applications in the InfoSphere portfolio. Many

of the applications are intertwined between

themselves, which can be confusing to users

often times

Provides documenting of workflow process

and further reuse of the same.

Special file adjustments to SIF format during

batch transaction processing.

Systematic planning of master data model

through several types of models: logical,

domain, physical.

Excessive mapping needs to be done before

importing data from external sources. Same

process happens on export, too.

Exports of models in reusable files: XML and

DDL scripts.

Great variety of data transformations and

standardization can change the data in great

form and may produce completely new

records. Such transformations can result in

false positives, matches that are not actual

matches.

Offers prebuilt matching jobs, rule engine for

rule definitions and their common use.

Blocking process during matching can be

efficient but also complex and time consuming.

Irregular block division can cause false

negatives, records that match but are not

detected because they were placed in different

blocks.

Variety of data content analysis, data

transformations and standardizations.

Compatible with different kinds of databases

and platforms.

Blocking process techniques for more efficient

matching.

Data can be shared through web requests, so

external applications don’t have to make major

changes to their databases.

Reverse engineering

Security provided at different levels in the

system

IBM InfoSphere is a rich portfolio of tools for data management and integration. It offers

great variety of applications that cover all the processes throughout data management, from

planning and modeling to cleansing, merging and distribution to external systems. It supports

all domains, implementations styles and methods of use as well as is platform independent

and can be compatible with all types of software. IBM didn’t create solution just for the time

being, but they also include features that allow users to save all that documentation, modeling

and rules into some common knowledge database to which they can recall afterwards.

Another novelty IBM can be proud of, is the possibility for reverse engineering, which

facilitates system integration during merging or acquisition.

However, this perfect solution has few disadvantages. According to the various data

transformation techniques supported in the QualityStage, users are given freedom to make

data transformation for easier matching. However, there is no limit how far user can go in

44

changing data and often times such transformations can change data in a way that it loses its

context and no longer represents the correct business object.

Defining complicated subsets of data for faster matching can be expensive, time consuming

and create wrong results, false negatives and positives that were mentioned earlier. It’s good

that users have freedom to work with master data anyway they want, but there still need to be

some system restrictions that will give users some guidance and warn them for the possible

mistakes.

Another thing I noticed is that many similar features can be found in different applications.

For example, data cleaning can be done in DataStage and QualityStage; data analysis in

Information Analyzer and Information Discovery, import and export of data in Metadata

Workbench and Asset Manager but also in any other component. IBM’s intention for this

shared functionality was maybe to broaden the application set of feature and don’t allow user

to work on several application to get clean data, but there should be either one application

that will support the whole data management process, or several components with precise set

of features so it would be less confusing for the user.

Overall, IBM InfoSphere is a mature solution that implements great techniques for data

management. Both Information Server and MDM Server can be used for managing data from

large and complex systems. Many modules from InfoSphere can be acquired and used

independently, for data analysis and cleansing. Therefore, InfoSphere line of products is

suitable for all sorts of enterprises and lines of business.

3.3.4. Oracle MDM Suite

Oracle introduced its MDM products ten years ago, starting first with programs for managing

customer and product data, and ending up with solutions for data management, called

Customer and Product Data Hubs. The whole idea for developing applications in MDM area

started internally when Oracle’s E-Business Suite was dealing with customer data quality

issues. They first developed program to manage customer data model, called Oracle

Customers Online, and shortly after its release they built Oracle Advanced Product

Catalogue, another program for the same suite to manage product data. Adding data quality,

source system management, and application integration capabilities, these two products grew

into the Oracle Customer Data Hub and the Oracle Product Hub. Major breakthrough on the

MDM market happened when Oracle acquired Siebel and Hyperion Data Relationship

Management (DRM). After releasing Customer and Product hub, Oracle expanded its MDM

line of products to Finance, Site and Supplier Hubs (Butler, 2011).

Oracle is currently focused on developing fusion versions for its existing Hubs. These fusion

applications are combination of SOA and MDM. They provide integration, management and

distribution of master data among applications from external systems. So far, Customer,

Product and Accounting Fusion Hubs are available on the market.

45

MDM solutions from this vendor will be discussed through several products from the Oracle

MDM Suite. Below is a list of applications that belong to the Oracle MDM portfolio (Oracle

Master Data Management. Retrieved March 5, 2012, from

http://www.oracle.com/us/products/applications/master-data-management/master-data-

management-ds-075053.pdf, 2010, p. 1)

Enterprise Data Quality;

Oracle Customer Hub;

Oracle Product Hub;

Oracle Supplier Hub;

Oracle Site Hub;

Oracle Higher Education Constituent Hub;

Hyperion Data Relationship Management.

Data modeling Oracle’s MDM products come with already predefined data models for each

entity. Users don’t have the ability to start from blank database, but what they can do is

update already existing tables in the master repository with new columns. Data models that

Oracle uses are based on the Trading Community Architecture (TCA) model. Oracle Trading

Community Architecture (TCA) is a data model that allows users to manage complex

information about customers, organizations and customer’s accounts. The base of this model

is used and readjusted when designing models for other types of domains such as product,

site etc. Tables in the master repository have standardized names; each starting with HZ

prefix followed by the name of entity which attributes are stored. For example,

HZ_PARTIES stores data for Parties, HZ_CONTACT_POINTS for party’s contact points

etc. This database is of relational type, organized in tables (entity), columns (attributes) and

relationships (hierarchies) (Oracle Trading Community Architecture, 2006, p. 1).

Figure 22: List of predefined tables for Customer entity

Source: S. Anand, Trading Community Architecture, 2008

46

Data import Since Oracle provides predefined data model, data is imported in the HZ tables

discussed earlier. There are several different ways for data import: (1) SQL/ETL Load, (2)

D&B Load and (3) File Load (Oracle Trading Community Architecture, 2006, p. 8 - 18):

(1) SQL/ETL Load: data is first extracted with scripts or tools, values are transformed to

meet the data requirements of the interface tables, and afterwards data is loaded;

(2) D&B Load: data is prepared by D&B sent in standard D&B bulk file which is next

run through the D&B Import Adapter and automatically mapped and loaded into the

interface tables;

(3) File Load: data is loaded from a comma-separated value (CSV) file, or file delimited

by another allowed character with Oracle Customers Online (OCO) or Oracle

Customer Data Librarian (CDL).

Before loading data in the master repository, data is first imported into staging tables,

matched and cleansed and afterwards imported in the interface tables. The staging tables are

copies of the existing tables and are temporary storage for the external data that is being

imported in the repository. Even after importing data, TCA runs post import processes for

data standardization. There are various data transformations such as: name conversions to

meet database standards, replacement of letters in phone numbers, removing of NULLs etc.

Data export Data for certain entity can be exported in Excel spreadsheet. However, for data

distribution to external applications Oracle uses cross referencing. Cross referencing is

approach that assigns unique key IDs for each record in the central repository and maps it to

the appropriate record from the external systems (similar to the key mapping discussed in the

SAP solution earlier)

Figure 23: Example of cross reference between PARTIES (master table) and

SYS_REFERENCES (external systems)

Source: Better Information through Master Data Management – MDM as a Foundation for BI, 2011, p. 9

With the help of Application Integration Architecture (AIA), data can be shared with other

application through web services. This enables external applications from different platform

to receive managed data from Oracle MDM Hubs.

47

Since every hub that Oracle MDM offers has its own domain of concern and different

architecture, there is also difference in the cross referencing processes. There are two

possibilities for cross reference: one way and two way cross reference. In the first approach

data flow occurs one way, from Hub to external applications. This means that data is

managed and updated only in the Hub and afterwards sent out to the other systems. This

approach is used in the Product Hub. The two way cross reference is implemented in the

Customer Hub and data flow is managed in two directions: from hub to external systems and

vice versa. Data is managed in the hub, but can be also updated in the external systems and

sent to the master repository for import. Changed data that is sent from external systems

needs to pass predefined validations before is loaded in the central database. This type of data

sharing gives freedom to external systems to use managed data from Oracle Hubs without

major changes made to their legacy applications (Cross-Referencing for Master Data

Management with Oracle Application Integration Architecture Foundation Pack, 2008, p. 5).

Data validation As mentioned earlier, data is checked for errors right after being imported in

the repository. There are several techniques that Oracle uses to validate, cleanse and manage

data. Most of them are similar to the ones discussed in the previous MDM solutions. Data

validation techniques are based on transformation, matching and merging and are part of the

Data Quality Management (DQM), mechanism for managing data found in the TCA model.

Figure 24: Example of data validation workflow

Source: Data Quality Management, 2012, p. 9

Oracle MDM examines data through several steps (Data Quality Management, 2002, p. 2 -

25):

(1) Step one - Transformation functions. These functions include characters or blank

space replacement, removing double letter, or any other data changes that will achieve

certain standards throughout the database. Also, Oracle uses word replacement which

replaces similar word variations with one standard word. Often times user enter

48

different data for the same item. In some cases they use item’s full name, in others the

shortcut. To avoid such irregularities MDM gives one name per item. If before

Slovenia was entered in the system as Slovenia, SLO or SI, MDM can replace all

these variations with SLO. Word replacement and transformation functions can show

some duplicate data throughout the system which was hard to be detected before;

(2) Step two - Match rules. Matching is done in the similar manner as discussed in the

earlier solutions. Because user cannot be familiar with each and every record, the best

way to detect match is when user defines some matching points that need to be

accomplished. Oracle names these values as thresholds, and based on such limits user

can define if two records are match or no match. Thresholds should be defined as

some average value, between 40 and 60, because small value for threshold may return

results that are not a match, whereas high value may exclude a lot of data and some

possible match among;

(3) Step three - Duplicate identification and merging. Once data is transformed and

certain match rules define, subsets of data can be prepared for duplicate identification.

Oracle DQM provides batch jobs that compare records from different groups, looking

for duplicates. Each record from one dataset is matched to all records from the other

datasets. Batch jobs can run for a long time if there are a lot of records for

comparison. That’s why often times is better for user to define subset of records and

apply batch job to these subsets. Once duplicates are done, then merging of same

records follows. The old record that is merged is deleted because it is already merged

to the new one.

An extra feature that Oracle MDM provides in the validation process is monitoring and

managing data decays. Data repositories contain big loads of obsolete data that accumulates

over some period. However, an enterprise cannot always delete this data because it may still

use it for analysis of historical transactions. Oracle MDM supports this data lifecycle by

monitoring data decays and flagging the active and passive data. What this tracking allows is

marking the current active data and making it available for the live applications. Data that is

not used anymore is flagged as passive and stored in remote locations. It’s not accessible by

external applications and it’s usually used for reporting.

Data security Oracle provides robust and precise security model that gives user rights to

work with certain data or hierarchies. It is based on roles and authentication and as in any

other security model, administrators have all the rights. The security is set on a granular level

that event controls user access for different versions of data.

Advantages and disadvantages of Oracle MDM are shown in table 7. Even though there

are several disadvantages regarding Oracle domain specific solution as well as the inability to

support collaborative MDM still there are workarounds that can complement these

deficiencies.

Table 7: Advantages and disadvantages of Oracle MDM

49


Supports several domains: Customer, Product,

Account, Site.

Doesn’t cover all domains.

Prebuilt data models so users don’t have to

start from blank database

Data models can be modified, but not built

from scratch for completely new entity

Keeping copy of the data in staging tables

prevents errors on import.

Hubs are domain dependable, for each new

domain Oracle launches new Hub.

Versioning data saves archives of data changes. Oracle focuses more on data cleansing and

deduplication, but doesn’t offer great support

for setting up rules to govern data.

Different ways to import data. Collaboration is excluded; only operational and

analytical implementation styles are supported.

(Collaborative implementation style may be

implemented in the new line of advanced

MDM hubs, called Fusion Hubs)

Cross reference data sharing which makes

Oracle compatible with different kinds of

external systems (include non Oracle based).

Monitoring data decays, and decreasing the

data load based on active and passive data.

Automatic batch processes for faster and more

efficient duplicate identification.

Based on the various features discussed earlier, Oracle MDM suite can join to IBM and SAP

MDM solutions with its complex architecture and the various functionalities that offer data

management. Organizing data management in different hubs, based on the type of entity, is of

great help for users because they don’t have to purchase the whole suite but only those

applications that are needed for managing their data. Also, with the use of Application

Integration Architecture and web services, these “parts” of the suite can be easily integrated

with other applications from different platforms and vendors. Prebuilt data models are also of

great help, so they can give users more time for data validations instead of creating data

model, something that comes with the Oracle software. Another advantage that is worth for

mentioning are the plenty transformation rules and matches that help in data management.

Also, the organization of the data in versions and comparing it among different versions,

allow users to keep track of changes over time, but still don’t make these versions as separate

data sets that don’t have any connections between themselves.

However, it seems that Oracle tries to facilitate the job for users by giving them everything

prebuilt. Constraints can appear because of this flexibility. First, not all domains are

supported. Database structure is already given and users may have to make changes in their

own systems before loading data in the MDM repository. Defining Hubs for each domain is

different approach than IBM and SAP. It seems that each Hub functions as independent

applications. Also, data governance is on lower level, in some cases Oracle Hub seems like

passive registry that servers cleansed data to external systems but doesn’t do match to keep it

clean and managed.

50

Overall, main domains are covered. For special lines of business these Hubs can be

readjusted. And with the new Fusion hubs that are already launched on the market, Oracle

strengthens the collaborative methods and evolves to multi-dimensional MDM suite on a

single least cost of ownership.

4. ANALYSIS OF SELECTED MASTER DATA MANAGEMENT

ARCHITECTURES

There are several aspects of comparison that can be considered when analyzing the discussed

MDM architectures. I’ve chosen three approaches: (1) data quality dimensions, (2) three

dimensional model and (3) the four data management phases which cover: consolidate,

cleanse, govern and share. These three approaches give an overview of the problem, solution

model and management processes described through the selected MDM solutions. The first

approach covers data quality issue which is common problem in every enterprise data. The

second approach is based on the three dimensional model that gives general view on MDM

solution. And the last approach summarizes different management techniques that are present

in each of the selected products.

4.1. MDM of selected architectures and quality dimensions

In the first part of the thesis I defined and covered data quality and the dimensions that

describe this subject. Since Master Data Management is found and developed in direction to

deal with data quality improvement, it’s understandable that MDM solutions would include

various validation technique that work towards accomplishing improved quality of data, and

not all kind of data, but enterprise data only.

Table 8: DQ dimension and MDM

MDM Solutions

Microsoft MDS SAP Netweaver IBM

InfoSphere

Oracle MDM

Suite

Data Quality

Dimensions

Intrinsic:

Believability

Accuracy

Objectivity

Reputation

Standardization

Matching and de-duplication

Stewardship

Contextual:

Value-added

Relevancy

Timeliness

Completeness

Central repository

Versions

Publication to

slave repositories

History logs

Web request

updates

History logs

Merge

Data decays

Merge

Representational: Data standardization

Data transformations

51

Interpretability

Ease of

understanding

Consistency

Concise

Representation

Data ranges

Dropdown fields-

qualifiers

Different types of

table structures

Domain model

Mapping on

import and

export

Word

replacement

Mapping on

import and

export

Accessibility:

Accessibility

Access security

Database queries

Web requests

Exports

User roles

User name / password authentication

The table above represents overview of data quality dimensions and how each product MDM

contributes in their maintenance and improvement. Same techniques are used for managing

different quality dimensions and some of them are common for different MDM products.

The first set of DQ dimensions is managed in similar way in all four solutions. Since intrinsic

quality is based on the actual data content, given as it is, the main goal of MDM is to keep

data content accurate for each item it relates to. Data standardization, matching, de-

duplication and stewardship are the core processes of master data management and it’s

expected that all of them will be present in the solutions. In the earlier discussion about each

solution individually, data validation techniques are list of transformations, replacement,

NULLs removals, words arrangements etc. Also, matching and de-duplication are processes

that are present in every MDM product, because one of the main issues with data quality is

duplicate data. And in order to maintain data quality in the system, data rules are available in

each of the four solutions. They are defined based on the business requirements each user

has.

Contextual data quality represents how complete and up-to-date is the information for certain

object. The fact that each of these solutions has central database where it merges data from all

external system into single “golden” record, covers the completeness as DQ dimension. So,

MDM database being central repository for an enterprise system is common strategy for all

solutions. Managing data updates to achieve accurate data as we speak is managed in

different ways. For example, MDS keeps versions of data, to “freeze” data and models for

certain point in time. Oracle keep tracks of the data decays, marks unusable data and stores it

in separate locations. However, to have real-time data, all MDM solutions implemented SOA

architecture. This is the most suitable way to integrate data from different sources and update

changes as they occur. All other exports in CSV or flat files are delayed versions of data.

Representational data quality is again equally managed in some part, but there are also

additional different features specific in each solution that support this group of quality

dimensions. To keep some uniqueness and standard format throughout the database, each of

these solutions apply data standardization rules as well as transformations. As an addition,

Micorsoft MDS for example, uses data ranges, to predefine the allowed values that one

52

attribute can posses; SAP supports different types of tables and fields, list of values and

qualifiers; IBM InfoSphere defines data format in its modeling phase, using domain model

and Oracle provides different string functions for data transformations and standardizations.

Accessibility is maintained in all solutions through SOA services and their security model.

Each model has different type of security but in general they are all based on role and

authentication type of model. The level of security is defined on general as well as modular

level. Depending on the structure of each solution, users can have permissions to view certain

applications, model or data. Also, users can have different privileges, to only browse data or

process all CRUD operations. With defined roles, data can be queried with database

statements or retrieved through web requests.

4.2. Comparison of selected architectures through the three dimensional model

MDM architecture is based on three dimensional model defined by Zachman. This model

includes: domains, methods of use and implementation styles. Because MDM solutions are

spread out in various modules, each of them with different functionality, best way to create

general picture for one vendor’s solution is if we unite them according the principles of this

model.

Table 9: MDM solutions and three-dimensional model

MDM Solutions

Microsoft MDS SAP

Netweaver

IBM

InfoSphere

Oracle MDM

Suite

Latest version add-in to Microsoft SQL

Server 2012 version 7.1 version 9.1

Last release of

Customer Hub

version 8.2

Dimension

Domain Domain neutral Domain neutral

or domain

based:

Customer,

Employee,

Supplier,

Material,

Business

Partner, Product

Domain neutral

or domain

based:

Party, Product,

Account

Domain based:

Product,

Account,

Customer,

Site

Method of use Operational and

analytical

Operational and

analytical.

Collaborative:

when combined

with SAP

(Business

Operational,

analytical and

collaborative

Operational and

analytical.

Collaborative

can be achieved

with additional

applications for

53

Process

Manager) BPM

business process

management.

Implementation

Style

Physical master

repository

Transactional Hub

Physical master

repository

Registry

Reconciliation

Engine

Transactional

Hub

Physical master

repository

Registry

Reconciliation

Engine

Transactional

Hub

Physical master

repository

Registry

Reconciliation

Engine

Transactional

Hub

(1) Domains – All of these MDM products except Oracle MDM, are domain neutral

solutions. They support model for every type of domain. Some of them, like SAP and

IBM, offer prebuilt models for the most common enterprise master domains,

Customer and Product, which is of great help for users so they don’t have to model

and plan their database from scratch. Oracle on the other hand, is the only one among

these solutions that doesn’t allow complete freedom when choosing a domain model,

all of its MDM products come with predefined domain model. However, since the

basic architecture and concepts of MDM work are well implemented, Oracle doesn’t

have problem in customization of the model to be suitable for different business

master objects;

(2) Methods of use – Operational and analytical methods of use are supported in all four

selected MDM architectures. With the support of CRUD operation for data processing

and support of business requirements, these products support the operational side of

the enterprise transactions. Also, since all of them are serving as central repository for

the external systems, storing master objects as well as other attributes and dimensions

related to the same ones, they give users 360 degrees view for their main domains.

Processed and cleansed data in the master repository is a main source of data for

reports, OLAP cubes, and various BI tools. Accumulating data from the whole

enterprise system into one place provides rich information for analytical use.

Collaborative method is the only one that is fully integrated among these products.

IBM InfoSphere is the only solution that has IBM BPM express integrated in its

MDM architecture and with this feature supports in management not only of the data,

but also on the business processes. This method can be implemented in the other

solution with collaboration with different applications, in most cases business process

management tools from the same or other vendors. Currently only IBM InfoSphere

offers the whole package that supports all three methods of use without additional

upgrades;

(3) Implementations styles – Microsoft MDS is the only one with the lowest number of

implementation styles due to its limited architecture. The other three solutions, SAP,

IBM and Oracle, offer different ways for storing and sharing data. Depending on the

business requirements and structure of the legacy applications, these three solutions

54

can offer central repository that will store all the data, or only registry style of

database that will play the role of system of reference. They also offer the most

advanced way of integration and communication of data, transactional hub, since they

all have implemented SOA services for sharing data across various platforms.

Microsoft MDS, as an addition to database management system can offer physical

storage for the cleansed data which can be accessed by querying the subscription

views with the known SQL queries. Transactional hub in this solution may be

achieved with Application Programming Interface (API) and the use of Windows

Communication Foundation (WCF) services. Key mapping is not supported in this

solution as it is the case in the other three MDM architectures. It may be achieved

with customization of the model in combination with WCF services, but it’s not

something that comes along with the product.

4.3. Comparison of selected architectures through the five MDM activities

Main goal of MDM is to consolidate data from different sources and generate single “golden”

record for use. MDM achieves this goal through the four main processes that occur in the

following order: (1) profile, (2) consolidate, (3) govern, (4) share and (5) leverage

(1) Profile – Data assessment is usually done on import. All of these solutions support

ETL processes. When new data is mapped to the existing data structures, then user

can make part of the assessment of the new records based on the predefined rules in

master repository;

(2) Consolidate –There are different approaches that selected MDM architectures use but

in all of them data import is based on the well known ETL process. Data is extracted

from the external sources, transformed to match the structure of master database and

loaded in the central repository. Even though most of these solutions try to automate

the import process, mapping is still done by the user. One facility that SAP Netweaver

and IBM InfoSphere offer is documenting of such mappings for future reuse. Data is

also consolidated with unique key value pairs that are created and stored in the master

repository, and are used as external references to the source system data;

(3) Govern –Cleansing is part of the govern phase. Typical for all solutions is the scoring

strategy in which they all start off with predefined match scores to increase

probability of duplicates in data. Also, before running match jobs, they also transform

data into some standard format for easier duplicate detection. Different logical

operators are used in the matching processes. Once duplicates are detected, they are

merged into one record and the obsolete one is removed from the master database. In

order to prevent data from future duplicates and errors these suggested solutions

implement rules engines in their architecture, to detect potential bad input and warn

the user to change the data because it doesn’t match the standards. Additional BPM

tool combined with MDM works even better to manage business process and prevent

from generating improper data. This facilitates in great deal the job of MDM;

55

(4) Share – In order to provide real time cleanse data, all systems support SOA. This

architecture, mentioned several times already, allows open architecture for sharing

data between applications;

(5) Leverage – Once data is well structured and cleansed, it serves as reliable source for

BI tools and analytical system. Reporting is supported by all of these architecture as

last phase for data preview.

Below is a table that lists different tools from the selected vendors that are used in the four

phases of MDM.

Table 10: MDM overview through four data management phases

MDM Solutions

Microsoft MDS SAP Netweaver IBM

InfoSphere

Oracle MDM

Suite

MDM phases

Profile Data is assessed on import. When mapping new data to the existing data

structures in the master repository

Consolidate Single central

repository where data

is loaded through

ETL

MDM Import

Manager

MDM Import

Server

Key mapping

Batch

transaction

processing of

SIF files.

InfoSphere

FastTrack

External

reference keys

SQL/ETL jobs

D&B Batch

loads

File Loads from

CSV files with

Oracle

Customers

Online (OCO)

or Oracle

Customer Data

Librarian

(CDL)

Govern Match operator

Data Quality Services

Data standards

Triggers for sending

notifications and e-

mails for data change

approvals

SAP Data

Manager

Scores of

similarities

Logical operators

Merge

Data

transformations

Hierarchies

Validation rules

Infosphere

Analzer

InfoSphere

Discovery

InfoSphere

DataStage

InfoSphere

QualityStage

InfoSphere

DataStage

InfoSphere

QualityStage

Data Quality

Management

(DQM)

Data Quality

Management

(DQM)

Word

replacements

Validation rules

Data decays

Share Subscription

Master Data

Manager web

MDM Syndication

Server

MDM Syndicator

InfoSphere

FastTrack

SOA and web

Excel sheets

Application

Integration

56

application

WCF services

Web services

services

Architecture

(AIA) web

services

Leverage Managed data is great source for analytical systems. All of these architectures

support reporting, final step in the MDM that provides data preview.

5. CASE STUDY OF MDM SOLUTION USED IN STUDIO MODERNA

As an addition to this discussion about MDM architecture I decided to add another solution

developed for the business requirements of Studio Moderna (SM). The reason why I chose to

add custom developed MDM is because I wanted to show how one company can handle

MDM processes internally without purchasing off-the-shelf product. For this case study, I

would briefly describe Central Product Register (CPR), solution for managing product data.

The research methodology used to gather data for this case study is based on unstructured

interviews with the project lead of CPR. Communication was done over emails and couple of

meetings, where we discussed about the architecture and development phases of this solution.

I also spoke with another SM employee who was in charge of testing CPR and entering data

in this CPR’s repository. Also, I used SM documents that describe the architecture as well as

the business logic developed in this solution.

5.1. Problems with product data management

Studio Moderna (SM) is a marketing and sales company that exists 20 years on the market.

With 5500 employees this organization operates in 21 countries in Central and Eastern

Europe, Russia and Turkey. There are 5000 different types of products for various purposes,

from electronics and health & fitness to products for kitchen & household, sold through five

different channels: TV, Internet, Print, Shops and Telemarketing. One of the most popular

brands are: Dormeo, Delimano, Kosmodisk, Top Shop etc. Studio Moderna is the distributor

of choice for all the major global direct response marketing companies and have been

responsible for all the major DRTV product winners in the region. They also work directly

with manufacturers from Europe and Asia. With strong direct customer relationships

managed through: 130 transactional websites, 220+ retail stores, 22 call center locations, and

300+ hours of daily TV advertising airtime, 6 own TV channels, thousands of retail

distributors, 15+ million catalogs, 70+ million calls handled annually, Studio Moderna strives

to turn consumer brands and products into household names (Overview of Our Company

[Studio Moderna – portal], Retrieved February 7, 2013, from http://www.studio-

moderna.com).

Working with rich portfolio of 5000 products across 21 countries, SM’s system was

experiencing problems in managing various product data from all those locations. Examples

of such problems are:

57

- Product data was scattered around various applications (system for eOrdering,

Telemarketing, Shop POS, PIS (Product Information System, OLAP Admin));

- Same products were stored in different applications, and their updates requested data

changes across multiple applications where their records existed;

- Decentralization of data was producing duplicate records;

- Products from different channels and for different countries weren’t following the

same workflows. For example, products ordered through TV channels were submitted

to internal ordering, step that was skipped from SM fashion group products;

- It was hard to track product status (orders, prices, promotions) in each country,

because each one of them managed product data in its own repository;

- As an addition to this last problem, it was difficult for management to track

customer’s interest in each product. Because product data was managed on country

(local) level, it was hard to determine if certain brand was selling enough so it can

stay on the market, or its marketing wasn’t paying off any longer.

In order to decrease the number of problems listed above, SM decided to centralize products

and change business processes related to this business domain, and store and manage them

with in-house built solution called Central Product Register (CPR).

5.2. Central Product Register (CPR)

Central Product Registry is Master Data Management solution for product data built for the

purposes of SM. It is developed with Microsoft tools, using Microsoft Dynamics ERP

platform and SQL database. It’s a central repository that stores and manages product data on

local (country) and international (all countries) level.

Development of this system began in May 2011 and it was launched for the first time in

March 2012. This solution was designed to cover the following functionalities: (1) Basic

product management, (2) Managing permissions, (3) Managing product lifecycle, (4)

Managing product marketing data, (5) Managing supply chain data, (6) Managing central

pricing data (Central prices: purchase, Suggested Retail Price (SRP), calculation, vendor

SRP) and (7) Managing local pricing data ( Local prices: CPO price, retail price, transfer

price).

Data is entered through CPR’s user interface. There aren’t any bulk imports or data transfers

from different sources, but instead data is manually entered by person assigned on this

position. Since local and international data is stored in one place, not everyone has

permission to enter or update the product data. On a local level data updates are done by

Local Sourcing Officer whereas on international level by Central Sourcing Officer. Importing

all data from one place, by limited number of people avoids duplicate data as well as multiple

data entrance in different applications.

58

Once data is entered in CPR, it’s stored in central database that represents the main source for

all the other applications that work with product data, regardless if they are used for

marketing, analytical or any other kind of purposes.

The whole architecture of this solution is designed in such way that doesn’t support data flow

from different source, but it only allows manual data import. Duplicates are prevented by

triggers setup on a database level. However, this doesn’t cover all scenarios and doesn’t

prevent potential duplicate or misspellings to be again imported in the database.

Unfortunately the system is not fully developed to detect that Dormeo and Drmeo, for

example, may be the same product with potential misspellings. There aren’t any matching

mechanisms that are scheduled to run and compare what was imported in the central

database.

External applications still work with their own databases; they are not connected to CPR’s

database in such way to use product data directly from this place. Product data from CPR to

client system databases can be loaded in two ways: (1) nightly jobs and (2) “pull” method.

(1) Nightly jobs are scheduled to run at certain time of the night, when the number of

application’s users is very low. These jobs contain complex queries that check for the

updates on both sides, CPR master database and client system database, and transfer

the changes in the client’s databases;

(2) In case there is an external application that needs to work with a product that was

entered in CPR at the same moment, and can’t wait until the next day when the

nightly jobs are executed, then this application can make web request to CPR and

retrieve the needed data (pull method).

Figure 25: Example of data flow in CPR

Data Import Data share

Two features that contribute to product data management and are unique for this solution are:

(1) product statuses and (2) data security.

5.2.1. Product statuses

One of the many problems SM systems experience with product data is the disability to track

product statuses in each country as well as on an international level. This problem becomes

more complicated and hard to solve in situations when same products follow different

Local Sourcing Officer

Central Sourcing Officer

External Applications

EOrdering

Shop POS

PIS

OLAP

CPR

master

database

59

business workflows. As a solution for proper data management, CPR introduced two new

concepts: product statuses and product operational lifecycle.

Product statuses are business terms defined specifically for the needs of SM. An example of

such statuses would be:

- “New” (product has been created in CPR);

- “Evaluating” (international sourcing department is evaluating whether to suggest this

product to the countries for consideration);

- “In local test decision” (countries are deciding whether they would like to sell this

product) etc.

Product operational lifecycle covers:

- Management of product statuses,

- Transitions of products between different statuses – determines if product can “cross”

from one status to another based on predefined business rules;

- Determines which applications can use product data in certain status.

Not all products, that Studio Moderna markets, have the same set of statuses. These statuses

are based on the brand, sales channel or other business requirements therefore some of them

can be activated or disabled for various products.

Interdependency is also defined between product statuses and countries. There is international

and local status defined for each product. In some cases, local status is the marked as primary

and it dictates the current state of the product regardless of the changes on global level. In

another case, international status takes over the primary role in determining product’s state.

For example, Dormeo Pillow is product sold in all 21 countries. It’s local and international

status is set to “Active” and the primary role is based on local level. In case sales drop

internationally and the global product status changes to “Retired“, Turkey’s Dormeo Pillow

will still have “Active” status, based on the local status definition.

Product statuses are changed manually, by Local or Central Sourcing Officer, based on the

sales and customer’s response for certain product. They are involved in the business rules for

product and dictate if product data can be visible or not in certain application. Unfortunately,

CPR doesn’t support any business logic that can track sales and change product statuses

automatically.

5.2.2. Product data security

Placing data into central storage created additional problem for data security. SM wanted to

have all product data in the same place but somehow prevent countries to see all product

records stored in the master repository. To accomplish this, unique security logic was

implemented in CRP.

Security model is based on two perspectives: (1) domain and (2) role.

60

(1) Domain - Security based on domain perspective defines two levels of permissions:

local and international. Users who have local permission ( Local Sourcing Officer)

can only work with data from the country for which they have permissions and cannot

view data from other countries. On the other hand, international permission (Central

Sourcing Officer) allows users to work with data from all countries;

(2) Roles are defined based on the various data operation that user can execute. There are

three types of permission: No permission – user has no permission of product data and

will not be able to view it or be aware of its existence; Read permission – user can

see the data, but will not be able to modify it; Full permission – user can see data and

has rights to perform CRUD operations.

As an addition to this security model, permissions are also determined by product statuses. In

some business scenarios one product status can be visible for the Central Sourcing officer but

invisible for other roles.

Users’ roles and permissions are hardcoded in web configuration files. Each country has its

own configuration file that defines roles for each user and those files are updated manually.

5.3. Benefits of CPR

Implementing CPR managed to cover product data problems in the following order:

Table 11: CPR solutions for product data management

Problems Solution in CPR

Product data was scattered around various

applications (system for eOrdering, Telemarketing,

Shop POS, IS, OLAP Admin)

Central master repository Same products were stored in different applications

multiple times

Decentralization of data was producing duplicate

records

Products from different channels and countries

weren’t following the same workflows.

Product statuses

Operational product lifecycle

It was hard to track product status (orders, prices,

promotions) in each country, because each one of

them managed product data in its own repository

As an addition to this last problem, it was difficult for

management to track customer’s interest in each

product. Because product data was managed on

country (local) level, it was hard to determine if

certain brand is selling enough so it can stay on the

market, or its marketing wasn’t paying off any longer

61

Benefits that SM gains from this solution are more of intangible nature than they can be

actually measured. CPR offers:

- Unified repository of product data (“one truth”);

- Infrastructure to enable product data management in one system;

- Consolidated view of product data from different client systems;

- No manual retyping of product data between systems which reduces human errors;

- This centralized data is great source for analytical systems (OLAP) and gives

complete summary of product data for local and international level, something that

was much harder to implement before this solution was introduced.

Unfortunately, CPR is not yet assessed for ROI therefore I can’t present any actual numbers

that can show how much SM saved when it started using CPR.

5.4. Comparison of CPR and selected MDM architectures

CPR and the selected MDM applications are compared from three perspectives: (1) three-

dimensional model, (2) MDM activities and (3) cost and time for built and implementation of

the solutions. These three perspectives are selected to find out: (1) if the custom built CPR

follows MDM standards in its structure, (2) if it supports the five activities that were

discussed earlier and (3) if CPR development and implementation is worth the invested time

and money.

(1)

Table 12: Comparison of MDM architectures and CPR’s three dimensional model

Microsoft MDS, SAP Netweaver,

IBM InfoSphere and Oracle MDM

Suite

Central Product Register

Dimension

Domain Domain neutral which means that can

support every master domain object

that business requires

Product

Method of use Operational, analytical and

collaborative. Most of the solutions

offer operational and analytical but

the collaborative style can be

achieved when BPM application is

integrated in the MDM environment

Operational, analytical and to

some point collaborative. Due to

the complex product status logic,

various workflows are supported.

Implementation

Style

Physical master repository,

Transactional Hub or Registry

Physical master repository

(2)

Table 13: Comparison of MDM architectures and CPR’s MDM Phases

62



Suite


MDM Phases

Profile

Profiling is done on data import

There are predefined business rule

that determine what data should

be imported for product items.

Consolidate ETL processes are mainly used for

data import. Bulk loads or imports

from excel or CSV files. Key

mapping is also included to map

master unique key IDs to appropriate

data in the external applications.

There is no actual consolidation

of data from various client

systems. New data import is

regulated by user roles. Only

persons with full permission can

make changes in the master

repository.

Govern

Various tools for data quality

improvement are used, such as:

column analysis, matching, merging.

Data is validated through the rules

engines that these solutions support.

Most of the data validations are

handled though triggers that are

activated on incompatible values,

types or NULLs. Also, CPR

implements complex coding logic

to keep the business rules that

depend on product statuses.

Share All of these solutions implement SOA

architecture to support data retrieval

on request. Pull and push mode are

supported. Data can be retrieved on

client request (pull mode) or

distributed by the master repository

(push mode)

CPR support only pull mode

which means that for every

update, client systems need to

make a request. Automatic

updates from CPR to client

systems are performed as

scheduled night time jobs.

Leverage Both of the compared subjects provide unified source of data for analytical

systems.

(3)

Table 14: Comparison of MDM architectures and CPR’s time and cost



Suite


Time for

development,

implementation and

testing

Less than 6 months

10 months for development and 3

years for implementation in all

countries

Cost In the range of 500K $ (350K euros) > 84K euros (just for labor)

63

These last estimations in Table 14 are based on the following facts.

Development for Central Product Register began in May 2011 and it went into production in

March 2012, so it took SM 10 months to build and launch the first complete release. Even

though this solution is being used for months now, there are still upcoming versions and

releases that are improving CPR. Implementation, on the other hand, is a long term process

and is planned to happen in the next three years. The reason for this is because Studio

Moderna implements changes related to CPR one country at a time. Data transfers, changes

in client applications in each country and testing are time consuming processes that need to

be done in 21 SM places.

The purchased packaged software is introduced much faster. It usually takes less than six

months to finish with implementation of the solution. I spoke with Mr. Boštjan Kos,

Information Management Client Technical Professional in IBM, about IBM InfoSphere

MDM Solution and his estimations for implementation and testing were following:

“Difficulty to say as it depends on InfoSphere product you have in mind and which products

are in the scope for specific project. Installation and configuration would take 3-5 days

(simple installation on a single server, without high-availability, without disaster recovery,

etc.), connecting to data sources and data targets would take another couple of days, data

migration depends if it is from source A to target B without any complex transformations is

very easy and done on few clicks per table, but if there are complex transformations needed it

might take much longer. Training would take 3-5 days per module. Looking the whole

migration project I believe it should be finished within 1-6 months, depends from

complexity. “

Regarding the cost for these solutions, it is a bit hard to make precise comparison because the

numbers given for both “types” of solutions are rough estimations and really depend on the

scenario. For CPR I wasn’t able to retrieve the final amount that was invested in designing

this solution. There were on average six to seven people working on CPR each month. Four

of them were external contractors including consultants (in the beginning of the project).

According data from SURS (Statistični Urad Republike Slovenije) the average net salary for

programmer is 1200 euros for March, 2011 which is couple of months before the project

started. So the least amount that was spent on labor force was 84K euros = (7 programmers *

1200 euros * 10 months). But this is just the minimum amount. The salaries may be higher,

there are other people involved in the process like testers and business analysts, also

additional software licenses were purchased and so on.

On the other hand, the packaged software looks much more expensive. Three to five years

ago (2003-2007) the typical MDM solution cost in excess of $1 million just for the software

and an additional $3-4 million for the implementation services during the first year. During

2008, price points and product packaging (we should say “repackaging”) provided more

modest MDM functionality and accordingly less complexity which supported market pricing

64

in the sub $500K range. Overall, MDM matured from “early adopter IT project” status to

become a mainstay “Global 5000 business strategy” during 2007-08 These new price points

are reflective of various types of projects and the related product capabilities, i.e., enterprise

MDM initiative vs. very-specific business solution. Moreover market dynamics further drove

price differentiation as the market became more sophisticated and understood the price: value

ratio of hybrid vs. registry vs. tool kit vs. fully fledged MDM application. (Zornes, 2009, p.

6)

5.5. Build vs Buy MDM solution

The comparison in the previous chapter is made to bring out some positive and negative sides

from packaged and custom developed solution and to determine the choice that will bring

more benefits to an organization.

When deciding for MDM solution, customers usually consider the following criteria

(Stratature White Paper, p. 3 – 6): (1) Model Adaptation, (2) Security, (3) Performance and

usability optimization, (4) Notification and workflow, (5) Business rules and validation, (6)

Import and export, (7) Time to complete, (8) Cost of implementation, (9) Risk of failure.

(1) Model adaptation - Based on the comparison table 9 presented earlier, all selected

architectures provided freedom in choosing any domain that is requested by

organization. They even offer prebuilt data models, like SAP Netweaver for example,

to make solution implementation process easier. On the other hand, CPR is limited to

just one domain, Product. If Studio Moderna decides to built another MDM solution

for Customer domain, then it has to start from scratch and develop new model,

because the processing logic for Customer would be different than the one developed

for Product;

(2) Security - Security model is supported in every MDM architecture and it’s based on

the same principles of roles and permissions. CPR has its own security model, which

is unique and also includes support to the product lifecycle process;

(3) Performance and usability – all of the selected MDM architectures provide friendly

user interface that provides data visibility and its management. Data analysis,

matching, merging and other data processing techniques are implemented in the front

part of the application and don’t require writing excessive queries. CPR also uses

Dynamics to build the user interface of CPR so it can be easier for data stewards to

manage product data;

(4) Notification and workflow – notifications are seen in Microsoft MDS, integrated in

the MDM solution to notify user for some approval. However, this logic is not

necessary for every architecture because it wasn’t discussed in the other three

solutions. Other than IBM, none of the other packaged software support some

business processes, unless BPM is integrated in the solution. CPR put great thought in

managing workflows for product lifecycle statuses;

65

(5) Business rules and validation – MDM packaged software offers powerful rule engines

with predefined rules that user can apply to data. Also, user can define his own rules

and store them as future system knowledge. Business rules in CPR are also applied

when automatically activating or disabling some product data based on its status.

Also, its security model includes business rules when showing visibility of some

product statuses;

(6) Import and Export – Data flow is based on the ETL process for import and SOA

architecture on export. Solutions nowadays try to be more open and available for

different types of platforms so they can be able to handle data from various external

sources. It’s hard to discuss for import and export in CPR. It’s pretty limited because

the import of data is done by users with certain permissions whereas export is done

only on client systems request or nightly jobs, so there is no automatic data transfer at

the same moment when change is done in the CPR master database.

(7) Time to complete – packaged software is implemented much faster than building

custom solution (Table 14). There may be update fixes along the way, but not some

drastically changes as in custom solution development. From personal experience,

when custom development is done, there are people from different departments who

are involved in the decision making process and the development, which additionally

slows the development time while waiting for decisions, presentations and approvals

during meetings. And even though development was done in less than a year,

implementation is still ongoing and it will take a lot more time than implementing

packaged software;

(8) Cost of implementation – According to Stratature Research (Stratature White Paper ,

p. 5) it’s much cheaper to buy rather than built MDM solution. People included in

such complex solutions are developers with well-compensated salary and above

average annual incomes. Also, there are external advisors, business consultants,

additional software that is included. The development may prolong after deadline set

for the project because this is custom made solution and there is always possible risk

of failure. Outsourcing is an option to lower costs, but it’s not recommended because

of the strong connection between business and IT and the hands-on support that is

needed for this type of solutions. In the case of CPR, the cost was much lower than

packaged software. However, this solution is very limited, domain focused and any

additional modifications brings further costs for the company;

(9) Risk of failure – This last criterion depends on the organizational needs and proper

business planning rather than the technical difficulties that can be experienced when

implementing MDM. Based on previous experiences from other companies, one

organization can gain some information for MDM packaged software and decides if

selected solution will be suitable for its business requirements. However, it cannot

expect that this software can fix business problems in the organization. MDM solution

can be adjusted to centralize data for customer for example, but it cannot solve

constant errors and disambiguates that occur in the system because of organization

failure to define the difference between customer and supplier. Custom built solution

66

is specifically designed according to predefined business model. Therefore, it is

expected from this solution to accomplish the set goals, but this is only in the

beginning when there are no business model changes. For example, CPR offers such

strict and limited number of workflows that are focused on the product lifecycle and

security. But, any introduction of new logic, status, business rule means changes in

the code, processing logic and higher risk of failure of the current work of CPR.

CONCLUSION

Master Data Management architecture matured over the years from single domain hubs for

data cleansing to complex application that provide not only data quality improvement but

collaboration with users and business processes. When determining its business values there

are two types of benefits we should look into: (1) intangible and (2) tangible.

(1) Intangible benefits – MDM helps organization in solving four key issues: data

redundancy, data inconsistency, business inefficiency because of data errors and

supporting business changes (White, 2007, p. 2). Data quality dimensions as well as

data inconsistencies were discussed in the faster part of the thesis. MDM was defined

as discipline to improve data quality; therefore the immeasurable benefits are mostly

represented with solving existing data errors;

(2) Tangible benefits – Even though I reviewed great number of literatures written for

MDM, only few of them presented some quantitative gains that organizations get

from MDM applications. These studies (Table 1) show money return in many areas of

the business, but there is very little on the way how these numbers are determined.

Was this profit achieved in the first year or after few years of MDM implementation;

how much was invested in order to come to these savings etc. Even in the case study

for Studio Moderna, I couldn’t get information if they’ve seen any actual increase in

sales, return on investment when their solution was brought into use. This difficulty in

estimating actual return on investment implies to productivity paradox that IT

companies are struggling with since the 70s and 80s. The cost of MDM solution is

around 500K dollars, and yet the savings are hard to measure.

Four selected architectures were presented and they follow the MDM activities: profile,

consolidate, govern, share and leverage. Microsoft MDS offers least functionality of all four

solutions and still needs to work on development to build complex suite that will be similar to

its competitors. SAP, IBM and Oracle on the other hand, offer all kinds of possibilities for

data management and cover various business domains. They have implemented powerful

matching engines and validation rules that are used in detecting bad data. Also, applications

of these MDM portfolios can be purchased and integrated in already existing organization

system and work with tools from other vendors.

The Central Product Registry that I presented is also unique solution for Studio Moderna and

covers the needs for product data management. However, it’s specifically designed for one

domain and one business scenario that covers the product lifecycle and it still requires a lot of

67

manual work. Imports are done by person and they don’t come as data flow from external

sources. Export is also done in one way, the already mentioned pull method. This same

solution cannot be reused for another business domain because it doesn’t support the business

logic for any other domains but Product. Product status is based on sales, but processes where

suppliers are involved are not covered. For example, how can CPR handle the status of

product that actively sales but there are problems with the supplier? This starts another

discussion for managing data and involves logistic systems related to products.

Considering the fact that most of the activities for managing product data in CPR are done

manually and don’t require some unique automation or architecture, I would say that the

selected architectures discussed earlier can be suitable replacement for CPR and here are the

reasons:

- Data model - all of these solutions are domain neutral with exception to Oracle, but

Oracle offers Product Data Hub, which means that Product domain is covered in all of

the packaged software. Some of them even have predefined product data models

which will help when defining the data model for SM;

- Data import – all four solutions offer user interface for entering data. Also, some of

them like SAP or IBM have this process automated which additionally speeds up the

process when large product data records for some SM client country need to be

migrated to CPR database;

- Data export - all four solutions support SOA, which means they are platform

independent and can send the data upon clients system web requests, something that is

also implemented in CPR;

- Data validation – rules engine is supported in all of the four solutions and can

implement the business rules of SM as well. Since product status are changed by the

Local or Central Sourcing Officer, visibility of the rest of the product data can be also

controlled when defining additional business rules that depend on these statuses;

- Data security – as discussed earlier, all of these solutions have strong security model

as part of their architecture that can be adjusted also for SM.

From the above summary, in my opinion is better if organization decides for packaged

software because:

- Organization can buy as many applications it needs, which gives certain business to

choose according its requirements without spending money on useless applications;

- Solutions are easily adaptable to various environment and platforms, and also leave

space for future upgrades and additions of new modules and domains. So, instead of

organizations to build some MDM solution with use of existing ERP systems, they

can use SAP MDM Data Manager or IBM InfoSphere Quality Stage and integrate

them with the current organizational systems.

- Time for implementation and testing is much shorter than the one for custom

development;

68

- The cost may vary depending on the number of applications, licenses and so on, but

custom built solution cost a lot starting with the large business and IT people who

work on such projects.

As far as deciding on certain vendor from the four presented architecture, I cannot make

choice for the best solution because it takes further research to come to this result. It all

depends on the type of organization and is based on its: business domain, database size and

variety, budget, existing IT systems. However, most important of all is that organizations

need to be aware of their problem of bad data, what causes it and how it should be prevented,

before choosing suitable solution because often problems come from bad business definitions

which IT solutions can’t solve.

As mentioned in the introduction part, forecasts show increase of the MDM market.

Contribution towards this increase have the sophisticated MDM applications that evolved

from simple data quality tools to complex collaborative managements suits as well as the

ongoing investments for their improvement. There are three main aspects that will dictate

future MDM development (Radcliffe, 2012): (1) Multi domain support, (2) Cloud technology

and (3) Big Data.

(1) MDM solutions are trying to cover as many business requests as they can. That’s why

in recent years more vendors focus towards development of solutions that support

various domains in one application. In the past, customers with multi-domain

businesses were dealing with multiple products that supported single domains, or one

solution that had limited number of features for multi domain support. This problem

challenged MDM vendors to invest in new developments that will support multi

domain functionality;

(2) Following the path of different IT solutions, MDM vendors also try to find a way to

support cloud computing and have their master repositories stored in a cloud. Major

concern is the security because it is still central data storage that contains data crucial

for a business and vendors still can’t trust the existing security model and implement

MDM in cloud;

(3) The third and most interesting trend is MDM and its work with big data. Due to the

fast popularization and extreme interest in social networking, many companies found

this place as great marketing blackboard for presenting their products and services and

also finding new customers. Tendencies for MDM architectures is to improve MDM

to work with unstructured data retrieved from social networking, and make

predictions for suitable customers or connect them with the current organizational

sales. This will create more potential customers and increase the cross sales. SAP

HANA is such solution for fast processing of large data amounts, but MDM is pretty

new in this area. If this will be developed and implemented, then it will increase

MDM to a completely new level, not just as solution for data improvement but also

for data mining and predictive analysis.

69

From what is offered on today’s MDM market, what was presented in this thesis and also

based on the predictions of Gartner, bigger number of MDM organizations are willing to

turn into reality their long term idea for implementing master data management.

Hopefully, in the process of choosing the best solution, organizations should first look

into their problem of bad data and maybe try to improve their business model. If business

model changes are overlooked and MDM solution is purchased without thorough

consideration, organizations are facing the risk of another silo of data in their system that

will cause more damage than improvement of their master data.

70

LIST OF REFERENCES

1. Alon, T., Arkus, G., Duran, R., Haber, M., Liebke, R., Morreale, F. Jr., Roth, I.,

Sumano, A. & Zhu, J. (October, 2011). Metadata Management with IBM InfoSphere

Information Server.

2. An Informatica and Capgemini White Paper (2011). Building the Business Case for

Master Data Management (MDM). Strategies to quantify and articulate the business

value of MDM.

3. Arlbjørn, S.J. & Haug, A.(2011). Barriers to master data quality. Journal of

Enterprise Information Management. 24(3).

4. Ballard, C., Farrell, D. M., Lee, M., Stone, P. D., Thibault, S. &Tucker, S. (2010,

September). IBM InfoSphere Streams: Harnessing Data in Motion.

5. Berson, A. & Dubov, L. (2007). Master Data Management and Customer Data

Integration

6. Berson, A. & Dubov, L. (2010). Master Data Management and Data Governance

7. Bhatia, C., Jain, R., Perniu, L., Raveendramurthy, S., Samuel, R.., Vibhute, S. &

Wilson, E. (2011, June) InfoSphere Data Architect

8. Better Information through Master Data Management – MDM as a Foundation for

BI. Retrieved March 5, 2012, from

http://www.oracle.com/us/products/applications/master-data-management/042444.pdf

9. Bracht, J., Rehr, J., Siebert, M., & Thimm, R. (2012, July) Smarter Modeling of IBM

InfoSphere Master Data Management Solutions

10. Böttcher, O., Heilig, L., Karch, S., Hofmann, C. & Pfennig, R. (2007, March). SAP

NetWeaver Master Data Management

11. Bullerwell, M., Kashel, J., & Kent, T. (2011, July). Microsoft SQL Server 2008 R2

Master Data Services.

12. Butler, D.(2011). Boiling the Ocean. Retrieved December 27, 2012, from

https://blogs.oracle.com/mdm/entry/boiling_the_ocean

13. Crosman, P. (2010). Gartner Expects 14% Growth in Master Data Management

Software Revenue for 2010. Retrieved December 27, 2012, from

http://www.banktech.com/architecture-infrastructure/gartner-expects-14-

growth-in-master-data/228800031

14. Cross-Referencing for Master Data Management with Oracle Application Integration

Architecture Foundation Pack. Retrieved December 27, 2012, from

http://www.oracle.com/us/products/applications/056910.pdf

15. Dreibelbis, A., Hechler, E., Milman, I., Oberhofer, M., Run, P. Wolfson, D. (2008)

Enterprise Master Data Management. An SOA Approach to Managing Core

Information

16. Ferguson, R.(2004). SAP Buys A2is Technology for Master Data Management.

Retrieved December 27, 2012, from http://www.eweek.com/c/a/Enterprise-

Applications/SAP-Buys-A2is-Technology-for-Master-Data-Management/


https://blogs.oracle.com/mdm/entry/boiling_the_ocean

http://www.banktech.com/architecture-infrastructure/gartner-expects-14-growth-in-master-data/228800031

http://www.banktech.com/architecture-infrastructure/gartner-expects-14-growth-in-master-data/228800031

http://www.oracle.com/us/products/applications/056910.pdf

http://www.eweek.com/c/a/Enterprise-Applications/SAP-Buys-A2is-Technology-for-Master-Data-Management/

http://www.eweek.com/c/a/Enterprise-Applications/SAP-Buys-A2is-Technology-for-Master-Data-Management/

71

17. Graham, T. & Selhorn, S. (2011) Master Data Services:Implementation &

Administration

18. Gryz, J., Hazlewood, S., Pawluk, P.,& Run, P. (2011). Trusted Data in IBM’s Master

Data Management

19. Haapasalo, H., Hanna Kropsu-Vehkapera, H., Jaaskelainen, O. &Silvola, R.(2011).

Managing one master data – challenges and preconditions. Industrial Management &

Data Systems. 111(1).

20. Hillard, R..(2010) Information-Driven Business:How to Manage Data and

Information for Maximum Advantage.

21. IBM InfoSphere FastTrack (2007). Retrieved December 27, 2012, from

http://publibfp.boulder.ibm.com/epubs/pdf/c1934780.pdf

22. IBM InfoSphere Information Analyzer Retrieved December 27, 2012, from


23. IBM Multiform Master Data Management: The evolution of MDM applications.

(June, 2007). Retrieved March 5, 2012, from

http://www.itworldcanada.com/WhitePaperLibrary/PdfDownloads/IBM-LI-

Evolution_of_MDM.pdf

24. IBM InfoSphere Master Data Management Server Retrieved December 27, 2012,

from

http://origin01.aws.connect.clarityelections.com/Assets/Connect/RootPublish/soe-

testclient6.connect.clarityelections.com/Maps/MDMUnderstandingAndPlanning.pdf

25. IBM InfoSphere QualityStage Retrieved December 27, 2012, from


26. InfoSphere Metadata Asset Manager Tutorial (2012). Retrieved December 27, 2012,

from http://www-01.ibm.com/support/docview.wss?uid=swg27024462

27. IBM Software White Paper (2011). How master data management serves the

business. Retrieved March 5, 2012, from http://www-

01.ibm.com/software/data/master-data-management/overview.html

28. Kahn, B., Strong, D. & Wang, R. (2002). Information Quality Benchmarks: Product

and Service Performance.

29. Kokemuller, J. & Weisbecker, A. Master Data Management:Product and Research

30. Loshin, D. (2009). Master Data Management.

31. Loshin, D. (2008). MDM Paradigms and Architectures.

32. Magic Quadrant for Business Intelligence Platforms(2012). Retrieved December 27,

2012, from http://businessintelligence.info/docs/estudios/Magic-Quadrant-for-

Business-Intelligence-Platforms-2012.pdf

33. Magic Quadrant for Data Quality Tools (2012). Retrieved December 27, 2012, from

http://www.gartner.com/technology/reprints.do?id=1-1BO662V&ct=120809&st=sb

34. Magic Quadrant for Data Warehouse Database Management System (2012).

Retrieved December 27, 2012, from

http://www.gartner.com/technology/reprints.do?id=1-196T8S5&ct=120207&st=sb



http://www.itworldcanada.com/WhitePaperLibrary/PdfDownloads/IBM-LI-Evolution_of_MDM.pdf

http://www.itworldcanada.com/WhitePaperLibrary/PdfDownloads/IBM-LI-Evolution_of_MDM.pdf

http://origin01.aws.connect.clarityelections.com/Assets/Connect/RootPublish/soe-testclient6.connect.clarityelections.com/Maps/MDMUnderstandingAndPlanning.pdf

http://origin01.aws.connect.clarityelections.com/Assets/Connect/RootPublish/soe-testclient6.connect.clarityelections.com/Maps/MDMUnderstandingAndPlanning.pdf


http://www-01.ibm.com/support/docview.wss?uid=swg27024462

http://businessintelligence.info/docs/estudios/Magic-Quadrant-for-Business-Intelligence-Platforms-2012.pdf

http://businessintelligence.info/docs/estudios/Magic-Quadrant-for-Business-Intelligence-Platforms-2012.pdf

72

35. Master Data Management (2011, September). Retrieved March 5, 2012, from


36. Mauri, D. &Sarka, D. (2011, June). Data Quality and Master Data Management with

Microsoft SQL Server 2008 R2

37. McKnight, W. (2006). Justifying and Implementing Master Data Management for the

Enterprise. Retrieved March 5, 2012, fromhttp://web.ebscohost.com.nukweb.nuk.uni-

lj.si/ehost/results?sid=67111ba7-9fd6-49a5-ab58-

4b092e1a5797%40sessionmgr13&vid=5&hid=108&bquery=Justifying+AND+Imple

menting+Master+Data+Management+for+the&bdata=JmRiPWE5aCZkYj1idWgmZ

GI9bmxlYmsmZGI9cG9oJmRiPXNpaCZkYj11ZmgmZGI9bXRoJmRiPWY1aCZkY

j1yaWgmZGI9bmZoJmRiPWM4aCZkYj1id2gmZGI9aGNoJmRiPWNtZWRtJmRiP

WVyaWMmZGI9aHhoJmRiPWx4aCZkYj04Z2gmbGFuZz1zbCZ0eXBlPTAmc2l0Z

T1laG9zdC1saXZl

38. MDM Aware Applications. Retrieved March 5, 2012, from


39. Messerschmidt, M. & Stuben, J.(2011) Hidden Treasure.

40. Michnik, J. & Lo, M. (2007). The assessment of the information quality with the aid

of multiple criteria analysis.

41. Oracle Information Framework - The power of the combined ODI and MDM suites.

Retrieved March 5, 2012, from


42. Oracle Master Data Management Strategy. Retrieved March 5, 2012, from


43. Oracle Master Data Management. Retrieved March 5, 2012, from

http://www.oracle.com/us/products/applications/master-data-management/master-

data-management-ds-075053.pdf

44. Oracle Trading Community Architecture. Data Quality Management. Retrieved

December 27, 2012, from

http://docs.oracle.com/cd/A99488_01/acrobat/115hzdqm.pdf

45. Otto, B. (2011). How to design the master data architecture: Findings from a case

study at Bosch. International Journal of Information Management. Retrieved March

5, 2012, from http://www.oracle.com/us/products/applications/master-data-

management/018876.pdf

46. Overview of Our Company [Studio Moderna –portal]. Retrieved February 7, 2013,

from http://www.studio-moderna.com

47. Press release notes from IBM, Retrieved February 7, 2013, from http://www-

03.ibm.com/press/us/en/index.wss

48. Rao, U. (2011). SAP NetWeaver MDM 7.1 Administrator’s Guide

49. Radcliffe, J. (2012). Three trends that will shape the master data management market.

Retrieved December 27, 2012, from

http://www.computerweekly.com/opinion/Three-trends-that-will-shape-the-

master-data-management-market

http://docs.oracle.com/cd/A99488_01/acrobat/115hzdqm.pdf





http://www.computerweekly.com/opinion/Three-trends-that-will-shape-the-master-data-management-market

http://www.computerweekly.com/opinion/Three-trends-that-will-shape-the-master-data-management-market

73

50. Rivard, F., Harb, G. & Meret, P.(2009). Transverse Information Systems : New

Solutions for IS and Business Performance

51. SAP NetWeaver Master Data Management (MDM). MDM Data Manager.(2011,

October). Retrieved March 5, 2012, from

http://help.sap.com/saphelp_mdm71/helpdata/en/4b/72b8aaa42301bae100000

00a42189b/MDMDataManager71.pdf

52. SAP NetWeaver Master Data Management (MDM). MDM Console. (2011, October).


http://help.sap.com/saphelp_mdm71/helpdata/en/4b/71608566ae3260e100000

00a42189b/MDMConsole71.pdf

53. SAP NetWeaver Master Data Management (MDM). MDM Import Manager. (2011,

October). Retrieved December 27, 2012, from

http://help.sap.com/saphelp_nwmdm71/helpdata/en/4b/72b8e7a42301bae1000

0000a42189b/MDMImportManager71.pdf

54. SAP NetWeaver Master Data Management (MDM). MDM Console. (2011, October).


http://help.sap.com/saphelp_mdm71/helpdata/en/4b/71608566ae3260e100000

00a42189b/MDMConsole71.pdf

55. Sarngadharan, M., Minimol, C. (2010). Management Information System.

56. Schneider-Neureither, A. (2004, May). SAP System Landscape Optimization.

57. Smith,M. (2006). Master data management trends with Mark Smith, CEO of Ventana

Research.[Podcast] Retrieved December 27, 2012, from

http://searchdatamanagement.techtarget.com/podcast/Master-data-

management-trends-with-Mark-Smith-CEO-of-Ventana-

Research?vgnextfmt=aiog&cc=8c98ce6166128210VgnVCM1000000d01c80aRCRD

58. Smith, H. A., & McKeen, J. D. (2008). Developments in practice XXX: Master data

management: Salvation or snake oil? Communications of the AIS

59. Stratature White Paper. Master Data Management – The Build vs. Buy Decision

60. Strong, Diane M. & Wang, Richard Y. (1996). Beyond accuracy: What data quality

means to data consumers.Journal of Management Information Systems 12(4).

61. Thackeray, N. (2009, August). Administrator’s Perspective of SAP NetWeaver MDM

– Part 1 & 2. Retrieved March 5, 2012, from

http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/1041a80a-f462-

2c10-3ab3-9acb03bdb816?QuickLink=index&overridelayout=true&44714905772576

http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/70c286a6-0375-

2c10-bfbb-e6e83d72d804?QuickLink=index&overridelayout=true&44925358039672

62. Understand IBM InfoSphere MDM Server security, Part 1: Overview of Master Data

Management Server security. Retrieved December 27, 2012, from

http://www.ibm.com/developerworks/data/library/techarticle/dm-

0809mccallum/

63. Venkatagiri, S. SQL Server Master Data Services – A Point of View.

http://help.sap.com/saphelp_nwmdm71/helpdata/en/4b/72b8e7a42301bae10000000a42189b/MDMImportManager71.pdf

http://help.sap.com/saphelp_nwmdm71/helpdata/en/4b/72b8e7a42301bae10000000a42189b/MDMImportManager71.pdf

http://searchdatamanagement.techtarget.com/podcast/Master-data-management-trends-with-Mark-Smith-CEO-of-Ventana-Research?vgnextfmt=aiog&cc=8c98ce6166128210VgnVCM1000000d01c80aRCRD



http://www.ibm.com/developerworks/data/library/techarticle/dm-0809mccallum/

http://www.ibm.com/developerworks/data/library/techarticle/dm-0809mccallum/

74

64. Wand, Y. & Wang. Richard, Y. (1996, November). Anchoring Data Quality

Dimensions Ontological Foundations. Communications of the ACM 39 (11). Alon, T.,

Arkus, G., Duran, R., Haber, M., Liebke, R., Morreale, F. Jr., Roth, I., Sumano, A. &

Zhu, J et al. (October, 2011). Metadata Management with IBM InfoSphere

Information Server.

65. Wang, R., Pierce, M. & Madnick, S. (2005) Information Quality.

66. White, C. (2007). Using Master Data in Business Intelligence.

67. Yang, S. (2005, June). Master Data Management.

68. Zornes, A. (2009). Enterprise Master Data Management; Market Review & Forecast

for 2008 - 12

UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS · UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS ......

Documents

Transcript of UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS · UNIVERSITY OF LJUBLJANA FACULTY OF ECONOMICS ......