DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

103
DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Grace C. Shiao May, 2006

Transcript of DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Page 1: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

DESIGN AND IMPLEMENTATION

OF

DATA ANALYSIS COMPONENTS

A Thesis

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

Grace C. Shiao

May, 2006

Page 2: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

ii

DESIGN AND IMPLEMENTATION

OF

DATA ANALYSIS COMPONENTS

Grace C. Shiao

Thesis

Approved: Accepted: _________________________________ ____________________________________ Advisor Dean of the College Dr. Chien-Chung Chan Dr. Ronald F. Levant

_________________________________ ____________________________________ Committee Member Dean of the Graduate School Dr. Xuan-Hien Dang Dr. George R. Newkome _________________________________ ____________________________________ Committee Member Date Dr. Zhong-Hui Duan _________________________________ Department Chair Dr. Wolfgang Pelz

Page 3: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

iii

ABSTRACT

This thesis describes the design and implementation of the data analysis

components. Many features of modern database systems facilitate the decision-making

process. Recently, Online Analytical Processing (OLAP) and data mining are

increasingly being used in a wide range of applications. OLAP allows users to analyze

data from a wide variety of viewpoints. Data mining is the process of selecting,

exploring, and modeling large amounts of data to discover previously unknown patterns

for business advantage. Microsoft® SQL server™ 2000 Analysis Services provides a rich

set of tools to create and to maintain OLAP and data mining objects. In order to use

these tools, users need to fully understand the underlying architectures and the

specialized technological terms, which are not related to the data analysis. The

complexities in the development challenges prevent the data analysts to use these tools

effectively. In this work, we developed several components, which can be used as the

foundation in the analytical applications. Using these components in the software

applications can hide the technical complexities and can provide tools to build the OLAP

and mining model and to access data information from these model systems. Developers

can also reuse these components without coding from scratch. The reusability of these

components enhances the application’s reliability and reduces the development costs and

time.

Page 4: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

iv

DEDICATION

Dedicated to my late parents

Mr. and Mrs. K. C. Chang

Who taught me the value of Education

And

Opened my eyes to the Power of Knowledge

Page 5: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

v

ACKNOWLEDGEMENTS

First of all, I want to thank my adviser Dr. Chien-Chung Chan for his guidance and

support throughout my graduate research. His feedback helped to strengthen my research

skills and contributed greatly to this thesis. I want to thank my thesis committee

members, Dr. Xuan-Hien Dang and Dr Zhong-Hui Duan, for their guidance and

encouragement. In addition, I want to thank the faculty members of the Department of

Computer Science for building the foundation of my computer knowledge.

I also want to thank my late parents and wish they would have been able to see this

finished manuscript. I appreciate both of them for their love, support and encouragement

in my life. I thank my husband S. Y. for his love and support through these years, and to

my daughter Ming-Hao and my son Ming-Jay for their love, humor, and understanding.

Lastly, I thank the Mighty God for all His grace and blessing in my life.

Page 6: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

vi

TABLE OF CONTENTS Page

LIST OF TABLES............................................................................................................. ix

LIST OF FIGURES ............................................................................................................ x

CHAPTER

I. INTRODUCTION.......................................................................................................... 1

1.1 What is Online Analytical Processing (OLAP)? .................................................... 2

1.2 Data Mining ............................................................................................................ 3

1.3 Statement of the Problem........................................................................................ 3

1.4 Motivations and Contributions ............................................................................... 3

1.5 Organization of the Thesis ...................................................................................... 5

II. MICROSOFT SQL SERVER 2000 ANALYSIS SERVICES ..................................... 7

2.1 Overview................................................................................................................. 7

2.2 Architecture............................................................................................................. 7

2.2.1 Server Architecture .......................................................................................... 7

2.2.2 Client Architecture........................................................................................... 9

2.3 OLAP Cube............................................................................................................. 9

2.4 Analysis Manager ................................................................................................. 11

2.4.1 Creating the Basic Cube Model ..................................................................... 12

2.4.2 Browsing a Cube............................................................................................ 23

2.4.3 Building the Data Mining Models ................................................................. 24

Page 7: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

vii

III. DESIGN OF DATA ANALYSIS COMPONENTS................................................ 32

3.1 Component-Based Development .......................................................................... 33

3.2 What Is a Component?.......................................................................................... 33

3.3 The cubeBuilder Component ................................................................................ 34

3.4 The cubeBrowser Component............................................................................... 37

3.4.1 Browsing OLAP objects ................................................................................ 38

3.4.1.1 Retrieving Information of Cube Schema ............................................. 39

3.4.1.2 Analytical Querying of Cube Data ...................................................... 41

3.5 The DMBuilder Component ................................................................................. 43

3.6 Conclusions........................................................................................................... 47

IV. CASE STUDIES AND RESULTS............................................................................ 48

4.1 A Case Study of the Heart Disease Datasets ........................................................ 48

4.1.1 Heart Disease Sample File ............................................................................. 49

4.1.2 Software Implementation............................................................................... 49

4.2 Implementation of the cubeBuilder Component................................................... 50

4.2.1 Creating a New Cube ..................................................................................... 51

4.2.2 The Fact Table and Measures Selections....................................................... 52

4.2.3 Adding Dimensions to the Cube .................................................................... 52

4.2.4 Processing and Building the New Cube......................................................... 53

4.2.5 The Results..................................................................................................... 54

4.3 Implementation of the cubeBrowser Component ................................................. 56

Page 8: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

viii

4.3.1 Connection to the Analysis Server................................................................. 56

4.3.2 Retrieving the Cardio Cube Data................................................................... 57

4.3.3 Displaying the Cardio Cube Data .................................................................. 59

4.3.4 Drill-down and Drill-up Capacities ............................................................... 60

4.4 Implementation of the DMBuilder component ..................................................... 62

V. DISCUSSIONS AND FUTURE WORKS ................................................................. 67

5.1 Contributions and Evaluations.............................................................................. 67

5.2 Future Works ........................................................................................................ 70

BIBLIOGRAPHY............................................................................................................. 71

APPENDICES .................................................................................................................. 73

APPENDIX A. DATASET USED FOR CASE STUDIES......................................... 74

APPENDIX B. APPLICATION INTERFACE OF OLAP CUBE BUILDER ........... 76

APPENDIX C. SOURCE CODE OF CUBEBUILDER ............................................. 77

APPENDIX D. SOURCE CODE OF CUBEBROWSER ........................................... 84

APPENDIX E. SOURCE CODE OF DMBUILDER.................................................. 88

Page 9: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

ix

LIST OF TABLES

Table Page 2.1 Storage options supported by Analysis Services ..................................................... 19

2.2 Summary of cube process options ........................................................................... 22

3.1 Values of the connection string ............................................................................... 41

3.2 Listings of properties required for OLAP mining model objects ............................ 46

Page 10: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

x

LIST OF FIGURES

Figure Page 2.1 Analysis Services architecture ................................................................................ 8

2.2 The star and snowflake schemas........................................................................... 10

2.3 Screenshot of the Analysis Manager .................................................................... 11

2.4 Screenshot of the database dialog box of Cube Wizard ....................................... 13

2.5 Screenshot of the Provider for the Data Link dialog box ..................................... 13

2.6 Screenshot of the Connection tab of the Data Link dialog box ............................ 14

2.7 Screenshot of the "Select a fact table" dialog box with a selected fact table........ 15

2.8 Screenshot of the "Defining measures" dialog box. ............................................. 15

2.9 Screenshot of the Dimension Wizard ................................................................... 16

2.10 Screenshot of the "Select Dimension Table" dialog box ...................................... 17

2.11 Screenshot of the "Select levels" dialog box ........................................................ 17

2.12 Screenshot of the "Dimension Finish" dialog box................................................ 18

2.13 Screenshot of the "Storage Design Wizard" for selecting of storage options …...19

2.14 Screenshot of the "Set aggregation options" dialog box....................................... 20

2.15 Screenshot of the "Process" window .................................................................... 21

2.16 Screenshot of the "Process a cube" dialog box..................................................... 22

2.17 Screenshot of the "Cube Browser" and sample results......................................... 23

2.18 Screenshot of "Select source type" dialog box ..................................................... 25

Page 11: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

xi

2.19 Screenshot of "Select source cube" window......................................................... 26

2.20 Screenshot of the selecting mining model technique ........................................... 26

2.21 Screenshot of the "Select case" dialog box for specifying a case of analysis....... 27

2.22 Screenshot of the "Select predicted entity" window............................................. 28

2.23 Screenshot of the "Select training data" window.................................................. 29

2.24 Screenshot of the "Saving the data model" of the Mining Model Wizard........... 30

2.25 Screenshot of the "Model execution diagnostics" window................................... 30

2.26 Screenshot of the content details of a created mining model ............................... 31

3.1 Architecture of the component cubeBuilder ......................................................... 35

3.2 Relationship of cubeBrowser to the Analysis Server ........................................... 38

3.3 The basic workflow of browsing OLAP cube data using cubeBrowser............... 40

3.4 The architecture and logic relations of DMBuilder with DSO ............................ 44

3.5 Flow Logic of the DMBuilder Component …...................................................... 45

4.1 Relationship of the heart disease test data ............................................................ 49

4.2 Screenshot of the cardio cube builder interface.................................................... 50

4.3 Screenshot of the “Data Source/Cube” section..................................................... 51

4.4 Screenshot of sample entries for both sections of "Data Source/Cube" and "Specify Fact/Measures"………………………...............……………………….51

4.5 Screenshot of sample entries of “Specify Fact/Measure” section ........................ 52

4.6 Screenshot of the “Add Dimensions to Cube” section ......................................... 53

4.7 Screenshot of sample entries for cube dimension................................................. 53

4.8 Screenshot of the “Process/Build Cube” section .................................................. 54

Page 12: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

xii

4.9 Screenshot of the cardio test database object before building the new cardio cube……….…………………………………………………………...….55 4.10 Screenshot of the cardio test database object after building the sample "cube1"………………………………………………………………………...…55 4.11 Screenshot of the web form BrowseCube.aspx .................................................... 56

4.12 Screenshot of listing of available cube ................................................................. 57

4.13 Screenshot of specifying cube entry and measures………….. ............................ 57

4.14 Screenshot of selections of measures and the pre-defined view options.............. 58

4.15 Screenshot of selections of location for Pain-Type option ................................... 59

4.16 Screenshot of selections of pain-type for Patient option ...................................... 59

4.17 Results of cube data for Pain-Type option with test country................................ 59

4.18 Results of cube data for the angina chest pains per patient test city..................... 60

4.19 Screenshot of drill-down to the test center level of Patient option....................... 61

4.20 Screenshot of drill-up to the country’s level of Patient option ............................. 61

4.21 Screenshot of the main interface DMMBuilder .................................................... 62

4.22 Screenshot of the “Server/Database” section........................................................ 63

4.23 Screenshot of Mining model setup ....................................................................... 63

4.24 Screenshot of setting the mining model role ........................................................ 64

4.25 Screenshot of setting properties and algorithm for the mining model.................. 64

4.26 Screenshot of setting the attributes of analytical column ..................................... 65

4.27 Screenshot of the cardio mining model using Microsoft Decision Trees Algorithm…………………………...……………………………………………66

B.1 Screenshot of the OLAP cube builder interface for the power users .................... 76

Page 13: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

1

CHAPTER I

INTRODUCTION

Data are not only valuable assets, but also the strategic resources in today’s

competitive environment. Organizations around the world are accumulating vast and

growing amounts of data in different database formats. Business companies need to

understand the effectiveness of their marketing efforts and quickly maintain the large

volumes of data created each day. These challenges require a well-defined database

system that can bring together disparate data with different dimensionality and

granularity. Making the data meaningful is no small task, especially given the different

aspects of data analysis. Companies need quality analysis of operational information to

understand their business strengths and weaknesses. Business analysis focuses on the

effective use of data and information to drive positive business actions. With good and

accurate data analysis, business decision makers can make well-informed decisions for

the future of their organizations. The Business Intelligence (BI) tools allow companies

to automate its functions of analysis, strategy, and forecasting to make better business

decisions. Online Analytical Processing (OLAP) and Data mining model are the key

features of the BI tools that help companies extract data from an operational system, to

summarize data into working totals, to find the hidden patterns from data for future

analysis and prediction, and to intuitively present these results to the end users [1, 2].

Page 14: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

2

1.1 What is Online Analytical Processing (OLAP)?

The standard definition of OLAP provided by the OLAP Council [2] is:

“A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user”.

The functionality of OLAP, according to the definition of the OLAP Council, lets

the users complete the following tasks [2]:

• Calculations and modeling applied across dimensions, through hierarchies and/or across members

• Trend analysis over sequential time periods • Slicing subsets for on-screen viewing • Drill-down to deeper levels of consolidation • Reach-through to underlying detail data • Rotation to new dimensional comparisons in the viewing area.

Therefore, OLAP performs multidimensional analysis of enterprise data and

provides the capabilities for complex calculations, trend analysis and very sophisticated

data modeling. In addition, OLAP enables end-users to perform ad hoc analysis of data

in multiple dimensions, thereby providing the insight and understanding they need for

better decision making.

An OLAP structure created from the operational data is called an OLAP cube [1, 2].

OLAP cubes are data processing units consisting of the fact and the dimensions from the

database. They provide multidimensional views and analytical querying capacities.

Therefore, OLAP technology can provide fast answers for complex querying on

operational data for decision-making management.

Page 15: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

3

1.2 Data Mining

Data Mining is defined as the automated extraction of hidden predictive information

from database systems [3, 4]. Generally, it is the process of analyzing data from different

perspectives and discovering patterns and regularities in sets of data. Specifically, the

hidden patterns and the correlations discovered in the data can provide strategic business

advantages for decision-making in organizations.

1.3 Statement of the Problem

Microsoft® Analysis Services, shipped with SQL server™ 2000, is the OLAP

database engine and is able to build multidimensional cubes [1, 5]. It also provides the

application programs to browse the cube data and tools to support data mining algorithms

for discovering trends in data and predicting future results. The implementation of

Analysis Services is heavily wizard oriented in building and managing data cube and data

mining model. Although many features are also available through the predefined editors,

the wizard-intensive process still requires users to fully understand the cube structure and

associated objects in the definition process. The complexity of cube development makes

it difficult for end-users with little technical experience to gain access to these analysis

tools.

1.4 Motivations and Contributions

In reality, most decision-makers within an enterprise want to be able to use the

insights gained from their data for more tactical decision-making purposes. However,

they are not generally interested in spending time in building cube or mining model to

Page 16: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

4

answer their business issues. Analysis Services provides intensive wizards and editors in

the development of OLAP cubes and the mining models. It has been designed to be

flexible for all levels of users, but users have difficulty learning to use these features

effectively and creating useful models for decision making. The best solution is to design

a specific front-end interface to meet the user’s requirements with the ability to cross-

analyze data even through a single click and to mask the underlying complexities of the

applications from the users.

Analysis applications contain sensitive and confidential information that should be

protected against unauthorized access and only are available to appropriate decision

makers. Analysis Services automatically creates an OLAP Administrators group in the

operating system. A member of the OLAP Administrators group has complete access to

the analysis objects. A user that is not a member of the OLAP Administrators group has

read- or write-access to the extent permitted based on dimension-level or cell-level

security but performs no administrative tasks. However, the active user must be a

member of the OLAP administrators group to use Analysis Manager. Therefore, the non-

Administrator user can not exploit the cube information through Analysis Manager. One

of the scope of this thesis is to construct a client-application interface by using the Multi-

dimensional Expressions (MDX) and ActiveX® Data Objects/Multi-dimensional

(ADO/MD) to query OLAP data to solve this conflict issue [1, 6].

The main contributions of this thesis are as follows:

• Development of a component, cubeBuilder, for software developers to design

application interface which can build the OLAP cube model to meet user’s

analytical requirements

Page 17: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

5

• Development of a component, DMBuilder, for developers to design a specific

user-interface to create data mining model for users to uncover previously

unknown patterns

• Development of a component, cubeBrowser, for developers to design a client

interface to browse the cube data for non-Administrators group users.

In addition, these data analysis components not only help the software developers to

design the specific application without coding from scratch, but also hide the

complexities of development challenges from the less technically-oriented users.

1.5 Organization of the Thesis

This thesis covers the work on the development of the data analysis components,

cubeBuilder, cubeBrowser and DMBuilder for OLAP and mining model solutions. This

thesis is organized as follows:

Chapter II provides an overview of Microsoft SQL Server Analysis Services

including its fundamental operations and architectures in the functionality of OLAP and

Data Mining model. The step-by-step processes used to create an OLAP cube, to browse

the existing cube data and to create a data mining model with Analysis Manager are also

illustrated and described in Chapter II.

Chapter III focuses on the development of the design and the structures of the

analysis components for OLAP and mining model solutions.

Chapter IV describes the implementations of these analysis components in the

desktop and web-based applications interface for OLAP cube and mining model system.

Page 18: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

6

It also describes a case study with the heart disease dataset to demonstrate the application

of the analysis components.

Chapter V presents a summary of the work that has been done in this thesis. It also

compares the functionalities between the analysis components and Analysis Manager in

the aspects of building of OLAP cube and mining model. The directions of future work

and the conclusion of this thesis are also presented in Chapter V.

Page 19: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

7

CHAPTER II

MICROSOFT SQL SERVER 2000 ANALYSIS SERVICES

2.1. Overview

Microsoft® SQL server™ 2000 Analysis Services provides fully-functional OLAP

environment, which includes both OLAP and data-mining functionality [5]. It is a suite

of decision support engines and tools. It can also function as an intermediate layer that

converts relational warehouse data into a form, also called a cube, which makes it fast

and flexible for creating an analytical report.

2.2. Architecture

The architecture of Analysis Services can be divided into two portions: the server

and the client, as shown in Figure 2.1. The server portion, including the engines,

provides the functionality and power, while the client portion has interfaces for front-end

applications [5].

2.2.1. Server Architecture

The primary component of Analysis Services is the Analysis Server. The Analysis

Server operates as a Microsoft Window NT or Windows 2000 service and is

Page 20: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Analysis Manager

Decision Support Objects (DSO)

Data sources

Cubes

Analysis ServerMining models

Client ApplicationClient Application

ADO MD

PivotTable Service

Client

Server

Microsoft Management Console (MMC)

Figure 2.1 Analysis Services architecture

specifically designed to create and maintain multidimensional data structures [5, 6]. It

also provides multi-dimensional data values to client queries and manages connections to

the specified data sources and local access security. Figure 2.1 illustrates the Analysis

Manager, a snap-in console in Analysis Services, which communicates with the server

8

Page 21: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

9

through the Decision Support Objects (DSO) component tool. The DSO is a set of

programming instructions for applications to work with the Analysis Services [7].

2.2.2. Client Architecture

The client side of the Analysis Services is primarily used to provide an accessing

interface, the PivotTable Service, between the server and the custom applications, as

shown in Figure 2.1 [6, 7]. PivotTable Service communicates with the Analysis server

and provides interfaces for client applications to access OLAP data and data mining data

on the server [6, 7]. It provides the OLE DB interface for users to access data managed

by Analysis Services, custom programs or client tools.

2.3 OLAP Cube

The primary form of data representation within the Analysis Services is the OLAP

cube [5-8]. A cube is a logical construct. It is a multidimensional representation of both

detailed and summary data. Cubes are designed according to the client’s analytical

requirements. Each cube represents data values of different business entities. Each side

of the cube presents a different aspect of the data.

Cubes in the Analysis Services are built using one of two types of database schemas:

the star schema and the snowflake schema [9]. Both schemas consist of a fact table and

dimension tables. The Analysis Services aggregates data from these tables to build

cubes. As shown in Figure 2.2, the star schema consists of a fact table and several

dimension tables. Each dimension table corresponds to a column in the fact table. The

data in the dimension tables are used to form the analytical queries in the fact table.

Page 22: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

However, in the snowflake schema, several dimension tables are joined before being

linked to the fact table.

Star Schema

10

Snowflake Schema

Dimension table 1

Dimension table 2

Fact Table

Dimension Table

Fact Table

A layer of Dimension tables

Dimension Table

Dimension Table

Dimension table 3

Figure 2.2 The star and snowflake schemas

Page 23: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

2.4 Analysis Manager

The Analysis Manager is a tool for the Analysis Server administration in Microsoft

SQL Server 2000 Analysis Services [5-9]. It is a snap-in application within the

Microsoft Management Console (MMC), which is the common framework for hosting

administrative tools. Figure 2.3 illustrates the screenshot of the hierarchical, tree-view

representation of the server and all its components in the left pane of the console.

Figure 2.3 Screenshot of the Analysis Manager

11

Page 24: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

12

The major functional features for the Analysis Manager are summarized as follows:

• Administering Analysis server

• Creating database and specifying data sources

• Creating and processing cubes

• Creating dimensions for the specified database

• Specifying storage options and optimizing performance

• Authorizing and managing cube security

• Browsing cube data, shared dimensions and other objects

• Creating data mining model from relational and multidimensional data

• Viewing the Mining Model.

2.4.1 Creating the Basic Cube Model

Analysis Services provides wizards and editors within the Analysis Manager to let

the user create the cube easily [6, 8]. The step-by-step instructions for building a basic

cube model in the Analysis Manager using the Cube Wizard are summarized as follows:

1. Creating an Analysis Server’s database

A database acts like a folder that holds cubes, data sources, shared dimensions,

mining model and database roles as illustrated in Figure 2.3. To create a new database on

a server, after launching onto the Analysis Manager, right-click the server name and then

select new database from the pop-up menu [1, 2]. The Database dialog box appears for

user to enter a new database name for the new cube model, as shown in Figure 2.4.

Page 25: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.4 Screenshot of the database dialog box of Cube Wizard

2. Specifying the data source

After creating a new database, a data source needs to be specified for the cube. The

data source contains the information of the data used in the cube [6, 7]. The purpose of

adding a data source is to let Analysis server establish connections to the source data.

The Data Link dialog box, as illustrated in Figure 2.5, can be opened by right-clicking the

Data Source folder and selecting New Data source from the pop-up menu.

Figure 2.5 Screenshot of the Provider for the Data Link dialog box

13

Page 26: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

In the Data Link dialog box shown in Figure 2.6, the user can specify a provider, the

server name, login information and a database name to connect to the Analysis server.

Figure 2.6 Screenshot of the Connection tab of the Data Link dialog box

3. Selecting the fact table and the measures

The Cube Wizard and the Cube Editor are the tools to be used in the Analysis

Manager to create the OLAP cube [8]. A fact table contains the measure fields, which

consist of the numeric values for the analysis, and the key fields that are used to join to

dimension tables. The fact table should not contain any descriptive information or any

labels in addition to the measures and the index fields. Each cube must be based on only

one fact table. As shown in Figure 2.7, the panel displays all the tables in the specified

data source. After selecting the fact table, click the “Next” button, the Wizard displays

all of the available numeric data in the selected table, as shown in Figure 2.8

14

Page 27: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.7 Screenshot of the “Select a fact table” dialog box with a selected fact table

After specifying the measures from the list, click the “Next” button, the Cube

Wizard asks the user to select dimensions or to create dimensions.

Figure 2.8 Screenshot of the “Defining measures” dialog box

4. Adding dimensions and levels to the cube

Dimensions are the categories for the user to analyze and summarize the data [6-8].

In other words, dimensions are the organized hierarchies that describe the data functions

in the fact table. There are two types of dimensions to be created for use in the cube. A

dimension created for use in an individual cube is called a private dimension. A shared

dimension is the one that multiple cubes can use [8]. A cube must contain at least one

dimension, and the dimension must exist in the database object where a cube will be

created.

15

Page 28: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

In the Analysis Manager, a new dimension can be created either by the Cube Editor

or the Cube Wizard. If the editors are used to build the cube, then a dimension has to be

created before adding to a cube. However, if the Cube Wizard is used to create a cube,

then it will launch the Dimension Wizard to handle the task as part of the processing in

creating a cube [8]. The step-by-step processes of creating a new shared dimension with

the Dimension Wizard are summarized as follows:

a. Selecting the type of dimension schema in the screen of the “Choose how

you want to create the dimension”, as shown in Figure 2.9.

Figure 2.9 Screenshot of the Dimension Wizard

b. Specifying the dimension table from the available table list in the screen of

the “Select the dimension table”, as shown in Figure 2.10.

c. Selecting the level on the screen of the “Select the levels for your

dimension”, as shown in Figure 2.11.

16

Page 29: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.10 Screenshot of the “Select Dimension table” dialog box

Figure 2.11 Screenshot of the “Select levels” dialog box

d. Specifying the new dimension name and previewing the dimension data in

the “Finish” dialog box of the Dimension Wizard, as illustrated in Figure

2.12.

17

Page 30: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.12 Screenshot of the “Dimension Finish“ dialog box

5. Setting the storage options and setting up the cube aggregations

The storage mode determines how the data is organized in the server [8, 9]. It

affects the requirements of disk-storage space and the data-retrieval performance. There

are three types of storage options supported by Analysis Services: Multi-dimensional

OLAP (MOLAP), the relational OLAP, and the Hybrid OLAP (HOLAP). The

descriptions and storage locations of each mode are summarized in Table 2.1. The

Storage Design Wizard is used to select the option for the cube in the Analysis Manager,

as shown in Figure 2.13

18

Page 31: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Table 2.1 Storage options supported by Analysis Services

Storage Locations Storage Mode

Description Fact data Aggregated

Values ROLAP Relational OLAP

1. Slow processing, 2. Slow query response and 3. Huge storage requirements 4. Suitable for large databases or

legacy data.

Relational database Server

Relational Database Server

MOLAP Multidimensional OLAP 1. Require data duplication 2. Pre-summarizes the data to improve

performance in querying and displaying the data

3. High performances 4. Good for small to medium size data

sets.

Cube Cube

HOLAP Hybrid OLAP A combination of ROLAP and MOLAP 1. Does not create a copy of data 2. Provides connectivity to a large

number of relational databases. 3. Good for limited storage space but

faster query responses are needed.

Relational database Server

Cube

Figure 2.13 Screenshot of the “Storage Design Wizard” for selecting of storage options 19

Page 32: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

After deciding the storage option, the next step is to specify the aggregation options

in the Set Aggregation Options dialog, as illustrated in Figure 2.14 [8, 9]. This option

allows the user to set the level of aggregation for the cube to boost the performance of

queries.

Aggregations are pre-calculated summaries of data that improve query response

time. The larger the level of cube’s aggregation, the faster the queries will be executed,

but a greater amount of disk space will be needed and more time will be required to

process the cube.

In the Analysis Services, there are three aggregation options for selection:

• Estimated storage reaches: specifying the maximum storage size in either megabytes (MB) or gigabytes (GB)

• Performance gain reaches: specifying the percentage amount of performance

gain for the queries • Until I click stop: selecting the manual control of the balance

.

Figure 2.14 Screenshot of the “Set aggregation options” dialog box

20

Page 33: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

6. Processing the cube

Processing the cube is required before attempting to browse the cube data, especially

after designing its storage options and aggregations, because the aggregations are needed

to be calculated for the cube before the user to view the cube data [8, 9].

The major activities involved in the cube processing are described in a

“Process” window, as shown in Figure 2.15, and summarized as follows:

a. Reading the dimension tables to populate the levels from the actual data

b. Reading the fact table

c. Calculating specified aggregations

d. Storing the results in the cube.

Figure 2.15 Screenshot of the “Process” window

21

Page 34: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

In the Analysis Manager, there are three options to be used to process a cube

depending on the different circumstances of the data structures. These options,

summarized in Table 2.2, can be selected in the “Process a Cube” dialog box, as shown in

Figure 2.16 [9].

Table 2.2 Summary of cube process options

Options of Process Circumstances

Full process Modifying the structure of the cube

Incremental update Adding new data to the cube

Refresh data Clear out and replacing a cube’s source data

Figure 2.16 Screenshot of the “Process a cube” dialog box

22

Page 35: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

2.4.2 Browsing a Cube

In the Analysis Manager, using Cube Wizard to view the cube data is one of viewing

methods [5- 9]. There are two ways to open the Cube Browser to load cube data into it:

a. Right-click the cube name in the Analysis Manager Tree pane and selecting

“Browse Data” from the pop-up menu

b. Click the “Browse Sample Data” in the last step of the Cube Wizard

The cube Browser not only let users to view the multidimensional data in a flattened

two-dimensional grid format, as shown in Figure 2.17, but also makes it possible to drill

up or drill down different dimensions of data. However, the Cube Browser can not be

used to view unprocessed cube data [6].

Figure 2.17 Screenshot of the “Cube Browser” and sample results

23

Page 36: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

24

2.4.3 Building the Data Mining Models

Data Mining is the process of extracting knowledge hidden from large volumes of

data [10, 11]. It involves uncovering patterns, trends, and relationships from historical

data and predicting outcomes of future situations. The primary mechanism for data

mining is the data mining model, an abstract object that stores data mining information in

a series of schema rowsets. The mining model serves as the blueprint for how data

should be analyzed or processed. Once the model is processed, information associated

with the mining model not only represents what was learned from the data, but also

allows users to discover the business trends for future decision making [11]. Two data

mining algorithms are built into Microsoft SQL server 2000 Analysis Services: Microsoft

Decision Trees and Microsoft Clustering [12, 13]

A. Decision Trees Algorithm:

Microsoft Decision Trees algorithm uses the recursive partitioning to divide the data

in a tree structure, and continually performs this search for predictive factors until there is

no more data to continue with [10-13]. A node in the tree structure represents each

predictive factor used to classify the data. This method focuses on providing information

paths for rules and patterns within data, and is useful in predicting the exact outcomes for

the future problems [12, 13].

B. Microsoft Clustering Algorithm:

Microsoft Clustering algorithm is based on the Expectation and Maximization (EM)

algorithm [11, 12]. It uses iterative refinement techniques to group records into

neighborhoods (clusters) that exhibit similar, predictable characteristics [13]. These are

Page 37: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

useful for uncovering a relationship among data items in a large database with hundreds

of evaluated attributes.

The following steps describe the process of creating a mining model using the

mining model wizard in the Analysis Manager [13]:

1. Specifying the type of data:

In the window of “select data source type”, as shown in Figure 2.18, users

can select either relational data type or OLAP data to build the target mining

model.

Figure 2.18 Screenshot of the “Select source type” dialog box

2. Selecting the source cube:

In the “select source cube” window, as shown in Figure 2.19, users need to

highlight the target cube from the available cube lists [11, 13].

25

Page 38: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.19 Screenshot of “Select source cube” window

3. Specifying the data mining method;

In the “Select data mining technique” window, as shown in Figure 2.2,

users can select one of two mining algorithms provided with the Analysis

Services: Microsoft Decision Trees and Microsoft Clustering [9, 10].

Figure 2.20 Screenshot of the selecting mining model technique

26

Page 39: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

4. Identifying the case base or unit of analysis

In the “Select case” window, as shown in Figure 2.21, users need to

specify the case base of the analysis for the modeling task. A case is the basic

unit of analysis for mining task.

Figure 2.21 Screenshot of the “Select case” dialog box for specifying a case of analysis

5. Selecting the predicted entity:

In this step users must provide information for prediction used in the

mining model [12], as shown in Figure 2.22. The predicted entity can be

chosen as one of the following items:

A measure of the source table A member property of the case dimension and level Members of another dimension in the cube.

This feature provides flexibility in the process of predictive analysis using

OLAP data.

27

Page 40: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.22 Screenshot of “Select predicted entity” window

6. Selecting a training data:

The training data is used to process OLAP data mining model and to

define the column structure of a data mining for the case set. As shown in

Figure 2.23, the users should select at least one additional data item from the

data training data [12, 13].

28

Page 41: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.23 Screenshot of the “Select training data” window

7. Naming the model and process the model:

After user enters a model name and selects the “Save and process now”

check box, as shown in Figure 2.24, the wizard will process the model and

train the model with data based on the specified algorithm. Figure 2.25

displays the process of model execution [13]. When the process is complete, a

message of “Processing completed successfully” appears in the bottom of

dialog box.

29

Page 42: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 2.24 Screenshot of the “Saving the data model” of the Mining Model Wizard

Figure 2.25 Screenshot of the “Model execution diagnostics” window

30

Page 43: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

After clicking the “close” button, the OLAP Mining Model Editor will be launched

and system displays the content details of the proposed mining model, as shown in Figure

2.26.

Figure 2.26 Screenshot of the content details of a created mining model

31

Page 44: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

32

CHAPTER III

DESIGN OF DATA ANALYSIS COMPONENTS

It has been known that Microsoft SQL Server 2000 provides the OLAP functionality

to build and manage multidimensional models of data and applications for use in large

enterprise systems [1, 2]. There are three programmatic interfaces in the Analysis

Services for user’s applications: ActiveX Data Objects Multidimensional (ADO MD),

OLE DB for Online Analytical Processing (OLE DB for OLAP) and Decision Support

Objects (DSO) [10 -14].

ADO MD is an extension to the ADO programming interface that can be used to

access multidimensional schema, to query cubes, and to retrieve the results [10]. It uses

an underlying OLE DB provider, which is Microsoft's strategic low-level application

program interface (API) for access to different data sources [11]. OLE DB for Online

Analytical Processing (OLE DB for OLAP) is a set of objects and interfaces that extend

the ability of OLE DB to provide access to multidimensional data stores [12]. DSO is

the administrative programming interface to create and alter cubes, dimensions, and

calculated members. It also can use other functions that are able to perform interactively

through the Analysis Manager application [13, 14].

Page 45: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

33

Using these programmatic tools can provide more controls over those OLAP and

data mining operations. Developer can hide the complexities in the process of creating

cubes and mining model from a less technical user. ADOMD, the data abstraction tool,

allows developers to create either a local or remote front-end interface for exploration

metadata, databases and analysis functions. Especially it provides an analytical tool for

end-users who do not have the OLAP administrator privileges to access cube data with

Analysis Manager.

This chapter will introduce the data analysis components developed for building and

viewing the cube and data mining model in Microsoft SQL server 2000 environment.

3.1 Component-Based Development

Component-based development (CBD) is a software application methodology that

allows developers to reuse the existing components. The idea of reuse and the flexibility

are the main characteristics of CBD [15, 16]. Developers no longer need to construct

software applications from scratch; they only need to reuse existing re-built components

to meet application requirements. This feature of code reuse reduces the production costs

and enhances the maintainability of the software system. Flexibility is another useful

trait which allows for components to be easily replaced, modified and maintained. Using

CBD, the process of software design is made more effective and flexible.

3.2 What Is a Component?

A software component involves three essential parts: a service interface, an

implementation, and deployment [17]. A service interface specifies the component. An

Page 46: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

34

implementation implements the interface to make the component work. The deployment

is the executable file to make the component run to meet the requirements. Kirby

McInnis [17] has given a single comprehensive definition of a component:

“A component is a language-neutral, independently implemented package of software services, delivered in an encapsulated and replaceable container, accessed via one or more published interfaces. A component is not platform contained or application bound.”

The reuse of existing components reduces the development and maintenance costs.

It also increases the productivity since there is no need to build new applications from

scratch.

3.3 The cubeBuilder Component

The component, cubeBuilder has been developed and is built on top of the DSO to

allow the developers to create the OLAP cube programmatically without using the

Analysis Manager [17, 18]. Figure 3.1 not only depicts the component’s architecture

and relation to the server, but also shows the workflow to create a data cube with the

component.

The sequence of operations involved in building an OLAP cube is described as

follows:

1. Connecting to an Analysis server:

The first step in the process of building an OLAP cube is to connect to an

Analysis server. A server object clsServer of the DSO object model is the main entry for

accessing the Analysis server.

Page 47: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Custom Application

cubeBuilder component

Connect to server

Create Database

Process Cube

Create Cube

Add DataSource

Create Dimension

Relational Database

Cube

Decision Support Objects (DSO)

Analysis Server

Figure 3.1 Architecture of the component cubeBuilder

The cubeBuilder component provides a method called ConnectToServer to use the

server object of the DSO to connect a computer where the Analysis server service is

running [18].

2. Creating a database object to contain dimensions and cubes:

After connection to the Analysis server, the database object is the first object needs

to be created in the process of building of the OLAP cube. A database object is a

container for the related cubes and other objects. It consists of data sources, shared

dimensions and database roles. It is also used to store the cubes, data mining models and

35

Page 48: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

36

other related objects. The cubeBuilder component can either create a new database

object or open the existing database system in the server.

3. Adding a data source that contains the data.

After setting up the database object, a link to a data source into the database has to

be added before constructing an OLAP cube. The data source object of DSO specifies a

source of a data file to be used as the source database for the cube. The cubeBuilder

component is able to handle the following tasks through the data source object of DSO:

Setting up the connection to data source

Finding the specified data source

Adding new data source into the specified database object.

Setting the link to the specified data source

4. Creating dimensions and their levels

A dimension is a structural attribute of an OLAP cube and is an organized hierarchy

of categories that describe data in the fact table of the data warehouse system. These

categories provide users the base of data analysis. The cubeBuilder component uses the

dimension object of DSO to create a shared dimension in the user-specified database

object. The dimension object provides a specific implementation of the DSO dimension

interface. Through the Dimension interface, the component cubeBuilder can provide the

following tasks:

Creating a new dimension object in the database object

Creating a new level on the dimension, and sets the associated level’s properties.

Page 49: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

37

5. Create a cube and specify dimensions and measures

The following steps illustrate the way to add a cube to the user-specified database

object by using the cubeBuilder component;

Using the method to add the user-specified cube name into the collections of the database object

Specifies the data source of the cube

Specifying the fact table of the cube

Setting up the SourceTable and EstimatedRows properties of the cube through the method of AddFactTblToCube

Specify the measures from the fact table for analysis

Adding the database’s dimension to the cube’s collections with the AddSharedDimToCube method.

6. Process a cube to load its structure and data.

After defining a cube and its measures to the database objects, the cube can be

processed. The cube can be fully processed by using the ProcessCube method of the

cubeBuilder component to load the cube’s structure and data.

3.4 The cubeBrowser Component

The component cubeBrowser can be used in the software applications to access data

information from the multidimensional data sources in the Microsoft Analysis Services.

It is a layer on top of ADO MD that can be used to write OLAP applications to retrieve

data information from the OLAP cube. Figure 3.2 shows how the component

cubeBrowser fits into the Analysis Services architecture. The PivotTable Services is the

client side of Microsoft Analysis Services and implements the OLE DB for OLAP, which

Page 50: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

is a standard interface for returning OLAP data. OLE DB for OLAP is a high-

performance COM that doesn’t support OLE automation. ADO MD is the Microsoft’s

extension to ADO for accessing and manipulating data cubes [16, 17].

Analysis Server

Application

ADOMD cubeBrowser

PivotTable Service

OLEDB for OLAP

Cube

OLAP Engine

User

Figure 3.2 Relationship of cubeBrowser to the Analysis Server

3.4.1 Browsing OLAP objects

The component cubeBrowser can be used in the OLAP applications to allow the end

users to browse the OLAP cubes. It also can view the properties of the cubes and

underlying structures and to execute the analytical queries for their business questions.

The data information of cube schema is one of the two options to access data from OLAP

cubes. It includes the concept of an OLAP database containing all the cubes and their

underlying structures, while the other option consists of the execution of the analytical

queries and displaying the queried results for business analysts [15, 16]. 38

Page 51: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

39

The basic workflow of the usage of the component cubeBrowser in browsing the

cube objects is shown in Figure 3.3 and is summarized as follows:

A. Retrieving the Information of cube schema

a. Setting up connection string and connect to server.

b. Displaying the results.

B. Execution of a analytical query

a. Setting up the direct connection to the Analysis Server.

b. Displaying the hierarchical structures of an OLAP database.

c. Constructing the MDX queries and displaying the retrieval results.

d. Illustrating the definition of a particular OLAP cube and its underlying

dimensions.

3.4.1.1 Retrieving information of cube schema

The information of the cube schema includes the concept of an OLAP database

containing all the cubes and their underlying structures. In order to get information of the

cube schema, the first step is to set up a connection to the Analysis Services engine. The

connection string consists of values for the provider, data source, initial catalog, and other

user’s and system’s information.

Table 3.1 lists the primary values needed to construct a connection string. The

provider is the name of the OLE DB for OLAP provider which is used to connect to the

OLAP engine. In the Analysis Services, the value is MSOLAP2, which is the name of

the Microsoft OLE DB Provider for OLAP Services 8.0 [19]. The data source is the host-

Page 52: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

name of the server. The initial catalog is the particular database object in the specified

server.

Details of cell values

OLAP Application

Analysis Server

Database object

Cube

ADOMD

Catalog object

Cellset object

Process query & display result

Create MDX query

Setup connection

Listing the definitions

Setup connection

cubeBrowser

Hierarchical structure of a database object

Figure 3.3. The basic workflow of browsing OLAP cube data using cubeBrowser

40

Page 53: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

41

Table 3.1 Values of the connection string

Parameter Value Provider Name of the OLE DB for OLAP used to connect to the

OLAP engine. In the Analysis Services, this value is MOSOLAP2.

Data Source The location of the server, expressed as a hostname.

Initial Catalog The name of OLAP database objects to be connected.

User ID Username to use for connection to the server.

Password Password used for user to connect to the server.

After construction of the connection string, the component cubeBrowser provides a

method to connect to an ADO MD Catalog object to the server and the database object

specified in the connection string. The detailed hierarchical structures of a cube can be

viewed by using the method ViewCubeStrct of the component cubeBrowser after

specification of a particular cube. This method uses the CubeDef object of ADO MD to

display the definition of a particular OLAP cube and its underlying dimensions [15, 16].

In summary, by using the component cubeBrowser in conjunction with ADO MD

object in the OLAP application, the end user can retrieve the complete information about

the structure of any cube stored in the Analysis Services [20, 21].

3.4.1.2 Analytical querying of cube data

In addition to drill down and to display the object schema with a particular OLAP

database, the component cubeBrowser provides features to support querying of an

Analysis Services cube with the MDX. The result from a MDX query of a cube is

returned in a structure called a Cellset [16].

Page 54: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

42

The querying language to manipulating data through ADMD is called

Multidimensional Expressions (MDX). The MDX syntax supports the definition and

manipulation of multidimensional objects and the data stored in the cubes of the Analysis

Server [22]. In addition to its query capabilities, the MDX can be used to define the cube

structures and to change the data in some cases. It also can be used in conjunction with

ADOMD to build client applications to access OLAP data for business analysts [16, 23].

The following steps are required to process an MDX query:

a. Create a new Cellset object

A Cellset object is used to store the results of a multidimensional MDX query in the

ADO MD object model. The Cellset object is created based on a MDX query for the

user’s analysis.

b. Establishing the connection:

To make a connection to the Analysis Services engine, it is necessary to specify the

values of the provider, data source and initial catalog of the connection string of a Cellset

object.

c. Construction of an MDX query:

The general syntax for an MDX statement is shown as follows:

SELECT <member selection> on axis1, <member selection> on axis2 --- FROM <cube name> WHERE <slicer>

The three clauses shown above describe the nature and scope of an MDX query. The

axis clause specifies the data information wanted and the format to display the results. A

FROM clause defines the specific cube which contains the required data. A WHERE

clause is used to specify the conditional selection for the data slicing. The component

Page 55: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

43

cubeBrowser provides a function to set up the MDX query based on the user’s

specification and the analytical questions.

d. Perform the query and populate the results:

After the construction the required query, the component cubeBrowser provides a

method to open a specific Cellset object. After the Cellset object is open, the resulted

data can be along the positions and displaying the data in its cell of the Cellset.

3.5 The DMBuilder component

In addition to the programmatic access to the OLAP cube resources, the Decision

Support Objects (DSO) can also be used to create and maintain the data mining objects

programmatically [10, 11]. The component DMBuilder, developed in this work, acts on

top of DSO to allow the software developers to accommodate direct programmatic access

to the data mining functionality within the Analysis Services. The analysis component

DMBuilder can provide an object model to program a range of a varied object set to work

with, including servers, databases, mining structures and algorithms, as well as OLAP

cube objects. It also allows developers to embed data mining functionality into

applications to meet user’s mining requirements. The architecture and the logic relations

of the component DMBuilder to DSO are depicted in Figure 3.4.

Page 56: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

User

Cube

DSO

Analysis Server

Data Mining Model

DM Solutions

DMBuilder

Figure 3.4.The architecture and logic relations of DMBuilder with DSO The following steps describe the basic operations involved in the process of creating

a data mining model programmatically using the developed DMBuilder component in

conjunction with DSO [17] (Figure 3.5):

1. Connecting to the target Analysis server:

The component DMBuilder can connect to the target Analysis server through the

function of ConnectToServer with the user’s specified server name.

2. Selecting a target database object which contains the OLAP cube data sources:

After connecting to the target server, the database object which consists of the target

OLAP cube data sources can be selected and set up from the target server by using the

component’s function of SelectDbObj.

44

Page 57: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Connect Server/Database

object

Add Mining Roles

Setup mining properties 1. Data Source Name 2. Source cube Name 3. Case Dimension

Process Data Mining Model Set Up Data

Mining Algorithm

Analysis Column Entries

Add Mining Name

Data Mining Model

Analysis Server

DSO

DMBuilder

Figure 3.5 Flow Logic of the DMBuilder Component 3. Creating a new data mining model:

A new data mining model object can be created by using the AddNewMiningModel

of the DMBuilder component with the user-specified mining model name and class type.

When the OLAP mining model is created, the class type is set to be so called sbclsOLAP.

4. Creating and assigning a mining model roles:

Using the function “AddMiningModel” of DMBuilder component, the user-specified

mining model role can be created and assigned to the new OLAP mining model object.

5. Setting the needed properties for the new mining model:

There are several needed properties to be set up for the OLAP mining model objects.

45

Page 58: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

46

Table 3.2 summarizes these needed properties required for the OLAP mining model

object. These properties can be set up by using the function “SetModelProperty” of the

component DMBuilder.

Table 3.2 Listings of properties required for OLAP mining model objects [17]

Property Descriptions Case dimension Defines the case dimension. Case Level Defines the case level of the case dimension. It identifies

the lowest level in the dimension. Mining model algorithm Defines the data mining algorithm providers. In analysis

Services, there are two types: Decision Trees and Microsoft Clustering.

Source cube Defines the OLAP cube used for training data. Subclass type Defines the column option type. The value for OLAP

mining model object is set to sbclsOLAP.

6. Creating a new mining model column and setting its properties:

Data mining column has several property types which are needed for the new

mining model. The most important properties are data type, content type and usage. In

data mining with the Analysis Services of SQL server 2000, there are four types of

column usage: input, predict, disabled and key. The component DMBuilder provides the

function called “EnableColumnProperty” to process this task and to send the column

metadata to the server.

7. Training and processing the mining model object:

All the necessary properties and definitions required for the target mining model are

set up and complete. Before the mining model can be used for analysis, this model needs

to be trained to find useful information or patterns in the data. This processing step is

executed in the server, and the time needed in the processing depends on the amount of

Page 59: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

47

data involved and on the complexity of the analytical category. Before training and

processing, the model has only the defined metadata, however, after processing, the

hidden patterns are stored in the model. The function ProcessMiningModel of the

component DMBuilder will handle this task using the ProcessFull option [3, 11].

3.6 Conclusions

These analysis components cubeBuilder, cubeBrowser and DMBuilder provide a set

of functions for creating and managing OLAP solutions and data mining model in the

Analysis server. Fully compatible with the .NET environment, these components let

developers easily embed code into user-specific applications to build and to process the

target OLAP solutions and mining model systems. Using these data analysis components,

the SQL Server 2000 business intelligence is able to be integrated directly into user-

friendly applications, and the OLAP solutions can be created and managed

programmatically to meet user’s needs and specifications for their daily business analysis

and decision-makings.

Page 60: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

48

CHAPTER IV

CASE STUDIES AND RESULTS

These data analysis components, developed in this thesis, are applied to a case study

of the heart disease database in the Microsoft SQL server 2000 environment . The

purpose of this case study is to provide the user’s application interface wrapped with the

analysis components in building the OLAP cube, in browsing the cube data and in

creating the mining model with the cardio test dataset. The results and implementations

of the case study are used to illustrate the advantage of using these data analysis

components for the OLAP solutions and the mining models. Each of the following

sections describes the practical aspects of the developed analysis components.

4.1 A Case Study of the Heart Disease Datasets

The heart disease datasets are collected from four different locations and are the

results of the heart disease diagnosis tests [24]. Each database has the same instance

format using only thirteen of a possible seventy-five attributes for analysis. Appendix A

provides the detailed descriptions of the heart disease datasets.

Page 61: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

4.1.1 Heart Disease Sample File

The heart disease datasets are downloaded and are saved as the Microsoft Access

2003 database. These database samples consist of four tables. The relationship of these

sample tables is depicted in Figure 4.1. This constructed schema resembles the structure

of the star schema.

Figure 4.1 Relationship of the heart disease test data

4.1.2 Software Implementation

These data analysis components are implemented in the Microsoft SQL server 2000

environment using Visual Basic.Net (VB.Net) as the major programming language for

both the OLAP solutions and the mining model objects. The window front-end

applications implemented with the cubeBuilder and DMBuilder components are the

desktop stand-alone software applications and require the Analysis Services to reside on 49

Page 62: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

the same system. The advantage of this approach is that the runtime’s access security

will not be an issue for connection to the Analysis server. In addition, an ASP.Net web-

based application implemented with the cubeBrowser component is also developed using

VB.Net as the major source code for end-users to browse the OLAP cube data.

4.2 Implementation of the cubeBuilder Component

The interface of cardio cube builder, cardioCube, as shown in Figure 4.2,

implements the component cubeBuilder in order to demonstrate the process of building

the OLAP cube with the heart disease database. The detailed procedures for building of

the cardio test cube are described, step-by-step, in the following subsections.

Figure 4.2 Screenshot of the cardio cube builder interface

50

Page 63: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

4.2.1 Creating a New Cube

When the form was loaded, only the “Data Source/Cube” section was visible for

users to specify the name of the data source and the name of the new cube (Figure 4.3).

The name for the data source specifies the cardio data file used to build the cardio cube

(Figure 4.4). The name of the new cube is the name saved in the database object for

future reference. These specified names are added to the cardio database object in the

Analysis server through the functions, SetDataSource and AddCubetoDb, of the

component cubeBuilder.

Figure 4.3 Screenshot of the “Data Source/Cube” section

Figure 4.4 Screenshot of sample entries for both sections of “Data Source/Cube” and “Specify Fact/Measures”

51

Page 64: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

4.2.2 The Fact Table and Measures Selections

After the setting of the data source and the adding of the new cube name “test1” to

the target database object, the section of “Specify Fact/Measures” is visible for the user to

specify the fact table and the measures which can be used in the building of the cardio

cube (Figure 4.4). The fact table lists the core features for the query to be used in the

analysis. It contains a column for each measure as well as a column for each dimension.

The measures are a set of numeric data based on the column values of the fact table and

are the key indicators for the primary analytical interest of the user [6]. Figure 4.5 shows

the details of the sample entries for the selection of the fact table and measures.

Figure 4.5 Screenshot of sample entries of “Specify Fact/Measure” section

4.2.3 Adding Dimensions to the Cube

Dimensions are the categories of the data analysis. As shown in Figure 4.6, the

section of “Add Dimensions to Cube” is used to add dimensions to the cube. There are

pre-defined shared dimensions are available in the cardio database object; these

dimensions can be specified and be added to the cube through the function of

52

Page 65: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

AddDimToCube of the component cubeBuilder after clicking the button “Add

Dimension”. Figure 4.7 shows the sample entries of the dimension and its related key

column.

Figure 4.6 Screenshot of the “Add Dimensions to Cube” section

Figure 4.7 Screenshot of sample entries for cube dimension

4.2.4 Processing and Building the New Cube

After determining the measures, the dimensions and the fact table of the cube, the

section of “Process/Build Cube” of the form is visible as shown in Figure 4.8. After

clicking the “Build Cube” button, the “multidimensional OLAP” (MOLAP) is chosen as

the storage mode for the cardio cube [1, 6]. The storage format affects the disk-storage

space requirements and the data-retrieval performance. The MOLAP mode is chosen

because it stores the fact data and the aggregations on the Analysis server in a space-

53

Page 66: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

efficient, highly indexed multidimensional form [1, 5]. In addition, MOLAP mode

summarizes the transactions into multidimensional views ahead of time. The data

retrievals on these types of databases are extremely fast, because all calculations have

been pre-generated when the cube is created.

Figure 4.8 Screenshot of the “Process/Build Cube” section

4.2.5 The Results

The detailed hierarchical database objects before and after the processes of building

the cardio cube are depicted in Figure 4.9 and Figure 4.10 respectively. The difference

between these two figures is that the new sample cube was added to the cube folder of the

cardio database object after the process of building the cardio cube. However, these

figures do not show the cube data of the cardio cube; the detailed cube data can be

accessed in the following section with the web-based application, Cardio Cube Browser,

which is implemented with the cubeBrowser component and is developed in this thesis.

54

Page 67: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 4.9 Screenshot of the cardio test database object before building the new cardio cube

Figure 4.10 Screenshot of the cardio test database object after building the sample “cube1” 55

Page 68: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

4.3 Implementation of the cubeBrowser Component

The ASP.NET web-based application, using VB.net as the programming code,

implements the cubeBrowser component for the end-users to browse the cardio cube data.

This application’s user interface developed in this work, is contained within a single web

form, called cubeBrowser.aspx, as shown in Figure 4.11 [20, 21]. This application

provides the following functional features for user in the process of retrieving cube data:

A. Connecting to the Analysis sever where the targeted cardio cube located

B. Retrieving the cardio cube data based on the user’s specifications

Figure 4.11 Screenshot of the web form BrowseCuber.aspx

4.3.1 Connection to the Analysis Server

In querying the cardio cube, the first stage is to set up the connection to the Analysis

server, which is the location of the target cardio cube. There are three functions in the

analysis component cubeBrowser: ConnectToServer, SetUpDatabase and

56

Page 69: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

SetUpDataSource, to be used to connect to the server, to set up database object and data

sources in the Analysis server respectively. When the page is requested by the user, the

server will process the request and sends the page to the browser. In addition, the server

also connects to the Analysis server and lists the available cubes of the cardio database

object for users to view the data as shown in Figure 4.12.

Figure 4.12 Screenshot of listing of available cube

4.3.2 Retrieving the Cardio Cube Data

Once a connection has been made to the OLAP data source, the multidimensional

data of the cardio cube can be queried and manipulated through the MDX querying

language [22, 23]. The first step in creating the MDX query is to select the target cube

from the dropdown list (Figure 4.13). After the specification of the target cube, user

needs to select the measures whose data is in the cube, as shown in Figure 4.13.

Figure 4.13 Screenshot of specifying cube entry and measures

57

Page 70: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

As shown in Figure 4.14, there are two pre-defined queries for user to select for their

analytical purpose to view the cardio cube data:

a. Pain type-location data

This option will display the cardio cube data results of different chest pain types in

the aspects of different geographical locations for the selected target measures (Figure

4.15).

b. Patient-pain type data

This option will display the cube data of different patient data with the selected chest

pain type for the target measures. This option needs user’s selection of the different chest

pains from the dropdown list as shown Figure 4.16.

Figure 4.14 Screenshot of selections of measures and the pre-defined view options

58

Page 71: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 4.15 Screenshot of selections of location for Pain-Type option

Figure 4.16 Screenshot of selections of pain-type for Patient option

4.3.3 Displaying of the Cardio Cube Data

After the selections of measures and specifying of view options, click the “Browse”

button, the server processes the user’s request and displays the cube data in the grid

format as shown in Figure 4.17 and Figure 4.18.

Figure 4.17 Results of cube data for Pain-Type option with test country

59

Page 72: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 4.18 Results of cube data for the angina chest pains per patient test city

4.3.4 Drill-down and Drill-up Capacities

OLAP tools organize the data in multiple dimensions and in hierarchies.

Dimensions usually associate with hierarchies, which organize data according to the

levels. The drilling-down and the drilling-up are the two analytical techniques whereby

the user navigates among various levels of data ranging from the most summarized (up)

to the most detailed (down) [20, 21]. For example, when viewing the cardio cube data of

different cites, a drill-down operation in the dimension of patient test center would

display tc001 to tc004 of each test centers, as shown in Figure 4.19. However, a drilling-

up operation would go in the reverse direction to a higher level and display the data of

test countries as shown in Figure 4.20.

60

Page 73: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 4.19 Screenshot of drill-down to the test center level of Patient option

Figure 4.20 Screenshot of drill-up to the country’s level of Patient option

61

Page 74: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

4.4 Implementation of the DMBuilder component

Figure 4.21 depicts the application interface DMMBuilder, which implements the

component DMBuilder. This interface is used to create a mining model with the

Microsoft Decision Trees algorithm as the constructing rules in the cardio cube data,

which is created from the previous section. This application interface is coded and

designed as a simple window form using VB.Net as the major programming code in the

MS SQL server 2000 environment [12]. As shown in Figure 4.21, the “Mining Model

Builder” form is divided into five groups. The following steps describe how to use this

application form to build the mining model with the cardio cube data.

Figure 4.21 Screenshot of the main interface DMMBuilder

62

Page 75: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Step 1: Setting up server and database information:

The first step of creating the mining model is to provide not only the name for the

Analytical server and database on which the user wants to perform the mining model task

but also the mining model’s name for storing of mining model attributes, as shown in

Figure 4.22. After clicking the “OK” button, an empty mining model has been created in

the user-specified server and added to the user-specified database object. In addition, the

section of “Mining Model Setup” is also available for the rest of the process, as shown in

Figure 4.23.

Figure 4.22 Screenshot of the “Server/Database” section

Figure 4.23 Screenshot of Mining model setup

63

Page 76: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Step 2: Setting up the mining model role:

The screen of the “Mining Model Role” is used to get the information of the mining

model role in order to set up the security role for the new mining model, as shown in

Figure 4.24. The method of SetMiningRole of the component DMBuilder is used to

perform this task.

Figure 4.24 Screenshot of setting the mining model role

Step 3: Setting up the properties of the mining model:

In this step, user needs not only to provide the data source name, source cube name,

case dimension and the general description of the model but also to specify the mining

model algorithm to be used for the target mining model (Figure 4.25). The Microsoft

Decision Trees algorithm is chosen as the method for prediction [10]. After clicking the

button “Add to DB”, the attribute information is added into each related property for the

target mining model.

Figure 4.25 Screenshot of setting properties and algorithm for the mining model

64

Page 77: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Step 4: Adding analysis column attribute:

The properties needed for the new data mining model column are set through the

form section “Analysis Column Entry”, as shown in Figure 4.26. In this step, the user

needs to identify the training case and the predictive outcome for the purpose of analysis.

Figure 4.26 Screenshot of setting the attributes of analytical column

Step 5: Saving and processing data mining model:

This new mining model object can be saved in the Analysis server after clicking the

“Save DMM” button (Figure 4.26). At this point, this new mining model is created but it

is not processed yet. Although a new mining model does not need to be processed, the

mining model can not be viewed until the processing is completed.

After clicking the button “Process DMM”, the processing of the new mining is

performed in the mode of fully process and the information about the patterns and rules

discovered in the training data are stored as the mining model content. The actual data

from the training dataset is not stored in the target server database. Figure 4.27 is a

screenshot of the content detail after processing the cardio mining model.

65

Page 78: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

Figure 4.27 Screenshot of the cardio mining model using Microsoft Decision Trees Algorithm

66

Page 79: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

67

CHAPTER V

DISCUSSIONS & FUTURE WORKS

This chapter summarizes the main contributions and conclusions of this thesis

regarding the data analysis components with OLAP solutions in the Microsoft SQL

server 2000 system. Moreover, this chapter also addresses some future works based on

the current work.

5.1 Contributions and Evaluations

The main purpose of this thesis is to develop the data analysis components as the

foundation for developers to build user-friendly interface applications for OLAP

solutions. These analysis components also can be used to hide the complexity challenge

and the heavy technological terms from the non-technical users in the process of building

the OLAP cubes and mining model system.

Our contributions are summarized as follows:

Detailed review the functionalities of Analysis Manager in the process of

building and viewing of the OLAP cubes as well as of building the data

mining models

Development of the data analysis components for OLAP solutions

Page 80: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

68

Development of the desktop stand-alone interfaces implemented with the

component cubeBuilder and DMBuilder

Applying the case study of cardio disease dataset with the user-specific

application which is implemented the data analysis components developed in

the current work

Development of the web-based interface application implemented with the

component cubeBrowser. This web-based interface is also used to browse the

cardio cube data, which is created in the current work.

In addition to these contributions, the detailed reviews of the functionalities of

Analysis Manager in the process of building and viewing of the OLAP cubes as well as

of building the data mining models are also included in this thesis [5, 7].

Both Microsoft Analysis Manager and the analysis components can perform

following tasks for OLAP solutions:

A. Creating the database objects and specifying the data sources in the Analysis server

B. Building and processing the OLAP cubes

C. Creating and processing the data mining models

D. Specifying the storage options and optimizing the query performance

E. Browsing the cube data

Although Analysis Manager provides wizards and editors to help users to build and

to process the OLAP cubes and the mining models, but the technical terms and the fully

understanding of the underlying structure still become the barrier for user to use these

tools efficiently. In addition, the analysis components can help the developers in

Page 81: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

69

designing the user-friendly interface application to hide the technical complexities from

the non-technical users. In addition, the analysis components offer the potential to

assemble applications much more rapidly and efficiently. A key to developing

applications quickly is the ability to reuse the existing pre-built components to meet the

user’s application requirements [6].

Analysis Manager installs a PivotTable service on the database server, which

includes an OLE DB provider that allows connecting to the OLAP data sources. The

PivotTable Service is an OLE DB provider for multidimensional data and data mining

operations [7, 14]. It is the primary method of communication between a client

application and a multidimensional data source or data mining model. It is used to build

client applications to interact the multidimensional data. It also provides methods for

online and off-line data-mining analysis of multidimensional data and relational data. It

offers connectivity to multidimensional cubes and data-mining models managed by the

Analysis Services.

The major limitation of the PivotTable Services is that it must be installed on the

client machine; otherwise, the client’s PivotTable control is unable to communicate with

the OLAP data sources.

To overcome this limitation, the development of the data analysis component,

implemented in the web-based OLAP browsing application interface, can provide re-

usable business solutions and can disseminate information more effectively. The

architecture that we presented here is designed to utilize several sophisticated

technologies, including SQL 2000 Analysis Services, the cubeBrowser component, and

ASP.NET, to the best of their capacities.

Page 82: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

70

5.2 Future Works

This research developed the data analysis components for the OLAP solutions and

the mining model systems. It also demonstrated their functionalities with the cardio

databases. However, the analysis component for viewing the data of mining model is not

developed as well as its implementation. The development of the component in

visualizing the mining model will be the future work we are faced with.

The new release version, Microsoft SQL server 2005, enhances many features of

Business Intelligence and also builds complex business analytics with Analysis Services

[25, 26]. In addition, ADOMD.NET uses the XML for analytical protocol to

communicate with the analytical data source [27]. More works will be needed to develop

the data analysis components to use the new features of the Microsoft SQL server 2005

and ADOMD.NET to provide a user-friendly interface for OLAP solutions. In addition

to unload the design burden from the developers, these analysis components are able to

benefit the end users to navigate a rich, complex data set with a higher degree confidence

in analysis.

.

Page 83: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

71

BIBLIOGRAPHY

[1]. Mailvaganam, H. 2004. “Introduction to OLAP: Slice Dice and Drill”. Retrieved August 22, 2005 from http://www.dwreview.com/OLAP/Introduction_OLAP.html.

[2]. The OLAP Council. OLAP and OLAP Server definitions. Retrieved August, 2005

from http://altaplana.com/olap/glossary.html. [3]. Thearling, K. 1995. “From Data Mining to Database Marketing”, Data

Intelligence Group. [4]. Thearling, K. 2000. An Introduction to Data Mining: Discovering hidden value in

your data warehouse. Retrieved August 18, 2005 from http://www.thearling.com/text/dmwhite/dmwhite.htm.

[5]. Pearson, W. 2002. “Introduction to SQL server 2000 Analysis Services-Creating

our first cube”. http://www.databasejournal.com/feature/mssql/article.php/1429671.

[6]. OLAP Train and Jacobson, R. 2000. Microsoft SQL Server 2000 Analysis

Services Step by Step. Microsoft Press. [7]. Garcia, L. 2003. “Understanding Microsoft SQL Server 2000 Analysis Services”.

http://www.phptr.com/articles/article.asp. [8]. Bertucc, P. 2002. Microsoft SQL Server Analysis Services. Microsoft® SQL

Server 2000 Unleashed, Second Edition. Chapter 42, 1347-1392. [9]. Soni, S.; Kurtz, W. 2001. “Analysis Services: optimizing cube performance using

Microsoft SQL server 2000 Analysis Services”. Retrieved April, 2005 from http://msdn.microsoft.com/library/en-us/dnsql2k/html/olapunisys.asp.

[10]. de Ville, B. 2001. “Data Mining in SQL server 2000”. SQL Server Magazine

http://www.windowsitpro.com/SQLServer/Article/ArticleID/16175/16175.html. [11]. Charran, E. 2002. “Introduction to Data Mining with SQL server”. Retrieved

August, 2005 from http://www.sql-server-performance.com/ec_data_mining.asp.

Page 84: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

72

[12]. Rae, S. 2005. “Building intelligent .NET applications: Data-Mining predictions”. http://www.awprofessional.com/articles/article.asp.

[13]. Data Mining: http://www.megaputer.com/dm/dm101.php3.

[14]. Microsoft OLE DB Programmer's Reference: http://msdn.microsoft.com/library. [15]. Brust, A. J. 1999. “Put OLAP and ADO MD to Work”. VBPJ, November 1999 Issue.

94-97. [16]. Youness, S. 2000. “Using MDX and ADOMD to access Microsoft OLAP data”.

http://www.topxml.com/conference/wrox/2000_vegas/text/sakhr_olap.pdf. [17]. Whitney, R. 2002. “Collaboration through DSO”.

http://www.windowsitpro.com/SQLServer/Article/ArticleID/26564/26564.html. [18]. Frank C Rice (2002) Programming OLAP databases from Microsoft Access-

Using DSO. http://msdn.microsoft.com/library/default.

[19]. Microsoft OLE DB programmer's reference: http://msdn.microsoft.com/library. [20]. Nolan, C. 1999. “Manipulate and Query OLAP Data Using ADOMD and MDX -

Part I". Microsoft System Journal, August, 1999.

[21]. Nolan, C. 1999. “Manipulate and Query OLAP Data Using ADOMD and MDX - Part II". Microsoft System Journal, September, 1999.

[22]. Pearson, W. 2002. “MDX in Analysis Services”. Retrieved December, 2004 from

http://www.databasejournal.com/features/mssql/article.php/1495511. [23]. Pearson, W. 2002. “MDX Essentials”. Retrieved December, 2004 from

http://www.databasejournal.com/features/mssql/article.php/1550061. [24]. Heart Disease database. http://www.ics.uci.edu/~mlearn/MLSummary.html.

[25]. Frawley, M. 2004. “Analysis Services Comparison: SQL 2000 vs. 2005”. Retrieved October, 2005 from http://www.devx.com/dbzone/Article/21539.

[26]. Utley, C. 2005. “Solving Business Problems with SQL Server 2005 Analysis

Services”. Retrieved January, 2006 from http://www.microsoft.com/technet/prodtechnol/sql/2005/solvngbp.mspx.

[27]. Analysis Services Data Access Interfaces: ADOMD.NET Client Programming.

Retrieved January, 2006 from http://msdn2.microsoft.com/en-us/ library/ms123483.aspx.

Page 85: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

73

APPENDICES

Page 86: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

74

APPENDIX A

DATASET USED FOR CASE STUDIES

The database used in this work is downloaded from the web site of the Repository of

machine learning databases [24]. This heart-disease directory contains four databases

concerning the heart disease diagnosis. The data was collected from the following

locations:

1. Cleveland Clinic Foundation (Cleveland.data) 2. Hungarian Institute of Cardiology, Budapest (Hungarian.data) 3. V. A. Medical Center, Long Beach, Ca (long-beach-va.data) 4. University Hospital, Zurich, Switzerland (Switzerland.data)

Each database has the same instance format. While the databases have seventy-six

raw attributes, but all published experiments refer to using a subset of fourteen of them.

The authors of the databases have requested that any publications resulting from the

use of the data include the names of the principal investigator responsible for the data

collection.

They would be:

A. Creators:

1. Hungarian Institute of Cardiology, Budapest: Andras Janosi, M. D. 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:

Robert Detrano, M.D., Ph.D.

B. Donors: David W. Aha ([email protected]) Date: July, 1988.

Page 87: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

75

C. Attributes

a. Age: age in years b. Sex: gender ( 1 = male; 0 = female) c. cp: chest pain type

Value 1: typical angina 2: atypical angina 3: non-angina pain 4: asymptomatic

d. tresbps: resting blood pressure (in mm Hg on admission to the hospital) e. chol: serum cholesterol in mg/dl f. fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false) g. restecg: resting electrocardiographic results

Value 1. 0: normal 2. 1: Having ST/T wave abnormality 3. 2: showing probable or definite left ventricular

hypertrophy by 4. Estes’ criteria

h. thalach: maximum heart rate achieved i. exang: exercise induced angina (1 = yes; 0 = no) j. oldpeak: ST depression induced by exercise relative to rest k. slope: the slope of the peak existence ST segment l. ca: number of major vessels (0-3) colored by fluoroscopy m. thal: 3= normal; 6= fixed defect; 7= reversible defect n. num (prediction attribute): diagnosis: 0 is healthy, 1, 2, 3, 4 are sick.

Page 88: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

APPENDIX B

APPLICATION INTERFACE OF OLAP CUBE BUILDER

Figure B.1 Screenshot of the OLAP cube builder interface for the power users

76

Page 89: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

77

APPENDIX C

SOURCE CODE OF CUBBUILDER

This section consists of the source code for the analysis component, cubeBuilder,

which was written in the Visual Basic.NET programming language.

‘Visual Basic.NET source code

Public Class CubeBuilder ‘Declarations Public DataServer As New DSO.Server() Public DataSource As DSO.DataSource() Public DataProj As DSO.MDStore() Public Provider As String Public DataPath As String Public SerName As String ‘Initializations

Sub New() End Sub

Sub New(ByVal inServ As String, ByVal inProv As String, ByVal inPath As String)

SerName = inServ Provider = inProv DataPath = inPath End Sub

Page 90: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

78

‘Class Property Public ReadOnly Property server() Get Return DataServer End Get End Property Public Property DSProvider() As String Get Return Provider End Get Set(ByVal Value As String) Provider = Value End Set End Property Public Property DataProject() Get Return DataProj End Get Set(ByVal Value) DataProj = Value End Set End Property ‘Connects to the specified server Public Sub ConnectToServer(ByVal servName As String, ByRef serv As DSO.Server) serv.Connect(servName)

End Sub ‘Closes the connection to the server Public Sub CloseServerConnect(ByRef inServer As DSO.Server) inServer.CloseServer() End Sub

Page 91: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

79

‘Checking the validation status of a server Public Function ServerValid(ByRef serv As DSO.Server) As Boolean

If serv.IsValid Then Return True Else Return False End If End Function ‘Finding the target database object in the server Public Function FindDataProj(ByVal db As String, ByRef dServ As DSO.Server) As Boolean If dServ.MDStores.Find(db) Then Return True Else Return False End If End Function

‘Adding new database object to the server

Public Function AddNewDataProj(ByVal db As String, ByRef dServ As DSO.Server) As DSO.MDStore

Return dServ.MDStores.AddNew(db)

End Function

‘Setting Database object

Public Function SetDataProj(ByVal db As String, ByRef dServ As DSO.Server) As DSO.MDStore

Return dServ.MDStores.Item(db) End Function

Page 92: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

80

‘Searching the specified data source Public Function FindDataSource(ByVal ds As String, ByRef dDB As DSO.MDStore)

As Boolean If dDB.DataSources.Find(ds) Then Return True Else Return False End If End Function ‘Adding new Data Source Public Function AddNewDataSource(ByVal ds As String, ByRef dDB As DSO.MDStore) As DSO.DataSource Return dDB.DataSources.AddNew(ds) End Function ‘Setting data source Public Function SetDataSource(ByVal ds As String, ByRef dDB As DSO.MDStore) As DSO.DataSource Return dDB.DataSources.Item(ds)

End Function ‘Get Data link connection Public Function GetDataLink(ByVal p As String, ByVal dp As String) As String Dim str As String Dim info As String = ";Persist Security Info=False;Jet OLEDB:SFP=True;""" str = "Provider=" & p & ";Data Source=" & dp & ";Persist Security Info=False;Jet OLEDB:SFP=True;" Return str End Function

Page 93: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

81

‘Setting data link of data source Public Sub SetLinkDataSource(ByVal dLink As String, ByRef ds As DSO.DataSource) ds.ConnectionString = dLink ds.Update() End Sub ‘Creating Database Connection Public Function CreateDbaseDimension(ByRef dDbase As DSO.MDStore, ByRef dataSrc As DSO.DataSource, ByVal strDim As String,

ByVal strDescr As String ByVal strFromClause As String, ByVal strJoin As String, ByVal strDimType as String) As DSO.Dimension

Dim dsoNewDim As DSO.Dimension dsoNewDim = dDbase.Dimensions.AddNew(strDim) dsoNewDim.DataSource = dataSrc dsoNewDim.Description = strDescr dsoNewDim.FromClause = strFromClause dsoNewDim.JoinClause = strJoin dsoNewDim.DimensionType = strDimType Return dsoNewDim End Function ‘Adding Level to the Dimension table Public Sub AddLeveltoDim(ByRef dsoDim As DSO.Dimension, ByVal levStr As String, ByVal strDimtbl As String, ByVal ColumnStr As String, ByVal ColType As Short, ByVal colSize As Integer, ByVal EstSize As Integer)

Dim dsoLev As DSO.Level Dim strKeyColumn As String dsoLev = dsoDim.Levels.AddNew(levStr) strKeyColumn = """" & strDimtbl & """" & "." & """" & ColumnStr & """" dsoLev.MemberKeyColumn = strKeyColumn dsoLev.ColumnType = ColType dsoLev.ColumnSize = colSize

Page 94: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

82

dsoLev.EstimatedSize = EstSize dsoDim.Update() End Sub ‘ Alternative method for adding level to the dimension table Public Sub AddLeveltoDim1(ByRef dsoDim As DSO.Dimension, ByVal levStr As String, ByVal strDimtbl As String, _ ByVal ColumnStr As String, ByVal ColType As String) Dim dsoLev As DSO.Level Dim strKeyColumn As String dsoLev = dsoDim.Levels.AddNew(levStr) strKeyColumn = """" & strDimtbl & """" & "." & """" & ColumnStr & """" dsoLev.MemberKeyColumn = strKeyColumn dsoLev.ColumnType = CShort(ColType) dsoLev.ColumnSize = 255 dsoLev.EstimatedSize = 1 dsoDim.Update() End Sub ‘Adding new cube to the database object Public Function AddNewCube(ByRef dSer As DSO.Server, ByVal dDB As String, ByVal DtSrc As String, ByVal dtCube As String) As DSO.MDStore Dim dsoCube As DSO.MDStore dsoCube = dSer.MDStores.Item(dDB).MDStores.AddNew(dtCube) dsoCube.DataSources.AddNew(DtSrc) dsoCube.Update() Return dsoCube End Function ‘Adding fact table to the cube Public Sub AddFactTblToCube(ByRef inCube As DSO.MDStore, ByVal strFactTblName As String) inCube.SourceTable = strFactTblName inCube.EstimatedRows = 100000 End Sub

Page 95: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

83

‘Adding shared dimension to the cube Public Sub AddShareDDimToCube(ByRef inCube As DSO.MDStore, ByVal strDimName As String) inCube.Dimensions.AddNew(strDimName) inCube.Update() End Sub ‘Adding measure to the cube Public Sub AddMeasureToCube(ByRef inCube As DSO.MDStore, ByVal inMeaText As String, ByVal inDescr As String, ByVal factTbl As String, ByVal inField As String) Dim dsoMeasure As DSO.Measure dsoMeasure = inCube.Measures.AddNew(inMeaText) dsoMeasure.Description = inDescr dsoMeasure.SourceColumn = "" & factTbl & "." & "" & inField & "" dsoMeasure.SourceColumnType = ADODB.DataTypeEnum.adDouble dsoMeasure.AggregateFunction = DSO.AggregatesTypes.aggSum inCube.Update() End Sub ‘ Processing the cube Public Sub ProcessCube(ByRef iCube As DSO.MDStore) iCube.Process(DSO.ProcessTypes.processFull) End Sub End Class

Page 96: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

84

APPENDIX D

SOURCE CODE OF CUBEBROWSER

This section consists of the source code for the analysis component, cubeBrowser,

which was written in the Visual Basic.NET programming language.

‘Visual Basic.NET source code

Public Class cubeBrowser

‘Declarations

Public cbServer As String Public cbDatabase As String Public cbDBconnect As New ADODB.Connection() Public cbCellset As New ADOMD.Cellset() 'Dim conStr As String

‘Initialization

Public Sub New(ByVal oSer As String, ByVal oDb As String)

cbServer = oSer cbDatabase = oDb End Sub

‘Getting the connection string for Catalog object

Public Function GetConCatalogString() As String Dim strTemp As String strTemp = " " strTemp = strTemp & "Provider=msolap; data source=" & cbServer strTemp = strTemp & "; Initial Catalog=" & cbDatabase & ";"

Page 97: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

85

Return strTemp End Function ‘Connecting to Catalog object Public Function ConnectToCatalog(ByVal conStr As String) As Object Dim adomdCatlog As New ADOMD.Catalog() adomdCatlog.let_ActiveConnection(conStr) Return adomdCatlog End Function

‘Get Cellset connection string

Public Function GetCellConnectString() As String Dim strCon As String

strCon = " " strCon = strCon & "Provider=msolap; data source=" & cbServer strCon = strCon & "; database=" & cbDatabase & ";" Return strCon End Function

‘Getting connection to the cellset

Public Function GetConnectToCell(ByVal olapDb As ADODB.Connection) As Object cbCellset.ActiveConnection = olapDb Return cbCellset End Function ‘Connecting to Database object Public Function ConnectToDB(ByVal oS As String) As Object cbDBconnect.Open(oS) Return cbDBconnect End Function ‘Connecting to the cube

Public Function ConnectToCube(ByVal oStr As String, ByVal oMdx As String) As Object cbDBconnect.Open(oStr) cbCellset.ActiveConnection = cbDBconnect

Page 98: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

86

cbCellset.Open(oMdx) Return cbCellset End Function

‘Displaying the Cellset I

Public Sub ViewCubeStruct(ByRef lstBox As Object, ByRef inCat As Object, ByVal inCubeName As String)

Dim cbDef As ADOMD.CubeDef Dim cbDim As ADOMD.Dimension Dim strDim As String Dim cbHir As ADOMD.Hierarchy Dim strLevel As String Dim cbLev As ADOMD.Level Dim strTemp As String cbDef = inCat.cubeDefs(inCubeName) strTemp = "Cube: " & inCubeName lstBox.Items.Add(strTemp) strTemp = " " For Each cbDim In cbDef.Dimensions strDim = " " strDim = " -Dimension: " & cbDim.Name lstBox.Items.Add(strDim) For Each cbHir In cbDim.Hierarchies For Each cbLev In cbHir.Levels strLevel = " -- " & cbLev.Name lstBox.Items.Add(strLevel) Next Next Next End Sub

‘Displaying the Cellset II Public Sub ViewCubeStruct(ByRef lstBox As Object, ByRef inCat As Object) Dim cbDef As ADOMD.CubeDef Dim cbDim As ADOMD.Dimension Dim strDim As String Dim cbHir As ADOMD.Hierarchy Dim strLevel As String

Page 99: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

87

Dim cbLev As ADOMD.Level Dim strTemp As String For Each cbDef In inCat.CubeDefs strTemp = "Cube: " & cbDef.Name lstBox.Items.Add(strTemp) strTemp = " " For Each cbDim In cbDef.Dimensions strDim = " " strDim = " -Dimension: " & cbDim.Name stBox.Items.Add(strDim) For Each cbHir In cbDim.Hierarchies For Each cbLev In cbHir.Levels strLevel = " -- " & cbLev.Name lstBox.Items.Add(strLevel) Next Next Next Next End Sub

‘Close the object connection

Public Sub CloseConnection(ByVal iConn As Object)

iConn.Close()

End Sub

End Class

Page 100: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

88

APPENDIX E

SOURCE CODE OF DMBUILDER

This section consists of the source code for the analysis component, DMBuilder,

which was written in the Visual Basic.NET programming language.

‘Visual Basic.NET source code

Public Class clsBuildMiningModel

‘Declarations

Public dsoCol As DSO.Column

‘Initialization

Public Sub New()

End Sub

‘Clearing the object

Public Sub ClearObject(ByRef inObj As Object) inObj = Nothing End Sub ‘Connecting to the Server Public Sub ConnectToServer(ByVal strSer As String, ByRef ser As DSO.Server) ser = New DSO.Server() ser.Connect(strSer)

End Sub

Page 101: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

89

‘Closing the server connection Public Sub CloseServerConnection(ByRef s As DSO.Server)

s.CloseServer()

End Sub

‘Checking the Server connection status

Public Function IsServerConnect(ByRef ser As DSO.Server) As Boolean If ser.IsValid Then Return True Else Return False End If

End Function

‘Checking the target object’s status

Public Function IsExistingModel(ByRef db As DSO.MDStore, ByVal strName As String) As Boolean If db.MiningModels.Item(strName) Is Nothing Then Return False Else Return True End If End Function

‘Checking the target cube’s status

Public Function IsValidCube(ByRef db As DSO.MDStore, ByVal sCube As String) As Boolean If db.MDStores.Find(sCube) Then Return True Else Return False End If End Function

‘Adding a new mining model

Page 102: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

90

Public Sub AddNewMiningModel(ByRef db As DSO.MDStore, ByVal mName As String, ByVal dtType As DSO.SubClassTypes, ByRef dMM As DSO.MiningModel) dMM = db.MiningModels.AddNew(mName, dtType) End Sub

‘Adding a new model role

Public Sub AddNewMMRole(ByRef dmm As DSO.MiningModel, ByVal rName As String, ByRef dRole As DSO.Role) dRole = dmm.Roles.AddNew(rName) End Sub

‘Setting the properties of the target mining model

Public Sub SetModelProperty(ByRef dmm As DSO.MiningModel, ByVal dtSrc As String, ByVal mDescr As String, ByVal dtType As DSO.SubClassTypes, ByVal mmAlgo As String, ByVal srcCube As String, ByVal cDim As String, ByVal mTrainQ As String) With dmm .DataSources.AddNew(dtSrc, DSO.SubClassTypes.sbclsOlap) .Description = mDescr .MiningAlgorithm = mmAlgo .SourceCube = srcCube .CaseDimension = cDim .TrainingQuery = mTrainQ .Update() End With End Sub

‘Enabling the column’s property

Public Sub EnableColumnProperty(ByRef dmm As DSO.MiningModel, ByVal strCol As String, ByVal CheckFlag As Boolean, ByVal InputSelect As Boolean, ByVal PredictSelect As Boolean)

Page 103: DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS

91

dsoCol = dmm.Columns.Item(strCol) If CheckFlag = True Then dsoCol.IsInput = InputSelect dsoCol.IsPredictable = PredictSelect End If dsoCol.IsDisabled = False End Sub ‘Saving the target mining model

Public Sub SaveMiningModel(ByRef dMM As DSO.MiningModel) dMM.LastUpdated = Now dMM.Update() End Sub ‘Processing the mining model Public Sub ProcessMiningModel(ByRef dsoDMM As DSO.MiningModel, ByRef dsoLockType As DSO.OlapLockTypes, _ ByRef dsoLockDescr As String, ByVal prcType As DSO.ProcessTypes) With dsoDMM .LockObject(dsoLockType, dsoLockDescr) .Process(DSO.ProcessTypes.processFull) .UnlockObject() End With End Sub End Class