DR.ORALUCK PATTANAPRATEEP · 4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge...

Doctor of Philosophy Program in Clinical Epidemiology Section for Clinical Epidemiology & BiostatisticsFaculty of Medicine Ramathibodi Hospital Mahidol University

Semester I Academic year 2016 www.ceb-rama.org

MEDICAL INFORMATICS & DATABASE MANAGEMENT

MODULE 5: BIG DATA MANAGEMENT AND ANALYSIS

DR.ORALUCK PATTANAPRATEEP

RACE 614 Medical Informatics &

Database Management

Module 5: Big data management and analysis

Contents

Objectives ....................................................................................................................... 1

References ...................................................................................................................... 1

I. Big data and data science ......................................................................................... 3

Why big data ................................................................................................................ 4

Data science ................................................................................................................ 4

II. Data warehouse and visualization ............................................................................. 6

What is a data warehouse ............................................................................................ 7

- Basic data warehousing environment .................................................................... 8

- Data mart and its components ............................................................................... 9

Big data and data lake ................................................................................................ 11

Data visualization ....................................................................................................... 13

- Infographic ........................................................................................................... 14

III. Machine learning algorithm and big data analytic ................................................... 16

What is machine learning algorithms .......................................................................... 16

- Classification model............................................................................................. 18

- Regression model ............................................................................................... 19

- Cluster analysis ................................................................................................... 20

- Association analysis ............................................................................................ 21

Objectives

Students should be able to:

1. Understand the big data, data science and data warehouse concept

2. Utilize the data science processes to big data problem

3. Select appropriate data visualizations to clearly communicate analytic insights to

audiences

4. Apply appropriate machine learning algorithms to analyze big data

References

1. Lantz B. Machine learning with R (2nd edition). Packt Publishing. 2015.

2. Provost F and Fawcett T. Data science for business. O’Reilly Media, Inc. 2013.

3. Reeves LL. A manager’s guide to data warehousing. Wiley Publishing, Inc. 2009.

4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge

management: cases and applications. Information Science Reference. 2009.

5. Han J and Kamber M. Data mining: concepts and techniques (2nd edition).

Morgan Kaufmann Publisher, CA, USA. 2006.

6. Kimball R and Ross Margy. The data warehouse toolkit: the complete guide to

dimensional modeling 2nd ed. Wiley Computer Publishing. 2002.

1

In previous modules, we explored data management of primary source, starting

with design record form(s) and manage database in Epidata®. However, in the real

world, another source of data is secondary, especially in electronic format which

recently has grown bigger in size. It is increasingly gathered by high performance and

convenient devices, so called “big data”.

In this final module, we will cover three domains exploring big data, which are:

1. Data science and big data: reveals the concept of data value added and

introduces data science, which is the new era of data management.

2. Data warehouse and visualization: provide the concept of making data

warehouse and demonstrate how to communicate the finding(s).

3. Machine learning algorithm and big data analytic: explore how to mine data with

4 main machine learning algorithms.

2

I. Big data and data science

How do we find information and knowledge from data or big data. Figure 1

demonstrates the value added from no-meaning raw data at the base of pyramid to

meaningful information, knowledge and wisdom. For example, 2 numbers at raw data

level, 115 and 90, has no meaning without any clue. By adding the meaning to number,

we found relationship of these 2 numbers. It is fasting blood sugar (FBS) which

decreases from 115 to 90. But the next question is whether lower FBS is good or bad.

Figure 1: from data to wisdom

By adding the context that FBS should be less than 100, so the information we

found “FBS decreases from 115 to 90” is good. Moreover, tools and techniques called

“machine learning algorithm” may be applied to understand patterns and predict future.

From this example, we may find patterns of patients who control their diet well and

Wisdom

Understand

principles

Knowledge

Understand patterns

Information

Understand relations

Data

Meaning

FBS decreases from 115 to 90

Raw

115, 90

Context

FBS should be less than 100

by dietary control

Applied

Control of the diet will

improve patient’s health

3

predict his/her level of FBS. Finally, at the top, we may conclude that “controlling one’s

diet will improve patient’s health”.

Why big data Big data simply means datasets that are too large, too various, and too rapid for

traditional data processing systems. In the past, we have processed only small volumes

of data with no variety of data type over one night each time to find information and

knowledge. With high hardware performance and technology, data recently has been

generated in many forms with high volume and can be kept, retrieved, and analyzed

more rapidly at one time. Big data, then, can be mainly described by the following 3

characteristics:

Volume: amount of data from a few to millions of records, from one to hundreds

of tables.

Variety: range of data types and sources from structured to unstructured, from

text to image.

Velocity: speed of data in and out from batch to real time.

Data science Once the massive data in flexible forms can be processed in a few minutes,

possibility to find information and dig knowledge will happen more. The next questions

are who will perform the analysis and which special talent do they need.

4

Figure 2: data science Venn diagram 2.0

Ref: http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html

Three essential skills of data scientist (figure 2) are 1.computer science, 2.maths

and statistics, and 3.subject matter expertise (Venn diagram 2.0). Since it is quite hard

to find one person who is keen in all 3 areas, most data scientists are working as

teamwork.

Data science process (figure 3) starts with collecting raw data from real world

situations e.g. from human behavior, financial issues or medicine utilization. Then

formulate hypothesis, process and clean data to get exploratory data analysis which

may be in summary statistics or graphs. In case, the data is not enough or cannot

answer the question, more data should be collected, processed and cleaned. The next

step is building models with machine learning algorithms. The final outcome from data

science process is data product or value added data. Within the process, the important

thing is communication to the audience to make decisions, such as reports or

dashboards.

Traditional research

Traditional software

Machine learning

Unicorn

Data science

Computer science

Maths and statistics

Subject matter expertise

5

Figure 3: data science process

Ref: http://www.kdnuggets.com/2016/03/data-science-process.html

II. Data warehouse and visualization

From data science process, we discuss about collecting, processing, and

cleaning data (so called ETL in data warehousing); and also the importance of

communication to audience for making decisions. We begin with comparison of

information which is kept in 2 forms with different purposes.

Information is mainly kept in 2 forms: the operational systems of record and the

data warehouse. The operational systems are where the data is put in and almost

always deal with one record at a time; while the data warehouse is where the data is

integrated from different operational systems and almost never dealt with one row at a

time. Table 1 is a comparison of operational systems and data warehouse.

Raw data is

collected Data is

processed Clean data

Data

product Communicate

Models &

Algorithms

Exploratory

data

analysis

Make decision

6

http://www.kdnuggets.com/2016/03/data-science-process.html

Table 1: comparison of operational systems and data warehouse

Area of

comparison

Operational systems Data warehouse

Purpose of

data

Daily business tasks Analysis, planning, decision

supporting

Function Day-to-day operation, detailed

data

Long term information,

summarized data

Design Application oriented, real time Subject oriented, depends on

length of cycle for data

supplements to warehouse

Access Read and write Mostly read

Size 100MB to GB 100 GB to TB

What is a data warehouse Data warehouse is a central repository of integrated data from one or more

disparate transactional data sources; such as relational database, enterprise resource

planning. Figure 4 shows basic data warehousing environment, starting with

transactional data sources, using a process called ETL:

- Extracts data from transactional data sources and normally temporarily keeps in

staging tables

- Transforms the data in the proper format for the purposes of querying and

analysis and

- Loads it into the final target which designed and modeled in dimensional format

Then at the client side, a user will retrieve data which is already in a data

warehouse to create their own dashboard/report for either exploring or analysing

purpose.

7

- Basic data warehousing environment Figure 4 explains basic data warehousing environment. From left to right, data in

various forms are extracted, transformed and loaded into a data warehouse which

consists of many data marts. The ETL process or metadata management deals with

ODS (operational data store or a mirror (backup) of transactional database), staging

tables which are temporary databases, master tables which mainly keep the master

data warehouse dimensions.

Figure 4: basic data warehouse environment

Transactional data sources

Metadata management

Dimensional modeling

Query/Report/ Visualization

Skills

DB design and

administration, SQL

Extract, transform,

load (ETL)

Dimensional

modeling

Multidimensional

queries, Data

mining, Predictive

analysis

ERP

Relational

DB/HIS

Other

sources

Flat files

Staging

tables

Master

tables

Meta data

ODS Data warehouse

Data mart

Data mart

Data mart

Dashboard in BI tools

Pivot in MS Excel®

8


Metadata management

Dimensional modeling

Query/Report/ Visualization

Tools

MS SQL server,

MySQL, Oracle

11g, MS Access,

etc.

IBM data manager,

Informatica, Oracle

ODI, SAS DI studio

IBM framework

manager, Oracle

warehouse builder,

DB2, SQL server

integration services

IBM Cognos

business object,

MS analysis

services, (MS

powerBI, MS

PowerPivot,

QlikSense,

Tableau)

DB = database, HIS = hospital information system, ERP = enterprise resource planning,

ODS = operational data store

- Data mart and its components A simple form of data warehouse that is focused on a single functional area is

called data mart or cube. A data mart is designed in dimensional format as a fact table

which comprises of measures and dimensions.

Typically, measures are values that can be aggregated, and dimensions are

groups of hierarchies that define the facts. For example, in figure 5, number of visits is a

measure; date, clinic, health scheme are elements of dimensions. A dimension may

have none, one or more hierarchies. Health scheme has no hierarchy. Clinic has one

hierarchy with one level that means clinic can drill up as building. Also date has 2

hierarchies which are calendar year and fiscal year; and each hierarchy is also in

several levels i.e., week, month, and year. In addition, dimension date has one attribute

which is day (Monday-Sunday).

9

Figure 5: a data mart in star schema

Figure 6 explains a data mart as a cube for reporting number of visits in 3

dimensions: date in X axis, clinic in Y axis, and health scheme in Z axis. Each box

contains number of visits, e.g., on 1/1/16, 22 NHSO patients visit medicine clinic (3

dimensions: date, clinic and heath scheme). A cube can accommodates data of

dimensions that define a business problem. When dimension changes, measure will be

summed as box is combined, e.g., 119 patients visit medicine clinic on 1/1/16 (2

dimensions: clinic and date) or when drilling up, 198 patients visit building 1 on 1/1/16.

10

Figure 6: a data mart as a cube

Big data and data lake With the growth of data in the last decade, the new term dealing with data

management system is “data lake”. Table 2 compares key differences between data

warehouse and data lake.

Table 2: comparison of data warehouse and data lake

Area of

comparison

data warehouse data lake

Data

structure

Structured Structured and unstructured

Data type Cleansed/aggregated Raw

Data volume Large (Terabytes) Extremely large (Petabytes)

11

Area of

comparison

data warehouse data lake

Access

methods

SQL NoSQL

However, data lake can be added to data system with data warehouse to

maximize the use of data. In figure 7, Hadoop is added to retrieve data from

unstructured data sources.

Figure 7: basic data warehouse environment plus data lake architecture


Data system Query/Report/ Visualization

ERP

Relational

DB/HIS

Other

sources

Flat files

Staging

tables

Master

tables

Meta data

Data mart

Data mart

Dashboard in BI tools

Pivot in MS Excel®

Unstructured data file

12

Data visualization From the last column of transformed data to information and knowledge in figure

4 and 7, data visualization, which is both an art and a science, is one of data science

processes (figure 3) to communicate information, knowledge or even data products

clearly and efficiently to the audiences. Effective visualization helps users analyze and

get an evidence from data. To generate the visualization of the data, we need to

understand the data we are trying to visualize, know the audience in what they want to

know and then simply use a visual in the best and simplest form to convey the

information.

There are many tools to do data visualization. From simple tools such as MS

Excel to small BI (business intelligence) tools such as MS Power BI, QlikSense,

Tableau, etcetera and large BI tools such as as IBM Cognos. Figure 8 and 9 are

examples of using pivot tools in MS Excel and dashboard in MS Power BI to present

data from a data mart.

Figure 8: pivot table and chart in MS Excel®

13

Figure 9: a dashboard designed in MS Power BI®

- Infographic Infographic is a combination of 2 words - information and graphic. It is a kind of

data visualization that is composed of three parts which are visual, content, and

knowledge. The visual means how to make an attractive and memorable graphic, since

the vision is a sense in which a human receives significantly more information than any

of the other four (touch, hearing, smell, taste). The content must be a statistically proven

fact and be able to transfer the knowledge to audiences. Figure 10 is a sample of

infographic from WHO in 2013.

14

Figure 10: a sample of infographic from WHO

15

III. Machine learning algorithm and big data analytic

What is machine learning algorithms In previous sections, we have discussed how to manage big data to get

information. In this section, we will move next on to how to transform data or information

into knowledge with a set of algorithms, called “machine learning”.

Figure 11: machine learning and its combination

AI = Artificial Intelligence, KDD = Knowledge Discovery and Data mining

As a statistician, we may question what is the difference between statistical

modellings and machine learnings. The answer is that statistical modellings are

formalization of relationships between variables in the form of mathematical equations,

while machine learning are algorithms that can learn from data without relying on rules-

based programming.

Machine learning algorithms are generally divided into 2 major categories

(descriptive and predictive) with 2 major types of data (continuous and categorical) as

shown with the sample techniques in table 3. The objective of descriptive tasks is to

derive patterns that summarize the underlying relationships in data. They are often

16

exploratory in nature to validate and explain the results. Predictive tasks aim to predict

the value of a particular attribute based on the values of other attributes.

Table 3: machine learning techniques/algorithms

Descriptive tasks (unsupervised)

Predictive tasks (supervised)

Continuous Clustering Regression

Categorical Association Classification

Choosing the best algorithm to use for a specific analytical task can be a

challenge. While we can use different algorithms to perform the same business task,

each algorithm produces a different result, and some algorithms can produce more than

one type of result. For example, we can use the Microsoft Decision Trees algorithm not

only for prediction, but also as a way to reduce the number of columns in a dataset,

because the decision tree can identify columns that do not affect the final mining model.

In machine learning process, there are 6 steps as shown in figure 12 starting with

1.understanding the business and type of problem; 2.understanding data (data may be

from different sources); 3.preparing data (ETL); 4.creating model; 5.evaluating the

model; and 6.deploying.

17

Figure 12: CRISP-DM model

- Classification model Classification consists of predicting a certain outcome based on a given input. In

order to predict the outcome, the algorithm processes a training set containing a set of

attributes and the respective outcome. The algorithm tries to discover relationships

between the attributes that would make it possible to predict the outcome. Next, the

prediction set, which contains the same set of attributes except for the prediction

outcome, which will be applied to test the classification model. There are many

algorithms applied in classification such as k-NN, Naïve Bayes, and decision tree

etcetera.

The k-NN or k-nearest neighbors algorithm uses information about a prediction’s

k-nearest neighbors to classify an unknown outcome. Figure 13 is an example of how to

diagnose breast cancer with k-NN algorithm. With 2 dimensional attributes, texture and

radius, each dot presents malignant (m) or benign (b). To classify x into m or b, k-NN

calculates distance and decides outcome for x.

18

Figure 13: an example of k-NN algorithm

In the other 2 algorithm, Naïve Bayes or Bayesian method use training data to

calculate probability of unknown outcome by using the formula P(A|B) = P(B|A)P(A) /

P(B). Decision tree uses a tree structure to model the relationships among attributes

and outcomes.

- Regression model While classification algorithm applies for categorical attributes, regression

algorithm applies for continuous ones for supervised model. This algorithm is the same

as in statistics class which uses independent variables to predict a dependent variable.

19

- Cluster analysis Clustering is an unsupervised model that divides data into clusters. I,

classification we have training data for knowing outcomes and predicting outcomes from

testing data. For example, figure 14, as a medical staff, we want to organize and

facilitate diabetes patients to learn how to control their diet and do exercise by dividing

into 3 groups based on patients’ age and level of sugar in blood.

Figure 14: an example of diabetes patients

The most common algorithm for cluster analysis is k-means. The k-means first

assigns each of n examples to one of k clusters, then, it tries to minimize the differences

within each cluster and maximize the differences between clusters. Figure 15 is the

result of k-mean cluster analysis where patients are divided into 3 groups based on their

similarity of age and level of sugar in blood.

20

Figure 15: an example of diabetes patients

- Association analysis Association or market basket analysis is another unsupervised model that finds

the relationships among categorical variables in a dataset. Table 4 is an example of 5

prescriptions from one clinic.

Table 4: an example of drugs in prescriptions

Rx no. Drug items

1 {PPI, NSAIDs, Calcium}

2 {Antidepressant, NSAIDs, Antianxiety, Muscle relaxant}

3 {NSAIDs, Muscle relaxant, PPI}

4 {Antidepressant, Antianxiety, Calcium}

5 {NSAIDs, PPI, Calcium}

21

By looking at only 5 prescriptions dataset, we may guess some patterns: Rx no.

1, 3, and 5 are for orthopedic patients, while Rx no. 2 and 4 are for psychiatric patients.

With similar rules, in large transaction databases, association analysis uses statistical

measures (support and confidence measures) to locate association of items and groups

into the same basket. The most common method is Apriori approach.

The Apriori approach is an association rule mining, based on the principle of

frequent pattern mining. Performing Apriori analysis involves 2 steps as follow:

1. Generate candidate set: the first step finds items that occur with a

frequency that exceeds a specified threshold (defined as support measure) from the

data set, that is:

Support = Number of observations having A ∩ B

Total number of observations

2. Derive the association rules: the second step analyses items in the

candidate set for mining association rules which indicate conditional probabilities

between each pair of item groups. Rules are generated based on pairs whose

conditional probability value exceeds a user-defined threshold (called confidence

measure), that is:

Confidence = Number of observations having A ∩ B

Number of observations having A

22

DR.ORALUCK PATTANAPRATEEP · 4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge...

Documents

Transcript of DR.ORALUCK PATTANAPRATEEP · 4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge...