DR.ORALUCK PATTANAPRATEEP · 4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge...
Transcript of DR.ORALUCK PATTANAPRATEEP · 4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge...
Doctor of Philosophy Program in Clinical Epidemiology Section for Clinical Epidemiology & BiostatisticsFaculty of Medicine Ramathibodi Hospital Mahidol University
Semester I Academic year 2016 www.ceb-rama.org
MEDICAL INFORMATICS & DATABASE MANAGEMENT
MODULE 5: BIG DATA MANAGEMENT AND ANALYSIS
DR.ORALUCK PATTANAPRATEEP
RACE 614 Medical Informatics &
Database Management
Module 5: Big data management and analysis
Contents
Objectives ....................................................................................................................... 1
References ...................................................................................................................... 1
I. Big data and data science ......................................................................................... 3
Why big data ................................................................................................................ 4
Data science ................................................................................................................ 4
II. Data warehouse and visualization ............................................................................. 6
What is a data warehouse ............................................................................................ 7
- Basic data warehousing environment .................................................................... 8
- Data mart and its components ............................................................................... 9
Big data and data lake ................................................................................................ 11
Data visualization ....................................................................................................... 13
- Infographic ........................................................................................................... 14
III. Machine learning algorithm and big data analytic ................................................... 16
What is machine learning algorithms .......................................................................... 16
- Classification model............................................................................................. 18
- Regression model ............................................................................................... 19
- Cluster analysis ................................................................................................... 20
- Association analysis ............................................................................................ 21
Objectives
Students should be able to:
1. Understand the big data, data science and data warehouse concept
2. Utilize the data science processes to big data problem
3. Select appropriate data visualizations to clearly communicate analytic insights to
audiences
4. Apply appropriate machine learning algorithms to analyze big data
References
1. Lantz B. Machine learning with R (2nd edition). Packt Publishing. 2015.
2. Provost F and Fawcett T. Data science for business. O’Reilly Media, Inc. 2013.
3. Reeves LL. A manager’s guide to data warehousing. Wiley Publishing, Inc. 2009.
4. Berka P, Rauch J, and Zighed DA. Data mining and medical knowledge
management: cases and applications. Information Science Reference. 2009.
5. Han J and Kamber M. Data mining: concepts and techniques (2nd edition).
Morgan Kaufmann Publisher, CA, USA. 2006.
6. Kimball R and Ross Margy. The data warehouse toolkit: the complete guide to
dimensional modeling 2nd ed. Wiley Computer Publishing. 2002.
1
In previous modules, we explored data management of primary source, starting
with design record form(s) and manage database in Epidata®. However, in the real
world, another source of data is secondary, especially in electronic format which
recently has grown bigger in size. It is increasingly gathered by high performance and
convenient devices, so called “big data”.
In this final module, we will cover three domains exploring big data, which are:
1. Data science and big data: reveals the concept of data value added and
introduces data science, which is the new era of data management.
2. Data warehouse and visualization: provide the concept of making data
warehouse and demonstrate how to communicate the finding(s).
3. Machine learning algorithm and big data analytic: explore how to mine data with
4 main machine learning algorithms.
2
I. Big data and data science
How do we find information and knowledge from data or big data. Figure 1
demonstrates the value added from no-meaning raw data at the base of pyramid to
meaningful information, knowledge and wisdom. For example, 2 numbers at raw data
level, 115 and 90, has no meaning without any clue. By adding the meaning to number,
we found relationship of these 2 numbers. It is fasting blood sugar (FBS) which
decreases from 115 to 90. But the next question is whether lower FBS is good or bad.
Figure 1: from data to wisdom
By adding the context that FBS should be less than 100, so the information we
found “FBS decreases from 115 to 90” is good. Moreover, tools and techniques called
“machine learning algorithm” may be applied to understand patterns and predict future.
From this example, we may find patterns of patients who control their diet well and
Wisdom
Understand
principles
Knowledge
Understand patterns
Information
Understand relations
Data
Meaning
FBS decreases from 115 to 90
Raw
115, 90
Context
FBS should be less than 100
by dietary control
Applied
Control of the diet will
improve patient’s health
3
predict his/her level of FBS. Finally, at the top, we may conclude that “controlling one’s
diet will improve patient’s health”.
Why big data Big data simply means datasets that are too large, too various, and too rapid for
traditional data processing systems. In the past, we have processed only small volumes
of data with no variety of data type over one night each time to find information and
knowledge. With high hardware performance and technology, data recently has been
generated in many forms with high volume and can be kept, retrieved, and analyzed
more rapidly at one time. Big data, then, can be mainly described by the following 3
characteristics:
Volume: amount of data from a few to millions of records, from one to hundreds
of tables.
Variety: range of data types and sources from structured to unstructured, from
text to image.
Velocity: speed of data in and out from batch to real time.
Data science Once the massive data in flexible forms can be processed in a few minutes,
possibility to find information and dig knowledge will happen more. The next questions
are who will perform the analysis and which special talent do they need.
4
Figure 2: data science Venn diagram 2.0
Ref: http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
Three essential skills of data scientist (figure 2) are 1.computer science, 2.maths
and statistics, and 3.subject matter expertise (Venn diagram 2.0). Since it is quite hard
to find one person who is keen in all 3 areas, most data scientists are working as
teamwork.
Data science process (figure 3) starts with collecting raw data from real world
situations e.g. from human behavior, financial issues or medicine utilization. Then
formulate hypothesis, process and clean data to get exploratory data analysis which
may be in summary statistics or graphs. In case, the data is not enough or cannot
answer the question, more data should be collected, processed and cleaned. The next
step is building models with machine learning algorithms. The final outcome from data
science process is data product or value added data. Within the process, the important
thing is communication to the audience to make decisions, such as reports or
dashboards.
Traditional research
Traditional software
Machine learning
Unicorn
Data science
Computer science
Maths and statistics
Subject matter expertise
5
Figure 3: data science process
Ref: http://www.kdnuggets.com/2016/03/data-science-process.html
II. Data warehouse and visualization
From data science process, we discuss about collecting, processing, and
cleaning data (so called ETL in data warehousing); and also the importance of
communication to audience for making decisions. We begin with comparison of
information which is kept in 2 forms with different purposes.
Information is mainly kept in 2 forms: the operational systems of record and the
data warehouse. The operational systems are where the data is put in and almost
always deal with one record at a time; while the data warehouse is where the data is
integrated from different operational systems and almost never dealt with one row at a
time. Table 1 is a comparison of operational systems and data warehouse.
Raw data is
collected Data is
processed Clean data
Data
product Communicate
Models &
Algorithms
Exploratory
data
analysis
Make decision
6
Table 1: comparison of operational systems and data warehouse
Area of
comparison
Operational systems Data warehouse
Purpose of
data
Daily business tasks Analysis, planning, decision
supporting
Function Day-to-day operation, detailed
data
Long term information,
summarized data
Design Application oriented, real time Subject oriented, depends on
length of cycle for data
supplements to warehouse
Access Read and write Mostly read
Size 100MB to GB 100 GB to TB
What is a data warehouse Data warehouse is a central repository of integrated data from one or more
disparate transactional data sources; such as relational database, enterprise resource
planning. Figure 4 shows basic data warehousing environment, starting with
transactional data sources, using a process called ETL:
- Extracts data from transactional data sources and normally temporarily keeps in
staging tables
- Transforms the data in the proper format for the purposes of querying and
analysis and
- Loads it into the final target which designed and modeled in dimensional format
Then at the client side, a user will retrieve data which is already in a data
warehouse to create their own dashboard/report for either exploring or analysing
purpose.
7
- Basic data warehousing environment Figure 4 explains basic data warehousing environment. From left to right, data in
various forms are extracted, transformed and loaded into a data warehouse which
consists of many data marts. The ETL process or metadata management deals with
ODS (operational data store or a mirror (backup) of transactional database), staging
tables which are temporary databases, master tables which mainly keep the master
data warehouse dimensions.
Figure 4: basic data warehouse environment
Transactional data sources
Metadata management
Dimensional modeling
Query/Report/ Visualization
Skills
DB design and
administration, SQL
Extract, transform,
load (ETL)
Dimensional
modeling
Multidimensional
queries, Data
mining, Predictive
analysis
ERP
Relational
DB/HIS
Other
sources
Flat files
Staging
tables
Master
tables
Meta data
ODS Data warehouse
Data mart
Data mart
Data mart
Dashboard in BI tools
Pivot in MS Excel®
8
Transactional data sources
Metadata management
Dimensional modeling
Query/Report/ Visualization
Tools
MS SQL server,
MySQL, Oracle
11g, MS Access,
etc.
IBM data manager,
Informatica, Oracle
ODI, SAS DI studio
IBM framework
manager, Oracle
warehouse builder,
DB2, SQL server
integration services
IBM Cognos
business object,
MS analysis
services, (MS
powerBI, MS
PowerPivot,
QlikSense,
Tableau)
DB = database, HIS = hospital information system, ERP = enterprise resource planning,
ODS = operational data store
- Data mart and its components A simple form of data warehouse that is focused on a single functional area is
called data mart or cube. A data mart is designed in dimensional format as a fact table
which comprises of measures and dimensions.
Typically, measures are values that can be aggregated, and dimensions are
groups of hierarchies that define the facts. For example, in figure 5, number of visits is a
measure; date, clinic, health scheme are elements of dimensions. A dimension may
have none, one or more hierarchies. Health scheme has no hierarchy. Clinic has one
hierarchy with one level that means clinic can drill up as building. Also date has 2
hierarchies which are calendar year and fiscal year; and each hierarchy is also in
several levels i.e., week, month, and year. In addition, dimension date has one attribute
which is day (Monday-Sunday).
9
Figure 5: a data mart in star schema
Figure 6 explains a data mart as a cube for reporting number of visits in 3
dimensions: date in X axis, clinic in Y axis, and health scheme in Z axis. Each box
contains number of visits, e.g., on 1/1/16, 22 NHSO patients visit medicine clinic (3
dimensions: date, clinic and heath scheme). A cube can accommodates data of
dimensions that define a business problem. When dimension changes, measure will be
summed as box is combined, e.g., 119 patients visit medicine clinic on 1/1/16 (2
dimensions: clinic and date) or when drilling up, 198 patients visit building 1 on 1/1/16.
10
Figure 6: a data mart as a cube
Big data and data lake With the growth of data in the last decade, the new term dealing with data
management system is “data lake”. Table 2 compares key differences between data
warehouse and data lake.
Table 2: comparison of data warehouse and data lake
Area of
comparison
data warehouse data lake
Data
structure
Structured Structured and unstructured
Data type Cleansed/aggregated Raw
Data volume Large (Terabytes) Extremely large (Petabytes)
11
Area of
comparison
data warehouse data lake
Access
methods
SQL NoSQL
However, data lake can be added to data system with data warehouse to
maximize the use of data. In figure 7, Hadoop is added to retrieve data from
unstructured data sources.
Figure 7: basic data warehouse environment plus data lake architecture
Transactional data sources
Data system Query/Report/ Visualization
ERP
Relational
DB/HIS
Other
sources
Flat files
Staging
tables
Master
tables
Meta data
Data mart
Data mart
Dashboard in BI tools
Pivot in MS Excel®
Unstructured data file
12
Data visualization From the last column of transformed data to information and knowledge in figure
4 and 7, data visualization, which is both an art and a science, is one of data science
processes (figure 3) to communicate information, knowledge or even data products
clearly and efficiently to the audiences. Effective visualization helps users analyze and
get an evidence from data. To generate the visualization of the data, we need to
understand the data we are trying to visualize, know the audience in what they want to
know and then simply use a visual in the best and simplest form to convey the
information.
There are many tools to do data visualization. From simple tools such as MS
Excel to small BI (business intelligence) tools such as MS Power BI, QlikSense,
Tableau, etcetera and large BI tools such as as IBM Cognos. Figure 8 and 9 are
examples of using pivot tools in MS Excel and dashboard in MS Power BI to present
data from a data mart.
Figure 8: pivot table and chart in MS Excel®
13
Figure 9: a dashboard designed in MS Power BI®
- Infographic Infographic is a combination of 2 words - information and graphic. It is a kind of
data visualization that is composed of three parts which are visual, content, and
knowledge. The visual means how to make an attractive and memorable graphic, since
the vision is a sense in which a human receives significantly more information than any
of the other four (touch, hearing, smell, taste). The content must be a statistically proven
fact and be able to transfer the knowledge to audiences. Figure 10 is a sample of
infographic from WHO in 2013.
14
Figure 10: a sample of infographic from WHO
15
III. Machine learning algorithm and big data analytic
What is machine learning algorithms In previous sections, we have discussed how to manage big data to get
information. In this section, we will move next on to how to transform data or information
into knowledge with a set of algorithms, called “machine learning”.
Figure 11: machine learning and its combination
AI = Artificial Intelligence, KDD = Knowledge Discovery and Data mining
As a statistician, we may question what is the difference between statistical
modellings and machine learnings. The answer is that statistical modellings are
formalization of relationships between variables in the form of mathematical equations,
while machine learning are algorithms that can learn from data without relying on rules-
based programming.
Machine learning algorithms are generally divided into 2 major categories
(descriptive and predictive) with 2 major types of data (continuous and categorical) as
shown with the sample techniques in table 3. The objective of descriptive tasks is to
derive patterns that summarize the underlying relationships in data. They are often
16
exploratory in nature to validate and explain the results. Predictive tasks aim to predict
the value of a particular attribute based on the values of other attributes.
Table 3: machine learning techniques/algorithms
Descriptive tasks (unsupervised)
Predictive tasks (supervised)
Continuous Clustering Regression
Categorical Association Classification
Choosing the best algorithm to use for a specific analytical task can be a
challenge. While we can use different algorithms to perform the same business task,
each algorithm produces a different result, and some algorithms can produce more than
one type of result. For example, we can use the Microsoft Decision Trees algorithm not
only for prediction, but also as a way to reduce the number of columns in a dataset,
because the decision tree can identify columns that do not affect the final mining model.
In machine learning process, there are 6 steps as shown in figure 12 starting with
1.understanding the business and type of problem; 2.understanding data (data may be
from different sources); 3.preparing data (ETL); 4.creating model; 5.evaluating the
model; and 6.deploying.
17
Figure 12: CRISP-DM model
- Classification model Classification consists of predicting a certain outcome based on a given input. In
order to predict the outcome, the algorithm processes a training set containing a set of
attributes and the respective outcome. The algorithm tries to discover relationships
between the attributes that would make it possible to predict the outcome. Next, the
prediction set, which contains the same set of attributes except for the prediction
outcome, which will be applied to test the classification model. There are many
algorithms applied in classification such as k-NN, Naïve Bayes, and decision tree
etcetera.
The k-NN or k-nearest neighbors algorithm uses information about a prediction’s
k-nearest neighbors to classify an unknown outcome. Figure 13 is an example of how to
diagnose breast cancer with k-NN algorithm. With 2 dimensional attributes, texture and
radius, each dot presents malignant (m) or benign (b). To classify x into m or b, k-NN
calculates distance and decides outcome for x.
18
Figure 13: an example of k-NN algorithm
In the other 2 algorithm, Naïve Bayes or Bayesian method use training data to
calculate probability of unknown outcome by using the formula P(A|B) = P(B|A)P(A) /
P(B). Decision tree uses a tree structure to model the relationships among attributes
and outcomes.
- Regression model While classification algorithm applies for categorical attributes, regression
algorithm applies for continuous ones for supervised model. This algorithm is the same
as in statistics class which uses independent variables to predict a dependent variable.
19
- Cluster analysis Clustering is an unsupervised model that divides data into clusters. I,
classification we have training data for knowing outcomes and predicting outcomes from
testing data. For example, figure 14, as a medical staff, we want to organize and
facilitate diabetes patients to learn how to control their diet and do exercise by dividing
into 3 groups based on patients’ age and level of sugar in blood.
Figure 14: an example of diabetes patients
The most common algorithm for cluster analysis is k-means. The k-means first
assigns each of n examples to one of k clusters, then, it tries to minimize the differences
within each cluster and maximize the differences between clusters. Figure 15 is the
result of k-mean cluster analysis where patients are divided into 3 groups based on their
similarity of age and level of sugar in blood.
20
Figure 15: an example of diabetes patients
- Association analysis Association or market basket analysis is another unsupervised model that finds
the relationships among categorical variables in a dataset. Table 4 is an example of 5
prescriptions from one clinic.
Table 4: an example of drugs in prescriptions
Rx no. Drug items
1 {PPI, NSAIDs, Calcium}
2 {Antidepressant, NSAIDs, Antianxiety, Muscle relaxant}
3 {NSAIDs, Muscle relaxant, PPI}
4 {Antidepressant, Antianxiety, Calcium}
5 {NSAIDs, PPI, Calcium}
21
By looking at only 5 prescriptions dataset, we may guess some patterns: Rx no.
1, 3, and 5 are for orthopedic patients, while Rx no. 2 and 4 are for psychiatric patients.
With similar rules, in large transaction databases, association analysis uses statistical
measures (support and confidence measures) to locate association of items and groups
into the same basket. The most common method is Apriori approach.
The Apriori approach is an association rule mining, based on the principle of
frequent pattern mining. Performing Apriori analysis involves 2 steps as follow:
1. Generate candidate set: the first step finds items that occur with a
frequency that exceeds a specified threshold (defined as support measure) from the
data set, that is:
Support = Number of observations having A ∩ B
Total number of observations
2. Derive the association rules: the second step analyses items in the
candidate set for mining association rules which indicate conditional probabilities
between each pair of item groups. Rules are generated based on pairs whose
conditional probability value exceeds a user-defined threshold (called confidence
measure), that is:
Confidence = Number of observations having A ∩ B
Number of observations having A
22