20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

32
Copyright © UCC 2014 September 2014 1 Copyright © UCC 2014 Prof. Barry O’Sullivan University College Cork [email protected]

description

Many tools, techniques, vendors, and research exists to tackle the major issue of data management for analytics initiatives. Managing data still consumes significant effort related to Analytics projects.This challenge has a major effect in limiting the full potential of analytics capabilities for organisations.

Transcript of 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Page 1: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 1 Copyright © UCC 2014

Prof. Barry O’Sullivan University College Cork

[email protected]

Page 2: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 2 Copyright © UCC 2014

Theme 2 Data Management

Page 3: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 3

The Science of Better

Page 4: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 4

Data Management for Analytics

Many tools, techniques, vendors, and research exists to tackle the major issue of data management for analytics initiatives.

Managing data still consumes

significant effort related to Analytics projects.

This challenge has a major

effect in limiting the full potential of analytics capabilities for organisations.

Page 5: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 5

Data Management - The Sensible Shoe of Data Analytics

Page 6: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 6

Data Management for Analytics

Sub-themes: – reduce data

management effort for analytics

– data validation

– relevance of event(s) to relationships

– data curation (determining useful data)

– adaptive ETL

Page 7: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 7

Focus of the Theme

Theme 2.1 Reduce data management effort for analytics:

– The goal of this theme is to develop approaches, methods and tools to improve, simplify and reduce the effort involved in the management of data for analytics purposes.

Theme 2.2: Data validation:

– The goal of this theme is to develop advanced analytics techniques and demonstrators to manage the validity and quality of the data subsequently being used for data analytics purposes.

Page 8: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 8

Focus of the Theme

Theme 2.3 Relevance of event(s) to relationships: – The goal of this theme is to develop analytical

approaches, methods, models, and tools to understand and improve the relevance of an event on relationships (people, things, other data).

Theme 2.4: Data curation: – The goal of this theme is to use advanced analytics

techniques to determine which data may be considered “useful” to improve (among other things) data archiving and data storage approaches.

Page 9: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 9

Focus of the Theme

Theme 2.5 Adaptive ETL:

– The goal of this research theme is to create tools, demonstrators, and advanced analytics techniques to prevent STP (Straight Through Processing) breaks by automatically compensating for changes in data received – Adaptive ETL.

Page 10: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 10

Some CeADAR Technologies

Automated Data Management Workflows

Querying with Confidence

Test Database Generation

Supply-Chain Inventory Management

Process Analytics (later this morning)

Page 11: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 11

Automated Data Management Workflows

Each action modelled with logical preconditions and effects

Develop a library of standard actions and processes

Planning software generates, validates and analyses new workflows

Software tool assists users to design new workflow

Page 12: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 12

Technology solution

Workflow

model

Abstract

planning

operators

Planning

domain for

new workflows

Plan and

workflow

analysis model

Tool for

creation of new

workflows

Workflow

execution

monitor

Page 13: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 13

Principles of Planning

[ Malte Helmert ]

Page 14: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 14

An auto-generated worklow

Page 15: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 15

Planning Data Analytics Tasks

[ Fernandez et al. ICKEPS, 2009 ]

Page 16: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 16

Data Preparation Action

(:action datasetPreparation

:parameters (?d - DataSet ?t - TestMode)

:precondition (and (loaded ?d)

(can-learn ?d ?t))

:effect (and (eval-on ?d ?t)

(not (preprocess-on ?d))

(not (loaded ?d))

(increase (exec-time)

(* (thousandsofInstances)

(preparationFactor ?t)))))

[ Fernandez et al. ICKEPS, 2009 ]

Page 17: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 17

Querying with Confidence

Database Incorrect

Query

Incorrect

Results

Database Correct

Query

Correct

Results

Poor decisions

Good decisions

Page 18: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 18

Querying with Confidence

Process is important

E.g. many eyes

– Pair programming

– Code reviews

– Separation of the development and testing teams

But technology can help

Page 19: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 19

Querying with Confidence

Query-Aware Test-Database

Generator

Query Audit Tool

Page 20: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 20

Query Audit Tool

Provenance

For a record r in the result of a query, its provenance is the set of records from the database that contribute to r

Page 21: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 21

Query Audit Tool

PROJECT

PROJID PROJNAME PROJDURATION

1 ProductX 12

2 SystemY 4

3 StrategyZ 23

WORKS_ON

EMPID PROJID HOURS

11 1 10

11 2 7

12 1 16

12 2 2

12 3 19

13 1 4

14 2 6

14 3 6

EMPLOYEE

EMPID EMPNAME EMPJOB

11 Brown Developer

12 White Developer

13 Smith Sysadmin

14 Jones Sysadmin

Page 22: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 22

Query Audit Tool

User’s query

User selects a record

Query result

Provenance of

selected record

Page 23: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 23

Query Audit Tool

User’s query

User selects a record

Query result

Provenance of

selected record

Page 24: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 24

Database Generator

“To test a query, run it against the database and compare the actual results with the expected results”

But what if your database has

– no data, or

– too much data?

Page 25: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 25

Database Generator

Query-aware generator

– Analyses the query

– Generates data

The data ‘exercises’ the query (similar to program testing)

Page 26: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 27

Database Generator

Page 27: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 28

Database Generator

Presently,

– The generator realises that the query is a 3-way join

– So it inserts records into each of the 3 tables

In future, it must create more records for more comprehensive testing

Page 28: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 29

Database Generator

“Who works on every project?” select distinct EMPID

from WORKS_ON as T1

where not exists

(select PROJID

from PROJECT

where not exists

(select EMPID

from WORKS_ON as T2

where T2.PROJID = PROJECT.PROJID

and T2.EMPID = T1.EMPID));

Page 29: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 30

Database Generator

PROJECT

PROJID PROJNAME PROJDURATION

1 ProductX 12

2 SystemY 4

WORKS_ON

EMPID PROJID HOURS

11 1 10

11 2 7

12 1 16

EMPLOYEE

EMPID EMPNAME EMPJOB

11 Brown Developer

12 White Developer

Page 30: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 31

Database Generator

PROJECT

PROJID PROJNAME PROJDURATION WORKS_ON

EMPID PROJID HOURS

EMPLOYEE

EMPID EMPNAME EMPJOB

11 Brown Developer

Page 31: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 32

Supply Chain Analytics

Page 32: 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan

Copyright © UCC 2014 September 2014 33 Copyright © UCC 2014

Theme 2 Data Management