20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan
-
Upload
irish-software-innovation-network -
Category
Technology
-
view
66 -
download
1
description
Transcript of 20140918 CeADAR Theme 2 Data Management for Analytics_Prof Barry O Sullivan
Copyright © UCC 2014 September 2014 1 Copyright © UCC 2014
Prof. Barry O’Sullivan University College Cork
Copyright © UCC 2014 September 2014 2 Copyright © UCC 2014
Theme 2 Data Management
Copyright © UCC 2014 September 2014 3
The Science of Better
Copyright © UCC 2014 September 2014 4
Data Management for Analytics
Many tools, techniques, vendors, and research exists to tackle the major issue of data management for analytics initiatives.
Managing data still consumes
significant effort related to Analytics projects.
This challenge has a major
effect in limiting the full potential of analytics capabilities for organisations.
Copyright © UCC 2014 September 2014 5
Data Management - The Sensible Shoe of Data Analytics
Copyright © UCC 2014 September 2014 6
Data Management for Analytics
Sub-themes: – reduce data
management effort for analytics
– data validation
– relevance of event(s) to relationships
– data curation (determining useful data)
– adaptive ETL
Copyright © UCC 2014 September 2014 7
Focus of the Theme
Theme 2.1 Reduce data management effort for analytics:
– The goal of this theme is to develop approaches, methods and tools to improve, simplify and reduce the effort involved in the management of data for analytics purposes.
Theme 2.2: Data validation:
– The goal of this theme is to develop advanced analytics techniques and demonstrators to manage the validity and quality of the data subsequently being used for data analytics purposes.
Copyright © UCC 2014 September 2014 8
Focus of the Theme
Theme 2.3 Relevance of event(s) to relationships: – The goal of this theme is to develop analytical
approaches, methods, models, and tools to understand and improve the relevance of an event on relationships (people, things, other data).
Theme 2.4: Data curation: – The goal of this theme is to use advanced analytics
techniques to determine which data may be considered “useful” to improve (among other things) data archiving and data storage approaches.
Copyright © UCC 2014 September 2014 9
Focus of the Theme
Theme 2.5 Adaptive ETL:
– The goal of this research theme is to create tools, demonstrators, and advanced analytics techniques to prevent STP (Straight Through Processing) breaks by automatically compensating for changes in data received – Adaptive ETL.
Copyright © UCC 2014 September 2014 10
Some CeADAR Technologies
Automated Data Management Workflows
Querying with Confidence
Test Database Generation
Supply-Chain Inventory Management
Process Analytics (later this morning)
Copyright © UCC 2014 September 2014 11
Automated Data Management Workflows
Each action modelled with logical preconditions and effects
Develop a library of standard actions and processes
Planning software generates, validates and analyses new workflows
Software tool assists users to design new workflow
Copyright © UCC 2014 September 2014 12
Technology solution
Workflow
model
Abstract
planning
operators
Planning
domain for
new workflows
Plan and
workflow
analysis model
Tool for
creation of new
workflows
Workflow
execution
monitor
Copyright © UCC 2014 September 2014 13
Principles of Planning
[ Malte Helmert ]
Copyright © UCC 2014 September 2014 14
An auto-generated worklow
Copyright © UCC 2014 September 2014 15
Planning Data Analytics Tasks
[ Fernandez et al. ICKEPS, 2009 ]
Copyright © UCC 2014 September 2014 16
Data Preparation Action
(:action datasetPreparation
:parameters (?d - DataSet ?t - TestMode)
:precondition (and (loaded ?d)
(can-learn ?d ?t))
:effect (and (eval-on ?d ?t)
(not (preprocess-on ?d))
(not (loaded ?d))
(increase (exec-time)
(* (thousandsofInstances)
(preparationFactor ?t)))))
[ Fernandez et al. ICKEPS, 2009 ]
Copyright © UCC 2014 September 2014 17
Querying with Confidence
Database Incorrect
Query
Incorrect
Results
Database Correct
Query
Correct
Results
Poor decisions
Good decisions
Copyright © UCC 2014 September 2014 18
Querying with Confidence
Process is important
E.g. many eyes
– Pair programming
– Code reviews
– Separation of the development and testing teams
But technology can help
Copyright © UCC 2014 September 2014 19
Querying with Confidence
Query-Aware Test-Database
Generator
Query Audit Tool
Copyright © UCC 2014 September 2014 20
Query Audit Tool
Provenance
For a record r in the result of a query, its provenance is the set of records from the database that contribute to r
Copyright © UCC 2014 September 2014 21
Query Audit Tool
PROJECT
PROJID PROJNAME PROJDURATION
1 ProductX 12
2 SystemY 4
3 StrategyZ 23
WORKS_ON
EMPID PROJID HOURS
11 1 10
11 2 7
12 1 16
12 2 2
12 3 19
13 1 4
14 2 6
14 3 6
EMPLOYEE
EMPID EMPNAME EMPJOB
11 Brown Developer
12 White Developer
13 Smith Sysadmin
14 Jones Sysadmin
Copyright © UCC 2014 September 2014 22
Query Audit Tool
User’s query
User selects a record
Query result
Provenance of
selected record
Copyright © UCC 2014 September 2014 23
Query Audit Tool
User’s query
User selects a record
Query result
Provenance of
selected record
Copyright © UCC 2014 September 2014 24
Database Generator
“To test a query, run it against the database and compare the actual results with the expected results”
But what if your database has
– no data, or
– too much data?
Copyright © UCC 2014 September 2014 25
Database Generator
Query-aware generator
– Analyses the query
– Generates data
The data ‘exercises’ the query (similar to program testing)
Copyright © UCC 2014 September 2014 27
Database Generator
Copyright © UCC 2014 September 2014 28
Database Generator
Presently,
– The generator realises that the query is a 3-way join
– So it inserts records into each of the 3 tables
In future, it must create more records for more comprehensive testing
Copyright © UCC 2014 September 2014 29
Database Generator
“Who works on every project?” select distinct EMPID
from WORKS_ON as T1
where not exists
(select PROJID
from PROJECT
where not exists
(select EMPID
from WORKS_ON as T2
where T2.PROJID = PROJECT.PROJID
and T2.EMPID = T1.EMPID));
Copyright © UCC 2014 September 2014 30
Database Generator
PROJECT
PROJID PROJNAME PROJDURATION
1 ProductX 12
2 SystemY 4
WORKS_ON
EMPID PROJID HOURS
11 1 10
11 2 7
12 1 16
EMPLOYEE
EMPID EMPNAME EMPJOB
11 Brown Developer
12 White Developer
Copyright © UCC 2014 September 2014 31
Database Generator
PROJECT
PROJID PROJNAME PROJDURATION WORKS_ON
EMPID PROJID HOURS
EMPLOYEE
EMPID EMPNAME EMPJOB
11 Brown Developer
Copyright © UCC 2014 September 2014 32
Supply Chain Analytics
Copyright © UCC 2014 September 2014 33 Copyright © UCC 2014
Theme 2 Data Management