University of ViennaP. Brezany 1 Knowledge Discovery in Grid Datasets – Goals, Design Concepts and...

44
University of Vienna P. Brezany 1 Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna

Transcript of University of ViennaP. Brezany 1 Knowledge Discovery in Grid Datasets – Goals, Design Concepts and...

University of ViennaP. Brezany1

Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture

Peter Brezany

University of Vienna

University of ViennaP. Brezany2

Collecting Data

Data Re-positories

SatellitesLaboratories(microscopes,

MRI/CT scanners, ...)

Computer simulationsExperiments

(high energy physics,...)

AnalysisBusiness

University of ViennaP. Brezany3

Motivation• Computational Grid – a new-generation infrastructure

• Challenge: Advanced analysis of data managed by Grid

• Typical data in modern Grid applications:– files, file collections, relational and XML DBs, virtual data, data

objects

• The data is often is large, geographically distributed and its complexity is increasing; some applications require special security precautions.

• Our research aims:– Phase 1 : Knowledge discovery Grid system (GridMiner)– Phase 2 : Intelligent Grid system (WisdomGrid)

University of ViennaP. Brezany4

Outline• Motivation

• Background and Related Work

• Basic Concepts and GridMiner Architecture

• Grid Data Integration System

• Data Mining Layer

• Implementation Issues and Experiments

• Future Research

• Conclusions

University of ViennaP. Brezany5

Background and Related Work• Basic Grid development (Globus 1) – metacomputing

• Data Grid (Globus 2, DataGrid of CERN, etc.)

• Semantic Grid (myGrid)

• Open Grid Service Architecture (Globus 3, OGSA-DAIS)

• Parallel and Distributed Data Mining and Data Warehousing

• Knowledge Grid (GridMiner and work of others)

• Web Intelligence

University of ViennaP. Brezany6

GridMiner Requirements• Open architecture

• Data distribution, complexity, heterogeneity, and large data size

• Applying different kinds of analysis strategies

• Compatibility with existing Grid infrastructure

• Openness to tools and algorithms

• Scalability

• Grid, network, and location transparency

• Security and data privacy

• OLAP support

University of ViennaP. Brezany7

GridMiner (Layered) Abstract Architecture

Computational & Data Grid

Information Grid

Knowledge Grid

Data toKnowledge Control

User Interface

Built on the K.G. Jeffery‘s proposal

University of ViennaP. Brezany8

GridMiner Conceptual Architecture

Job

Control

University of ViennaP. Brezany9

Service Architecture

Based on OGSA-DAIS

University of ViennaP. Brezany10

Data Distribution Scenarios

1. Single data source

2. Federated data sources with different types of partitioning

University of ViennaP. Brezany11

Example

Vertical and horizontal distribution of the virtual data source

University of ViennaP. Brezany12

Mapping Schema

University of ViennaP. Brezany13

Grid Data Mediation Services

University of ViennaP. Brezany14

Architecture of a Data Mining System

University of ViennaP. Brezany15

Components of the Data Mining Layer

• GridMiner Service Factory

• GridMiner Service Registry

• GridMiner Data Mining Service

• GridMiner Preprocessing Service

• GridMiner Presentation Service

• GridMiner Orchestration Service

University of ViennaP. Brezany16

Centralized Data Mining

University of ViennaP. Brezany17

Parallel and Distributed Data Mining

University of ViennaP. Brezany18

GridMiner Orchestration Service

GMOrchSGS GMDMNSrc

notif

icat

ions

quer

y S

DE

s 3. execute Workflow2.

cre

ate

GM

DM

S

GMPPS 1 GMDMS GMP RSG MP PS 2

GMSF GMSF GMSF GMSF5. performActivity

7. perform Act ivity

9. performActivity

11. performActivity

10. create8. create6. create4. create

<read> <read><read>

<write> <write>

<read>

GS

F

GMS F

Workflow Engine

WorkflowOut line

GridMin er Job Desc ription

Header

Resource Declarations

Workflow

use GMPPS for filling missing values, remove noi seActivity

use GMPPS for selection and preliminary aggregationsActivity

use GMDMS forgenerati ng a decis ion tree Activity

use GMPRS for a graphic al, interactive representationActivity

<write>

Client1. browse

GSHs >

GS

R

GMS R

University of ViennaP. Brezany19

GridMiner Job

Specification

Language

University of ViennaP. Brezany20

Implementation Prototype

• Implementation of the Mediation Service for horizontal data partitioning

• Implementation of Data Mining Services for decision tree construction as OGSA conformous Grid service, based on the Globus Toolkit 3 Release

• We use – a freely available Java-based data mining system Weka (data

preprocessing and data mining tasks) – (main memory oriented)

– a home-grown Java implementation of the algorithm SPRINT (disk-oriented)

University of ViennaP. Brezany21

Experimental Environment

• Test data suites– synthetical data (generated by an extended version of

the IBM Quest Synthetic Data Generation Code)– TBI (Traumatic Brain Injury) databases

• Grid testbed– Vienna– CERN– Dublin– Zagreb– Cracow

• Goals in the first phases– Verifying model accuracy– Overhead of the service layers

University of ViennaP. Brezany22

Extending the

Functionality

University of ViennaP. Brezany23

OLAM

University of ViennaP. Brezany24

Example: Mining Patterns for Data Classification and

Associations

use database dat1, dat2mine classificationsanalyze patient_outcomeusing g_parsimonydisplay as tree

use database DBs attributesmine associationsusing method_attributesdisplay as rules

University of ViennaP. Brezany25

Workflow 1: Interactive Mode

University of ViennaP. Brezany26

Workflow 2: Batch Mode

University of ViennaP. Brezany27

Workflow 3: Hybrid Mode

University of ViennaP. Brezany28

Execution Model Based on Static Workflow

University of ViennaP. Brezany29

Execution Model Based on Dynamic Workflow

University of ViennaP. Brezany30

Towards the Wisdom Grid (WG)

University of ViennaP. Brezany31

WG Architecture

Wisdom Grid

Agent Grid Service

Knowledge Base Service

Knowledge Discovery Service

Agent Platform

External Services

External Knowledge Base

Domain Knowledge Agents Knowledge Explorer Agent

End User (personal) AgentGrid

KB

University of ViennaP. Brezany32

Work-Flow

End User Agent Knowledge Agent Knowledge Explorer Agent

Knowledge Baseservice

External Agents

Knowledge Base

Agent Service Knowledge discoveryservice

Services ...

University of ViennaP. Brezany33

Knowledge Discovery Service

Client for other servicesKnowledge Discovery in Databases

GridMiner data mining on-line analytical processing (OLAP)

Web Miningsemantic web

Online libraries Web/Grid ServicesKnowledge Explorer Agent

University of ViennaP. Brezany34

Knowledge Base Service / KB

KBS - Search, Query, Expand Knowledge BaseKB- Database that stores particular data about real objects and relations between these objects and their propertiesConsists of ontologies and instancesInformation about resources (location, query lang.)

on the Web web/grid services ,agents references to the online database

LanguagesXML/RDF/DAML-OIL/DAML-S/OWL

University of ViennaP. Brezany35

Ontology - example

Patient

Age

Human

has

is

DAML-OIL Language:

<daml:Class rdf:ID=“Human”> <rdfs:subClassOf> <daml:Restriction cardinality=“1”> <daml:onProperty rdf:resource= “#Age”/> </daml:Restriction> </rdfs:subClassOf></daml>

<daml:DatatypeProperty about:ID=“Age”> <rdf:domain rdf:resource = “#Human”/></daml:DatatypeProperty>

<daml:Class rdf:ID=“Patient”> <daml:subClassOf rdf:resource=“#Human”/></daml:Class>

University of ViennaP. Brezany36

Knowledge Base - example

Patient

TemperatureHuman

has

has has

DatabaseTables

jdbc://foo/hospitaltable:PATIENTSattribute:PAT_ID

is

Value

Attributehas

University of ViennaP. Brezany37

Semantic mediator

• Distributed heterogeneous databases– Different database schemas– Different query languages– Different names of attributes/tables…

but the same semantics !

• WG enables semantics mediation at a higher level

University of ViennaP. Brezany38

Semantic mediator (cont.)

PATIENTS

PAT_ID PAT_AGE PAT_BLOOD_TYPE

... … …

PAT_TAB

ID AGE BT

... … …

Patient

AgeHumanhas

is

Blood Type

has

AGE PAT_AGE

samePropertyAs

BTPAT_BLOOD_TYPE

samePropertyAs

Database in Hospital X

Database in Hospital Z

University of ViennaP. Brezany39

Distributed Knowledge base

is subclasshas property

Class

Class

property

uri:fooX#Patient

uri:fooY#Human

uri:fooZ#Temperature

class

uri:fooX#Ill_Person

Is same class as

University of ViennaP. Brezany40

Agent Grid Service

Supports system with ability to communicate with the outside world in standard languages FIPA Standards

ACL – Agent Communication Language

KQML- Knowledge Query and Manipulation Language

Agent Platform (JADE,FIPA-OS)Agents

Domain Knowledge AgentKnowledge Explorer Agent

End-user Agent (personal)

University of ViennaP. Brezany41

Querying

End-user agent with own ontology – subset of ontology

Merging of ontologies without own ontology

Negotiating about domain of interest Queries created from ontology Templates

<Patient rdf:ID=“ID001”><Temperature/>

</Patient>

University of ViennaP. Brezany42

Answers

• Mined Knowledge (GridMiner)– Decision trees/ rules

» (clinical pathways)– Association rules

• Instances of domain ontology– Particular data– References– Links to Web sites– Information about another knowledge providers

University of ViennaP. Brezany43

Case Study - Medical Application

End User (personal) Agent

Q: Outcome?+ data about patient’s condition

Knowledge Agent

Trainingset

GridMiner

Testset

Hospital Databases

Knowledge DiscoveryService

Knowledge Base

Semantic Web/Grid

A: probability of survival+ references to the diagnoses

Knowledge Explorer Agent

resources

University of ViennaP. Brezany44

Conclusions and Future Work

• Application and extension of the Grid technology to knowledge discovery – an important, but non-traditional Grid application domain

• Introduction of a new Grid Data Mediation Service

• Future work– Performance evaluation on large synthetic data volumes– Coupling of the Data Minining services architecture with the

OLAP services architecture– Development of a knowledge discovery oriented Grid Workflow

Language and the appropriate Workflow Engine– Application of GridMiner to a real medical application

(management of patients with severe traumatic brain injuries)– Development of the Wisdom Grid