Post on 23-Dec-2015
University of ViennaP. Brezany1
Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture
Peter Brezany
University of Vienna
University of ViennaP. Brezany2
Collecting Data
Data Re-positories
SatellitesLaboratories(microscopes,
MRI/CT scanners, ...)
Computer simulationsExperiments
(high energy physics,...)
AnalysisBusiness
University of ViennaP. Brezany3
Motivation• Computational Grid – a new-generation infrastructure
• Challenge: Advanced analysis of data managed by Grid
• Typical data in modern Grid applications:– files, file collections, relational and XML DBs, virtual data, data
objects
• The data is often is large, geographically distributed and its complexity is increasing; some applications require special security precautions.
• Our research aims:– Phase 1 : Knowledge discovery Grid system (GridMiner)– Phase 2 : Intelligent Grid system (WisdomGrid)
University of ViennaP. Brezany4
Outline• Motivation
• Background and Related Work
• Basic Concepts and GridMiner Architecture
• Grid Data Integration System
• Data Mining Layer
• Implementation Issues and Experiments
• Future Research
• Conclusions
University of ViennaP. Brezany5
Background and Related Work• Basic Grid development (Globus 1) – metacomputing
• Data Grid (Globus 2, DataGrid of CERN, etc.)
• Semantic Grid (myGrid)
• Open Grid Service Architecture (Globus 3, OGSA-DAIS)
• Parallel and Distributed Data Mining and Data Warehousing
• Knowledge Grid (GridMiner and work of others)
• Web Intelligence
University of ViennaP. Brezany6
GridMiner Requirements• Open architecture
• Data distribution, complexity, heterogeneity, and large data size
• Applying different kinds of analysis strategies
• Compatibility with existing Grid infrastructure
• Openness to tools and algorithms
• Scalability
• Grid, network, and location transparency
• Security and data privacy
• OLAP support
University of ViennaP. Brezany7
GridMiner (Layered) Abstract Architecture
Computational & Data Grid
Information Grid
Knowledge Grid
Data toKnowledge Control
User Interface
Built on the K.G. Jeffery‘s proposal
University of ViennaP. Brezany10
Data Distribution Scenarios
1. Single data source
2. Federated data sources with different types of partitioning
University of ViennaP. Brezany11
Example
Vertical and horizontal distribution of the virtual data source
University of ViennaP. Brezany15
Components of the Data Mining Layer
• GridMiner Service Factory
• GridMiner Service Registry
• GridMiner Data Mining Service
• GridMiner Preprocessing Service
• GridMiner Presentation Service
• GridMiner Orchestration Service
University of ViennaP. Brezany18
GridMiner Orchestration Service
GMOrchSGS GMDMNSrc
notif
icat
ions
quer
y S
DE
s 3. execute Workflow2.
cre
ate
GM
DM
S
GMPPS 1 GMDMS GMP RSG MP PS 2
GMSF GMSF GMSF GMSF5. performActivity
7. perform Act ivity
9. performActivity
11. performActivity
10. create8. create6. create4. create
<read> <read><read>
<write> <write>
<read>
GS
F
GMS F
Workflow Engine
WorkflowOut line
GridMin er Job Desc ription
Header
Resource Declarations
Workflow
use GMPPS for filling missing values, remove noi seActivity
use GMPPS for selection and preliminary aggregationsActivity
use GMDMS forgenerati ng a decis ion tree Activity
use GMPRS for a graphic al, interactive representationActivity
<write>
Client1. browse
GSHs >
GS
R
GMS R
University of ViennaP. Brezany20
Implementation Prototype
• Implementation of the Mediation Service for horizontal data partitioning
• Implementation of Data Mining Services for decision tree construction as OGSA conformous Grid service, based on the Globus Toolkit 3 Release
• We use – a freely available Java-based data mining system Weka (data
preprocessing and data mining tasks) – (main memory oriented)
– a home-grown Java implementation of the algorithm SPRINT (disk-oriented)
University of ViennaP. Brezany21
Experimental Environment
• Test data suites– synthetical data (generated by an extended version of
the IBM Quest Synthetic Data Generation Code)– TBI (Traumatic Brain Injury) databases
• Grid testbed– Vienna– CERN– Dublin– Zagreb– Cracow
• Goals in the first phases– Verifying model accuracy– Overhead of the service layers
University of ViennaP. Brezany24
Example: Mining Patterns for Data Classification and
Associations
use database dat1, dat2mine classificationsanalyze patient_outcomeusing g_parsimonydisplay as tree
use database DBs attributesmine associationsusing method_attributesdisplay as rules
University of ViennaP. Brezany31
WG Architecture
Wisdom Grid
Agent Grid Service
Knowledge Base Service
Knowledge Discovery Service
Agent Platform
External Services
External Knowledge Base
Domain Knowledge Agents Knowledge Explorer Agent
End User (personal) AgentGrid
KB
University of ViennaP. Brezany32
Work-Flow
End User Agent Knowledge Agent Knowledge Explorer Agent
Knowledge Baseservice
External Agents
Knowledge Base
Agent Service Knowledge discoveryservice
Services ...
University of ViennaP. Brezany33
Knowledge Discovery Service
Client for other servicesKnowledge Discovery in Databases
GridMiner data mining on-line analytical processing (OLAP)
Web Miningsemantic web
Online libraries Web/Grid ServicesKnowledge Explorer Agent
University of ViennaP. Brezany34
Knowledge Base Service / KB
KBS - Search, Query, Expand Knowledge BaseKB- Database that stores particular data about real objects and relations between these objects and their propertiesConsists of ontologies and instancesInformation about resources (location, query lang.)
on the Web web/grid services ,agents references to the online database
LanguagesXML/RDF/DAML-OIL/DAML-S/OWL
University of ViennaP. Brezany35
Ontology - example
Patient
Age
Human
has
is
DAML-OIL Language:
<daml:Class rdf:ID=“Human”> <rdfs:subClassOf> <daml:Restriction cardinality=“1”> <daml:onProperty rdf:resource= “#Age”/> </daml:Restriction> </rdfs:subClassOf></daml>
<daml:DatatypeProperty about:ID=“Age”> <rdf:domain rdf:resource = “#Human”/></daml:DatatypeProperty>
<daml:Class rdf:ID=“Patient”> <daml:subClassOf rdf:resource=“#Human”/></daml:Class>
University of ViennaP. Brezany36
Knowledge Base - example
Patient
TemperatureHuman
has
has has
DatabaseTables
jdbc://foo/hospitaltable:PATIENTSattribute:PAT_ID
is
Value
Attributehas
University of ViennaP. Brezany37
Semantic mediator
• Distributed heterogeneous databases– Different database schemas– Different query languages– Different names of attributes/tables…
but the same semantics !
• WG enables semantics mediation at a higher level
University of ViennaP. Brezany38
Semantic mediator (cont.)
PATIENTS
PAT_ID PAT_AGE PAT_BLOOD_TYPE
... … …
PAT_TAB
ID AGE BT
... … …
Patient
AgeHumanhas
is
Blood Type
has
AGE PAT_AGE
samePropertyAs
BTPAT_BLOOD_TYPE
samePropertyAs
Database in Hospital X
Database in Hospital Z
University of ViennaP. Brezany39
Distributed Knowledge base
is subclasshas property
Class
Class
property
uri:fooX#Patient
uri:fooY#Human
uri:fooZ#Temperature
class
uri:fooX#Ill_Person
Is same class as
University of ViennaP. Brezany40
Agent Grid Service
Supports system with ability to communicate with the outside world in standard languages FIPA Standards
ACL – Agent Communication Language
KQML- Knowledge Query and Manipulation Language
Agent Platform (JADE,FIPA-OS)Agents
Domain Knowledge AgentKnowledge Explorer Agent
End-user Agent (personal)
University of ViennaP. Brezany41
Querying
End-user agent with own ontology – subset of ontology
Merging of ontologies without own ontology
Negotiating about domain of interest Queries created from ontology Templates
<Patient rdf:ID=“ID001”><Temperature/>
</Patient>
University of ViennaP. Brezany42
Answers
• Mined Knowledge (GridMiner)– Decision trees/ rules
» (clinical pathways)– Association rules
• Instances of domain ontology– Particular data– References– Links to Web sites– Information about another knowledge providers
University of ViennaP. Brezany43
Case Study - Medical Application
End User (personal) Agent
Q: Outcome?+ data about patient’s condition
Knowledge Agent
Trainingset
GridMiner
Testset
Hospital Databases
Knowledge DiscoveryService
Knowledge Base
Semantic Web/Grid
A: probability of survival+ references to the diagnoses
Knowledge Explorer Agent
resources
University of ViennaP. Brezany44
Conclusions and Future Work
• Application and extension of the Grid technology to knowledge discovery – an important, but non-traditional Grid application domain
• Introduction of a new Grid Data Mediation Service
• Future work– Performance evaluation on large synthetic data volumes– Coupling of the Data Minining services architecture with the
OLAP services architecture– Development of a knowledge discovery oriented Grid Workflow
Language and the appropriate Workflow Engine– Application of GridMiner to a real medical application
(management of patients with severe traumatic brain injuries)– Development of the Wisdom Grid