Tools and Techniques for the Data Grid
-
Upload
cecilia-tucker -
Category
Documents
-
view
19 -
download
0
description
Transcript of Tools and Techniques for the Data Grid
![Page 1: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/1.jpg)
Ohio State University Department of Computer Science and Engineering
1
Tools and Techniques for the Tools and Techniques for the Data Grid Data Grid
Gagan Agrawal The Ohio State University
![Page 2: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/2.jpg)
Ohio State University Department of Computer Science and Engineering
2
Overall MotivationOverall Motivation• Computation has long become an integral part of any
scientific discipline – Parallels theory and experiments
• Last 2 (or more) decades have seen Computational-X emerge – Major emphasis on computational modeling – Involved CS support for high-end computing
• In last 5-10 years, X-Informatics is emerging – Data-driven science and engineering applications – Needs CS support for high-end and distributed computing
![Page 3: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/3.jpg)
Ohio State University Department of Computer Science and Engineering
3
Context: Grid Computing Context: Grid Computing • Wide area collaborations and pooling of
resources • Natural synergy with data-intensive
applications – Wide-area sharing of data – Using distributed resources for data analysis – Stage multiple tasks: data generation, processing,
visualization
![Page 4: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/4.jpg)
Ohio State University Department of Computer Science and Engineering
4
Scientific Data Analysis Scientific Data Analysis on (Grid-based) Data Repositorieson (Grid-based) Data Repositories
• Scientific data repositories– Large volume
» Gigabyte, Terabyte, Petabyte– Distributed datasets
» Generated/collected by scientific simulations or instruments
– Data could be streaming in nature
• Scientific data analysisData Specification Data Organization
Data Extraction Data Movement
Data AnalysisData Visualization
![Page 5: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/5.jpg)
Ohio State University Department of Computer Science and Engineering
5
Opportunities Opportunities • Scientific simulations and data collection
instruments generating large scale data • Rapidly increasing wide-area bandwidths • Grid standards enabling sharing of data • Service/grid model of computing
– Plug and play application modules / data sources
![Page 6: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/6.jpg)
Ohio State University Department of Computer Science and Engineering
6
Existing Efforts Existing Efforts • Data grids recognized as important component
of grid/distributed computing • Major topics
– Efficient/Secure Data Movement – Replica Selection – Metadata catalogs / Metadata services – Setting up workflows
![Page 7: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/7.jpg)
Ohio State University Department of Computer Science and Engineering
7
Open Issues Open Issues
• Accessing / Retrieving / Processing data from scientific repositories – Need to deal with low-level formats
• Integrating tools and services having/requiring data with different formats
• Support for processing streaming data in a distributed environment
• Developing scalable data analysis applications
![Page 8: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/8.jpg)
Ohio State University Department of Computer Science and Engineering
8
Ongoing Projects Ongoing Projects • Automatic Data Virtualization • On the fly data integration in a distributed
environment • Middleware for Processing Streaming Data • Compiling XQuery on Scientific and Streaming
Data • Middleware for Scalable Data Processing • Data Mining Algorithms and Systems
![Page 9: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/9.jpg)
Ohio State University Department of Computer Science and Engineering
9
Coastal Forecasting and Change Coastal Forecasting and Change Detection (Lake Erie)Detection (Lake Erie)
![Page 10: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/10.jpg)
Ohio State University Department of Computer Science and Engineering
10
An Example Application ScenarioAn Example Application Scenario
![Page 11: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/11.jpg)
Ohio State University Department of Computer Science and Engineering
11
Outline Outline • Automatic Data Virtualization
– Relational/SQL – XML/XQuery based
• Data Integration • Middleware for Streaming Data • Cluster and Grid-based data mining
middleware
![Page 12: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/12.jpg)
Ohio State University Department of Computer Science and Engineering
12
Automatic Data Virtualization: Automatic Data Virtualization: MotivationMotivation
• Emergence of grid-based data repositories– Can enable sharing of data in an unprecedented way
• Access mechanisms for remote repositories– Complex low-level formats make accessing and
processing of data difficult• Main desired functionality
– Ability to select, down-load, and process a subset of data
![Page 13: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/13.jpg)
Ohio State University Department of Computer Science and Engineering
13
Data VirtualizationData Virtualization An abstract view of data
datasetData Service
DataVirtualization
By Global Grid Forum’s DAIS working group:• A Data Virtualization describes an abstract view of data.• A Data Service implements the mechanism to access and process data through the Data Virtualization
![Page 14: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/14.jpg)
Ohio State University Department of Computer Science and Engineering
14
Our Approach: Automatic Data Our Approach: Automatic Data VirtualizationVirtualization
• Automatically create data services – A new application of compiler technology
• A metadata descriptor describes the layout of data on a repository
• An abstract view is exposed to the users • Two implementations:
– Relational /SQL-based – XML/XQuery based
![Page 15: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/15.jpg)
Ohio State University Department of Computer Science and Engineering
15
System OverviewSystem Overview
SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );
![Page 16: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/16.jpg)
Ohio State University Department of Computer Science and Engineering
16
Design a Meta-data Description Design a Meta-data Description LanguageLanguage
• Requirements– Specify the relationship of a dataset to the virtual
dataset schema– Describe the dataset physical layout within a file– Describe the dataset distribution on nodes of one or
more clusters– Specify the subsetting index attributes– Easy to use for data repository administrators and also
convenient for our code generation
![Page 17: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/17.jpg)
Ohio State University Department of Computer Science and Engineering
17
Design OverviewDesign Overview
• Dataset Schema Description Component• Dataset Storage Description Component• Dataset Layout Description Component
![Page 18: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/18.jpg)
Ohio State University Department of Computer Science and Engineering
18
An ExampleAn Example• Oil Reservoir Management
– The dataset comprises several simulation on the same grid
– For each realization, each grid point, a number of attributes are stored.
– The dataset is stored on a 4 node cluster.
Component I: Dataset Schema Description[IPARS] // { * Dataset schema name *}REL = short int // {* Data type definition *}TIME = intX = floatY = floatZ = floatSOIL = floatSGAS = float
Component II: Dataset Storage Description[IparsData] //{* Dataset name *}//{* Dataset schema for IparsData *}DatasetDescription = IPARSDIR[0] = osu0/iparsDIR[1] = osu1/iparsDIR[2] = osu2/iparsDIR[3] = osu3/ipars
![Page 19: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/19.jpg)
Ohio State University Department of Computer Science and Engineering
19
Data Layout Description ComponentData Layout Description Component
Dataset Root
dataset 1 dataset 2 dataset 3
Data1 Data2 Data3 Data4 Data5 Data6
DATASET “ROOT” { DATATYPE { … } DATAINDEX { … } DATA { DATASET dataset1 DATASET dataset2 DATASET dataset3 } DATASET “dataset1” {
DATATYPE { … } DATASPACE { … } DATA { data1 data2 data3 } }
DATASET “dataset2” { DATATYPE { … } DATASPACE { … } DATA { data4 } }
DATASET “dataset3” {….
}}
![Page 20: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/20.jpg)
Ohio State University Department of Computer Science and Engineering
20
An ExampleAn Example• Oil Reservoir Management
– Use LOOP keyword for capturing the repetitive structure within a file.
– The grid has 4 partitions (0~3).
– “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.
Component III: Dataset Layout DescriptionDATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 }
DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 {
X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *}
DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 {
SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *}}
![Page 21: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/21.jpg)
Ohio State University Department of Computer Science and Engineering
21
Automatic Virtualization Using Meta-Automatic Virtualization Using Meta-datadata
• Aligned file chunks {num_rows,
{File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} }
• Our tool parses the meta-data descriptor and generates function codes.
At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks.
Dataset Root
dataset 1 dataset 2 dataset 3
Data1 Data2 Data3 Data4 Data5 Data6
![Page 22: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/22.jpg)
Ohio State University Department of Computer Science and Engineering
22
Compiler AnalysisCompiler Analysis• Meta-data descriptor
Create AFC
Process AFC
Index & Extraction function code
Data _Extract { Find _File _Groups() Process _File _Groups() } Find _File _Groups { Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S1, … ,Sm be the m sets T = Ø foreach {s1, … ,sm } si ∈ Si { {* cartesian product between S1, … ,Sm *} If the values of implicit attributes are not inconsistent { T = T ∪ {s1, … ,sm } } } Output T } Process _File _Groups { foreach {s1, … ,sm } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk } }
![Page 23: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/23.jpg)
Ohio State University Department of Computer Science and Engineering
23
Outline Outline • Automatic Data Virtualization
– Relational/SQL – XML/XQuery based
• Information Integration • Middleware for Streaming Data • Coarse-grained pipelined parallelism
![Page 24: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/24.jpg)
Ohio State University Department of Computer Science and Engineering
24
XML/XQuery ImplementationXML/XQuery Implementation
TEXT
…
NetCDF
RMDB
HDF5
XML
XQuery
???
![Page 25: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/25.jpg)
Ohio State University Department of Computer Science and Engineering
25
Programming/Query LanguageProgramming/Query Language• High-level declarative languages ease application
development – Popularity of Matlab for scientific computations
• New challenges in compiling them for efficient execution
• XQuery is a high-level language for processing XML datasets – Derived from database, declarative, and functional languages ! – XPath (a subset of XQuery) embedded in an imperative language
is another option
![Page 26: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/26.jpg)
Ohio State University Department of Computer Science and Engineering
26
Approach / Contributions Approach / Contributions • Use of XML Schemas to provide high-level abstractions
on complex datasets • Using XQuery with these Schemas to specify
processing • Issues in Translation
– High-level to low-level code – Data-centric transformations for locality in low-level codes – Issues specific to XQuery
» Recognizing recursive reductions » Type inferencing and translation
![Page 27: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/27.jpg)
Ohio State University Department of Computer Science and Engineering
27
External Schema
XQuery Sources
Compiler
XML Mapping Service
System ArchitectureSystem Architecture
logical XML schema physical XML schema
C++/C
![Page 28: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/28.jpg)
Ohio State University Department of Computer Science and Engineering
28
Outline Outline • Automatic Data Virtualization
– Relational/SQL – XML/XQuery based
• Information Integration • Middleware for Streaming Data • Cluster and Grid-based data mining
middleware
![Page 29: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/29.jpg)
Ohio State University Department of Computer Science and Engineering
29
Data Integration: Overall GoalData Integration: Overall Goal• Tools for data integration driven by:
– Data explosion» Data size & number of data sources
– New analysis tools– Autonomous resources
» Heterogeneous data representation & various interfaces – Frequent Updates– Common Situations:
» Flat-file datasets » Ad-hoc sharing of data
![Page 30: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/30.jpg)
Ohio State University Department of Computer Science and Engineering
30
Current ApproachesCurrent Approaches• Manually written wrappers
– Problems» O(N2) wrappers needed, O(N) for a single updates
• Mediator-based integration systems– Problems
» Need a common intermediate format » Unnecessary data transformation
• Integration using web/grid services» Needs all tools to be web-services (all data in XML?)
![Page 31: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/31.jpg)
Ohio State University Department of Computer Science and Engineering
31
Our ApproachOur Approach• Automatically generate wrappers
– Stand-alone programs– For integrated DBs, (grid) workflow systems
• Transform data in files of arbitrary formats– No domain- or format-specific heuristics– Layout information provided by users
• Help biologists write layout descriptors using data mining techniques
• Particularly attractive for – flat-file datasets – ad hoc data sharing – data grid environments
![Page 32: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/32.jpg)
Ohio State University Department of Computer Science and Engineering
32
Our Approach: AdvantagesOur Approach: Advantages• Advantages:
– No DB or query support required– One descriptor per resource needed – No unnecessary transformation– New resources can be integrated on-the-fly
![Page 33: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/33.jpg)
Ohio State University Department of Computer Science and Engineering
33
Our Approach: ChallengesOur Approach: Challenges• Description language
– Format and logical view of data in flat files– Easy to interpret and write
• Wrapper generation and Execution– Correspondence between data items– Separating wrapper analysis and execution
• Interactive tools for writing layout descriptors – What data mining techniques to use ?
![Page 34: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/34.jpg)
Ohio State University Department of Computer Science and Engineering
34
Wrapper Generation System Wrapper Generation System OverviewOverview
Layout Descriptor Schema Descriptors
Parser Mapping Generator
Data Entry Representation Schema Mapping
DataReader DataWriterSynchronizer
SourceDataset
TargetDataset
Application Analyzer
WRAPINFO
![Page 35: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/35.jpg)
Ohio State University Department of Computer Science and Engineering
35
Outline Outline • Automatic Data Virtualization
– Relational/SQL – XML/XQuery based
• Information Integration • Middleware for Streaming Data • Coarse-grained pipelined parallelism
![Page 36: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/36.jpg)
Ohio State University Department of Computer Science and Engineering
36
Streaming Data ModelStreaming Data Model• Continuous data arrival and processing • Emerging model for data processing
– Sources that produce data continuously: sensors, long running simulations
– WAN bandwidths growing faster than disk bandwidths • Active topic in many computer science communities
– Databases– Data Mining – Networking ….
![Page 37: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/37.jpg)
Ohio State University Department of Computer Science and Engineering
37
Summary/Limitations of Current Summary/Limitations of Current WorkWork
• Focus on– centralized processing of stream from a single source
(databases, data mining) – communication only (networking)
• Many applications involve– distributed processing of streams– streams from multiple sources
![Page 38: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/38.jpg)
Ohio State University Department of Computer Science and Engineering
38
Motivating ApplicationMotivating Application
Switch Network
X
Network Fault Management System
Network Fault Management System
![Page 39: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/39.jpg)
Ohio State University Department of Computer Science and Engineering
39
Motivating Application (2)Motivating Application (2)Computer Vision Based Surveillance
![Page 40: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/40.jpg)
Ohio State University Department of Computer Science and Engineering
40
Features of Distributed Streaming Features of Distributed Streaming Processing ApplicationsProcessing Applications
• Data sources could be distributed– Over a WAN
• Continuous data arrival • Enormous volume
– Probably can’t communicate it all to one site• Results from analysis may be desired at multiple sites • Real-time constraints
– A real-time, high-throughput, distributed processing problem
![Page 41: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/41.jpg)
Ohio State University Department of Computer Science and Engineering
41
Need for a Grid-Based Stream Need for a Grid-Based Stream Processing Middleware Processing Middleware
• Application developers interested in data stream processing – Will like to have abstracted
» Grid standards and interfaces » Adaptation function
– Will like to focus on algorithms only • GATES is a middleware for
– Grid-based – Self-adapting
Data Stream Processing
![Page 42: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/42.jpg)
Ohio State University Department of Computer Science and Engineering
42
Adaptation for Real-time ProcessingAdaptation for Real-time Processing• Analysis on streaming data is approximate • Accuracy and execution rate trade-off can be
captured by certain parameters (Adaptation parameters) – Sampling Rate – Size of summary structure
• Application developers can expose these parameters and a range of values
![Page 43: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/43.jpg)
Ohio State University Department of Computer Science and Engineering
43
Public class Sampling-Stage implements StreamProcessing{… void init(){…}… void work(buffer in, buffer out){
…
while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out);
}…
}
API for AdaptationAPI for Adaptation
sampling-ratio = GATES.getSuggestedParameter();
GATES.Information-About-Adjustment-Parameter(min, max, 1)
![Page 44: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/44.jpg)
Ohio State University Department of Computer Science and Engineering
44
Outline Outline • Automatic Data Virtualization
– Relational/SQL – XML/XQuery based
• Information Integration • Middleware for Streaming Data • Cluster and Grid-based data mining
middleware
![Page 45: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/45.jpg)
Ohio State University Department of Computer Science and Engineering
45
Scalable Mining ProblemScalable Mining Problem
• Our understanding of what algorithms and parameters will give desired insights is often limited
• The time required for creating scalable implementations of different algorithms and running them with different parameters on large datasets slows down the data mining process
![Page 46: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/46.jpg)
Ohio State University Department of Computer Science and Engineering
46
Mining in a Grid Environment Mining in a Grid Environment
A data mining application in a grid environment - - Needs to exploit different forms of available parallelism
- Needs to deal with different data layouts and formats - Needs to adapt to resource availability
![Page 47: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/47.jpg)
Ohio State University Department of Computer Science and Engineering
47
FREERIDE Overview FREERIDE Overview • Framework for Rapid
Implementation of datamining engines
• Demonstrated for a variety of standard mining algorithm
• Targeted distributed memory
parallelism, shared memory parallelism, and combination
• Can be used as basis for scalable grid-based data mining implementations
• Published in SDM 01, SDM 02, SDM 03, Sigmetrics 02, Europar 02, IPDPS 03, IEEE TKDE (to appear)
![Page 48: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/48.jpg)
Ohio State University Department of Computer Science and Engineering
48
FREERIDE-GFREERIDE-G• Data processing may not be feasible where the
data resides • Need to identify resources for data processing • Need to abstract data retrieval, movement and
parallel processing
![Page 49: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/49.jpg)
Ohio State University Department of Computer Science and Engineering
49
Students InvolvedStudents InvolvedRecent Ph.D Grads (2005-06)
– Ruoming Jin (Kent State University) – Wei Du (Yahoo) – Xiaogang Li (Ask.com)– Liang Chen (Amazon) – Li Weng (Oracle)
• Current Students: – Xuan Zhang (graduating Winter 07) – Kaushik Sinha (joint with Misha Belkin) – Leonid Glimcher (4th year) – Qian Zhu (3rd year) – Wenjing Ma (2nd year) – David Chiu (2nd year) – Fan Wang (2nd year)
![Page 50: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/50.jpg)
Ohio State University Department of Computer Science and Engineering
50
Some Newer Topics Some Newer Topics • Resource allocation, fault tolerance, and
process migration in GATES (Qian Zhu) • FREERIDE-G using SRB (Leonid Glimcher) • FREERIDE on newer architectures (Wenjing
Ma) • Deep web mining (for bioinformatics) (Fan
Wang) • Service-oriented composition of data and
services (David Chiu)
![Page 51: Tools and Techniques for the Data Grid](https://reader036.fdocuments.in/reader036/viewer/2022062502/56812c90550346895d913d08/html5/thumbnails/51.jpg)
Ohio State University Department of Computer Science and Engineering
51
Summary Summary • Distributed data-driven science:
– We have a long way to go • The holy grail will be
– The system finds all relevant data for you – The system finds all relevant analysis tools for you – The system best uses all possible resources to give you
the fastest response – Does all of this transparent to you !
• We will never get there, but the journey is interesting ….