A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments...

18
A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou, Antonis Zissimos and Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory e-mail: {nasia, katerina, ikons, azisi, nkoziris}@cslab.ece.ntua.gr

Transcript of A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments...

Page 1: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

A Distributed Architecture for Multi-dimensional Indexing and

Data Retrieval in Grid Environments

Athanasia Asiki, Katerina Doka,Ioannis Konstantinou, Antonis Zissimos

and Nectarios Koziris

National Technical University of AthensSchool of Electrical and Computer Engineering

Computing Systems Laboratorye-mail: {nasia, katerina, ikons, azisi, nkoziris}@cslab.ece.ntua.gr

Page 2: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

2CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Abstract “A service-oriented architecture of a generic middleware

platform, which provides the required services for efficient content storage, search and retrieval in a distributed environment.”

Algorithms from Peer-to-Peer computing are introduced in a grid environment in order scalability, fault-tolerance and data availability despite nodes arrivals and departures to be ensured

The system consists of heterogeneous resources belonging to different Virtual Organizations

Page 3: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

3CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Outline Introduction

Challenges and requirements

Overall architecture

A multidimensional indexing scheme

The Distributed Replica Location Service

GridTorrent protocol

Page 4: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

4CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

A brief introduction Grid computing

Remotely located, disjoint and diverse processing and data storage facilities are integrated under common software architecture (middleware)

Resources connected to a shared network and provide the necessary software-level services to be remotely used and administered

Heterogeneity of resources Rules and policies define the sharing of resources

P2P computing Oriented towards the sharing of large amount of data Files are stored in a dynamic set of peers, which may join or the leave the

network Absence of centralized structures

Page 5: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

5CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Our approach Main Idea

Exploitation of Peer-to-Peer techniques to build a service-oriented infrastructure for data management and search

Features of the proposed system:A powerful metadata search mechanism supporting both point and range

queries A data transfer mechanism for efficient storage and retrieval of data in

distributed and heterogeneous resources A distributed Replica Location Service to keep track of file locations

MotivationThe design of the proposed architecture has been largely motivated by the requirements posed by the Gredia research project (http://www.gredia.eu/?Page=home)

Page 6: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

6CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Challenges and requirements The data needs to be partitioned among the nodes according to a strategy

ensuring: load balance effective query processing

Absence of centralized structures to provide an overall view of the system Efficient query routing is required so as:

A small number of nodes to process the query The number of exchanged messages to remain relative small Data locality shall be preserved, namely relative data to be kept in the same node if

feasible

New advanced features should not affect the maintenance cost of the overlay

The large size of data along with limitations in storage capacity and network bandwidth should be considered and a more flexible structure should be adopted Metadata and data will be stored in different overlays

Page 7: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

7CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Architecture (1) Three different overlays will be implemented:

Metadata overlay DRLS overlay Storage overlay

The Metadata overlay and the DRLS overlay are implemented with the required extensions to the Kademlia DHT

The Storage overlay comprises a distributed repository

Kademlia DHT PING, STORE, FIND_NODE and FIND_VALUE RPCs A hash function is used to assign keys to values stored in the DHT Distance among points in the key space is defined by and XOR metric

Page 8: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

8CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Architecture (2) A “data file” is described by a

predefined set of attributes included in its metadata file

Upload a file The file is assigned with a unique

identifier The unique identifier is included in

the “metadata file” The “data file” is uploaded to the

Storage Overlay by the GridTorrent mechanism

The “metadata file” is inserted in the Metadata Overlay

The physical location(s) where the file is (are) inserted in the DRLS overlay

Storage overlay

DRLS overlay

Metadata overlay

Page 9: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

9CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Architecture (3) Search procedure

The search mechanism is applied in the Metadata overlay

A user can search the metadata files according to his / her criteria and select the data file(s) of its own interest

The physical location(s) of the “data file” replicas are returned by the DRLS overlay

The GridTorrent protocol downloads a file exploiting sharing properties in order to boost aggregate performance

Sea

rch

phys

ical

lo

catio

ns o

f a d

ata

file

Ret

rieve

a d

ata

item

w

ith G

ridT

orre

ntStorage overlay

DRLS overlay

Metadata overlay

Sea

rch

met

adat

a fil

es

Page 10: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

10CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Space Filling Curves Map continuously a compact interval to a d-dimensional space Partitioning of the d-dimensional space into 2kd cells, which in turn are

mapped through the Space Filling Curve to 2kd points of a single dimension Recursive nature (generation of the curve) Perseverance of locality, points being close in the 1-dimensional space are

mapped to points that are close together in the d-dimensional space

11

0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

00 01 10 11

00

11

0000 0001

0010

0011

0100

0101 0110

0111 1000

1001 1010

1011

1100

1101

1110 1111

Page 11: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

11CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Multidimensional indexing The set of d attributes chosen to be indexed form a d-dimensional space

Each combination of attributes’ values is depicted to a point of the d-dimensional space

The key of a metadata file is produced by the Space Filling curve

Query processing Clusters of the Space Filling curve will be defined answering the query Lookup for the specified clusters

Load balance Virtual servers entity that owns an interval of the identifier space Each physical node contains multiple virtual servers When a physical node is overloaded by virtue of available storage space or bandwidth,

it may move one ore more of its virtual servers to another, underloaded physical node

Page 12: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

12CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

DRLS (1) A Distributed Replica Location Service

A DHT by correlates its inherent key-value pairs to the unique identifier of a file to Physical File Names (PFNs) mappings

Problems In DHTs read-only files are stored while mutable data cannot be

handled The replication strategy results in key-value pairs to be stored in

nodes that are close to the ID of the key and cached around the network

The exact location(s) of a key-value pair in a given moment cannot be returned and the update of all replicas is not ensured

PFN mappings for a given LFN change frequently

Page 13: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

13CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

DRLS (2)

Solution Every lookup always queries all nodes responsible for a

specific key-value pair The Lookup procedure does not stop to the first returned

value and peers that do not reply with a value, they are considered as not uptodate

When all available results are returned, the query node compares the results based on some predefined version vector (indicating the latest update of the value)

Updates are propagated to the nodes it has found responsible for storage but not yet up-todate with the latest value

Page 14: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

14CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

The data transfer mechanism of the Storage overlay Implementation of BitTorrent designed to interface and integrate with

GridFTP

A GridTorrent client is able to request file fragments from other GridTorrent clients holding the file or connect to GridFTP servers

Effectiveness and scalability, even under extreme load and flash crowd conditions

Optimized replica selection based on the rarest-first policy

BitTorrent A peer-to-peer protocol that allows clients to download files from multiple

sources while uploading them to other users at the same time

Page 15: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

15CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

GridTorrent’s basic components

RLSManager PeerManager DiskManager

Local Disk

Store and Retrieve file piecesSHA1 check

RLSRegister new replica

Get PFNs

GridFTP Peer

GridFTP Peer

GridFTP Peer

GridTorrent Peer

Exchange pieces using GridTorrent ProtocolDownload pieces from

GridFTP servers

Local node

MetadataParser

XML Metadata

File unique identifier

Get size, piece_size, hashes

Page 16: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

16CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

GridTorrent Queries the RDLS periodically

The DRLS returns to the GridTorrent peer PFN of replicas identified by the GridTorrent URL

gtp://site.fully.qualified.domain.name/path/to/file File size, hashes of pieces, size of each piece

GridTorrent client The two invlolved peers initiate communication by exchanging the BitTorrent

bit field message, informing each other of the pieces they possess A request message for blocks is issued

GridFTP server The client issues a GridFTP partial get message for the data within the

specific block it intends to download

Page 17: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

17CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Conclusions The paper proposes a service-oriented architecture for efficient

search and retrieval of annotated content

Different overlays are implemented and Peer-to-Peer techniques are introduced in the Grid environment

The design of the platform ensures scalability and fault-tolerance

An extensible architecture is presented favoring the integration with other systems and the development of Grid applications

Page 18: A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

18CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland

Thank you for your attention