A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments...
-
Upload
shonda-oliver -
Category
Documents
-
view
228 -
download
0
Transcript of A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments...
A Distributed Architecture for Multi-dimensional Indexing and
Data Retrieval in Grid Environments
Athanasia Asiki, Katerina Doka,Ioannis Konstantinou, Antonis Zissimos
and Nectarios Koziris
National Technical University of AthensSchool of Electrical and Computer Engineering
Computing Systems Laboratorye-mail: {nasia, katerina, ikons, azisi, nkoziris}@cslab.ece.ntua.gr
2CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Abstract “A service-oriented architecture of a generic middleware
platform, which provides the required services for efficient content storage, search and retrieval in a distributed environment.”
Algorithms from Peer-to-Peer computing are introduced in a grid environment in order scalability, fault-tolerance and data availability despite nodes arrivals and departures to be ensured
The system consists of heterogeneous resources belonging to different Virtual Organizations
3CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Outline Introduction
Challenges and requirements
Overall architecture
A multidimensional indexing scheme
The Distributed Replica Location Service
GridTorrent protocol
4CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
A brief introduction Grid computing
Remotely located, disjoint and diverse processing and data storage facilities are integrated under common software architecture (middleware)
Resources connected to a shared network and provide the necessary software-level services to be remotely used and administered
Heterogeneity of resources Rules and policies define the sharing of resources
P2P computing Oriented towards the sharing of large amount of data Files are stored in a dynamic set of peers, which may join or the leave the
network Absence of centralized structures
5CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Our approach Main Idea
Exploitation of Peer-to-Peer techniques to build a service-oriented infrastructure for data management and search
Features of the proposed system:A powerful metadata search mechanism supporting both point and range
queries A data transfer mechanism for efficient storage and retrieval of data in
distributed and heterogeneous resources A distributed Replica Location Service to keep track of file locations
MotivationThe design of the proposed architecture has been largely motivated by the requirements posed by the Gredia research project (http://www.gredia.eu/?Page=home)
6CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Challenges and requirements The data needs to be partitioned among the nodes according to a strategy
ensuring: load balance effective query processing
Absence of centralized structures to provide an overall view of the system Efficient query routing is required so as:
A small number of nodes to process the query The number of exchanged messages to remain relative small Data locality shall be preserved, namely relative data to be kept in the same node if
feasible
New advanced features should not affect the maintenance cost of the overlay
The large size of data along with limitations in storage capacity and network bandwidth should be considered and a more flexible structure should be adopted Metadata and data will be stored in different overlays
7CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Architecture (1) Three different overlays will be implemented:
Metadata overlay DRLS overlay Storage overlay
The Metadata overlay and the DRLS overlay are implemented with the required extensions to the Kademlia DHT
The Storage overlay comprises a distributed repository
Kademlia DHT PING, STORE, FIND_NODE and FIND_VALUE RPCs A hash function is used to assign keys to values stored in the DHT Distance among points in the key space is defined by and XOR metric
8CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Architecture (2) A “data file” is described by a
predefined set of attributes included in its metadata file
Upload a file The file is assigned with a unique
identifier The unique identifier is included in
the “metadata file” The “data file” is uploaded to the
Storage Overlay by the GridTorrent mechanism
The “metadata file” is inserted in the Metadata Overlay
The physical location(s) where the file is (are) inserted in the DRLS overlay
Storage overlay
DRLS overlay
Metadata overlay
9CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Architecture (3) Search procedure
The search mechanism is applied in the Metadata overlay
A user can search the metadata files according to his / her criteria and select the data file(s) of its own interest
The physical location(s) of the “data file” replicas are returned by the DRLS overlay
The GridTorrent protocol downloads a file exploiting sharing properties in order to boost aggregate performance
Sea
rch
phys
ical
lo
catio
ns o
f a d
ata
file
Ret
rieve
a d
ata
item
w
ith G
ridT
orre
ntStorage overlay
DRLS overlay
Metadata overlay
Sea
rch
met
adat
a fil
es
10CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Space Filling Curves Map continuously a compact interval to a d-dimensional space Partitioning of the d-dimensional space into 2kd cells, which in turn are
mapped through the Space Filling Curve to 2kd points of a single dimension Recursive nature (generation of the curve) Perseverance of locality, points being close in the 1-dimensional space are
mapped to points that are close together in the d-dimensional space
11
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
00 01 10 11
00
11
0000 0001
0010
0011
0100
0101 0110
0111 1000
1001 1010
1011
1100
1101
1110 1111
11CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Multidimensional indexing The set of d attributes chosen to be indexed form a d-dimensional space
Each combination of attributes’ values is depicted to a point of the d-dimensional space
The key of a metadata file is produced by the Space Filling curve
Query processing Clusters of the Space Filling curve will be defined answering the query Lookup for the specified clusters
Load balance Virtual servers entity that owns an interval of the identifier space Each physical node contains multiple virtual servers When a physical node is overloaded by virtue of available storage space or bandwidth,
it may move one ore more of its virtual servers to another, underloaded physical node
12CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
DRLS (1) A Distributed Replica Location Service
A DHT by correlates its inherent key-value pairs to the unique identifier of a file to Physical File Names (PFNs) mappings
Problems In DHTs read-only files are stored while mutable data cannot be
handled The replication strategy results in key-value pairs to be stored in
nodes that are close to the ID of the key and cached around the network
The exact location(s) of a key-value pair in a given moment cannot be returned and the update of all replicas is not ensured
PFN mappings for a given LFN change frequently
13CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
DRLS (2)
Solution Every lookup always queries all nodes responsible for a
specific key-value pair The Lookup procedure does not stop to the first returned
value and peers that do not reply with a value, they are considered as not uptodate
When all available results are returned, the query node compares the results based on some predefined version vector (indicating the latest update of the value)
Updates are propagated to the nodes it has found responsible for storage but not yet up-todate with the latest value
14CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
The data transfer mechanism of the Storage overlay Implementation of BitTorrent designed to interface and integrate with
GridFTP
A GridTorrent client is able to request file fragments from other GridTorrent clients holding the file or connect to GridFTP servers
Effectiveness and scalability, even under extreme load and flash crowd conditions
Optimized replica selection based on the rarest-first policy
BitTorrent A peer-to-peer protocol that allows clients to download files from multiple
sources while uploading them to other users at the same time
15CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
GridTorrent’s basic components
RLSManager PeerManager DiskManager
Local Disk
Store and Retrieve file piecesSHA1 check
RLSRegister new replica
Get PFNs
GridFTP Peer
GridFTP Peer
GridFTP Peer
GridTorrent Peer
Exchange pieces using GridTorrent ProtocolDownload pieces from
GridFTP servers
Local node
MetadataParser
XML Metadata
File unique identifier
Get size, piece_size, hashes
16CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
GridTorrent Queries the RDLS periodically
The DRLS returns to the GridTorrent peer PFN of replicas identified by the GridTorrent URL
gtp://site.fully.qualified.domain.name/path/to/file File size, hashes of pieces, size of each piece
GridTorrent client The two invlolved peers initiate communication by exchanging the BitTorrent
bit field message, informing each other of the pieces they possess A request message for blocks is issued
GridFTP server The client issues a GridFTP partial get message for the data within the
specific block it intends to download
17CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Conclusions The paper proposes a service-oriented architecture for efficient
search and retrieval of annotated content
Different overlays are implemented and Peer-to-Peer techniques are introduced in the Grid environment
The design of the platform ensures scalability and fault-tolerance
An extensible architecture is presented favoring the integration with other systems and the development of Grid applications
18CGW’07, 15 – 16 October 2007, Cracow, PolandCGW’07, 15 – 16 October 2007, Cracow, Poland
Thank you for your attention