An Introduction to Sector/Sphere

24
An Introduction to Sector/Sphere Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago and VeryCloud LLC @CHUG, June 22, 2010

description

Sector & Sphere. An Introduction to Sector/Sphere. Yunhong Gu Univ. of Illinois at Chicago and VeryCloud LLC @CHUG, June 22, 2010. What is Sector/Sphere?. Sector: Distributed File System Sphere: Simplified Parallel Data Processing Framework Goal: handling big data on commodity clusters - PowerPoint PPT Presentation

Transcript of An Introduction to Sector/Sphere

Page 1: An Introduction to Sector/Sphere

An Introduction to

Sector/Sphere

Sector & Sphere

Yunhong Gu Univ. of Illinois at Chicago and VeryCloud LLC

@CHUG, June 22, 2010

Page 2: An Introduction to Sector/Sphere

What is Sector/Sphere?

Sector: Distributed File System Sphere: Simplified Parallel Data

Processing Framework Goal: handling big data on commodity

clusters Open source software, BSD license,

written in C++. Started since 2006, current version 2.3 http://sector.sf.net

Page 3: An Introduction to Sector/Sphere

Motivation: Data Locality

Storage Compute

Data

Super-computer model:Expensive, data IO bottleneck

Sector/Sphere model:Inexpensive, parallel data IO, data locality

Page 4: An Introduction to Sector/Sphere

Motivation: Simplified Programming

Parallel/Distributed Programming with MPI, etc.:Flexible and powerful.very complicated application development

Sector/Sphere model (cloud model):Clusters regarded as a single entity to the developer, simplified programming interface.Limited to certain data parallel applications.

Page 5: An Introduction to Sector/Sphere

Motivation: Global-scale System

Systems for single data centers:Requires additional effort to locate and move data.

Sector/Sphere model:Support wide-area data collection and distribution.

Data Center

Data Center

Data Center

Data ProviderUS Location

Data ReaderAsia Location

Sector/Sphere

Data ProviderUS Location Data Provider

US Location

Data ProviderEurope Location

Up

loa

d

Data UserUS Location

Processing

Data ReaderAsia Location

Page 6: An Introduction to Sector/Sphere

Sector Distributed File System

DFS designed to work on commodity hardware racks of computers with internal hard disks

and high speed network connections File system level fault tolerance via

replication Support wide area networks

Can be used for data collection and distribution

Not POSIX-compatible yet

Page 7: An Introduction to Sector/Sphere

Sector Distributed File System

Security Server Masters

slaves slaves

SSL SSLClients

User accountData protectionSystem Security

MetadataScheduling

Service provider

System access toolsApp. Programming

Interfaces

Storage and Processing

Data

UDTEncryption optional

Page 8: An Introduction to Sector/Sphere

Security Server

User accounts, permission, IP access control lists

Use independent accounts, but connect to existing account database via a simple “driver”, e.g., Linux accounts, LDAP, etc.

Single security server, system continue to run when security server is down, but new users cannot login

Page 9: An Introduction to Sector/Sphere

Master Servers Maintain file system metadata

Metadata is a customizable module, currently there are two implementations, one in-memory and one on disk

Authenticate users, slaves, and other masters (via security server)

Maintain and manage file replication, data IO and data processing requests Topology aware

Multiple active masters can dynamically join and leave; load balancing between masters

Page 10: An Introduction to Sector/Sphere

Slave Nodes

Store Sector filesSector file is not split into blocksOne Sector file is stored on the “native” file

system (e.g., EXT, XFS, etc.) of one or more slave nodes

Process Sector dataData is processed on the same storage node,

or nearest storage node possible Input and output are Sector files

Page 11: An Introduction to Sector/Sphere

Clients Sector file system client API

Access Sector files in applications using the C++ API Sector system tools

File system access tools FUSE

Mount Sector file system as a local directory Sphere programming API

Develop parallel data processing applications to process Sector data with a set of simple API

The client communicate with slave directly for data IO, via UDT

Page 12: An Introduction to Sector/Sphere

UDT: UDP-based Data Transfer

http://udt.sf.net Open source UDP based data transfer

protocolWith reliability control and congestion control

Fast, firewall friendly, easy to use Already used in many commercial and

research systems for large data transfer

Page 13: An Introduction to Sector/Sphere

Application-aware File System

Files are not split into blocksUsers are responsible to use proper sized

files Directory and File Family

Sector will keep related files together during upload and replication

In-memory object

Page 14: An Introduction to Sector/Sphere

Sphere: Simplified Data Processing

Data parallel applications Data is processed at where it resides, or on the

nearest possible node (locality) Same user defined functions (UDF) are applied

on all elements (records, blocks, files, or directories)

Processing output can be written to Sector files or sent back to the client

Transparent load balancing and fault tolerance

Page 15: An Introduction to Sector/Sphere

Sphere: Simplified Data Processing

for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …);

n

SPE

n+1n+2n+3...n+m

SPESPESPE

n-k...nn+1n+2n+3

Sphere Client

Application

Locate and Schedule SPEs

Split data

Collect result

Input Stream

Output Stream

SphereStream sdss;sdss.init("sdss files");SphereProcess myproc;myproc->run(sdss,"findBrownDwarf", …);

findBrownDwarf(char* image, int isize, char* result, int rsize);

Page 16: An Introduction to Sector/Sphere

Sphere: Data Movement

Slave -> Slave Local Slave -> Slaves

(Hash/Buckets) Each output record is

assigned an ID; all records with the same ID are sent to the same “bucket” file

Slave -> Client

nn+1n+2n+3...n+m

SPESPESPESPE

SPESPESPESPE

0123...b

0123...b

Sta

ge 1

: Shu

fflin

gS

tage

2: S

ortin

g

Input Stream

Intermediate Stream

Output Stream

Page 17: An Introduction to Sector/Sphere

What does a Sphere program like?

A client applicationSpecify input, output, and name of UDF Inputs and outputs are usually Sector

directories or collection of filesMay have multiple round of computation if

necessary (iterative/combinative processing) A UDF

A C++ function following the Sphere specification (parameters and return value)

Compiled into a dynamic library

Page 18: An Introduction to Sector/Sphere

Sphere/UDF vs. MapReduce

Map = UDF MapReduce = 2x UDF

First UDF generates bucket files and second processes the bucket files.

Page 19: An Introduction to Sector/Sphere

Sphere/UDF vs. MapReduce

Sphere is more flexible and efficient UDF can be applied directly on records, blocks, files,

and even directories Support multiple inputs/outputs with better data

locality, including certain legacy applications that process files and directories

Native binary data support w/ permanent index files Sorting is required by Reduce, but it is optional in

Sphere

Output locality allows Sphere to combine multiple operations more efficiently

Page 20: An Introduction to Sector/Sphere

Sphere Benchmarks Terasort: sort 1TB data over distributed servers Malstone: detect malware website from billions

of transactions Graph processing: analyze very large social

networks at billions of vertices (BFS and enumerating cliques)

Genome pipeline: analyze genome sequences Satellite image processing: compare satellite

images of different time, for disaster relief

Sphere is about 2 – 4 times faster than Hadoop

Page 21: An Introduction to Sector/Sphere

Open Cloud Testbed

15 Racks in Baltimore (JHU), Chicago (StarLight and UIC), and San Diego (Calit2)

10Gb/s inter-site connection on CiscoWave

1 - 2Gb/s inter-rack connection Two dual-core AMD CPU, 8 - 16GB RAM,

1-4TB RAID-0 disk

Page 22: An Introduction to Sector/Sphere

Open Cloud Testbed

Page 23: An Introduction to Sector/Sphere

Development Status

Current version 2.3, all core functions ready, still working on to improve code quality and details for certain modules.

Partly funded by NSF for NCDM/UIC Commercial support via VeryCloud LLC Next step: support column-based data

tables (similar to BigTable) Open source contributors are welcome

Page 24: An Introduction to Sector/Sphere

More Information

Sector Website: http://sector.sourceforge.net

Email: [email protected]