Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

35
Large Scale Large Scale Sharing GFS and Sharing GFS and PAST PAST Mahesh Balakrishnan Mahesh Balakrishnan
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    1

Transcript of Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Page 1: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Large Scale Sharing Large Scale Sharing GFS and PASTGFS and PAST

Mahesh BalakrishnanMahesh Balakrishnan

Page 2: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Distributed File SystemsDistributed File Systems

Traditional Definition:Traditional Definition: Data and/or metadata stored at remote Data and/or metadata stored at remote

locations, accessed by client over the locations, accessed by client over the network.network.

Various degrees of centralization: from NFS to Various degrees of centralization: from NFS to xFS.xFS.

GFS and PASTGFS and PAST Unconventional, specialized functionalityUnconventional, specialized functionality Large-scale in data and nodesLarge-scale in data and nodes

Page 3: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

The Google File SystemThe Google File System

Specifically designed for Google’s Specifically designed for Google’s

backend needsbackend needs

Web Spiders append to huge filesWeb Spiders append to huge files

Application data patterns:Application data patterns: Multiple producer – multiple consumerMultiple producer – multiple consumer

Many-way mergingMany-way merging

GFS GFS Traditional File Systems Traditional File Systems

Page 4: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Design Space CoordinatesDesign Space Coordinates

Commodity ComponentsCommodity Components

Very large files – Multi GBVery large files – Multi GB

Large sequential accessesLarge sequential accesses

Co-design of Applications and File SystemCo-design of Applications and File System

Supports small files, random access writes Supports small files, random access writes

and reads, but not efficientlyand reads, but not efficiently

Page 5: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

GFS ArchitectureGFS Architecture

Interface: Interface: Usual: create, delete, open, close, etcUsual: create, delete, open, close, etc

Special: snapshot, record appendSpecial: snapshot, record append

Files divided into fixed size chunksFiles divided into fixed size chunks

Each chunk replicated at chunkserversEach chunk replicated at chunkservers

Single master maintains metadataSingle master maintains metadata

Master, Chunkservers, Clients: Linux Master, Chunkservers, Clients: Linux

workstations, user-level processworkstations, user-level process

Page 6: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Client File RequestClient File Request

Client finds chunkid for offset within fileClient finds chunkid for offset within file

Client sends <filename, chunkid> to MasterClient sends <filename, chunkid> to Master

Master returns chunk handle and chunkserver locationsMaster returns chunk handle and chunkserver locations

Page 7: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Design Choices: MasterDesign Choices: Master

Single master maintains all metadata …Single master maintains all metadata … Simple DesignSimple Design

Global decision making for chunk replicationGlobal decision making for chunk replication

and placementand placement

Bottleneck?Bottleneck?

Single Point of Failure?Single Point of Failure?

Page 8: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Design Choices: MasterDesign Choices: Master

Single master maintains all metadata … in Single master maintains all metadata … in

memory!memory! Fast master operationsFast master operations

Allows background scans of entire dataAllows background scans of entire data

Memory Limit? Memory Limit?

Fault Tolerance?Fault Tolerance?

Page 9: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Relaxed Consistency ModelRelaxed Consistency Model

File Regions are -File Regions are - Consistent: All clients see the same thingConsistent: All clients see the same thing Defined: After mutation, all clients see exactly Defined: After mutation, all clients see exactly

what the mutation wrotewhat the mutation wrote

Ordering of Concurrent Mutations –Ordering of Concurrent Mutations – For each chunk’s replica set, Master gives For each chunk’s replica set, Master gives

one replica primary leaseone replica primary lease Primary replica decides ordering of mutations Primary replica decides ordering of mutations

and sends to other replicasand sends to other replicas

Page 10: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Anatomy of a MutationAnatomy of a Mutation

1 2 Client gets chunkserver 1 2 Client gets chunkserver locations from masterlocations from master

3 Client pushes data to 3 Client pushes data to replicas, in a chainreplicas, in a chain

4 Client sends write request to 4 Client sends write request to primary; primary assigns primary; primary assigns sequence number to write sequence number to write and applies itand applies it

5 6 Primary tells other replicas to 5 6 Primary tells other replicas to apply writeapply write

7 Primary replies to client7 Primary replies to client

Page 11: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Connection Connection withwith Consistency Model Consistency Model

Secondary replica encounters error while applying write Secondary replica encounters error while applying write (step 5): region Inconsistent.(step 5): region Inconsistent.

Client code breaks up single large write into multiple Client code breaks up single large write into multiple small writes: region Consistent, but Undefined.small writes: region Consistent, but Undefined.

Page 12: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Special FunctionalitySpecial Functionality

Atomic Record AppendAtomic Record Append Primary appends to itself, then tells other replicas to Primary appends to itself, then tells other replicas to

write at that offsetwrite at that offset

If secondary replica fails to write data (step 5), If secondary replica fails to write data (step 5),

duplicates in successful replicas, padding in failed onesduplicates in successful replicas, padding in failed ones

region defined where append successful, inconsistent where region defined where append successful, inconsistent where

failedfailed

SnapshotSnapshot Copy-on-write: chunks copied lazily to same replicaCopy-on-write: chunks copied lazily to same replica

Page 13: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Master InternalsMaster Internals

Namespace managementNamespace management

Replica Placement Replica Placement

Chunk Creation, Re-replication, Chunk Creation, Re-replication,

RebalancingRebalancing

Garbage CollectionGarbage Collection

Stale Replica DetectionStale Replica Detection

Page 14: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Dealing with FaultsDealing with Faults

High availabilityHigh availability Fast master and chunkserver recoveryFast master and chunkserver recovery

Chunk replicationChunk replication

Master state replication: read-only shadow replicasMaster state replication: read-only shadow replicas

Data IntegrityData Integrity Chunk broken into 64KB blocks, with 32 bit checksumChunk broken into 64KB blocks, with 32 bit checksum

Checksums stored in memory, logged to diskChecksums stored in memory, logged to disk

Optimized for appends, since no verifying requiredOptimized for appends, since no verifying required

Page 15: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Micro-benchmarksMicro-benchmarks

Page 16: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Storage Data for ‘real’ clustersStorage Data for ‘real’ clusters

Page 17: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

PerformancePerformance

Page 18: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Workload BreakdownWorkload Breakdown% of operations% of operations

for given sizefor given size

% of bytes% of bytes

transferred fortransferred for

given operationgiven operation

sizesize

Page 19: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

GFS: ConclusionGFS: Conclusion

Very application-specific: more Very application-specific: more engineering than researchengineering than research

Page 20: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

PASTPAST

Internet-based P2P global storage utilityInternet-based P2P global storage utility Strong persistenceStrong persistence High availabilityHigh availability ScalabilityScalability SecuritySecurity

Not a conventional FSNot a conventional FS Files have unique idFiles have unique id Clients can insert and retrieve filesClients can insert and retrieve files Files are immutableFiles are immutable

Page 21: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

PAST OperationsPAST Operations

Nodes have random unique nodeIdsNodes have random unique nodeIds

No searching, directory lookup, key distributionNo searching, directory lookup, key distribution

Supported Operations:Supported Operations:

Insert: (name, key, k, file) Insert: (name, key, k, file) fileId fileId Stores on k nodes closest in id spaceStores on k nodes closest in id space

Lookup: (fileId) Lookup: (fileId) file file

Reclaim: (fileId, key)Reclaim: (fileId, key)

Page 22: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

PastryPastry

P2P routing substrateP2P routing substrate

route (key, msg) : routes to numerically closest route (key, msg) : routes to numerically closest

node in less than node in less than loglog22bb N N steps steps

Routing Table Size: (2Routing Table Size: (2bb - 1) * log - 1) * log22b b N + 2N + 2ll

b : determines tradeoff between per node state b : determines tradeoff between per node state

and lookup orderand lookup order

ll : failure tolerance: delivery guaranteed unless : failure tolerance: delivery guaranteed unless

ll/2 adjacent nodeIds fail/2 adjacent nodeIds fail

Page 23: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

10233102: Routing Table10233102: Routing Table

|L|/2 larger and |L|/2 |L|/2 larger and |L|/2 smaller nodeIdssmaller nodeIds

Routing EntriesRouting Entries

|M| closest nodes|M| closest nodes

Page 24: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

PAST operations/securityPAST operations/security

Insert: Insert: Certificate created with fileId, file content hash, Certificate created with fileId, file content hash,

replication factor and signed with private keyreplication factor and signed with private key File and certificate routed through PastryFile and certificate routed through Pastry First node in k closest accepts file and forwards to First node in k closest accepts file and forwards to

other k-1other k-1

Security: SmartcardsSecurity: Smartcards Public/Private keyPublic/Private key Generate and verify certificatesGenerate and verify certificates Ensure integrity of nodeId and fileId assignmentsEnsure integrity of nodeId and fileId assignments

Page 25: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Storage ManagementStorage Management

Design GoalsDesign Goals High global storage utilizationHigh global storage utilization Graceful degradation near max utilizationGraceful degradation near max utilization

PAST tries to:PAST tries to: Balance free storage space amongst nodesBalance free storage space amongst nodes Maintain k closest nodes replication invariantMaintain k closest nodes replication invariant

Storage Load ImbalanceStorage Load Imbalance Variance in number of files assigned to nodeVariance in number of files assigned to node Variance in size distribution of inserted filesVariance in size distribution of inserted files Variance in storage capacity of PAST nodesVariance in storage capacity of PAST nodes

Page 26: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Storage ManagementStorage Management

Large capacity storage nodes have multiple nodeIdsLarge capacity storage nodes have multiple nodeIds

Replica DiversionReplica Diversion If node A cannot store file, it stores pointer to file at leaf set node If node A cannot store file, it stores pointer to file at leaf set node

B which is not in k closestB which is not in k closest

What if A or B fail? Duplicate pointer in k+1 closest nodeWhat if A or B fail? Duplicate pointer in k+1 closest node

Policies for directing and accepting replicas: tPolicies for directing and accepting replicas: tpripri and t and tdivdiv

thresholds for file size / free space. thresholds for file size / free space.

File DiversionFile Diversion If insert fails, client retries with different fileIdIf insert fails, client retries with different fileId

Page 27: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Storage ManagementStorage Management

Maintaining replication invariantMaintaining replication invariant Failures and joinsFailures and joins

CachingCaching k-replication in PAST for availabilityk-replication in PAST for availability

Extra copies stored to reduce client latency, network Extra copies stored to reduce client latency, network

traffictraffic

Unused disk space utilizedUnused disk space utilized

Greedy Dual-Size replacement policyGreedy Dual-Size replacement policy

Page 28: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

PerformancePerformance

Workloads:Workloads: 8 Web Proxy Logs8 Web Proxy Logs

Combined file systemsCombined file systems

k=5, b=4k=5, b=4

# of nodes = 2250# of nodes = 2250

Without replica and file Without replica and file diversion:diversion:

51.1% insertions failed51.1% insertions failed 60.8% global utilization60.8% global utilization

4 normal distributions 4 normal distributions

of node storage sizesof node storage sizes

Page 29: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Effect of Storage ManagementEffect of Storage Management

Page 30: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Effect of tEffect of tpripri

ttdivdiv = 0.05 = 0.05

ttpripri varied varied

Lower tLower tpripri::

Better utilization,Better utilization,

More failuresMore failures

Page 31: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Effect of tEffect of tdivdiv

ttpripri = 0.1 = 0.1

ttdivdiv varied varied

Trend similarTrend similar

to tto tpripri

Page 32: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

File and Replica DiversionsFile and Replica Diversions

Ratio of file diversionsRatio of file diversions

vs utilizationvs utilization

Ratio of replicaRatio of replica

diversions vsdiversions vs

utilizationutilization

Page 33: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

Distribution of Insertion FailuresDistribution of Insertion Failures

Web logs traceWeb logs trace

File system traceFile system trace

Page 34: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

CachingCaching

Page 35: Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

ConclusionConclusion