Hash-based Eventual Consistency to Scale the HDFS Block Report1181053/FULLTEXT01.pdf · Hadoop Open...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Hash-based Eventual Consistency to Scale the HDFS Block Report

AUGUST BONDS

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Abstract

The architecture of the distributed hierarchical file system HDFS imposes limi-tations on its scalability. All metadata is stored in-memory on a single machine,and in practice, this limits the cluster size to about 4000 servers. Larger HDFSclusters must resort to namespace federation which divides the filesystem intoisolated volumes and changes the semantics of cross-volume filesystem operations(for example, file move becomes a non-atomic combination of copy and delete).Ideally, organizations want to consolidate their data in as few clusters andnamespaces as possible to avoid such issues and increase operating efficiency,utility, and maintenance. HopsFS, a new distribution of HDFS developed atKTH, uses an in-memory distributed database for storing metadata. It scales to10k nodes and has shown that in principle it can support clusters of at least 15times the size of traditional non-federated HDFS clusters. However, an eventuallyconsistent data loss protection mechanism in HDFS, called the Block Reportprotocol, prevents HopsFS from reaching its full potential.

This thesis provides a solution to scaling the Block Report protocol forHopsFS that uses an incremental, hash-based eventual consistency mechanismto avoid duplicated work. In the average case, our simulations indicate that thesolution can reduce the load on the database by an order of magnitude at the costof less than 10 percent overhead on file mutations while performing similarly tothe old solution in the worst case.

1

Sammanfattning

Det distribuerade, hierarkiska filsystemet Apache HDFS arkitektur begransardess skalbarhet. All metadata lagras i minnet i ett av klustrets noder, och ipraktiken begransar detta ett HDFS-klusters storlek till ungefar 4000 noder.Storre kluster tvingas partitionera filsystemet i isolerade delar, vilket forandrarbeteendet vid operationer som korsar partitionens granser (exempelvis fil-flytterblir ickeatomara kombinationer av kopiera och radera). I idealfallet kan organisa-tioner sammansla alla sina lagringslosningar i ett och samma filtrad for attundvika sadana beteendeforandringar och darfor minska administrationen, samtoka anvandningen av den hardvara de valjer att behalla. HopsFS ar en ny utgava avApache HDFS, utvecklad pa KTH, som anvander en minnesbaserad distribueraddatabaslosning for att lagra metadata. Losningen kan hantera en klusterstorlekpa 10000 noder och har visat att det i princip kan stoda klusterstorlekar paupp till femton ganger Apache HDFS. Ett av de hinder som kvarstar for attHopsFS ska kunna na dessa nivaer ar en sa-smaningom-konsekvent algoritm fordataforlustskydd i Apache HDFS som kallas Block Report.

Detta arbete foreslar en losning for att oka skalbarheten i HDFS Block Reportsom anvander sig av en hash-baserad sa-smaningom-konsekvent mekanism foratt undvika dubbelt arbete. Simuleringar indikerar att den nya losningen igenomsnitt kan minska trycket pa databasen med en hel storleksordning, till enprestandakostnad om mindre an tio procent pa filsystemets vanliga operationer,medan databasanvandningen i varsta-fallet ar jamforbart med den gamla losningen.

3

Acknowledgements

Firstly, I would like to thank my examiner Jim Dowling for trusting me withthe responsibility of tackling this complicated problem. Secondly I would liketo thank my supervisor Salman Niazi for helping me along the way and havingpatience with me, despite the endless flow of questions. Thirdly I would like tothank my team at Logical Clocks for all the laughs, and finally, RISE SICS for thefinancial support and providing me a good working environment.

A special thanks to Monika Tolgo for always pushing me to be better.

5

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 31.3.2 Benefits, Ethics and Sustainability . . . . . . . . . . . . . 3

1.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 52.1 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 NDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Hops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Transactions in Hops . . . . . . . . . . . . . . . . . . . . 82.3.2 HopsFS . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Eventual Consistency . . . . . . . . . . . . . . . . . . . . . . . . 92.5 HDFS Block Management . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 Blocks and Replicas . . . . . . . . . . . . . . . . . . . . 102.5.2 HDFS Block Report . . . . . . . . . . . . . . . . . . . . 12

2.6 HopsFS Block Report . . . . . . . . . . . . . . . . . . . . . . . . 162.7 Merkle Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.8.1 Federated HDFS . . . . . . . . . . . . . . . . . . . . . . 172.8.2 Amazon S3 . . . . . . . . . . . . . . . . . . . . . . . . . 182.8.3 Google Cloud Storage . . . . . . . . . . . . . . . . . . . 19

3 Method 213.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7

8 CONTENTS

3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 The Improved HopsFS Block Report 234.1 Solution Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Incremental Hashing . . . . . . . . . . . . . . . . . . . . 234.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 A First Attempt . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Handling Concurrent Writes . . . . . . . . . . . . . . . . 274.2.3 Handling Data Node Restarts . . . . . . . . . . . . . . . . 284.2.4 Handling Appends . . . . . . . . . . . . . . . . . . . . . 294.2.5 Handling Recovery . . . . . . . . . . . . . . . . . . . . . 294.2.6 Final Design . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Analysis 335.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 335.1.2 Real-world Performance . . . . . . . . . . . . . . . . . . 34

5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.2 Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.2.1 Sensitivity to Misconfiguration . . . . . . . . . 365.2.2.2 Real-world Large Scale Systems . . . . . . . . 375.2.2.3 Computation and Storage Overhead . . . . . . . 37

5.3 Catastrophic Failure . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusions 396.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Bibliography 41

List of Figures

2.1 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Replica State Transitions [1] . . . . . . . . . . . . . . . . . . . . 142.3 Block State Transitions [1] . . . . . . . . . . . . . . . . . . . . . 152.4 HDFS Federation[2] . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Consistency-Scalability Matrix . . . . . . . . . . . . . . . . . . . 19

5.1 Impact of Number of Buckets on Block Report Time . . . . . . . 355.2 Impact of Block Processing speed on Block Report Time . . . . . 35

9

List of Tables

2.1 Block States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Replica States . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Reported Replica Fields . . . . . . . . . . . . . . . . . . . . . . . 14

5.1 Simulation Configuration . . . . . . . . . . . . . . . . . . . . . . 34

11

List of Listings

4.1 Simple Hash-based approach, Name Node . . . . . . . . . . . . . 264.2 Simple Hash-based approach, Data Node . . . . . . . . . . . . . 264.3 Hash-based approach with Concurrent Modification Handling,

Name Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4 Hash-based approach with Concurrent Modification Handling,

Data Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5 Handling data node restarts and stale data nodes, Name Node . . . 284.6 Handling Appends, Name Node . . . . . . . . . . . . . . . . . . 294.7 Handling Recovery, Name Node . . . . . . . . . . . . . . . . . . 294.8 Final Design, Name Node . . . . . . . . . . . . . . . . . . . . . . 304.9 Final Design, Data Node . . . . . . . . . . . . . . . . . . . . . . 31

13

List of Acronyms and Abbreviations

This document requires readers to be familiar with certain terms and concepts.For clarity we summarize some of these terms and give a short description ofthem before presenting them in next sections.

HDFS Hadoop Distributed File System[3]

HopsFS The KTH-developed new distribution of HDFS[4]

DAL Data Access Layer, the metadata layer HopsFS

CRDTs Conflict-free Replicated Data Types[5]

YARN Yet Another Resource Negotiator[6]

15

Chapter 1

Introduction

1.1 Background

With over 4 million YouTube videos watched, 46000 Instagram posts shared and2.5 petabyte of internet data consumed by American users, every minute of everyday[7], the absolute necessity of large-scale distributed storage systems is evident.The design of such systems is, however, a very complicated task with many pitfallsand unavoidable compromises.

Because of the thermodynamic limitations in scaling vertically, i.e. buildingmore powerful computers, large-scale distributed systems resort to scaling hori-zontally, i.e. adding more computers to the system. The problem with horizontalscaling is that increasing the number of devices in a distributed system alsoincreases communication complexity and adds to the number of failure points.In case of distributed storage, as the network grows so does the problem ofmaintaining data consistency, data integrity, and redundancy.

Arguably the most famous big data cluster framework, Hadoop, is entirelyopen source and is used by large organizations, such as Yahoo!, Facebook andSpotify for data storage and analysis. Hadoop stores its files in the HadoopDistributed File System (HDFS) that allows for reading, writing and appendingfiles of unlimited sizes. However, as will be discussed, because of its design thefile system and the cluster size cannot safely reach beyond 4000 nodes.

In a Hadoop cluster, there is at least one control node (Name Node) and oneor more storage/worker nodes (Data Nodes). The Name Node(s) does, amongother things, decide where clients should write new data and how the existing datashould be balanced and distributed over the cluster. The Name Node stores all thecluster information in-memory.

Because of storing all cluster metadata in-memory on that single Name Node,the cluster has a limit to the number of files it can keep track of. Storing

1

2 CHAPTER 1. INTRODUCTION

metadata in-memory means that all file system operations and all data node reportsmust eventually go through it as well. There are possibilities for moving thesynchronization of metadata outside of the Name Node using message passingmiddleware, but, for simplicity’s sake, the Name Node has complete control.

Hadoop Open Platform as a Service (Hops) is a new distribution of Hadoopthat aims to increase scalability and performance by removing critical bottlenecksin the Hadoop platform. To scale the metadata layer, Hops uses a distributedin-memory database called NDB. Using NDB for metadata raises the potentialmaximum number of files to 17 billion. Since NDB supports ACID transactions,it can also be used to distribute Name Node work.

HopsFS is the Hops equivalent of HDFS and takes care of the file storage.Being a distributed system, the number of potential errors is significant, andcrashes are frequent. Asynchronous operations that postpone synchronizing withthe Name Node improves scalability. However, HopsFS still maintains a highlevel of redundancy and data-loss protection and to do this the Name Node needsto keep an up to date view of the cluster with its files and blocks. The workingsolution for data consistency and redundancy checks is called block report.

Block Report is a mechanism in HDFS for maintaining an up to date view ofthe cluster on the Name Node, with respect to stored files. Once an hour, everydata node sends a complete list of its blocks, the block report, to the Name Node.The Name Node processes each block in the list and compares it with its view,updating the view when appropriate. To avoid misunderstanding, we will use theword block when talking about the Name Node’s idea of a block, and replica whentalking about a copy of a block stored on a Data Node. In these terms, the DataNode submits a list of replicas as its block report.

Data Nodes used in production at companies such as Spotify regularly hostmore than a million blocks, so the corresponding multi-megabyte block reportbecomes a heavy burden for the Name Node to process. The Name Node alsohas to query the distributed database for the most up to date view of the reportingnode, so it adds load to the database system.

The focus of this thesis is improving the performance and processing time ofthe block report mechanism.

1.2 ProblemIdeally, organizations that do Big Data analysis would like to consolidate theirdata in the fewest clusters possible for gains in operating efficiency and utilization.However, the metadata architecture of Hadoop limits the cluster size because ofstoring all metadata in-memory on a single node called the Name Node. HopsFSis a new distribution of HDFS, developed at KTH Royal Institute of Technology in

1.3. GOAL 3

Stockholm, Sweden, that scales metadata size and throughput using a distributedin-memory database. HopsFS has shown that in principle, it can support clustersat least 15 times larger than HDFS[4]. However, an internal protocol in HDFScalled block report prevents HopsFS from handling clusters of these sizes.

Currently, the block reporting protocol consumes an increasing amount ofcomputational and bandwidth resources with increasing HDFS cluster sizes.In practice, for clusters larger than 10K servers, block-reporting saturates themetadata layer preventing HDFS from scaling further.

Hence, the question becomes: Can we improve the scalability of HDFSwithout making compromises performance and durability? And how would sucha solution look? In this thesis, we will design and develop a more scalableeventually consistent protocol for block-reporting in HopsFS (and HDFS).

1.3 GoalThe goal of this work is to redesign the current HopsFS block reporting mechanismfor improved performance and scalability, as well as to perform the necessaryevaluation of the implementation. Furthermore, we hope that our solution willbe useful for the original Hadoop project, as a contribution the open sourcecommunity.

1.3.1 ContributionsThis work consists of an explanation of the existing HDFS block report mechanism,a design and implementation of a new, hash-based block report mechanism, and acomparison of the old and new design.

1.3.2 Benefits, Ethics and SustainabilityWe do not see any ethical issues with this project. Risks are that we do not finda solution, but a negative result (impossibility of further optimization of blockreports) would also be a contribution. On the other hand, a positive outcomecould increase server utilization and efficiency in large-scale clusters, reducingelectricity, maintenance and hardware requirements for Big Data operations.

1.4 MethodWe will attempt to formalize the existing consistency mechanism and then createan improved design. After improving on the original design, benchmarks will be

4 CHAPTER 1. INTRODUCTION

conducted to get a quantitative measure of performance increase. Reasoning andtests are used to verify correctness of the new solution.

1.5 DelimitationThis thesis focuses solely on implementing and evaluating a better performingconsistency mechanism for the distributed file system HopsFS. We wish for ourimproved solution to maintain the same consistency properties as the old solutionso that they can be used interchangeably. Hopefully, the algorithm design will beapplicable also to the original project HDFS, but the actual application of the newdesign to HDFS is outside of the scope of this work.

1.6 OutlineThe thesis follows the following outline. Starting off, the introduction (Chapter 1)gives an overview of the problem and the background. After that, the background(Chapter2) goes in-depth into context necessary for understanding the work, andan overview of existing solutions in the space. After that comes an explanationof the methods used (Chapter 3). The hash-based approach chapter (Chapter 4)describes the new solution. At the end of the work is an evaluation5, followed bya summary and suggested future work (Chapter 6).

Chapter 2

Background

2.1 Hadoop

Hadoop is a software framework for ”reliable, scalable, distributed computing.”[3]. It started out as an open source implementation of the Google File System(GFS)[8], coupled with cluster management and the support for map-reducejobs[9], and ended up being one of the core technologies of the big data sphere.Today Hadoop has a 25% market share in big data analytics.[10]

Google built GFS and MapReduce to deal with its ever-growing index ofweb-pages and to perform aggregated analysis on that index. Therefore, Hadoopinitially came with some core assumptions: users do not delete data, the datais written in a stream and read by multiple readers, and data-loss is intolerable.Furthermore, MapReduce used the concept of data-locality to increase processingspeed in a communication-bound system. The assumptions about data storage andretrieval will be discussed in Section 2.1.1 dedicated to Hadoop’s distributed filesystem HDFS.

Hadoop uses data-locality to deal with processing large datasets. Insteadof moving the data over the limited network, it moves the computation to thenodes where the data resides. One of the insights that lead to the creation ofGFS, MapReduce and finally Hadoop, was that cheap commodity hardware couldused be instead of expensive specialized hardware, as long as the slightly higherfailure rate was handled on the software level. The architecture is an example ofhorizontal scaling, increasing the number of machines instead of increasing theperformance of each individual machine.

In May 2012 the first alpha of Hadoop 2 was released[11], that extended theframework in a significant way: it added a scheduler and resource allocator calledYARN [6]. YARN made it possible for third-party software to request clusterresources and run jobs, which allowed for new data processing frameworks to

5

6 CHAPTER 2. BACKGROUND

plug into the platform. Nowadays Hadoop supports a number of popular Big Dataframeworks such as Spark, Hive, Kafka, HBase, and Storm.

2.1.1 HDFSThe default file system in Hadoop is Hadoop Distributed File System (HDFS).HDFS is an open source clone of Google File System based on the paper Googlepublished 2003 [8]. An HDFS cluster consists of two types of nodes: Data Nodes,and one or more (backup) Name Nodes. The Name Node synchronizes access tothe file system and maintains a view of the cluster: which data nodes store whichfiles, what is the desired replication level, and which files need re-replication. Thedata nodes serve as simple data storage servers.

Files in HDFS are divided into chunks called Blocks. The client operates onfiles, but internally and transparently to the user the files are split, distributed andreplicated on the block level. A block has a predefined maximum size; generally64mb or more. HDFS stores these blocks, not the complete files, on its data nodes.

The Name node controls the cluster and maintains the namespace. Usersinteract with HDFS via a provided client. When a user wants to read a file, itrequests a read-stream from the client. This stream does the work of asking theName node for locations (Data nodes) that hold the next block to read and thenproceeds with setting up a read socket to one of those data nodes. To the user, itlooks like a local file read.

A file can contain a virtually unlimited number of blocks. Blocks are limitedin size so when a block becomes full, the writer is notified and is forced to requesta new block to write to. The client handles this transparently to the user.

2.2 NDBNDB [12] is a distributed, in-memory, shared-nothing database solution thatsupports ACID transactions with row-level locking, capable of 200M queriesper second[13]. Storing database data in volatile memory requires specific workto maintain durability. In case of NDB, this means a replication factor of two,coupled with regular snapshotting to disk.

ACID transactions (Atomicity, Consistency, Isolation, Durability) is a conceptthat defines properties of concurrent single- or multiple statement operations ona database. They are used to ensure that a) either all statements completed, orthe effects are entirely discarded (Atomicity), b) that consecutive reads of thesame data rows give the same result (Consistency), c) concurrent statementsoperate on the database as if executed alone (Isolation), and d) the result of acompleted operation is never lost (Durability). Such transactions are necessary

2.3. HOPS 7

for a database system to emulate the corresponding properties of a data-race freein-memory data store. NDB supports such transactions with an isolation level ofREAD COMMITTED. This means that data read by a transaction is guaranteedto be the result of a completed transaction. In essence, a transaction only operateson its own copy of the row (the last committed state before the transaction begun)and will only complete successfully if no other simultaneous transaction hasupdated that row thereafter. If the transaction fails to complete due to its readvalues being modified by another transaction, it has to rollback and be retried.

To scale the system horizontally, NDB employs sharding. Sharding is apartition of a dataset according to some consistent mapping of row keys todatanodes in the cluster. This can be done automatically, unless some anothermechanism is specified e.g. sharding based on a specific column or full-tablereplication. Sharding your tables correctly can allow transactions to neverinvolve more than any one datanode, removing the need for cross-datanodecommunication and thereby improving transaction speed. NDB falls back tocross-partition transactions when this is not possible.

A key property of NDB that makes it useful for high-performance workloads isthat it scales linearly with respect to reads and writes, as you increase the numberof datanodes in the system.

2.3 Hops

Hops is an independent fork of Hadoop (see Section 2.1) that aims to solvesome of the inherent scalability issues present in the platform. One of themain complaints against Hadoop is the decision to store all cluster meta data in-memory inside a single Java Virtual Machine(JVM) on the primary Name Node.Furthermore, all Hadoop file-system operations are synchronized over the samelock which limits scalability. To remedy these issues Hops stores all its meta datain a distributed in-memory database called NDB2.2.

Storing the meta data outside the JVM comes not without drawbacks, though.Access times are slower because of network latency, and the network bandwidthbetween the name nodes and the database nodes limit the speed and amount ofparallel operations that can be performed. Hops has redesigned some aspectsof the file system internals to remove the synchronization bottleneck presentin Hadoop. In HopsFS all file system operations are executed as concurrenttransactions[4].


2.3.1 Transactions in HopsTo reduce the number of round-trips to the NDB database, Hops uses a localcache on each Name Node. Tables are interacted with through so called”EntityContext”s, that manage the cache state. All metadata operations needto implement a transaction interface, consisting of three operations: setUp,acquireLocks and performTask. setUp is allowed to perform individual requeststo the database to resolve any additional information necessary for the main task.acquireLocks is the main cache population phase where the developer specifiesa number of database rows that need to be populated (through ”locks”), andperformTask performs database updates. Updates performed in the performTaskare made to the local cached copy of the database rows, and at the end ofthe method, results are automatically committed back to the database by thetransaction handler. The transaction handler takes care of rollbacks and retriesin case of failure.

Aquiring locks on multiple rows in multiple tables concurrently can lead toso called dead-locks, where two transactions are waiting for the other to release alock the first one has already acquired. The Data Access Layer discussed in thefollowing section, enforces a total order of lock acquisition that spans both tablesand individual rows, that makes such dead-locks impossible.

2.3.2 HopsFSHopsFS is a drop-in replacement for Hadoop Distributed File System thatreplaces the in-memory metadata store of the control node with a databaseabstraction called Hops Data Access Layer (DAL for short). The data access layerhas one main implementation interfacing with the in-memory, shared-nothing,distributed database system Oracle NDB. Moving metadata into a database makesit queryable and batch-editable. It also allows for a much larger potential amountof metadata as compared to keeping the metadata in memory. In case of NDBversion 7.5, this equates to 24TB of metadata in a 48-node cluster with 512GB ofRAM on each NDB data node. Consequently, HopsFS has the potential to storemetadata information of up to 17 billion files.[4]

At the heart of the Hops File System is the representation of a file ordirectory, the inode. All inodes have a parent inode, an associated timestamp,ownership information and permissions. The inode dataset is partitioned by itsparent inode to enable quick directory listings. The root inode is accessed byall filesystem operations and is therefore made immutable and cached on allnamenodes. However, because of having a single namespace root, the wholedirectory tree ends up in the same partition. To mitigate this, the children oftop-level directories are partitioned according a configurable hashing scheme,

2.4. EVENTUAL CONSISTENCY 9

spreading out the second-level directories across the database nodes.[4]All file system operations are performed as transactions. However, transactions

on their own are not sufficient to scale out a hierarchical file system wheresimulatenous subtree-operations would rarely succeed without being invalidatedby concurrent modifications down the tree. To handle this, HopsFS uses anapplication level locking mechanism where a sub-tree lock cannot be acquiredif a node down the tree is locked, and vice-versa.

While moving the metadata into a database results in slower access times thanstoring it in-memory, the ability to perform concurrent operations on the data astransactions enables a safe way to scale out the system. Compare this to the HDFSsingle namespace lock, enforcing serializability, which hampers performance andlimits HDFS to a 16th of the throughput of HopsFS.[4].

HopsFS relies on the same block report protocol for replica management asHDFS, and is therefore limited in a similar way. However, as HopsFS introducesa database layer for storing cluster metadata it actually worsens the scalabilityconstraint of Block Report. Now not only does the report need to be transmittedover the network from the data node to the name node, but the processing on thename node requires reading a number of database rows, similar to the number ofreplicas in the report. Operations that modify replica meta data are performed intransactions and so every row read must eventually be locked.

The result is that HopsFS needs to resort to making full reports less frequent tooperate at similar cluster sizes. An improved design of the HopsFS block reportingmechanism to lessen the load on the database an increase block report processingspeed is the main contribution of this work, and is presented in Chapter 4.

2.4 Eventual ConsistencyProperties in distributed systems are either Safety Properties or Liveness Properties.Safety properties give guarantees about what bad things will never happen, andliveness properties state that good things happen eventually. A well implementeddistributed system never violates saftey properties, and never ceases to makeprogress, liveness. Consistency models are used to reason about the quicknessof the system to fulfill its liveness properties.

Distributing resources always requires compromises. As famously stated inthe CAP Theorem [14] by Eric Brewer, any distributed system can provide maxtwo out of the following three properties: Consistency, Availability and PartitionTolerance. Consistency considers the behavior of the system under concurrentoperations, Availability the continuous function of the system, and PartitionTolerance the case of lost messages between nodes in the system. Because ofthis inherent limitation of distributed systems, and the almost inevitable loss


of messages, most applications that need to scale must compromise betweenavailability and consistency.

Shared-memory abstractions are architectures that allow for distributing memoryover multiple units of storage, while to the outside observer looking more or less,like a single unit of storage. HDFS is an example of this. Since consistency is nota binary property, we use Consistency Models to help reason about different levelsof consistency a shared-memory abstraction can emit. The consistency model bestmatching that of the HDFS block report is the Eventual Consistency Model.

Eventual Consistency guarantees that after all updates have arrived, the systemwill eventually be in sync, meaning that there is a period of time, after updateshave arrived, allocated for conflict resolution. This allows for quicker generaloperation of the system, with performance penalties only when node statesdiverge. Another example of a consistency model is Eventual Consistency’s morepowerful cousin, Strong Consistency, which guarantees that as soon as each nodehas received all updates, the system is consistent.

2.5 HDFS Block Management

2.5.1 Blocks and ReplicasThe file storage architecture of HDFS (Figure 2.1) is modelled after Google FileSystem[8]. Similar to GFS, files in HDFS are divided in fixed size parts calledblocks (GFS: ”chunks”). Block sizes are configurable with a default of 64MB, butthe division is completely transparent to the user.

Like Hadoop, HDFS relies on the ’single control node - multiple workernodes’-architecture. The name node maintains a complete view of the clusterand decides where new blocks should be placed. By default, HDFS replicatesevery block three times and the replication level of a file is the same as that of it’sblocks. Block placement decisions are made with the goal of balancing load onthe cluster and minimizing the risk of data loss.

Physical copies of blocks are called replicas. Replicas are stored on data nodeswhich serve as dumb replica stores that have no knowledge of the rest of thecluster. Since data nodes are simple, the complexity is moved to the client. Towrite a file a client first asks the name node for locations to write the first block.After receiving a list of data nodes to write to it sets up a pipeline between itselfand the data nodes proceeds with the data transmission. Once a block has beenfilled and all data nodes in the pipeline have sent acknowledgements, the clientrequests from the name node a new block and new locations to write to.

Throughout the lifetime of a block, the block itself and its replicas go througha cycle of states. Block states are described in Table 2.1 and replica states are

2.5. HDFS BLOCK MANAGEMENT 11

Figure 2.1: HDFS Architecture


Block State DescriptionUNDER CONSTRUCTION The block is being writtenCOMMITTED The client has closed the block or requested a

new block to write to, and the minimum numberof data nodes has not yet reported finalizedreplicas.

COMPLETE The client has closed the block or requesteda new block to write to, and it has reachedminimum replication.

UNDER RECOVERY A writing client’s lease has expired andrecovery has started.

Table 2.1: Block States

Replica State DescriptionRBW Replica Being Written means a client has

started/is writing data to this replica.FINALIZED The client has closed the file/requested a new

block and all data has arrived.RUR Replica Under Recovery means that a recovery

the corresponding block has started.RWR Replica Waiting to be Recovered is a state that

all RBW replicas enter after the data noderestarts

TEMPORARY Temporary replicas are ephemeral and appearduring transfer of blocks for re-replication.

Table 2.2: Replica States

described in Table 2.2. If no errors are encountered during writing, a new blockgoes from UNDER CONSTRUCTION to COMMITTED to COMPLETE, whilethe replicas go from RBW to FINALIZED. A deeper dive into the state transitionscan be found in the Section 2.5.2.

2.5.2 HDFS Block ReportEvery file has a target replication level. Ideally, every file in the file system shouldbe maximally replicated at all times, but because of node failures, disk failures andnetwork failures replicas can be lost or corrupt. The replication level is maintainedby a combination of client and name node processes.

On the name node, a periodic procedure triggers re-replication of the blocks


known to be under-replicated. To not overload the cluster the batches of blocksscheduled to be repaired are sized proportionally to the cluster. Blocks can bemarked as under-replicated in several ways: by the client reporting corruptedreads, the client reporting pipeline failure when writing, the data nodes notreporting finalized replicas, data nodes marked dead and file replication upgrades.Also the data nodes themselves periodically do checksum controls on their storedreplicas.

Since the Name Node maintains a complete view of the cluster, all writes andreads need first to be confirmed by the Name Node. However, data nodes operateindependently so the Name Node knowledge of the state on the data nodes canquickly become obsolete. To keep track of where blocks are stored and in whichstate they are, Hadoop deploys a consistency mechanism called Block Report.

To deal with the case of missing messages or other situations that causereplicas to be left in states inconsistent with their blocks, there is a mechanismcalled Block Report. The Block Report is a two-part mechanism for synchronizingblock and replica states.

On the one hand, you have the once-an-hour full report. The full reportconsists of a complete list of replicas and and their states (replica contentsexcluded, see Table 2.3). The full report is also sent upon data node registration(startup). On the other hand, there are incremental reports. Incremental reportsare used to synchronize data node and name node state on the fly as replicas aremodified. Specific replica transitions are reported with the following messages:

• Receiving (blockId, generationStamp, RBW)

• Received (blockId, generationStamp, length, FINALIZED)

• Deleted (blockId, generationStamp)

The ”Receiving” message is sent when a write is started is set up, that is oncreate, append, append-recovery (append pipeline failed), and pipeline recovery(normal write failed). The ”Received” message is sent whenever the replica isfinalized: close, recovery and close recovery, replication(TEMP to FINALIZED).The ”Deleted” message is sent when a replica is deleted. See Figure 2.2 for adetailed overview. The corresponding block transitions can be found in 2.3.

Incremental block reports and the full block report together provide a guaranteethat replica states will be eventually synchronized with the name node. Everydata or version-changing modification is reported on its own and, in case of lostmessages, there is the full report to catch any remaining differences. Block Reportis therefore equivalent with an eventually consistent protocol. With time, if allmessages arrive, the state of the data nodes will be in sync with the name nodeview.


Field DescriptionblockId id of corresponding blocklength number of bytes in the blockgenerationStamp a version numberstatus RBW/FINALIZED/RUR

Table 2.3: Reported Replica Fields

Figure 2.2: Replica State Transitions [1]


Figure 2.3: Block State Transitions [1]


The time and space complexity of the block report is linear with respect to thenumber of blocks on a data node. In HDFS this has imposed a strict limitation oncluster sizes, since the complete reporting time of 1M blocks is in the order of onesecond, clusters have been limited to around 4000 nodes, for a report-interval of1h.

By the structure of the Block Report mechanism we can extract the essentialrequirements it fulfills:

1. On data node start/restart the state in terms of RUR, RBW, FINALIZED anddeleted replicas is synchronized by an initial full block report

2. Every replica modification is additionaly reported on its own

3. Every hour the state in terms of RUR, RBW, FINALIZED and deletedreplicas is synchronized by the full block report.

The fact that incremental reports are sent on every important data modification,can be used to define a secondary property: If data nodes receive all replicamodifications, and the name node receives all incremental reports, the state issynchronized. We utilize this fact to design the new hash-based block report.

2.6 HopsFS Block ReportWhile identical in design the HopsFS Block Report exhibits different performanceproperties than the HDFS Block Report because it interacts with the database formetadata. As explained in section 2.3.1, transactions go through three steps: initialdata resolution, cache population and updates. This is true also for HopsFS BlockReport. However, because locking of hundreds of thousands of rows is not feasiblein NDB within the time constraints of the block report, the replica information isinitially read outside of a transaction as an index scan. Then, the list of replicas inthe report is compared with the previously read replica information in the databaseto compute a diff. The differences found in the diff are processed individually astransactions, causing each modification to require a round-trip to the database.

Our measurements show that on an unloaded NDB cluster, we can readthe block and replica information at around 200k replicas per second outsidea transaction, and a single replica modification transaction takes around 20ms.As we expect a datanode in a typical cluster to contain at least 1M blocks, theinitial reading phase takes at least 10s. Furthermore, if every block requires amodification, there is an additional 2000s of total transaction processing. Forimproved start-up times, this second step can be parallelized, and after start-up,we expect the number of modifications needed in a typical block report to be low.Therefore the focus is on optimizing the first step.

2.7. MERKLE TREES 17

2.7 Merkle TreesMerkle Trees are balanced binary trees of hashes where the bottom level is full.Merkle Trees were initially conceived as a solution to digital signatures of largeamounts of data, as a combination of the application any pre-existing, proven,cryptographic hash function over equally sized chunks of the data, where twoneighbor hashes are combined and hashed again to compute a parent hash allthe way up to the root. In case of alteration during transmission, the offendingchunk(s) could be found simply by extracting the smallest subtree of offendinghashes, and only that segment would need to be re-transmitted.

2.8 State of the ArtAs companies have to deal with larger and larger amounts of data, and the need forhigh-speed access to said data increases as the market becomes more competitive,the performance and scalability of storage systems becomes a central componentin their success. Companies such as Google and MapR and organisations likeApache Hadoop are working on scaling their data storage architectures to largerand larger clusters but as the CAP impossibility result dictates[], every solutionhas to compromise either consistency, availability or partition-tolerance. Asnetwork connectivity problems is the norm, rather than the exception, partition-tolerance is practically a necessity. Hence, the compromise is between consistencyand availability. Varying degrees of consistency and availability require differentmechanisms for fault-tolerance and redundancy.

This chapter will discuss some notable distributed data storage solutionsout there, and compare them with the properties and performance of HadoopDistributed File System. Figure 2.5 illustrates the differences between theplatforms and the prospects of HopFS, in terms of scalability and consistency.As Federated HDFS doesn’t provide a unified namespace, it removed from thefigure.

2.8.1 Federated HDFSHDFS Namespace federation tries to lessen the load of the name node, by runningseveral overlapping clusters on the same data nodes. Each cluster has its ownfolder hierarchy and acts as an independent file system. The architecture isillustrated in Figure 2.4. As HDFS in and of itself scales to 4000 nodes, and youcan keep adding federated clusters, the scalability of the solution is practicallyunlimited. Namespace federation does not come without drawbacks, however. Asthe file hierarchies are isolated from each other, a file move across namespaces


Figure 2.4: HDFS Federation[2]

becomes a non-atomic operation. Management is also more complex since everycluster is managed independently. For file operations within a namespace, theperformance is the same as that of single-cluster HDFS.

2.8.2 Amazon S3Amazon S3 is a direct competitor to HDFS in the data storage market. In termsof elasticity and cost, S3 wins by a landslide [15]. With S3 you pay for what youstore, and the pricing is competitive. HDFS forces you to pay the cost up frontfor buying or renting the hardware, and any storage space you are not using iswasted. However, to scale the system, Amazon has chosen to strongly reduce theconsistency guarantees that S3 provides.

Contrary to the hierarchical nature of HDFS, S3 is a simple block store.Objects are stored in so called buckets, each with a unique identifier key.According to the documentation, PUT operation on a new object provides read-after-write consistency, while updates or deletes are eventually consistent.[16]That means that a new file can be read after being written, but might not showup in bucket listings for some time. It also means that in case of simultaneousupdates, a read might respond with the old data or either of the results of theupdates, until consistency has been reached. Put simply, a user cannot expect

2.8. STATE OF THE ART 19

Figure 2.5: Consistency-Scalability Matrix

data to be available immediately after writing. This stands in stark contrast to theconsistent operations on an HDFS instance.

2.8.3 Google Cloud StorageGoogle Cloud Storage, like S3, is also a non-hierarchical block store where blocksare organized in buckets. It does however provide consistent PUT on new blocks,consistent PUT-update and consistent delete. Only access control is eventuallyconsistent. Google Cloud Storage consistency provides higher throughput than S3for both uploads and downloads, but as an inevitable consequence of the strongerconsistency, access latencies are also much higher.[17]

Chapter 3

Method

This chapter presents the scientific approach used in this work, and the expectedoutcomes.

3.1 GoalsAs an engineering thesis, it focuses on the design and evaluation of a new solutionto a an engineering problem. The main goal is to produce a rationale, design,and verification of a new design of the HDFS Block Report. More concretely, thehope is to provide an improved Block Report mechanism in HopsFS that has thepotential to scale to clusters at least 10 times the previous maximum.

3.2 TasksThe tasks to be undertaken in this work are:

1. A thorough review of related materials

2. An analysis of the distributed storage state-of-the art

3. An in-depth analysis of the properties and behavior of the existing HDFSBlock Report

4. The production of a rationale for, and the design of, a new Block Reportmechanism

5. An evaluation of the design in terms of correctness and performance

21

22 CHAPTER 3. METHOD

3.3 EvaluationThe correctness evaluation is done by reasoning, and to evaluate the potentialperformance of the new solution we will do simulations on a model of the system,using model parameters based on real-world measurements. In case a correctimplementation can be achieved in time, there will be performance measurementstaken on a 40 node HopsFS cluster running within the SICS ICE data center inLulea, Sweden.

Chapter 4

The Improved HopsFS Block Report

The block report protocol in HDFS and HopsFS is a limiting factor on clusterscalability. This chapter will outline a new design that has potential for scaling to10x previous cluster sizes.

4.1 Solution IdeaOur new block report has some strict requirements, namely, it should provide thesame guarantees as the old block report (Section 2.5.2). However, it has to do sowithout the same linear complexity full block report dominating the processingtime, and at the same time not impose a more than marginal overhead on generaloperation.

The key insight is that because of the incremental block report alreadyreporting every replica modification, the full block report is mostly redundant.We use this fact to create a CRDT-inspired incremental hashing solution that canbe used to determine whether the full block report is necessary or not. To supportconcurrent modifications we split the report into a number of buckets, each withits own hash, to let some buckets be inconsistent without redoing the whole report.

4.1.1 Incremental Hashing

In essence, the block report synchronizes a list of replicas and their states on a datanode with the name node knowledge of that data node. We avoid duplicated workby representing the list of replicas by an integer that we keep up to date based onwhat incremental reports we have seen. To do this we use one way functions andthe commutative property of ”+” over the integers.

Assume there is a one-way function that compresses a reported replica (see

23

24 CHAPTER 4. THE IMPROVED HOPSFS BLOCK REPORT

Table 2.3) into an integer.

hash(Replica)! Z,Replica = (id, length,generationStamp,status)

We can use that function to generate a single replica that represents the wholereport, by simply applying it over all the elements in the list, and adding up theresults.

hash(BlockReport) = Âr2BlockReport

hash(r)

The key realization is that because of the commutative property of + over Z,the hash function also commutes, such that for replicas R1,R2 holds:

hash([R1])+hash([R2]) = hash([R2])+hash([R1]) = hash([R1,R2])

The old block reporting scheme already reports every change to a replica’sstate, in an at-most-once fashion. That allows us to draw the following conclusion:

Theorem 1 Assume un-ordered, exactly once delivery of incremental reports.Given an initial hash (H0 = 0) on the Name Node and a Data Node withzero replicas, only allow replica modifications that are additions or deletions ofcomplete blocks. Upon performing a number of file writes and file deletions, if thehash is updated incrementally every time an update from the modified Data Nodecomes in (+ for block additions, - for deletions), when all incremental reportshave arrived the stored hash will match the hash of the complete block report.

The introduction of hflush and append to Hadoop [1], also introduced anarray of added messages between the Data Node and Name Node: namely theBlockReceivedAndDeleted that served as an indication that a block was either:being written, finalized, or deleted from the Data Node. The idea was to create anintermediary block report that catches changes with higher granularity to enablereading intermediary replicas. But the complete block report was still kept toact as a catch-all solution. We use this pre-existing incremental block report andextend it in our hash-based solution.

4.2 Design

4.2.1 A First AttemptAssuming successful writing of files and, for now, ignoring appends and internalmaintenance, the operations left are file creation and file deletion. Creating or

4.2. DESIGN 25

deleting a file without errors will result in corresponding additions and removalsof replicas. Theorem 1 handles precisely that, but has two limitations: it assumesthat every update arrives exactly once and it assumes that there is only one possiblestate FINALIZED a replica can be in. We use it to design our first implementationin Listings 4.1 and 4.2, where processReport is the old block report processingcode that checks every reported replica against its knowledge of the associatedblock and schedules work and updates meta data. The reported data node state isconsidered the ”truth”.

This solution is safe but insufficient. Note that the complete block reportblockReport/2 calculates the hash of the actual report, not only the FINALIZEDblocks. That way we can be sure that the actual state of the data node isconsidered. However, given that at least one replica with a non-FINALIZED stateis very likely to exist on a Data Node at any given time, the hash will virtuallynever match, and so the performance of the block report remains the same. Inaddition to that, if a data node restarts and misses updates it will falsely reportmatching hashes, file appends will cause hash inconsistencies since the last blockof the appended file will be reported finalized twice, and replicas that get theirgeneration stamp bumped will cause hash inconsistencies as well.

To summarize, there are four detrimental issues with this solution:

1. Hashes match only if all replicas are FINALIZED (cannot handle concurrentwrites)

2. A datanode that restarts and misses updates will falsely report matchinghashes (incorrect)

3. Hashes conflict if a file has been appended, since the last block of theappended file will be reported as FINALIZED at least twice: once the firsttime it was written. (cannot handle appends)

4. Hashes conflict if block has been recovered since the bump in generationstamp counts as a FINALIZED block (cannot handle recovery)

These issues will be addressed in the following sections.


Listing 4.1: Simple Hash-based approach, Name NodeNAME NODEupon i n i t ( ) :

da ta nodes = t h e s e t o f d a t a n o d e s i n t h e c l u s t e rf o r dn i n da ta nodes :

s e t h a s h e s [ dn ] = 0

upon i n c B l o c k R e p o r t ( dn , r e p l i c a ) :k e p t B l o c k = p r o c e s s R e p o r t e d B l o c k ( r e p l i c a )i f ( k e p t B l o c k && b l o c k . r e p l i c a = FINALIZED ) :

h a s h e s [ dn ] = h a s h e s [ dn ] + hash ( r e p l i c a )e l s e i f ( ! k e p t B l o c k ) :

h a s h e s [ dn ] = h a s h e s [ dn ] � hash ( r e p l i c a )

upon b l o c k R e p o r t ( dn , r e p l i c a s ) :hash = hash ( r e p l i c a s )i f ( h a s h e s [ dn ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( r e p l i c a s )h a s h e s [ dn ] = hash ( k e p t B l o c k s )

e l s e :/ / do n o t h i n g

Listing 4.2: Simple Hash-based approach, Data NodeDATA NODEupon i n i t ( ) :

/ / Read b l o c k s t a t e s from d i s kb l o c k s = g e t d i s k b l o c k s ( )do work ( )

do work ( ) :l oop :

Rece ive commands | | do work | | r e c e i v e b l o c k s ;on b l o c k m o d i f i e d :

t r i g g e r ( s e l f , i n c B l o c k R e p o r t ( b l o c k ) )

once an hour :t r i g g e r ( b l o c k R e p o r t ( s e l f , b l o c k s )

4.2. DESIGN 27

4.2.2 Handling Concurrent Writes

Traditional Hadoop workloads are read-heavy. The Spotify workload that washanded to the Hops team was roughly 97% reads [4] which means a comparativelysmall number of replica modifications. However, For a current day Criteoworkload there is 80+ TeraBytes worth of data added every day. If we assumethat the data is being streamed and each file is at least bigger than the block sizeof 64 Mb with a replication factor of three, the number of blocks added per day isbound by 80T B/64M = 4M blocks. 4M blocks over 1000 nodes is 4000 blocksper data node per day ⇡ 6 replicas/s/dn. If a replica of 64MB takes three secondsto write, that means at any point in time there are 18 replicas being written perdata node. Therefore, our single-hash solution is rendered useless.

To account for these concurrent writes, we divide the blockId space into aconfigurable number of ”buckets”. Blocks are assigned to buckets consistentlyusing modulus on the blockId. Each bucket gets an assigned hash, and is processedindependently of the other buckets. Given a high enough number of buckets,concurrent modifications only render a small fraction of the hashes inconsistent,and major performance gains can still be achieved. The only modification neededis replacing the single-hash-per-datanode datastructure with a datanode-bucketsmap (see Listings 4.3 and 4.4).

Listing 4.3: Hash-based approach with Concurrent Modification Handling, NameNodeNAME NODE:upon i n i t ( ) :

da ta nodes = t h e s e t o f d a t a n o d e s i n t h e c l u s t e rf o r DN i n da ta nodes :

f o r b u c k e t I d i n b u c k e t I d s :s e t h a s h e s [DN] [ b u c k e t I d ] = 0

upon i n c B l o c k R e p o r t ( dn , r e p l i c a ) :k e p t R e p l i c a = p r o c e s s R e p o r t e d B l o c k ( r e p l i c a )b u c k e t I d = r e p l i c a . b l o c k I d % NUM BUCKETSi f ( k e p t B l o c k && b l o c k . s t a t e = FINALIZED ) :

h a s h e s [ dn ] [ b u c k e t I d ] =h a s h e s [ dn ] [ b u c k e t I d ] + hash ( r e p l i c a )

e l s e i f ( ! k e p t B l o c k ) :h a s h e s [ dn ] [ b u c k e t I d ] =

h a s h e s [ dn ] [ b u c k e t I d ] � hash ( r e p l i c a )

upon b l o c k R e p o r t ( dn , b u c k e t s ) :


f o r b u c k e t i n b u c k e t s :b u c k e t I d = b u c k e t . i dhash = hash ( b u c k e t . b l o c k s )i f ( h a s h e s [ dn ] [ b u c k e t I d ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s )h a s h e s [ dn ] [ b u c k e t I d ] = hash ( k e p t B l o c k s )

e l s e :/ / do n o t h i n g

Listing 4.4: Hash-based approach with Concurrent Modification Handling, DataNodeDATA NODE:upon i n i t ( ) :



Rece ive commands | | do work | | r e c e i v e b l o c k s ;on b l o c k m o d i f i e d :


once an hour :t r i g g e r ( b l o c k R e p o r t ( s e l f , b u c k e t i z e ( b l o c k s ) )

4.2.3 Handling Data Node RestartsTo handle the case of a stale data node, either due to a restart or because of missingmessages, treat the first report, and any report from a data node detected stale, as ifall hashes are inconsistent. Nodes are detected stale using a consistent mechanismprovided by HDFS internally.

Listing 4.5: Handling data node restarts and stale data nodes, Name NodeNAME NODE/ / The f i r s t b l o c k r e p o r t/ / and r e p o r t from s t a l e d a t a nodes/ / s h o u l d be c o m p l e t e l y p r o c e s s e dupon b l o c k R e p o r t ( dn , b u c k e t s ) :

i f ( i s F i r s t R e p o r t ( dn ) | | i s S t a l e ( dn ) ) :

4.2. DESIGN 29

f o r b u c k e t i n b u c k e t s :k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s )h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( k e p t B l o c k s )

e l s e :f o r b u c k e t i n b u c k e t s :

hash = hash ( b u c k e t . b l o c k s )i f ( h a s h e s [ dn ] [ b u c k e t . i d ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s )h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( b u c k e t . b l o c k s )

4.2.4 Handling AppendsTo handle file appends, whenever a client begins an append operation, undo thehash of the last FINALIZED replica (see Listing 4.6).

Listing 4.6: Handling Appends, Name NodeNAME NODEupon C l i e n t . append ( p a t h ) :

f i l e = g e t F i l e ( p a t h )b l o c k = g e t L a s t B l o c k ( f i l e )b u c k e t I d = b l o c k . i d % NUM BUCKETSd a t a n o d e s = g e t L o c a t i o n s ( b l o c k )f o r dn i n d a t a n o d e s :

h a s h e s [ dn ] [ b u c k e t I d ] =h a s h e s [ dn ] [ b u c k e t I d ] � hash ( b lock , FINALIZED )

. . . / / Re s t o f o l d append l o g i c

4.2.5 Handling RecoveryUpon block recovery initialization, if no recovery in progress, undo hashes ofthe corresponding replicas on the data nodes. Furthermore, if a client is forcedto upgrade its pipeline due to a writing error, undo the hash of the FINALIZEDreplica to all datanodes in the previous pipeline (see Listing 4.7).

Listing 4.7: Handling Recovery, Name NodeNAME NODE/ / i n t e r n a l name node code/ / c a l l e d d u r i n g l e a s e r e c o v e r yFSNamesystem . i n i t i a l i z e B l o c k R e c o v e r y ( b l o c k ) :

i f ! i n R e c o v e r y ( b l o c k ) :


oldDataNodes = g e t L o c a t i o n s ( b l o c k )b u c k e t I d = b l o c k . i d % NUM BUCKETSf o r dn i n o l d D a t a n o d e s :

/ / hash b l o c k as r e p l i c a by ad d i ng s t a t u sh a s h e s [ dn ] [ b u c k e t I d ] �= hash ( b lock , FINALIZED )

/ / Make s u r e we do n o t l e a v e s t a l e r e p l i c a s u n n o t i c e d .upon C l i e n t . u p d a t e P i p e l i n e ( b l o c k ) :

o ldDataNodes = g e t L o c a t i o n s ( b l o c k ) ;b u c k e t I d = b l o c k . i d % NUM BUCKETSf o r dn i n o ldDataNodes :


. . . r e s t o f u p d a t e p i p e l i n e code

4.2.6 Final DesignListings 4.8 and 4.9 contain the code of the new hash-based block report withimprovements applied.

Listing 4.8: Final Design, Name NodeNAME NODEupon i n i t ( ) :

da ta nodes = t h e s e t o f d a t a n o d e s i n t h e c l u s t e rf o r DN i n da ta nodes :

f o r b u c k e t I d i n b u c k e t I d s :s e t h a s h e s [DN] [ b u c k e t I d ] = 0

upon i n c B l o c k R e p o r t ( dn , r e p l i c a ) :k e p t R e p l i c a = p r o c e s s R e p o r t e d B l o c k ( r e p l i c a )b u c k e t I d = r e p l i c a . b l o c k I d % NUM BUCKETSi f ( k e p t B l o c k && b l o c k . s t a t e = FINALIZED ) :

h a s h e s [ dn ] [ b u c k e t I d ] =h a s h e s [ dn ] [ b u c k e t I d ] + hash ( r e p l i c a )

e l s e i f ( ! k e p t B l o c k ) :h a s h e s [ dn ] [ b u c k e t I d ] =

h a s h e s [ dn ] [ b u c k e t I d ] � hash ( r e p l i c a )

upon b l o c k R e p o r t ( dn , b u c k e t s ) :i f ( i s F i r s t R e p o r t ( dn ) | | i s S t a l e ( dn ) ) :

4.2. DESIGN 31

f o r b u c k e t i n b u c k e t s :k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s )h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( k e p t B l o c k s )

e l s e :f o r b u c k e t i n b u c k e t s :

hash = hash ( b u c k e t . b l o c k s )i f ( h a s h e s [ dn ] [ b u c k e t . i d ] != hash ) :

k e p t B l o c k s = p r o c e s s R e p o r t ( b u c k e t . b l o c k s )h a s h e s [ dn ] [ b u c k e t . i d ] = hash ( b u c k e t . b l o c k s )

/ / I n t e r n a l codeFSNamesystem . i n i t i a l i z e B l o c k R e c o v e r y ( b l o c k ) :

i f ! i n R e c o v e r y ( b l o c k ) :o ldDataNodes = g e t L o c a t i o n s ( b l o c k )b u c k e t I d = b l o c k . i d % NUM BUCKETSf o r dn i n o l d D a t a n o d e s :


upon C l i e n t . u p d a t e P i p e l i n e ( b l o c k ) :o ldDataNodes = g e t L o c a t i o n s ( b l o c k ) ;b u c k e t I d = b l o c k . i d % NUM BUCKETSf o r dn i n o ldDataNodes :


. . . r e s t o f u p d a t e p i p e l i n e code

upon C l i e n t . append ( p a t h ) :f i l e = g e t F i l e ( p a t h )b l o c k = g e t L a s t B l o c k ( f i l e )b u c k e t I d = b l o c k . i d % NUM BUCKETSd a t a n o d e s = g e t L o c a t i o n s ( b l o c k )f o r dn i n d a t a n o d e s :

h a s h e s [ dn ] [ b u c k e t I d ] =h a s h e s [ dn ] [ b u c k e t I d ] � hash ( b lock , FINALIZED )

. . . / / Re s t o f o l d append l o g i c

Listing 4.9: Final Design, Data NodeDATA NODEupon i n i t ( ) :




r e c e i v e commands | | do work | | r e c e i v e b l o c k s ;on b l o c k m o d i f i e d :


once an hour :t r i g g e r ( b l o c k R e p o r t ( s e l f , b u c k e t i z e ( b l o c k s ) )

Chapter 5

Analysis

5.1 ResultsThe performance evaluation was done as simulations of processing time givendifferent cluster configurations and number of file system operations performedper second. Correctness and design analysis is done in Section 5.2.

5.1.1 SimulationsThe model for calculating block report speed in HopsFS, where h is num buckets,b is num blocks per dn, cps is replica changes per datanode per second, bts=indexscan rows read/s, bms=replica states corrected per second, RT T is DN-NNroundtrip time, RT TDB is NN-database rountrip time, in is number of incorrecthashes after the nth block report and e is the expected number of lost incrementalupdates between any two block reports, is:

i0 = 0

in = max(e,BRn ⇤ cps),n � 1

BR1 = RT T +bytesInReport

netSpeed

BRn = RT T +RT TDB +min(in�1,h)

h⇤ b

bts+

ebms

,n � 2

Keep in mind that this is assuming that stale datanodes are avoided for newblock writes, which is the case if the number of stale datanodes is below aconfigurable threshold percentage, defaulting to 50%.

In every case, we assume a 2.7% write percentage, and a network latency of5ms. In Figure 5.1 we simulate the effects of increasing the number of buckets,

33

34 CHAPTER 5. ANALYSIS

Property ValueRTT 5 msRTTDB 1 msbytes per reported block 30network speed 5 Gbpscorrections per second (bms) 1000errors between reports (e) 50blocks read per second (bts) 150 000

Table 5.1: Simulation Configuration

and measure block reporting time as a function of number of file system operationsperformed per unit time. In Figure 5.2 we simulate the effects of database readspeed on block reporting time. Configuration parameter values are defined inTable 5.1, according to measurements taken in our production cluster. Because ofthe impact of concurrent operations, we need to do a step-wise simulation until astable time has been reached. Every individual measurement is therefore the lastmeasurement of a sequence of 50 block reports for that configuration.

5.1.2 Real-world PerformanceUnfortunately we did not manage to reproduce these results in a cluster, because ofissues remaining with the implementation. However, to get as close as possible toreal-world values in our simulation, we adjusted the model parameters accordingto values measured on the HopsFS production cluster in SICS ICE, Lulea.

5.2 Analysis

5.2.1 CorrectnessCorrectness is an essential property for a data-loss protection mechanism. Tradi-tional HDFS block report is quite simple to reason about since the complete datanode state is transmitted and replicas are compared one by one. Our hash-basedsolution instead compares a number of calculated hashes and we must thereforeshow that these hashes correspond to the expected state. The hash-based reportinguses knowledge of the underlying state and its transformations to keep the hashesup to date.

We start by showing correctness of a single-hash solution, and afterwardsextrapolate it to multiple ”buckets”. As we are comparing a hash of the state,instead of the actual state, we need to make sure that the hash actually corresponds

5.2. ANALYSIS 35

0 1 2 3 4 5 6 7 8 9 100

5

10

Millions of file system operations per second (2.7% writes) [M ops/s]

Tim

epe

rrep

ort[

s]

Impact of number of buckets on block report time

1k buckets/dn2k buckets/dn3k buckets/dn4k buckets/dn

Figure 5.1: Impact of Number of Buckets on Block Report Time

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

Millions of file system operations per second (2.7% writes) [M ops/s]

Tim

epe

rrep

ort[

s]

Impact of block processing speed on block report time

50k blocks/s100k blocks/s150k blocks/s

Figure 5.2: Impact of Block Processing speed on Block Report Time


the state it is supposed to represent. On the data node, processing power isabundant, so we generate a new hash every time we send a report. On the namenode it gets more complicated. Block and replica information is spread overmultiple tables: block infos, replicas and replica under constructions. Theblock infos table contains length, block id, inode reference, and generation stamp.The replicas table is simply a mapping between blocks and storages (data nodes),and the replica under constructions table contains non-finalized replica states.Our hash depends on all the values block id, block length, generation stampand construction state, so if a block info entry is modified by either length orgeneration stamp, this must be reflected in the hash for every replica of that block.

Since we are not re-calculating the hash, but rather incrementally updating iton replica or block updates, we need to make sure that all updates are reflectedin the hash and that in case of doubt we invalidate the hash rather than give afalse positive. We believe that the reasoning put forward in the first attempt ata solution (see Section4.2.1) and the improvements introduced in the followingSections 4.2.3,4.2.4 and 4.2.5 are sufficient to show the validity of the solution.

5.2.2 Pitfalls

The motivation behind this project was a previously unfeasible block reportingtime of roughly 10s per million blocks which, given the high expectationson durability in HDFS, were not enough to scale an hourly reporting clusterconfiguration to more than maybe a thousand nodes at best. HopsFS comeswith load-balancing between name nodes, and durable metadata in the distributeddatabase NDB, but that isn’t quite enough. With the improved design, we can seethat given enough buckets, the block reporting speed is improved by at least anorder of magnitude, with a worst case report time comparable to the old solution.However, there are several pitfalls to consider.

5.2.2.1 Sensitivity to Misconfiguration

The most obvious problem is that of configuring block reports to have toofew buckets to handle concurrent operations. The threshold between largeimprovements and no improvements is very narrow. The results of the simulationclearly show that given enough buckets to handle concurrent operations, the blockreport times remain very quick. However, what happens when the number of filesystem operations per second increases to the point of hitting the threshold whenthe buckets are not enough? Currently the only way to update the number ofbuckets is to shut down the whole cluster, reconfigure every data node, reconfigurethe name nodes, and update the bucket ids for every replica in the database. If

5.3. CATASTROPHIC FAILURE 37

downtime is not acceptable, we are stuck. During the course of the work a fewideas have been discussed.

Replacing the static bucket assignment with a Merkle Tree (see Section 2.7)has the potential to solve this problem gracefully. Using a Merkle Tree you candynamically adjust the level of the tree where to extract the hashes from, givingyou the opportunity to go down the tree if any part of the block id space becomesover-crowded with blocks. The problem is that of incremental updates. Everyincremental update would need to recalculate hashes all the way up the tree to theroot node, causing multiple round-trips to the database. There is the potential ofusing MySQL stored procedures to perform the cascading updates on the databaseside, but we deemed this to be outside the scope of the thesis.

Another idea was to assign the blocks to buckets in a ring, adding newly addedblocks to the newest bucket, thereby pushing most possible inconsistencies to theend of the ring. However, we did not manage to come up with a working concept.

The static bucket assignment could potentially lead to imbalanced blockassignments, because of pure chance, but also potentially from recurring operationssuch as timing the creation of new blocks in an unfortunate manner as to createperiodicity in block to bucket assignments. At this time we have not identified anysuch pattern.

5.2.2.2 Real-world Large Scale Systems

To thoroughly test the viability of this type of solution, in depth benchmarksof a working implementation in a large cluster would be necessary. As a fullyfunctional implementation was not achieved, and we also did not have access to alarge enough cluster, this kind of analysis was not been possible within the scopeof the thesis.

5.2.2.3 Computation and Storage Overhead

Despite not completing the implementation bug-free, we could still test theperformance overhead of the solution as we believe that what we have is nearlyfinished. In our general performance tests we we have not seen more than amarginal change in performance. Furthermore, the storage overhead for hashbuckets in the order of thousands of rows per data node is marginal with respectto the potential millions of rows representing replicas and block information.

5.3 Catastrophic FailureSo far we have identified at least two problematic failure scenarios. The first is theone of mass block deletion, the other is the one of mass-expiration of data node


heartbeats.Since deleting blocks is much faster than writing blocks (a file delete on the

data node local file system would just be the deletion of the corresponding inode),the number of changes per data node per second could spike during mass deletionof data. A block report being processed during such a deletion would cause a highnumber of inconsistent hashes due to the modifications happening concurrently,and so the next block report for the same data node would probably hit the worstcase scenario of a full block report and would increase the load on the name nodefurther. Over time, if the operation frequency goes back to previous values, sowill the block reporting time.

The simulation model assumes a low number of stale data nodes, which causesthe name node to avoid writing new blocks to those that are stale. For the hash-based reporting mechanism, this means that during the first full block report afterbecoming stale, the number of concurrent modifications would remain relativelysmall, and so would the number of inconsistent hashes. If the number of stale datanodes ever reaches past the threshold, the name node writes new blocks also to thestale data nodes, and so the number of inconsistent hashes would sharply increasecausing also the next report to be slow.

Chapter 6

Conclusions

6.1 ConclusionBy analyzing the results of the simulations done using a fine-grained modelpresented in Section 5.1.1 we conclude that the new HopsFS Block Report hasclear performance benefits as compared to the HDFS Block Report, at least for animplementation that uses distributed metadata. Given a correct configuration, theresulting block report times are at least 10 times faster, which allows for scalingthe system to clusters at least 10 times the previous size. Despite an initial goal ofproviding a complete implementation and real-world results, we did not manageto do so within the time frame of the thesis. We believe that our model is accurate,however, and that the work will come to fruition as soon as the last software bugsare ironed out.

The solution comes not without drawbacks such as the ones presented inSection 5.2.2, e.g. the issue of reconfiguration and that of handling catastrophicfailures. In addition, benchmarks of a fully working implementation would berequired to verify the model used to simulate large clusters. In the followingsection we present proposed areas of future work.

6.2 Future WorkWe propose the following areas of further research.

Implementation A complete implementation and real-world analysis ofbehavior over time in a production cluster to help verifythe results of this work

HDFS Applicability Exploring the feasibility of the proposed hash-based

39

40 CHAPTER 6. CONCLUSIONS

solution for replacing the Apache HDFS implemen-tation

Failure Scenarios Finding more failure scenarios and solutions to dealwith them, to increase the robustness of the solution

Reconfiguration Exploring solutions to the reconfiguration problem ofthe static block id space partition

General Applicability Using the concept of divide-and-hash to improve per-formance of other distributed systems that require syn-chronization of large amounts of state

Bibliography

[1] H. Kuang, K. Shvachko, N. Sze, S. Radia, and R. Chansler,“Append/Hflush/Read Design,” Tech. Rep., 2009. [Online]. Available: https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf

[2] S. Shrinvas, “An Introduction to HDFS Federation,”Aug 2011. [Online]. Available: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/

[3] “Hadoop Front Page.” [Online]. Available: http://hadoop.apache.org/

[4] S. Niazi, M. Ismail, S. Haridi, J. Dowling, S. Grohsschmiedt, andM. Ronstrom, “HopsFS : Scaling Hierarchical File System Metadata UsingNewSQL Databases,” arXiv preprint arXiv:1606.01588, pp. 1–15, 2016.

[5] M. Letia, N. Preguica, and M. Shapiro, “CRDTs: Consistency withoutconcurrency control,” Arxiv preprint arXiv09070929, vol. abs/0907.0, no.RR-6956, pp. 1–20, 2009. [Online]. Available: http://arxiv.org/abs/0907.0929

[6] V. Kumar Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O ’malley, S. Radia, B. Reed, and E. Baldeschwieler, “ApacheHadoop YARN: Yet Another Resource Negotiator,” SOCC ’13 Proceedingsof the 4th annual Symposium on Cloud Computing, vol. 13, pp.1–3, 2013. doi: 10.1145/2523616.2523633. [Online]. Available: http://dx.doi.org/10.1145/2523616.2523633

[7] “Data never sleeps 5.0 — domo.” [Online]. Available: https://www.domo.com/learn/data-never-sleeps-5

[8] S. Ghemawat, H. Gobioff, and S.-t. Leung, “The Google File System,” 2003.

[9] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processingon Large Clusters,” Proceedings of 6th Symposium on Operating

41

https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf

https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf

https://hortonworks.com/blog/an-introduction-to-hdfs-federation/

https://hortonworks.com/blog/an-introduction-to-hdfs-federation/

http://hadoop.apache.org/

http://arxiv.org/abs/0907.0929

http://arxiv.org/abs/0907.0929

http://dx.doi.org/10.1145/2523616.2523633

http://dx.doi.org/10.1145/2523616.2523633

https://www.domo.com/learn/data-never-sleeps-5

https://www.domo.com/learn/data-never-sleeps-5

42 BIBLIOGRAPHY

Systems Design and Implementation, pp. 137–149, 2004. doi:10.1145/1327452.1327492

[10] “Apache Hadoop Review, Customers, and Alternatives.” [Online].Available: https://siftery.com/apache-hadoop

[11] “Hadoop Releases.” [Online]. Available: https://hadoop.apache.org/releases.html

[12] M. Mether, “MySQL Cluster Internal Architecture,” 2013.

[13] M. Ronstrom, “200M reads per second in MySQL Cluster 7.4,”2015. [Online]. Available: http://mikaelronstrom.blogspot.se/2015/03/200m-reads-per-second-in-mysql-cluster.html

[14] E. A. Brewer, “Towards robust distributed systems,” in Proceedings of thenineteenth annual ACM symposium on Principles of distributed computing- PODC ’00, 2000. doi: 10.1145/343477.343502. ISBN 1581131836.ISSN 01635700 p. 7. [Online]. Available: http://openstorage.gunadarma.ac.id/{⇠}mwiryana/Kuliah/Database/PODC-keynote.pdf{%}5Cnhttp://portal.acm.org/citation.cfm?doid=343477.343502

[15] “Top 5 Reasons for Choosing S3 over HDFS.”[Online]. Available: https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html

[16] “Amazon S3 Consistency Model.” [Online]. Available: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

[17] Z. Bjornson, “Cloud Storage Performance.” [Online]. Available: http://blog.zachbjornson.com/2015/12/29/cloud-storage-performance.html

https://siftery.com/apache-hadoop

https://hadoop.apache.org/releases.html

https://hadoop.apache.org/releases.html

http://mikaelronstrom.blogspot.se/2015/03/200m-reads-per-second-in-mysql-cluster.html

http://mikaelronstrom.blogspot.se/2015/03/200m-reads-per-second-in-mysql-cluster.html

https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html

https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html

http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

http://blog.zachbjornson.com/2015/12/29/cloud-storage-performance.html

http://blog.zachbjornson.com/2015/12/29/cloud-storage-performance.html

TRITA TRITA-ICT-EX-2017:150

www.kth.se

Hash-based Eventual Consistency to Scale the HDFS Block Report1181053/FULLTEXT01.pdf · Hadoop Open...

Documents

Transcript of Hash-based Eventual Consistency to Scale the HDFS Block Report1181053/FULLTEXT01.pdf · Hadoop Open...