Designing an elastic and scalable social network applicationpvr/MemoireDeCosterGhilain.pdf ·...

Universite catholique de Louvain

Louvain School of Engineering

Computing Science Engineering Department

Designing an elastic and scalable socialnetwork application

Promoter:

Pr. Peter Van Roy

Readers:

Pr. Marc Lobelle

Boris Mejıas

Master thesis presented for the

obtention of the grade of master in

computer engineering, option

networking and security, by

Xavier De Coster and Matthieu Ghilain.

Louvain-la-Neuve

Academic year 2010 - 2011

Acknowledgments

The Bwitter team would like to thank Pr. Peter Van Roy for his help and insightful

comments.

We also want to thank Boris Mejıas for his guidance and availability during the

whole project.

We thank Florian Schintke, member of the Scalaris developing team, for his help

during our analysis of Scalaris and the numerous answers he provided to our questions.

We also thank Quentin Hunin for his support and constructive feedback during the

last few weeks of the redaction.

Finally, we would also like to thank our families, our friends and our girlfriends,

Ines and Lorraine, for their unconditional support and encouragement.

iii

Abstract

The amount of traffic on web based social networks is very difficult to predict. In

order to avoid wasting resources during low traffic periods or being overloaded during

peak periods, it would be interesting to adapt the amount of resources dedicated to the

service.

In this work we detail the design and implementation of our own social network

application, called Bwitter. Our first goal is to make Bwitter performance scales with

the number of machines we dedicate to it. Our second goal is linked to our first one, we

want to make Bwitter elastic so that it can react to flash crowds without suspending

its services by adding resources in order to handle this load. To achieve the desired

scalability and elasticity, Bwitter is implemented on a scalable key/value datastore with

transactional capabilities running on the Cloud.

During our tests we study the behaviour of Bwitter using the Scalaris datastore and

having both running on Amazon’s Elastic Compute Cloud. We show that the perfor-

mance of Bwitter increases almost linearly with the number of resources we allocate to

it. Bwitter is also able to improve its performance significantly in a matter of minutes.

Contents

I The Project vii

1 Introduction 1

1.1 Social networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Scalable Data Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 The Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 The Bwitter project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.2 Bwitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 State-of-the-art 7

2.1 Scalable datastores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Key/value Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Document Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Extensible Record Stores . . . . . . . . . . . . . . . . . . . . . . 9

2.1.4 Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Peer-to-peer systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Study of scalable key/value stores properties . . . . . . . . . . . . . . . 13

2.4.1 Network topology . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Storage abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.3 Replication strategy and consistency model . . . . . . . . . . . . 16

2.4.4 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.5 Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

iii

2.4.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 The Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 The Architecture 23

3.1 The requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Non-Functional requirements . . . . . . . . . . . . . . . . . . . . 23

3.1.2 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Open peer-to-peer architecture . . . . . . . . . . . . . . . . . . . 27

3.2.2 Cloud Based architecture . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 The popular value problem . . . . . . . . . . . . . . . . . . . . . 30

3.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 The Datastore 33

4.1 The datastore choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Identifying what we need . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 Our two choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 General Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Design of the datastore . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.1 Key uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.2 Push approach design details . . . . . . . . . . . . . . . . . . . . 39

4.3.3 The Pull Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Running multiple services using the same datastore . . . . . . . . . . . . 46

4.4.1 The unprotected data problem . . . . . . . . . . . . . . . . . . . 47

4.4.2 Key already used problem . . . . . . . . . . . . . . . . . . . . . . 50

4.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Algorithms and Implementation 53

5.1 Implementation of the cloud based architecture . . . . . . . . . . . . . . 53

5.1.1 Open peer-to-peer implementation . . . . . . . . . . . . . . . . . 53

5.1.2 First cloud based implementation . . . . . . . . . . . . . . . . . . 54

5.1.3 Final cloud based implementation . . . . . . . . . . . . . . . . . 55

iv

5.2 Nodes Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Scalaris Connections Manager . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3.1 Failure handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 Bwitter Request Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.4.1 The push approach . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4.2 The pull approach . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4.3 Theoretical comparison of Pull and Push approach . . . . . . . . 78

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Experiments 87

6.1 Working with Amazon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.1.1 Choosing the right instance type . . . . . . . . . . . . . . . . . . 87

6.1.2 Choosing an AMI . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.1.3 Instance security group . . . . . . . . . . . . . . . . . . . . . . . 89

6.1.4 Constructing Scalaris AMI . . . . . . . . . . . . . . . . . . . . . 89

6.2 Working with Scalaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2.1 Launching a Scalaris ring . . . . . . . . . . . . . . . . . . . . . . 90

6.2.2 Scalaris performance analysis . . . . . . . . . . . . . . . . . . . . 91

6.3 Bwitter tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.1 Experiment measures discussion . . . . . . . . . . . . . . . . . . 108

6.3.2 Push design tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.3.3 Pull scalability test . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.3.4 Conclusion: Pull versus Push . . . . . . . . . . . . . . . . . . . . 127

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7 Conclusion 131

7.1 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

II The Annexes 137

8 Beernet Secret API 139

8.1 Without replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.1.1 Put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.1.2 Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.2 With replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

v

8.2.1 Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.2.2 CreateSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.2.3 Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.2.4 Remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.2.5 DestroySet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

9 Bwitter API 143

9.1 User management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

9.1.1 createUser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

9.1.2 deleteAccount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

9.2 Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

9.2.1 postTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

9.2.2 reTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9.2.3 reply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9.2.4 deleteTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9.3 Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

9.3.1 addUser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

9.3.2 removeUser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

9.3.3 allUsersFromLine . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9.3.4 allTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9.3.5 getTweetsFromLine . . . . . . . . . . . . . . . . . . . . . . . . . 149

9.3.6 createLine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9.3.7 deleteLine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9.3.8 getLineNames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9.4 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9.4.1 addTweetToList . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9.4.2 removeTweetFromList . . . . . . . . . . . . . . . . . . . . . . . . 152

9.4.3 getTweetsFromList . . . . . . . . . . . . . . . . . . . . . . . . . . 152

9.4.4 createList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.4.5 deleteList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.4.6 getListNames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

10 The paper 155

vi

Part I

The Project

vii

Chapter 1

Introduction

The web 2.0 offers many new services to the users of the Internet. They can now

share, generate and upload content online faster and more easily than ever before. All

those services require computing, bandwidth and storage resources. To predict the

required amount of those resources can be tricky, especially if a service wants to avoid

wasting them while being at the same time able to face high usage peaks. We are going

to take a closer look at the scalability and elasticity of perhaps the most famous of

those web 2.0 services, namely social networks.

1.1 Social networks

Social networks such as Facebook and Twitter are an increasing popular way for

people to interact and express themselves. Facebook, for instance has 600 million active

users [6]. People can now create content and easily share it with other people. Social

networks are now a mean of communication on their own, used by politicians, artists

and brands to easily reach large communities to promote themselves or their products.

It also allows people to quickly organise social events like barbecues or even nationwide

revolutions like what happened in Tunisia [21, 40] or Egypt [16].

Social networks are also a powerful tool to communicate during natural disasters.

Twitter and Facebook were very useful to find updates from relatives and friends when

the mobile phone networks and some of the telephone landlines collapsed in the hours

following the 8.9 scale earthquake in Japan. The US State Department even used

Twitter to publish emergency numbers [35]. Other examples are the Haiti [18] and

Chile [9] earthquake that were covered in real time thanks to social networks, with

photos sent out to the rest of the world directly via Twitter.

It is thus critical that social networks do not crash when their users need it the

most. But, the servers of those social networks can only handle a given number of

requests at the same time, therefore, if there are too many requests, the server can

become overloaded. A typical result of overloading is Twitter suspending its services

and displaying the “Fail Whale” shown in Figure 1.1.

1

Figure 1.1: “Lifting a Dreamer” aka Fail Whale, illustration by Yiying Lu displayedwhen Twitter is overloaded.

To avoid overloading efficiently is a tricky problem as this load is related to many

social factors, some of which are impossible to predict. For instance we want to be able

to handle the high amount of people sending Christmas or New Year wishes but also

reacting to natural disasters. This is why we want to turn towards scalable and elastic

solutions, allowing the system to add and remove resources on the fly in order to fit

the required load.

Social networks are also platforms where users can share personal information,

destined to be seen only by some specific peers. Other personal information such as

coordinates are sometimes also stored into their system too. More and more users begin

to worry who ultimately has access to this information and what they can do with it.

It is thus important to have a system that is secure and enforces the privacy of the end

user.

1.2 Scalable Data Stores

The web 2.0 called for a different kind of database than the previous Relation

Database Management Systems (RDBMS) solutions. They needed data stores able to

host huge amounts of data and handle many parallel requests at the same time. There

are now numerous scalable and elastic storage solutions to answer this demand. These

scalable data stores allow to store increasingly more data and to handle more requests

as we allocate more resources to them. This is because they have been build to share

their load on the different machines allocated to them. Those data stores have also

elastic properties, allowing them to add or remove resources to gracefully upscale or

downscale without having to be rebooted. This elasticity is crucial in order to upscale

to face sudden increases in traffic but also to downscale when the hype is over in order

to avoid wasting resources.

2

As our work revolves around the scalability and elasticity of social network appli-

cations we are bound to work with those scalable data stores. Many different kinds of

scalable data stores exist and we are going to present them in our state-of-the-art in

Chapter 2.

1.3 The Cloud

The cloud is a phenomenon that is hard to ignore these days as most web appli-

cations tends to rely on it in order to provide their services. The cloud refers to on

demand resources such as storage, bandwidth and processing power but also to on de-

mand services such as mail service or word processing [2]. The computation can thus

be transferred from the users’ machine, as it was the case in the past, to the machines

forming the cloud. This allows users to have machines with very little computational re-

sources or storage but still be able to execute heavy calculations or store huge amounts

of data. A typical analogy to the usage of cloud resources is the usage of public utilities

such as water or electricity. Specialised companies provide those services at a fraction

of the price it would cost us if we needed to deploy and maintain ourselves all the

infrastructure required for that service.

The cloud is thus the ideal platform to use if we do not want to invest in costly

hardware and maintenance. This is especially true if we do not know beforehand if

our service will be successful. We can start small and only pay for a little amount of

resources. If our service is popular we can easily grow by requesting more resources and

thus paying a higher price. But, if our service does not manage to attract many people

we did not waste our money investing in powerful servers. Furthermore, the resources

the cloud offers are elastic, meaning you can increase or decrease them on the fly and

only pay for the amount you really need to keep your service going.

We are going to use the scalability and elasticity properties of the cloud during our

work, this is why we will detail it even further in our state-of-the-art in Chapter 2.

1.4 The Bwitter project

Bwitter is a lighter version of Twitter, the famous social network. Some readers

might be unfamiliar with Twitter, we will thus introduce it briefly before going further

with the description of Bwitter.

1.4.1 Twitter

Twitter is a micro-blogging system that allows users to post small text messages

of 140 characters called tweets. An enormous amount of tweets is posted each day,

according to Twitter themselves [33], 177 millions of tweets were posted in march 2011

and the record is 6939 tweets per second 4 seconds after midnight in Japan on New

Year’s Day.

3

Users can choose to display the messages of other users they find interesting in their

line by following them. In Figure 1.2, you can see the home screen of Twitter with user

Zulag (aka Xavier De Coster, co-author of this Master thesis) logged in, the messages

of the users he follows are displayed as a stream in his “Timeline”.

Figure 1.2: Home screen of Twitter’s web interface.

Twitter offers additional functionalities to this message posting, for instance a user

can reply to or retweet (share) any message he wants. He can also address a message

directly to another user by starting his message with “@destinationUser”. The main

difference of Twitter, and now also Google+ 1, in comparison to other social networks

such as Facebook is the asymmetry of the social connections. This means that the

connection does not go in both directions: a user A can follow a user B without user

B having to follow user A. This is unlike the Facebook system where two users can be-

come “Friends” and automatically see each others updates. This behaviour encourages

Twitter to be used as a place where fans can follow their favourite stars. Up to such a

point that 10% of the users accounts for 90% of the traffic [17].

The tweets can also be tagged by users using hashtags. The tweets containing an

hashtag are automatically added to a group of tweets associated to this hashtag.

1https://plus.google.com/, last accessed 13/08/2011

4

1.4.2 Bwitter

We decided to develop Bwitter as an elastic and scalable social network application

and to study how it behaves faced to flash crowds and heavy traffic.

Bwitter is an open source version of Twitter based on a scalable data store and

developed to run on a highly elastic cloud architecture. Bwitter thus presents similar

functionalities to Twitter. We decided for Twitter as it is one of the more basic social

networks and because it is now incredibly famous. The data store used by Bwitter is a

key/value store which we will present in detail later in Chapter 4. Bwitter was thought

so that other services could run on this data store without interfering with each other’s

data.

Bwitter is developed in multiple layers that are loosely coupled allowing for maximal

modularity. We added an optional cache layer on the top of the key/value data store

in order to maximize performance. Bwitter manages the cloud machines on which

it runs as well as the data store nodes, restarting them when needed. During the

implementation we took advantage of existing and proven technologies leading to an

efficient and robust implementation.

1.4.3 Contributions

The main contributions of this work are:

• Design of a scalable social network for microblogging.

• Improvement of Beernet’s API

• Helping to improve the bootstrapping of Scalaris and studying its behaviour on

the Amazon Elastic Compute Cloud .

During the development of Bwitter we identified some potential improvements with

one of the datastores we were using, namely Beernet [19, 23] and designed a new API

allowing to protect and manage the rights to the data stored. This new API supporting

secrets is now implemented and supported in Beernet version 0.9.

In order to further understand the behaviour of Bwitter we did performance tests

with the Scalaris [29] data store on the Amazon’s Elastic Compute Cloud (EC2) test-

ing its scalability and elasticity. We also studied the impact of the machine resources,

the number of parallel requests and having conflicting operations on Scalaris’ perfor-

mance. During our discussions with the developers of Scalaris we helped them locate

an instability in the booting of their system.

Ultimately, we implemented two different designs for Bwitter and tested both on

Amazon’s EC2 showing really good scalability and elasticity properties. During the

course of the development we presented a demo of our project at the “Foire du Libre”

held the 6th of April at Louvain-la-Neuve2 at the Beernet stand. We have also co-

2“Foire du Libre” is a fair celebrating open source software and organised by the Louvain-li-nux:http://www.louvainlinux.be/foire-du-libre/, last accessed 05/08/2011

5

written an article, along with Peter Van Roy and Boris Mejıas, entitled “Designing

an Elastic and Scalable Social Network Application” in which we detail some of the

observations and design decisions we developed in this master thesis. This article,

that can be found in Chapter 10 of our annexes, has been accepted for The Second

International Conference on Cloud Computing, GRIDs, and Virtualization3 organized

by the IARIA and held the 25th to the 30th of September 2011 in Rome, Italy.

1.5 Roadmap

We start with our state-of-the-art in Chapter 2, we discuss the different technologies

we used and explored during the development of Bwitter such as scalable data stores

and the cloud services.

We then identify the main requirements of our project and discuss the general

architecture of Bwitter in Chapter 3. We then explain why we choose to base it on

the cloud instead of letting it run in the wild on an open peer-to-peer system. In this

chapter we also explain how a cache could solve potential problems due to values being

too popular.

The next step is to take an in depth look at the data store we are going to use in

Chapter 4. We detail our main objectives in terms of data representation and explain

how we decided to store the different data abstractions we use in our data store. We

also take a look at how we can avoid conflicts between two different applications using

the same data store.

We detail the different modules composing the Bwitter system in Chapter 5, high-

lighting their purpose and the main algorithms developed to implement them. We also

compare more thoughtfully the two different approaches, push and pull, for our appli-

cations most crucial functions, the post tweet and read tweets. We end this chapter

with a global overview of the implemented architecture and detail how the different

modules fit together.

We carry on by doing a series of experiments in Chapter 6. We start by testing

Scalaris and measuring the impact of a few chosen parameters on its performance,

scalability and elasticity. We then continue by measuring the performance, scalability

and elasticity of Bwitter and compare the results for the push and pull approaches.

We finish this master thesis with a conclusion in Chapter 7 where we reflect on the

achieved work, the lessons learned and the possible further improvements that could

be made to our application.

In the annexes you find the new API we designed for Beernet, the API of Bwitter

and a section for our mathematical demonstrations.

3CLOUD COMPUTING 2011, http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html,last accessed 13/08/2011

6

Chapter 2

State-of-the-art

In this section we are going to take a look at the relevant technologies that could be

useful to the Bwitter project. We start with the different existing scalable datastores in

order to decide which kind is most appropriate for our application. From there, we take

a closer look at peer-to-peer systems and their look up performances and further study

the properties of Distributed Hash Tables (DHT). Finally, we overview the different

services the cloud has to offer.

2.1 Scalable datastores

We start our state-of-the-art with a section about scalable datastores. As our

application is going to heavily rely on a datastore, it is important to understand the

different kinds that are available today as well as their pros and cons [7].

There are several kinds of scalable datastores available each with their own speci-

ficities, but four main classes can be put forward: Key/value Stores, Document Stores,

Extensible Record Stores and Relational Databases. We are going to compare the

functionalities they provide and the way they achieve scalability.

Most of those datastores do not provide ACID properties, but BASE properties.

ACID stands for Atomicity, Consistency, Isolation, Durability and BASE stands for

Basically Available, Soft state, Eventually consistent. This eventual consistency is often

said to be a consequence of Eric Brewer’s CAP theorem [29], which states that a system

can have only two out of three of the following properties: consistency, availability, and

partition-tolerance. Most of the scalable datastores decide to give up consistency but

some of them decide to have more complex trade offs.

2.1.1 Key/value Stores

These are the simplest kind of datastore, they store values at a user defined indexes

called keys and behave as hash tables. They are very useful if you need to lookup objects

based on only one attribute, otherwise you might want to use a more complex datastore.

Some key/value stores provide key/set abstractions allowing to store multiple values at

7

a single key. Key/value stores all support insert, delete, and lookup operations, but they

also generally provide a persistence mechanism and additional functionalities such as

versioning, locking and transactions. Replication can be synchronous or asynchronous,

the second option allows faster operations, but some updates may be lost on a crash and

consistency cannot be guaranteed. Their scalability is ensured through key distribution

over nodes, and some present ACID properties. In conclusion, this solution, by its

simplicity, allows to easily scale. But every rose has its thorn, this simplicity comes

at the cost of poor data structures abstractions. Notable examples are Scalaris, Riak,

Voldemort, Redis and Beernet.

Figure 2.1: Data organisation in a key/value datastore.

2.1.2 Document Stores

These systems store documents and index them. A document can be seen as an

object that has attribute names that are dynamically defined for each document at

runtime. Those attributes are not necessarily predefined in a global schema, unlike,

for instance, SQL that imposes to define the schema beforehand. Moreover, those

attributes can be complex, meaning that nested and complex values are allowed. It is

possible to explicitly define indexes to speed up research.

Replication is asynchronous in order to increase the speed of the operations. Of-

ten scalability can be ensured by reading only one replica, and thus sacrificing strong

consistency, but some document stores, like MongoDB, can obtain scalability without

that compromise. MongoDB allows to split parts of a collection across several nodes in

order to increase scalability instead of relying on replication. This technique is called

sharding.

Figure 2.2: Data organisation in a document store.

A popular abstraction, called domain, database, collection or bucket, depending of

the document store, is often provided to allow the user to regroup documents together.

Users can query collections based on multiple attribute value constraints. Document

stores are useful to store different kinds of objects and to make queries on attributes

8

those objects share. Other notable examples are CouchDB, SimpleDB and TerraStore.

2.1.3 Extensible Record Stores

These systems, also known as wide column stores, and probably motivated by

Google’s success with BigTable, store extensible records. Extensible records are hy-

brids between tuples, that are simple rows of relational tables with predefined attribute

names, and documents that have attribute names defined on a per-record basis. In-

deed, extensible record stores have families of attributes defined in a global schema,

but inside these families new attributes can be defined at run-time.

The extensible record store data model relies on rows and columns that can be

partitioned vertically and horizontally across nodes to ensure scalability. Rows are

split across nodes based on the primary key, usually they are grouped by key range and

not randomly. Columns of a table are distributed across nodes based on user defined

“column groups” regrouping attributes that are usually best stored together on the

same node. For instance all attributes of an employee concerning his address (address,

city, country) will be placed in one column group and all the attributes concerning the

means of contacting him (email, phone number, fax number) will be stored in another

column group.

As the document stores, extensible record stores are useful to store different kinds

of objects and to make queries on shared attributes. Moreover, they can provide higher

throughput at the cost of a bit more complexity for the programmer when defining the

column groups. Notable examples are HBase, Cassandra and HyperTable.

Figure 2.3: Data organisation in an extensible record store.

2.1.4 Relational Databases

These systems store, index, and query tuples via the well known SQL interface.

They offer less flexibility than document stores and extensible record stores because

tuples are fixed by a global schema, defined during the design of the database. Moreover,

the classical relational database model system is not well suited for scalability [29].

There are several proposed solutions to scale the database [13], but they all suffer from

disadvantages. A classical solution is to use a master/slave approach to distribute the

work: the slaves handle the reads and the master server is responsible for the writes.

The first drawback is the eventual consistency. Each slave has its own copy of the

data, and even if we normally have near real-time replication, we do not have strong

9

consistency which is sometimes needed. The second immediate drawback is that the

master server quickly becomes a bottleneck when the amount of writes increases.

Cluster computing solutions improve this by using the same data for several nodes

but with only one node responsible for writing. They thus provide strong consistency,

but the bottleneck problem remains.

Finally, the share nothing architecture, introduced by Google [10], should scale to

an infinite number of nodes because each node would share nothing at all with the other

nodes. In this approach, each node is responsible for a different part of the database

and has its own memory, disk and CPU. In order to divide the database, sometimes

called sharding the database, we split the tables into several non overlapping tables

and dispatch these tables to different shards, that thus share nothing, so that the load

is divided between them. Usually the cutting of the tables is done horizontally. This

means that different rows are assigned to different shards given a partition criteria [39]

based on the value of a primary key. The partition criteria can be range partitioning

(the shard is responsible for a range of keys), list partitioning (the shard is responsible

for a given list of keys) or hash partitioning (hash of the key determines the shard

responsible for the key).

To achieve redundancy each shard is replicated, in MySQL Cluster [20] for example

each shard is replicated two times. But, to implement this solution correctly, several

challenges have to be solved. Particularly, how to partition the data into multiple non

overlapping shards and with load fairly divided between the shards? The answer to

this question is closely related to the application area. The splitting is natural if, for

example, the table to split is a table containing data for Americans and Europeans

people, but in most of the cases this can be quite tricky.

Figure 2.4: In a relational database data can be subdivided and accessed via fixedfields.

Those relational databases have presented improved horizontal scalability, provided

that the operations do not span many nodes. While they are not as scalable as the pre-

viously mentioned datastores, they might be in a near future. The appeal to relational

database is obvious, it has a well established user base and support from its commu-

nity, which means there are already multiple tools existing ready to be used with it.

Furthermore it has ACID properties, this makes life generally easier for the program-

mer. Notable examples are MySQL Cluster, VoltDB, Clustrix, ScaleDB, ScaleBase and

NimbusDB.

10

2.2 Peer-to-peer systems

We also decided to give a close look at peer-to-peer (P2P) systems. They are an

interesting alternative to classical client/server systems because they allow a more ef-

ficient use of resources like bandwidth, CPU and memory. This is because every peer

is equivalent in the application and has a dual client/server role, and therefore can

serve content as a classical server sharing load between the members of the network.

Moreover, because of this dual role, availability of the content increases with the net-

work size encouraging scalability of the system, which is a property we are very much

interested in. P2P systems have also the crucial property that they do not have any

central point of failure, neither a central point of coordination which often becomes a

bottleneck when the system needs to grow. These properties are extremely important

in distributed computing because they increase robustness of the system as well as

scalability.

There are three main categories of P2P systems [31] which vary according to their

topologies and their look-up performances. The first and oldest relies on a central index

maintaining mapping between the files reference and the peers holding the file. This

index is managed by central servers that provide the look-up service. This contradicts

what we just told about peer equivalence and implies that this generation is not a true

peer-to-peer system. Therefore, a peer wanting to access some file must first connect

to this server to find the peers responsible for this data, and then it can connect to the

peer holding the data. This is shown in Figure 2.5. This is the solution developed by

Napster, the famous file-sharing system.

Figure 2.5: P2P system relying on a central index to look up files: A) Searching Node(0) asks the Central Server (CS) where it can find a given file. B) The CS gives theaddress of node 3 to node 0. C) Node 0 retrieves the file directly from node 3.

11

The second category does not rely on any server to perform queries and has an

unstructured topology. Therefore, the connections between the peers in the network

are established arbitrarily. In this category of P2P systems, there is no relation between

the node and the data for which it is responsible. It follows that the look-up mechanism

must be a flooding like mechanism. In Gnutella, the flooding algorithm has a limited

scope in order to limit the number of messages exchanged. Therefore, it could happen

that a value present in the network is not found. Indeed, it is possible that a query

could not reach the peer holding the value because the flooding diameter was too small.

This is illustrated in Figure 2.6.

Figure 2.6: P2P system using flooding to look up files: A) Searching Node (0) floods thenetwork with a request for a file B) A query reaches node 2 which hosts a correspondingfile and responds directly to 0. Note that if a query has a time to live of 2 and if nodes1, 2 and 3 host a corresponding file, only nodes 1 and 2 will respond to 0 as 3 is toofar away from 0.

In order to provide look-up consistency, the flooding diameter must be N , with N

being the number of peers in the network, however this would not scale in large systems.

In order to resolve this problem, the third generation of P2P systems changed from an

unstructured to a structured topology improving drastically the look-up performances.

Distributed hash tables (DHT) are the most frequent abstraction used by P2P with a

structured topology. We take a closer look at them in the next section.

12

2.3 DHT

DHTs were designed in order to solve the look-up problem present in many P2P

systems [3]. They provide the same operations to store and retrieve key/value pairs as a

classical hash table. A key is what identifies a value, the value is the data you want to as-

sociate with this key. As an example, consider a movie named “Why DHTs are fun.avi”

and the actual file containing the movie, the key would logically be the title of the film

and the value is the file.

Each peer in a DHT system can handle key look-ups and key/value pair storing

requests avoiding the bottleneck of central servers. Another problem addressed by

those systems is the repartition of the key/value pairs responsibility between the peers.

Each key/value pair and peer has an identifier. The identifier domain can be anything,

taking the example of Chord-like DHT, the identifier is an integer between 0 and N ,

where N is a chosen parameter. Those identifiers are used to determine which peer are

responsible for which key/value pairs. Each peer has an interval it is responsible for,

this interval is computed based on its identifier and other peers in the network. Once

again we take the example of Chord, a peer is responsible of all the identifiers between

its identifier and the identifier of the next peer in the network not included. A peer

stores all the key/value pairs with an identifier in its interval.

The identifiers are computed most often using a consistent hash function. Assuming

each peer has an IP address associated, its identifier is computed by taking the result

of the application of this function to its IP address. Some systems allows a peer to

chose an identifier. The identifier of a key/value pair is computed taking the hash

function of this key. The use of a consistent hash function to compute identifiers allows

a roughly fair division of the key space between peers, which is a crucial point for

scalability. Moreover, this kind of hash function has the advantage that adding a peer

to the system does not cause a lot of identifiers to be remapped to other peers, which

improves the elasticity of the system.

DHTs, as said in the peer-to-peer point, are the third generation of peer-to-peer

systems. Compared to the previous generation, they mainly solve the scalability prob-

lems of the look-ups mechanism. Indeed, we now have a relation between the key of a

value and a peer, which permits us to achieve better look-ups performance by rooting

the look-up request to the responsible peer instead of flooding the network, which was

not scalable.

2.4 Study of scalable key/value stores properties

Bwitter is built on the top of a key/value datastore. Key/value datastores are sys-

tems that implement a DHT and offer other services on the top of it. We now compare

possible design choices when implementing systems offering a DHT abstraction. The

comparison is based on the following criteria: consistency model, replication strategy,

storage abstraction, network topology, churn, transactional support and finally security.

13

2.4.1 Network topology

The network topology refers to how peers are organized in the network, there may

be important differences between the different DHTs implementations. It is also a

crucial design point because it will influence deeply the performance of the look-up

mechanism as well as the fault-tolerance of the network. We will take a look at some

important network topologies.

In Chord-like topologies [30], nodes are organized in a ring (see Figure 2.7) and

keep a list of successors and predecessors as well as a routing table, which is filled with

fingers chosen according to various policies. We call a finger a reference to another

peer in the system, usually it is the IP address of this peer. The size of the routing

table varies among the systems, Chord keeps log2(N) fingers, where N is the number of

nodes int the system. DKS, which is a generalization of Chord, keeps logk(N), where

k is predefined constant. This is a trade-off between better look-up performance and

bigger routing tables. We summarize the most common choices in Table 2.1. Each

Chord node also keeps in its successor list log2(N) successors, this in order to recover

from node failures. This topology is widespread because it allows for efficient routing

as well as easy self-organization upon joins, leaves and failures.

Beernet topology is similar to Chord but differs in one crucial point. In Chord,

nodes must be connected with their direct predecessor, in Beernet they only need to

know the key of their predecessor creating a branch when a node cannot join its direct

predecessor. This property is the reason why the topology of Beernet is called relaxed

ring (see Figure 2.7). Indeed, when a node does not have the link toward its predecessor

the ring is not perfect. This topology is more resistant because it makes less assumptions

while preserving the consistent look-up. You can find more information about Beernet

topology in [19].

Scalaris currently relies on a Chord topology too. The Scalaris team is currently

working in order to use another Chord like topology too called Chord#, which is very

much like a classic Chord except it stores keys in lexicographical order. Furthermore,

the routing is not done in the key space but rather in the node space. This allows range

queries and allows the application to choose where to place the data in the ring [32].

Number of fingers Look-up performances

O(1) O(n)

O(log(n)) O(log(n)/log(log(n)))

O(log(n)) O(log(n)) (more common)

O(√N) O(1)

Table 2.1: Number of fingers versus Look-up performances for N nodes in the network.

Chord, as well as Beernet, does not take advantage of the underlying physical

topology. Pastry [28], Tapestry, and Kadmelia [15] also assume a circular key space

but try to tackle this problem by keeping a list of nodes which they can join with low

latency. They choose their fingers giving preference to nodes in that list.

14

We finally detail the topology of CAN [24] because it differs significantly from the

topology of the other DHTs. Nodes are organized so that they divide a virtual d-

dimensional Cartesian coordinate space. Each node is responsible for a part of this

space. In order to join the network a node, that we call A, chooses a random point

in the space. It then contacts the node responsible for this point, called B. Finally, B

splits its zone in two giving the half of the zone it was responsible for to A. Nodes only

maintain routes towards their immediate neighbours. In CAN two nodes are neighbours

if their zones touch along d − 1 dimensions. In order to better understand, imagine a

square (2 dimensions) divided into rectangle chunks, which correspond to zones, two

nodes would be neighbours if the rectangles they are responsible for have an edge in

common. According to the results in [24], for a d dimensional space partitioned in

n zones, the average routing path length is (d/4) × n1/d and the average path length

is O(n1/d). You can observe that the average path length decreases as the number

of dimensions increases, but this comes at the cost of higher space complexity for

maintaining routing tables. Moreover, each join and leave becomes more costly as the

number of dimensions increases. Indeed the number of neighbours of a node increases,

and thus the complexity to maintain routing table consistency grows. However, the

topology is not linked with the physical topology of the nodes. You can see an example

of how nodes are organized in CAN for d = 2 in Figure 2.7, each rectangle represents

a zone controlled by a node.

Figure 2.7: From left to right, the ring overlay (CHORD) , the relaxed ring overlay(Beernet) and a 2-dimension CAN overlay.

2.4.2 Storage abstraction

As mentioned before, key/value stores allow to do all the operations provided by

classical hashtables on key/value pairs, namely look-up, store and delete operations. In

order to be clear, a key is uniquely associated to a value, storing another value with the

same key will erase any previously stored value. Beernet, Redis [25] and OpenDHT [26]

allow to work with key/set pairs, where each key can be associated to a set that can

contain multiple values, and thus when a look-up on a key associated to a set is done it

returns all the values in the set. OpenDHT only works with key/set pairs which leads

to more complex algorithms for the applications using it.

15

2.4.3 Replication strategy and consistency model

In order to provide redundancy, these systems often provide replication services.

Those vary according to guarantees they offer: improved reliability of the system and/or

availability. The replication is done by storing a value at k different nodes instead of

only one, k is called the replication factor.

Beernet and Scalaris offer symmetric replication with strong consistency using a

transactionnal layer build on the top of their DHT implementation. Strong consistency

means that read operations always return the latest correctly written value, this is

achieved by always writing and reading from a majority of the replica set. In symmetric

replication [12], each node identifier is associated with a set of (k − 1) other node

identifiers, which we call the replica set. When using replication, a key/value pair is

stored at the node responsible for the identifier of “key” and at all the nodes which

are responsible for an identifier inside the replica set. Nodes maintain routes toward

nodes with “symmetric identifier” so that they can contact directly any of the replicas

of the key/value pairs they are responsible for. Strong consistency between replicas

does not come for free. Indeed, each time a value is accessed a majority of the replicas

must be contacted. Thus in such a scheme, it is not possible to increase availability

of the content through replication. We address this problem in section 3.2.3. Beernet

currently does not handle the restoration of the replication factor when a node fails

abruptly.

CAN does not have consistency problems because it works with immutable content,

meaning that values cannot be updated. This is a clear limitation when implementing

a social network where updates are frequent. CAN proposes replication through what

they call realities. A node, when joining the network, join r coordinate spaces and is in

charge of a different zone in a different space, each space coordinate is called a reality.

When a key/value pair is added, it is added in all the realities. Therefore, because the

nodes are in charge of different zones in different realities, different nodes are in charge

of the new added pair. To create these realities, a different hash function is applied to

map the node to different coordinates in each reality. This strategy, as every strategy

relying on different hash functions, has two major drawbacks compared to symmetric

replication [12]. First, the inverse of the hash function is not computable. Therefore,

it is not possible to recover the original key before hashing, while it is needed to fetch

the value of the remaining replicas. Moreover, because of hash function distribution

properties, and even if it was possible to find the inverse, other replicas would be spread

all over the remaining nodes. This forces the node in charge of restoring the replication

factor to contact a multitude of nodes. In conclusion, because we can not find the

inverse of the hash function, the replication degree of pairs thus decreases at each node

failure.

Pastry uses a different approach based on leaf-set which is close to the successor set

approach. As for CAN, Pastry assumes that values are immutable, and so there is no

problem of consistency between the replicas, but it is at the cost of no updates of values.

Pastry stores the replicas at nodes that have the closest ids with respect to the value’s

key. So if the replication factor is k, you have k/2 replicas before and after the key. In

16

the successor set approach all the replicas are stored at the k successors of the key. This

strategy allows to maintain the replication factor because it is possible to find other

replicas at the contrary of CAN strategy. But the algorithms to maintain the replication

factor are expensive compared to the cheap symmetric replication strategy [12].

2.4.4 Transactions

While not many key/value datastores offer the possibility to do transactions, it

is a crucial feature. “A transaction is a group of operations that have the following

properties: atomic, consistent, isolated, and durable (ACID)” 1. A transaction can have

two results: abort or commit. When a transaction commits we can be sure that all the

operations inside the transaction have been done successfully. On the other hand if a

transaction aborts we know none of the operations have been done. We know only two

key/value datastores that implement transactions : Beernet and Scalaris.

Transactions are usually achieved using a two phase commit (2PC) algorithm. The

two phases are the validation phase and the write phase. Both phases are supervised

by a Transaction Manager (TM), while all the nodes responsible for the involved items

become Transaction Participants (TP). During the validation phase, the TM try to lock

the involved resources on every TP. If the TM receives an abort message the operation

is aborted. Otherwise, the TM sends a commit message to all the TPs making the

update permanent and releasing the lock.

Figure 2.8: Two-Phase Commit protocol (left) reaching termination and (right) notreaching termination, image taken from [19].

A serious problem could arise if the TM fails during this operation. Indeed, the

locks would not be released as you can see on Figure 2.8. This is why some systems,

such as Beernet and Scalaris, decided to add a replicated Transaction Managers that

can take over in case the TM fails. This transaction algorithm is based on the Paxos

concensus algorithm.

Beernet adds a phase to the 2PC algorithm before registering the lock. In the first

phase, the client, who is the original TM, will do read and write operations without

taking any locks. In a second phrase, and before committing the transaction, it registers

1MSDN, what is a transaction? http://msdn.microsoft.com/en-us/library/aa366402(VS.85).aspx,last accessed 13/08/2011.

17

to a bunch of transaction managers that can, as said before, takeover the transaction

if the main TM fails. It then does the prepare phase of the 2PC, and send a message

to all the TPs in order to take the lock on the required items. The TPs send the result

of the transaction to each of the RTM, which will then send their results to the main

TM. It can then take a decision to commit or abort the transaction if the majority of

the RTM have voted the same. When the TM has taken its decision, it sends a final

message to the TPs so that they can release the locks.

This algorithm is said to be eager because modifications are done optimistically

before requesting any lock in the read phase. This algorithm makes the assumption

that the majority of the TPs and the TM will survive during the transaction. You can

find more details about this algorithm in Jim Gray and Leslie Lamport’s “Consensus

on transaction commit article” [14] .

2.4.5 Churn

We define the churn as John Buford, Heather Yu and Eng K. did it in “P2P Net-

working and Applications” [5]. The churn is ’the arrival and departure of peers to and

from the overlay, which changes the peer population of the overlay’. This is important

for DHTs who want to have good elastic properties to handle high rate of churn.

Let’s take a look at how a classical Chord-like DHT handle joining of nodes. As in

any peer-to-peer network, a joining node needs to know how to contact a node already

in the network. It first contacts this node that routes it toward the node responsible to

insert it in the ring. This last node is the successor in the ring of the joining node, thus

the node that has the identifier that follows the identifier of the joining node. There

are now two steps to perform to enter the ring: contact the successor to warn it that its

predecessor has changed, and contact the predecessor to warn it that the joining node is

its new successor. This is not robust as a failure of the nodes, or a networking problem

that prevents the new node to join its predecessor, can create a broken ring. Beernet

solved this problem with its relaxed ring as explained when discussing the network

topology in section 2.4.1. It adds a phase to this protocol during which the joining

node signals to the successor node that it has correctly contacted the predecessor. The

successor can then remove its pointer to the old predecessor as you can see in Figure 2.9.

Therefore, this algorithm maintains look-up consistency and tolerates network failures.

After joining the ring, the new node has to retrieve the key/value pairs it is responsible

for. It can do it by contacting its successor that was in charge of those values before.

When a node wants to leave the ring the opposite operations are done. If it is a

gently leave, the node sends the data to the nodes now responsible for the values it

hosted, and it tells its neighbours to update their pointers. However, if it is an abrupt

leave, the other nodes have to detect the absence of the node, and have to execute

more complex algorithms to find the remaining nodes responsible for the data the

missing node hosted. This operation varies a lot according to the replication strategy,

as explained in the replication strategy point. Anyway, it is an heavy and complex

operation that should be avoided if possible by leaving gently the network.

18

Figure 2.9: The join algorithm: A) Q contacts the successor R. B) R accepts theinsertion and replies with P’s address, R now considers Q as predecessor but keeps P inits predecessor list, Q contacts the predecessor P. C) Q tells P he is the new successorand P accepts it. D) Q tells R the insertion was successful and R drops R from itspredlist. Image taken from [19].

It is thus obvious that, while those mechanisms ensure the survivability of the

system in a environment where nodes can fail or disconnect abruptly, the performances

are going to be better if the nodes use gentle leaves.

2.4.6 Security

There are numerous of known attacks against DHT based systems [34]. Many

DHTs are able to work under the assumption that the number of malicious nodes stays

lower than a certain fraction f of the total number of nodes. In the case of Sybil

attacks, a malicious user inserts many malicious nodes into the system in order to

go over that limit. Once the attacker has enough malicious nodes into the system,

it can easily interfere with the the routing and replication algorithms. In the case

of Eclipse attacks, a malicious node can “eclipse” a correct node by manipulating all

the neighbours responsible to point to that node in order to skip it, meaning no one

can access it anymore. Those attacks can lead to routing and storage disruption if

malicious nodes works together to deny requests or to return different values than the

ones expected.

Assuring security in such systems when they are running in open environments such

Internet is thus a though challenge. Note that those attacks are possible only if the

DHT accepts nodes from untrusted users. While most of the DHT based systems we

know do not currently provide such a security level, we have good reasons to believe

those issues are being worked on. Still we need to keep those issues in mind when

designing our architecture.

If a malicious user has access to the datastore he can also try to delete, edit, or forge

data causing damage to the application using that data. Those attacks are generally

19

avoided by using capability based security. The idea is that if the attacker does not

know where to look he will not be able to find the data as it is stored at unguessable

keys.

OpenDHT goes even further and offers a secret mechanism, allowing users to asso-

ciate a secret to a given value. If anyone wants to delete that value, he has to provide

that secret. Note that in OpenDHT you can not replace a value by another, as multiple

values can be stored at a given key. Doing a put on a key will thus only add the value

to the set.

2.5 The Cloud

Bwitter is destined to run on the cloud in order to take advantage of its scalable

and elastic nature. Everyone has heard about the cloud, but ultimately many different

definitions exists. So we decided to explicit here the definition of the could that we are

going to use trough this work. We decided to use the National Institute of Standards

and Technology (NIST) definition of cloud computing [22]:

“Cloud computing is a model for enabling ubiquitous, convenient, on-demand net-

work access to a shared pool of configurable computing resources (e.g., networks,

servers, storage, applications, and services) that can be rapidly provisioned and re-

leased with minimal management effort or service provider interaction. This cloud

model promotes availability and is composed of five essential characteristics, three ser-

vice models, and four deployment models.”

The five essential characteristics mentioned are on-demand self-service, broad net-

work access, resource pooling, rapid elasticity and measured service. On-demand self-

service means that the users can adjust the amount of resources whenever they need

without having to go trough a service provider’s employee. Broad network access means

that the resources can be accessed from a broad range of mechanisms or devices. Re-

source pooling means that the resources of the provider can be assigned and reassigned

to different clients dynamically in order to meet the clients requirements in the most

effective way. Rapid elasticity means that resources can be allocated and removed in a

transparent way in order to meet the amount of resources needed by the client. Mea-

sured service means that the resources provided are monitored and can be recorded

transparently.

The three service models are Cloud Software, Cloud Platform or Cloud Infrastruc-

ture as a Service (SaaS, PaaS, IaaS). In the SaaS case the client has access to a software

running on the cloud, but has not access to the underlying cloud infrastructure. This

application can usually be accessed either via a web browser. In the PaaS case, the

client is able to deploy applications and manage them on the cloud infrastructure but

does not manage it. Finally, in the IaaS case, the client can manage the basic resources

such as network, processing power and storage. The client can furthermore deploy

applications and manage them. These models are compared in Figure 2.10.

The four deployment models are private, community, public or hybrid cloud. A

20

Figure 2.10: The three service models compared to a classic model, image taken from [8].

private cloud is owned by an organisation and is used only by it, unlike a community

cloud that is shared between a few selected organisations. Those solutions may provide

better privacy than a public cloud maintained by an organisation that is selling its

services to end users or other organisations. An hybrid cloud is a combination of at

least two clouds that are still different entities, but are bound in order to allow data

and application portability.

2.6 Conlusion

In this chapter we have explored the different types of scalable datastores. We

studied more deeply the DHTs and more particularly the ones offering transactions.

The technological advancements in those fields made it possible to build an efficient and

robust implementation of a Twitter-like system on the top of a peer-to-peer system,

taking advantage of their assumed scalability and elasticity properties. In the next

chapter we describe two possible architectures for Bwitter.

21

Chapter 3

The Architecture

In this section we are going to present the architecture of our application.The plat-

form on which an application is based can have an important impact on its architecture.

We thus explore the repercussions of having an application running either on a peer-

to-peer network based on the users’ machines, either on a stable cloud based platform.

The two solutions lead to two radically different architectures in terms of performance,

accessibility of the different layers, as well as in terms of security concerns.

But, before developing the architecture we take a look at the different requirements

of our application functional wise and non functional wise.

3.1 The requirements

Bwitter is designed to be a secure social network based on Twitter, and while it

looks relatively simple at first sight, it hides some complex functionalities. We included

almost all those functionalities in Bwitter and decided to include some others. We

depict the relevant functionalities that will help us to analyse the design of the system,

highlight the differences between a centralised and decentralised architecture, study

the feasibility of overcoming the problems described above and test its behaviour when

faced with heavy traffic and flash crowds.

3.1.1 Non-Functional requirements

Product requirements

• Scalability:

We are facing a system that is continuously growing in terms of users [4] but

also in terms of traffic [33]. It is thus crucial that our system’s performance

increases almost linearly with the number of machines we allocate to it, this is

known as horizontal scalability. We are not interested here in vertical scalability,

which means adding or removing resources (CPU, RAM, disk) from an individual

machine, as it is harder to achieve dynamically and usually more costly than

23

horizontal scaling.

• Elasticity:

As we explained, the load social network applications can handle must be able to

vary in real time in order to adapt to social reasons. They must sometimes face

high peaks of demands for some shorts periods, but do not need to keep the then

needed amount of resources the rest of the time. Therefore, it is inefficient to

have a fixed number of nodes. If you want to be able to handle peaks of load, you

have to over-provision the number of nodes in your data center. This is why our

system needs to be able to upscale when there are high demands and to downscale

easily when the peak is over to avoid wasting resources.

• Fault tolerance, availability and integrity:

The system has to be fault tolerant, this means that even if some machines in

the system fail the whole system is still able to function. Also the integrity of the

data and availability of the service have to be ensured as it is a major requirement

of every social network.

• Security:

Bwitter must ensure authenticity, integrity and confidentiality of the data posted

by users over the whole system. No malicious user should be able to forge, edit

or delete data in the system. Finally, Bwitter must forbid access to confidential

data such as password. Those requirements must hold even with Bwitter’s code

released as open source.

• Lightness of the application:

The end user should only have a fast and light interface performing little calcu-

lation. The goal is to be as portable as possible, so that smart phones and other

devices with less computing power can also use our application. This implies that

the heavy calculations should be done on the server side.

• Performance:

We need good performances for a lot of small reads and writes. Indeed, small

values are frequently read, written and updated in social network applications.

Organizational requirements

• Modularity:

Our project should be build in different modules and it should possible to easily

replace a layer by another based on clearly defined interfaces. For instance, the

graphical user interface (GUI) module could be desktop based or web based and

the main application should not see any difference.

24

• Open source:

We want our project to be released in the wild with its source code available for

anyone wanting to experiment with it. This also means that the libraries we use

in the development of our system should be open source.

• Use existing technologies:

We do not want to re-invent everything on our own so we decided to use already

developed open source tools during our development.

3.1.2 Functional requirements

Nomenclature

There are only a few core concepts on which our application is based:

• A tweet is basically a short message with additional meta information. It con-

tains a message up to 140 characters, the author’s username and a timestamp of

when it was posted. If the tweet is part of a discussion, it keeps a reference to

the tweet it is an answer to and also keeps the references towards tweets that are

replies to it.

• A user is anybody who has registered in the system. A few pieces of information

about the user are kept in the datastore, such as his complete name and the MD5

hash of his password, used for authentication.

• A line is a collection of tweets and users. The owner of the line can define which

users he wants to associate with the line. The tweets posted by those users are

from then on displayed in this line. This allows a user to have several lines with

different thematics and users associated.

Basic operations

There are many different social networks existing today, and while they each have

their own particularities, a few core operations to share, publish or discuss content

are almost always present. Based on our own use of social networks and on Twitter

functionalities, we identified a restricted number of operations our social network had

to be capable of.

• Post a tweet:

A user can publish a message by posting a tweet. The application posts the tweet

in the lines to which the user is associated. This way all the users following him

have the tweet displayed in their line.

25

• Retweet a tweet:

When a user likes a tweet from another user he can decide to share it by retweeting

it. This has the effect of “sending” the retweet to all the lines to which the user

is associated. The retweet is displayed in the lines as if the original author posted

it, but with the retweeter’s name indicated.

• Reply to a tweet:

A user can decide to reply to a tweet. This includes a reference to the reply tweet

inside the initial tweet. Additionally, a reply keeps a reference to the tweet to

which it responds. This allows to build the whole conversation tree.

• Create a line / a list:

A user can create additional lines / lists with custom names to regroup specific

users / tweets.

• Add and remove users from a line:

A user can associate a new user to a line, from then on, all the tweets this newly

added user posts will be included in the line. A user can also remove a user from

a line, he will not see the tweets of this user in his line anymore and will not

receive his new tweets either. Note that if a user re-adds a previously removed

user, the tweets he posted when he was still associated to the line will re-appear.

• Add and remove a tweet from a list:

A user can store a new tweet into a list to be able to retrieve it later easily. The

user can also decide later to remove this tweet from the list.

• Read tweets:

A user can read the tweets from a line in packs. The size of those packs is a

parameter, for example we can decide to retrieve the tweets by packs of 20. He

can also refresh the tweets of a line, or a list to retrieve the tweets that have been

posted since his last refresh.

3.1.3 Conclusion

We just presented the requirements of our application as well as the functionalities.

The most important requirements are scalability, elasticity, availability and security.

The next section details two different possible architectures we elaborated based on the

presented requirements.

3.2 Architecture

As previously mentioned we now present two different scalable architectures for

our application. In both architectures, our application is decomposed in three loosely

coupled layers as we can see in Figure 3.1. From top to bottom, the Graphic User

26

Interface (GUI), Bwitter which handles the operations described in the section 3.1.2

and the scalable datastore. The datastore is distributed amongst multiple nodes that

we call datastore nodes. In the next chapter, we present Beernet and Scalaris, the two

datastores that we have used.

Figure 3.1: Comparison of the architectures. Left) Cloud based architecture. Right)Open peer-to-peer architecture.

This architecture is very modular, each layer can be changed assuming it respects

the API of the layer above. We now have to decide where the datastore will run. We

have two options, either let the datastore nodes run on the users’ machines or run

them on the cloud, leading to two radically different architectures: the open peer-to-

peer architecture and the cloud-based architecture.

In both architectures we try to achieve a secure solution as building an insecure

application would not be realistic. Indeed, if a malicious user can either reveal personal

information or steal the identity of someone, our application would be both pointless

and dangerous. We finally compare the two architectures based on the requirements

we elaborated in the previous section.

3.2.1 Open peer-to-peer architecture

In a fully decentralised architecture, the user runs a datastore node and the Bwitter

application on his machine. The Bwitter application does requests directly to this local

datastore node. Ideally this local datastore node should not be restricted to the Bwitter

application, but should also be accessible for other applications. The problem with this

approach is that the user can bypass protection mechanisms enforced at higher level by

accessing the datastore’s low level functions. Usually this is not a problem as untrusted

users would not know at which key the data is stored so they cannot compromise it. But

in our case, the data has to be at known keys so that the application can dynamically

retrieve them. This means that any user understanding how our application works

would be able to delete, edit or forge lines, users, tweets and references. This would be

a security nightmare.

27

We tried to tackle this problem with the secret mechanism we designed to enrich

Beernet’s interface which is presented later. But while it prevents users to edit or delete

data they did not create themselves, we could not prevent them to forge elements. To

avoid this we need a way to authenticate every data posted by a user.

This could be done by enforcing authentication at the datastore level, but this is

a feature that is not always provided. We could also do this at the application layer.

Indeed, assuming that each user has public and private key information, we could

authenticate all the data posted using asymmetric cryptography. However, this would

require to do a cryptographic operation for each read and write operation. This would

also force users to store their private and public keys either on the datastore, either

on their local machine, or a mix of both. A possible solution would be to have users

storing their public key in the datastore at a public location, so anyone needing the

public key can retrieve it easily. The private key of a user is stored at a private location

that only himself can find back, by example using a key that is the hash of his password

concatenated with his username. Additionally, a sealed local cache could be maintained

on the user’s machine containing his private key and the public keys of all the users

with whom he has contacts. This cache is useful to avoid the constant reloading of all

the needed keys each time the user want to use the application. Furthermore, public

keys are values that seldom change. If a cryptographic problem is encountered while

using a key from the cache, the key is reloaded from the datastore in order to avoid

problems due to cache corruption or public key changed by the owner.

Even with those mechanisms in place, we have to enforce security at the datas-

tore level. Beernet uses encryption to communicate between different nodes to avoid

confidential information leak. But anyone could add modified Beernet nodes behaving

maliciously. Aside usual attacks presented in our state-of-the-art, a corrupted node

could be modified to reveal all the secrets inside the requests going trough it. Scalaris

faces the same problem as its code is widely available too. We thus have to make sure

that the code running the datastore node is not modified, so we need a mechanism

that enforces remote attestation as described in [38]. This can be done by using a

Trusted Platform Module (TPM) [37], which provides cryptographic code signature in

hardware, on the users’ machine in order to be able to prove to other datastore nodes

that the client’s node is a trustworthy node. Until a datastore node has a way to tell

for sure it can trust another datastore node we are in a dead end. This is especially

true for Beernet’s new secret mechanism described in section 4.4.1, as anyone stealing

the secret of another user can erase any data posted by the user.

Assuming that a Twitter session time is short, there could be a problem if our

application is the only one running on the top of our datastore. Indeed, it will result in

nodes frequently joining and leaving the network with a short connection time. Each

of those changes in the topology of our datastore modifies the keys for which the nodes

are responsible, it triggers key/value pairs reallocations leading to an important and

undesirable churn. This would not be an ideal environment for a DHT. Furthermore,

as we saw in the state-of-the-art, DHT based datastores, such as Beernet and Scalaris,

are sill exposed to attacks such as Sybil and Eclipse attacks if they accept malicious

nodes.

28

In our requirements we stated that the system has to be fault tolerant and that the

integrity of the data must be preserved. The integrity of the data is guaranteed thanks

to the replication at the datastore level. Because this environment is not stable, we

need to have a higher replication factor than usual. The impact is double. First, peers

are responsible for more keys making worse the already important churn. Secondly,

each transaction involves more peers which degrades the overall performance of the

system.

In conclusion, this solution has the advantage to provide free computing power

automatically growing with the number of users. But scalability, elasticity and security

are compromised due to the lack of control on the machines and to the difficulty to

control direct access to the datastore by users. We now take a look at the alternative

architecture based on the cloud.

3.2.2 Cloud Based architecture

With this architecture the Bwitter and datastore nodes run on a cloud platform.

A Bwitter node is a machine running Bwitter but generally also a datastore node.

This solution offers good elastic properties assuming we have an efficient cloud service,

meaning that we can quickly obtain machines ready for use. We can thus add or

remove Bwitter and datastore nodes to meet the demand, optimizing our use of the

machines. This solution also allows us to keep a stable DHT as nodes are not subject

to high churn, as it was the case in the first architecture we presented. Hence, a lower

replication factor is acceptable which should boost the performance. Moreover, the

communications should be much quicker between nodes in a cloud infrastructure than

between nodes spread over the world, which will in turn increase performance. Finally,

all the nodes are managed by us, there are thus no Eclipse or Sybil attacks possible in

this case.

Using this solutionc we do not have all the security issues we had with the open

peer-to-peer architecture. Indeed, the users do not have direct access to the datastore

nodes anymore, but have to go trough a Bwitter node which limit their possible actions

to the operations defined in section 3.1.2. Furthermore, the communication channel

between the GUI and the Bwitter nodes can guaranty authenticity of the server and

encryption of data being transmitted, for instance using https. Bwitter requires users

to be authenticated to modify their data. Doing so we provide data integrity and

authenticity because. For instance, Bwitter does not permit a user to delete a tweet

that he did not post, or to post a tweet using the username of someone else. The

malicious revelation of user secrets due to a corrupted node is not relevant anymore as

the datastore is fully under our control.

The cloud based architecture is more secure, stable and offers obvious advantages

for scalability and elasticity. This is why we have finally chosen to implement this

solution, we now take a closer look at how the layer stack is build.

The lowest layer, the datastore, runs on the cloud and is hidden from outside which

means no user can access it directly, all the attacks targeting the datastore are thus

29

avoided. Indeed, all the accesses to the datastore are done via Bwitter. This layer is

monitored in order to detect overload and, taking the advantage of the cloud, datastore

nodes will be added and removed on the fly to meet the demand.

The intermediate layer, Bwitter, is also running on the cloud and communicates with

the datastore nodes and the GUIs. A Bwitter node is connected to several datastore

nodes. They have an internal load balancer that dispatches work fairly on the datastore

nodes. The load balancer is the Scalaris Connection Manager (SCM) that we present

in the implementation section at 5.3. In practice, the Bwitter nodes are not accessible

directly, they are accessed through a fast and transparent reverse proxy that splits the

load between Bwitter nodes. We also designed a module that runs in parallel of the

SCM and that we call the Node Manager (NM). It is responsible for the bootstrapping

of the ring as well as adding nodes if needed. However, we do not have any module

responsible to decide if a new node should be launched.

The Bwitter nodes offer a REST-like (Representation State Transfer [27]) API to

the higher layer. This means that among other they are completely stateless, this is

important because it improves the clarity of the code and makes it easier to product bug

free code. Being stateless means that the application does not have to keep information

for each client. It can thus more easily scale with the number of clients and also allows

to dispatch requests from the same client to different nodes, suppressing the burden of

managing sessions.

Some values can be frequently accessed in a social network, a caching system is thus

crucial to achieve decent performance. We thus decided to add a cache at this level

in order to reduce the load on the datastore. We are going more into details about

the cache later in the next section. A similar cache mechanism in the decentralized

architecture would not be useful. Indeed, the advantage of the cache is that it contains

values that are susceptible to be accessed by several users. Therefore, if there is only

one user accessing it the gain will probably be very small.

The top layer is the GUI, it connects to a Bwitter node using a secure connection

channel that guarantes the authenticity of the Bwitter node, and encrypts all the com-

munications between them. Multiple GUI modules can, of course, connect to the same

Bwitter node. The GUI layer is the only one running on the client machine.

3.2.3 The popular value problem

Describing the problem

Given the properties of our datastores both based on DHTs, a key/value pair is

mapped to f nodes, where f is the replication factor, depending of the redundancy

level desired. This implies that if a key is frequently requested, the nodes responsible

for it can be overloaded while the rest of the network is mostly idle. Therefore, adding

additional machines is not going to improve the situation. It is not uncommon on

Twitter to have wildly popular tweets that are retweeted by thousands of users. In the

worst cases, retweets can be seen as exponential phenomenon as all the users following

30

the retweeter are susceptible to retweet it too.

The solution: use an application cache

Adding nodes do not solve the problem because the number of nodes responsible

for a key/value pair do not change. In order to reduce this number of requests, we have

decided to add a cache with a Least Recently Used (LRU) replacement strategy at the

application level.

This cache keeps the last values read. We keep associated with each key/value pair

in the cache a timestamp indicating the last time the value was read. When we face a

cache miss, we evince from the cache the pair that has the oldest timestamp value.

This solves the retweet problem because now the application has in its cache the

tweet as soon as it gets a request to read the popular tweet. This tweet stays in the

cache because the users frequently make requests to read it. This way we reduce the

load of the nodes responsible for the tweet and automatically increase availability of

popular values.

We have to take into account that values are not immutable, they can be deleted

and modified. It is thus necessary to have a mechanism to “refresh” the value inside

the cache. A naive solution would be to do active polling to the datastore to detect

changes to the key/value pairs stored in the cache. This would be quite inefficient

as there are several values, like tweets, that almost never change. In order to avoid

polling, we need a mechanism that warns us when a change is done to a key/value

pair stored in the cache. The datastore must thus allow an application to register to a

key/value pair and to receive a notification when this value is updated. Our application

cache thus registers to each key/value pair that it actually holds and when it receives a

notification from the datastore indicating that a pair has been updated it updates its

corresponding replica. This mechanism has the big advantage of removing unnecessary

polling requests. Notifications are asynchronous, so the replicas in the cache can have

different values at a given moment, leading to an eventually consistency model for the

reads. It is still possible to bypass the cache if a strong consistency is needed, but this

is application dependant. On the other hand, writes do not go trough the cache but

directly to the datastore, this allows to keep strong consistency for the writes inside

the datastore. This is an acceptable trade off as we do not need strong consistency for

the most of the reads in Bwitter. For example, it is not a problem to see for a small

period of time a deleted tweet in the line of a user.

Beernet, as described in [19], offers such a notification mechanism, making possible

to design an efficient eventually consistent cache. Scalaris however does not provide

such feature, we thus needed to find another solution in order to avoid active polling.

We decided to use a time to live of one minute for the values in the cache, meaning

that one minute after being read the first time the value is removed from the cache.

This way any value read from the cache is at most out of date of one minute, which is

not a problem.

31

3.2.4 Conclusion

We have presented two different possible architectures: the open peer-to-peer and

the cloud based architecture. We summarize in Table 3.1 the differences between the

two solutions.

Open peer-to-peer architecture Cloud based architecture

Security No control on the DHT leading tonumerous security flaws.

Full control on the DHT which ishidden from users. Surface of attackmuch smaller.

DHT con-trol andstability

High, uncontrollable and undesir-able churn. Connections betweennodes can be really bad.

Much stabler environment and pos-sible control on the number of nodesin order to scale up and down.

Costs Costs are supported by users (main-tenance of a DHT node).

High costs but directly proportionalto resources needed.

Performance Number of nodes normally propor-tional to the number of users but“quality” of the nodes is uncertain.

Nodes are well connected and cloudguarantees their performance. Con-trol allows optimization.

Cache No possible improvement of the per-formance using a cache.

High potential for performance in-creases using cache.

Table 3.1: Comparison between the open peer-to-peer architecture and the cloud basedarchitecture.

We have opted for the cloud based architecture as it has numerous advantages

compared to the open peer-to-peer one. From a performance point of view, it has

better network properties, less churn, a smaller replication factor and finally a cache

can be added to boost the performance. Moreover, security requirements are hard to

achieve in the open peer-to-peer, while most of security problems are solved simply

by moving to the cloud architecture. The only obvious advantage of the peer-to-peer

solution it that it is free. In the next chapter we take a look at the datastore we are

using and how we represent our data in it.

32

Chapter 4

The Datastore

In this section we are taking a closer look at the datastores we are going to use:

Beernet and Scalaris. From there we identify the design guidelines we followed to build

the datastore schema and finally detail this one. We end up this chapter by discussing

the problem of running several services on the same datastore which brings us to the

secret API we have designed for Beernet.

4.1 The datastore choice

4.1.1 Identifying what we need

As we saw in the state of the art there are several types of datastores: key/value

stores, document stores, extensible stores and relational databases. We have only a

few types of objects to store in our datastore, namely lines, lists, users and tweets.

Furthermore, we do not need any complex operations like the joins and queries available

in RDBMS. We want to use a simpler data model to avoid the unnecessary burden of

maintaining complex structures. Moreover, we want to have the most scalable and

elastic solution possible and RDBMS-like systems were shown not to be efficient in

those field.

For all those reasons we opted for key/value stores, but more precisely key/values

stores with transactional capacities. Transactions allow us to pack several operations

together and execute them atomically. A transaction either executes all those opera-

tions successfully, or none if the transaction aborts. This allows us to generate unique

keys and maintain the integrity of our data structures.

Suppose we want to store a value Bar at a key Key, nothing can guarantee us that

nothing else was already stored at Key. We thus do two operations: operation A, we

make a look-up on Key, operation B, if response of operation A was “not found” we store

Bar at Key. But this is not correct because nothing guarantees that no other operation

C on Key can happen between operations A and B. We must thus run operations A

and B in a transaction so that no other operation C can come in between them. During

our discussion of the datastore design in section 4.3, we use the transactional support

33

to generate unique IDs using counters that we read and increment atomically.

Persistence is a key requirement that we do not address in Bwitter. Unfortunately,

the key/value datastores that fulfil our other requirements do not provide persistence.

Scalaris is planning to add this feature but is still in development phase. We could

use a parallel datastore to do backups as Twitter does [36], but we do not address this

problem.

The datastore must be robust in the sense it must be capable of handling a lot of

churn without failing. This is crucial in the case of our fully distributed architecture.

Indeed, machines would not be under our control and a large amount of machines would

constantly join and abruptly leave the system. Our datastore should be able to manage

those abrupt leaves, having a similar behaviour as machine failures, to ensure no data is

lost. As we decided to go with the cloud based architecture, we work in an environment

where the machines provided are not expected to fail abruptly. Robustness is thus still

critical but datastores can have more complex algorithms to recover from failures as

those are less subject to happen. Although most of the machine leaves and joins are

under control, those operations must be efficient in order to have an elastic application.

Handling correctly the churn means that the datastore must maintain correct routing

between the peers as well as the replication factor.

4.1.2 Our two choices

There are several key/value datastore available but only two offer transactions with

transactional capabilities: Beernet and Scalaris. The two fulfill our datastore require-

ments but vary in some points. We now introduce these two datastores.

Beernet

Beernet [19, 23] is a transactional, scalable and elastic peer-to-peer key/value data-

store build on the top of a DHT. Peers in Beernet are organized in a relaxed Chord-like

ring [30] and keep O(log(N)) fingers for routing, where N is the number of peers in

the network. This relaxed ring is more fault tolerant than a traditional ring and its

robust join and leave algorithms to handle churn make Beernet a good candidate to

build an elastic system. Any peer can perform lookup and store operations for any key

in O(log(N)). The key distribution is done using a consistent hash function, roughly

distributing the load among the peers. These two properties are strong advantages for

system scalability compared to solutions like client/server model.

Beernet provides transactional storage with strong consistency, using different data

abstractions. Fault-tolerance is achieved through symmetric replication. It has several

advantages, that we do not detail here, compared to leaf-set and successor list repli-

cation strategies [11]. In every transaction, a dynamically chosen transaction manager

(TM) guarantees that if the transaction is committed, at least the majority of the repli-

cas of an item stores the latest value of the item. A set of replicated TMs guarantees

that the transaction does not rely on the survival of the TM leader. Transactions can

involve several items. If the transaction is committed, all items are modified. Updates

34

are performed using optimistic locking.

With respect to data abstractions, Beernet provides not only key/value-pairs as in

Chord-like networks, but also key/value sets with non blocking add operations, as in

OpenDHT-like networks [26]. The combination of these two abstractions provides more

possibilities in order to design and build the datastore, as we explain in Section 4.3.

Moreover, key/value sets are lock-free in Beernet, providing better performance for set

operations.

Elasticity in Beernet

We previously explained that to prevent overloading, the system needs to scale up

to allocate more resources to be able to answer to an increase of user requests. Once

the load of the system gets back to normal, the system needs to scale down to release

unused resources. We briefly explain how Beernet handles elasticity in terms of data

management.

Scale up: When a node j joins the ring in between peers i and k, it takes over part

of the responsibility of its successor, more specifically all keys from i to j. Therefore,

data migration is needed from peer k to peer j. The migration involves not only the

data associated to keys in the range ]i; j], but also the replicated items symmetrically

matching the range. Other noSQL datastores such as HBase [1] do not trigger any data

migration upon new nodes adding to the system, showing better performance scaling

up.

Scale down: There are two ways of removing nodes from the system: by gently

leaving and by failing. It is very reasonable to consider gently leave in cloud environ-

ments, because the system explicitly decides to reduce the size of the system. In such

case, it is assumed that the leaving peer j has time enough to migrate all its data to

its successor who becomes the new responsible for the key range ]i; j], i being j’s pre-

decessor. Scaling down due to the failure of peers is much more complicated because

the new responsible node of the missing key range needs to recover the data from the

remaining replicas. The difficulty comes from the fact that the value of application

keys is unknown, since the hash function is not bijective. Therefore, the peer needs

to perform a range query, as in Scalaris [29], but based on the hash keys. Another

complication is that there are no replica sets based on key ranges, but on each single

key.

Scalaris

Much like Beernet, Scalaris offers a transactional, scalable and elastic peer-to-peer

key/value datastore and is also build on the top of a DHT [29]. Scalaris is currently

based on a traditional Chord, with a possible upgrade to Chord#. While not as

fault tolerant as Beernet, Scalaris is a good candidate for building elastic systems

too. Lookup and store operations have the same complexity, O(log(N)), where N is

the number of peers in the network. Currently the key distribution is done using a

hash function but could be lexicographically ordered after the upgrade to Chord#.

35

As in Beernet, Scalaris provides transactional storage with strong consistency, and

fault-tolerance is achieved trough symmetric replication. Transactions are taken care of

by a local Transaction Manager associated with the node to which the user is connected.

Transactions are done in an optimistic way, trying to execute it completely on the

associated node and then trying to store it at the responsible nodes if it succeeded.

Beside the classical key/value-pairs, Scalaris also supports key/value lists as a data

abstraction. Lists, as opposed to Beernet sets, are not lock-free and there exists no

add operation on lists. In order to add atomically an element to the list, we must in

a single transaction read the list, add the element to it, and write it back to Scalaris.

Lists are thus a convenient abstraction that avoids the programmer to develop his own

parsing system, but do not offer any performance improvement.

Conclusion

Beernet and Scalaris both fit our needs with their elastic and scalability properties

and their native data abstractions. Unfortunately, due to some unexpected problems

with Beernet we were forced to continue on Scalaris alone. This was disappointing as

we were working closely with Boris Mejıas, the developer of Beernet, to further improve

his system with a richer API presented in section 4.4.1.

4.2 General Design

The design of the datastore is closely linked to our application requirements. Hence

before going straight into the design of the datastore, we take some time to explain

the guidelines we elicited from the requirements to build the datastore’s schema. Some

choices might be unclear now but they will be clarified when we present the algorithms

in Chapter 5.

Make reads cheap

While designing the lines we had to decide if we should favour the reads or the

writes. If we privilege the reads we push the information to the line and put the

burden on the write. In this case the “post tweet” operation adds a reference to the

tweet in the lines of each follower, we call this the push approach. On the other hand,

we could privilege the writes. In this case, we pull the information and build the lines

each time a user wants to read them. It is done by fetching all the tweets posted by the

users he follows and reordering them, we call this the pull approach. As people do more

reads than posts on social networks, and based on the assumption that each posted

tweet is at least read one time, we opted to make reads cheaper than writes and thus

privileged the push approach. However, we also study the pull approach and compare

it with the push when we present our algorithms in Chapter 5 and our experiments in

Chapter 6.

36

Do not store tweets in the lines but references

There is no need to replicate the whole tweet inside each line, as a tweet could be

potentially contain a lot of information and should be easy to delete. Therefore, we

prefer to store references to tweets. To delete a tweet the application only has to edit

the stored tweet and does not need to go trough every line that could contain the tweet.

When loading the tweet the application can see if it has been deleted or not.

Minimise the changes to an object

We want the objects to be as immutable as possible to enable cache systems. This

is why we avoid to store potentially dynamic information inside the objects but rather

have a pointer to it. For instance tweets are only modified when we delete them, this

is why a reply to a tweet should not modify the tweet itself.

Do not make users load unnecessary things

Loading the whole line each time we want to see the new tweets would result in

an unnecessary high number of exchanged messages and would be highly bandwidth

consuming. This is why we decided to cut lines, which in fact are just big sorted sets,

into subsets of x tweets, that can be organised in a linked list fashion and where x is

a tunable parameter. Sets fragmentation is done differently depending on the chosen

design of the datastore. This is explained later in the algorithms section.

Retrieving tweets in order

Users want to retrieve first the tweets posted last, tweets are thus dated to allow

ordering. Tweets must thus be stored so that getting the most recent is easy and

efficient. We have build an algorithm that guarantees the correct ordering of the tweets

inside our lines even with network reordering and failures.

Filtering the references

When a user is dissociated from a line we do not want our application to still display

the tweets he posted previously. We decided not to scan the whole line to remove all

the references added by this user. Indeed, we rather remove the user from the list of

the users associated with the line, and filter the references based on this list before

fetching the corresponding tweets.

Only encrypt sensitive data

Most of the data in Twitter is not private so there would be no point in encrypting

it. Only the sensitive data such as the passwords of the users should be protected by

encryption when stored in the datastore.

37

Simple data structures

We believe that having complex data structures is not a good idea in a key/value

store. Indeed, in order to maintain those we need to use transactions. Those are more

likely to fail if, to update a data structure, we need to access a lots of different keys at

the same time.

4.3 Design of the datastore

The design of the datastore is an important part of the project. In our case it

is more complicated than for a classical database because we do not have high level

data structures like database tables. As reminder, Beernet and Scalaris both provide

two different data structures, key/value pairs and key/set (or key/list), the second one

allowing to store multiple values at the same key.

As we wanted an easy way to store and retrieve java objects from the datastore

we decided to serialized them. Java objects when serialized are transformed to String

conforming to XML 1 format. After serialization they are stored as value in our datas-

tore. This has the advantage that we can easily recover java objects if needed later, or

directly respond to Bwitter requests in XML without even deserializing those objects.

Moreover, XML has the advantage to be a widely used format and thus a lot of exist-

ing libraries handle it. The process to add something in our datastore is the following:

create a java object, serialize it, choose a unique key and finally store the key/value

pair in the datastore. We avoid on purpose to talk about robustness and shared key

space now as we dedicate two sections to those two problems after the details of our

design.

Our first attempt to design a social network on a key/value pair datastore was based

on references. We say that it was based on references because everything, except the

user object, was stored at random and meaningless keys. Those user profiles contained

references to the other objects belonging to the user. For example, the lines of a user

were kept in a user set whose reference was kept in the user object.

After some thought we decided to drop the random keys and references, and replaced

them with a design based on human understandable and computable keys. The key

space layout now looks like a file directory. We do not need to follow a chain of

references to access an object anymore, it can be directly addressed. This also removes

the burden of managing the references. This in turn leads to a reduced number of

operations needed and improved performance. Moreover, the old design had a bigger

space complexity because it had to store references to every object from the user profile

to the object itself. Thanks to this simple addressing, it is also easier to write clearer

code and avoid bugs. Note that through this section, when we talk about keys, the

variable parts of the key are written in bold characters while the static parts are not.

We have two different datastore designs: one for the push and one the pull approach.

The push approach pushes the information posted to the readers and the pull one

1http://www.w3.org/XML/, last accessed 14/08/2011

38

retrieves the information from the poster. We focus on the push approach because we

believe it is the most adapted to our application and describe shortly the pull design.

4.3.1 Key uniqueness

For now we assume only Bwitter is running on the datastore. We must still ensure

key uniqueness to avoid unwanted overwriting of data. In order to do so, information

must be kept in the datastore for each key already used. This information must be

stored at a known location. We separate the datastore into several groups of objects,

for example the tweets of a user, a line a of user, sets of tweets, etc. For each of those

groups we keep track of the number of objects in this group so that we can forge a

new key for each new object. Each group must have a unique base key from which

we can create new unique keys for the members of the group. For example, we show

how we add a new tweet to the tweets already posted by a user. We assume that the

tweets of a user are stored at the base key “/user/username/tweets/” (username is

the username of the user) called tweetBase, and that the number of his tweets is stored

at “tweetBase/size”. The following pseudo code adds a new tweet to tweets already

posted by that user.

addNewTweet{

begin transaction

x = Read(‘‘tweetBase/size’’)

x++

Write(x, ‘‘tweetBase/size’’)

Store the new tweet at the key ‘‘tweetBase/x’’

end transaction

}

This ensures that we always use unique keys when adding a new object to a group.

The drawback is that all objects adding are done via the same key where the number

of those objects is stored. Any two parallel transactions that add an object to the same

group thus conflict. Therefore, it is important to keep this limitation in mind while

designing our data structures.

We still have a problem, we just stated that we need the base keys to be unique.

We consider that the username of a user must be unique, this allows us to create unique

base keys for each user. The uniqueness can be easily checked at the registering of each

user.

4.3.2 Push approach design details

Users

The user object, represented in Figure 4.1, the real real name of the user and his

registration date. Any other personal information could be added later to this object.

We store the user object at “user/username”.

39

Figure 4.1: User profile object of user “Paul”.

We store the hashed password of the user at the key “user/username/password”.

We use it for authentication for each operation involving writing. We store this value

on its own as it is requested more often than the other user’s personal information.

We propose to add a special structure, shown in Figure 4.2, that allows to search for

users. Indeed, searches in a key/value pairs are not well supported because application

keys are not organized in a lexical order in the ring, but according to an hash function.

We thus group in the same key/set (key/list) pair real names that share some prefix,

we do not make any difference between upper and lower case. This user search tree we

propose is a binary search tree, we made this choice because we know it is an efficient

structure for insertion and retrieval. Leaf nodes contains matching between real names

and usernames to allow to find the username of a user thanks to his real name. Indeed,

people do not necessarily know the username of someone and we identify users thanks

to their username. Therefore, this structure is crucial for users to easily find people

they know in our system. All the leaf nodes together represent the whole alphabet.

Parent nodes do not contain any search information, they only keep references to their

children. Leaf nodes have an approximate maximum size. When the size of a leaf node

reaches this limit, we add two children to it and and split his responsibility interval

between the two children. We did not developed a formal algorithm for this search tree

due to lack of time, it is thus not present in our implementation.

Figure 4.2: Username search tree.

40

Lines and lists

Lines and lists are really similar, we thus only detail lines because lists are lines but

without any users associated. A line has a set of tweets and a set of users associated.

In practice and as said in the main guidelines, tweets are not stored on lines, instead we

store references to them. Those references contain a date, the username of the poster

used for filtering, the username of the original poster if it is a retweet and the key of

the referenced tweet, as can be seen in Figure 4.3.

Figure 4.3: Reference to tweet object to be stored in a line or list.

Sets of usernames are not split like tweet sets because they are always read in their

entirety when used. We also keep a set containing all the lines and list names so that

we can easily retrieve them (see Figure 4.4).

Figure 4.4: Left) Lines set of user “Paul”. Right) User set of the “coolpeople” line ofuser “Paul”.

The set of tweets associated with a line or list can become very big. Taking into

account our main design guidelines, we do not store them in one set but as list of

chunks organized in chronological order from most recent to oldest, as you can see in

Figure 4.5. The head is at a fixed location (/user/username/line/linename/head),

which allows us to quickly add an element to this set and read the latest tweets. The

other chunks are located at a fix based key (/user/username/line/linename) to which

we concatenate a number called chunkNbr. The chunk with the chunkNbr equal to

0 is the oldest. The newest chunk has a chunkNbr equal to the value contained at

the key “/user/username/line/linename/size” minus 1. It is thus easy to access any

chunk of the line.

This may not be obvious at the moment but the number of tweets in each chunk

has a big importance as it influences the complexity of the algorithms we present in

41

the next section.

Figure 4.5: Top) Number of chunks in the “coolpeople” line of user “Paul”. Bottom)The head chunk and two chunks of the “coolpeople” line of user “Paul”.

Topost set

The Topost set, represented in Figure 4.6, contains references to lines (keys of the

lines in the datastore) in which the user must post references to his tweets. We do

not store the whole reference to a line because some part of the reference are constant.

Instead, we store what is needed to find the line back: the name of the line and the

username of the line’s owner.

As it was the case for the lines, the Topost set is fragmented using the same tech-

nique. Each of its chunks contains maximum nbrOfFollowersPerChunk references,

this is a parameter that has to be tuned and is further discussed in section 6.3.2 of our

experiment chapter. Moreover, each chunk also has a counter, it is used to implement

the post tweet algorithm robustly. This counter has a value between -1 and the number

of tweets that a user has posted not included. From Figure 4.6, you can notice that

the tweets of the owner of the Topost set were not correctly posted for all the chunks.

Indeed, the counter values differ between the chunks indicating some remaining tweets

to post. We add another counter that is used to remember the tweet number of the last

tweet that was correctly posted, this counter is also initialized at -1. In this example,

assuming Paul has already posted 12 tweets, we can see that one tweet needs to be

posted for chunk 0 and two for chunk 1.

Tweet

The messages the users post are called tweets. As mentioned before a tweet is a

small message of 140 characters. The tweet object contains a message field as well as a

poster field. Moreover, some tweets can be retweeted, to handle this situation we added

an original author field that contains the name of the original author of the tweet. This

field is null if the tweet is not a retweet. Tweets are also dated using second precision,

the time used when storing in the datastore is the Greenwich Mean Time (GMT) for

the whole system, it is up to the GUI layer to adapt the time to the local area when

42

Figure 4.6: Different parts of the Topost set of user “Paul”. Top left) Number of chunksin the Topost set. Top right) Global counter of correctly posted tweets. Center) Chunkcounters of correctly posted tweets. Bottom) Chunks of the Topost set.

displaying the tweet. A field indicates if this tweet was deleted by his owner. Finally,

users can answer to tweets. We want to be able to find the complete conversation back

given one tweet, therefore we keep a reference to a potential parent and to a set of

children. We put an example in Figure 4.7, Tweet2 is a response to Tweet1, Tweet3

is response to Tweet2, Tweet6 is a response to Tweet4 and so on... Tweets are stored

only once in the datastore, we made this choice in order to make their deletion easier

and to minimize the data stored in the datastore.

Figure 4.7: Conversation Tree.

43

The key of a new tweet is the concatenation of the prefix of the key: “/user/textb-

fusername/tweet/” with the number of tweets already posted by a user. The schema

of the tweet number 42 posted by the user “Paul” is shown below in Figure 4.8.

Figure 4.8: Left) Tweet number 42 object of user “Paul”. Right) Number of tweets ofuser “Paul”.

4.3.3 The Pull Variation

As explained in the introduction, we have also decided to experiment with a vari-

ation of the push based design, and to observe how the system would behave if we

decided to pull the information instead of pushing it. As this was not our primary

goal, we decided to focus on the design of the datastore with only the push approach

in mind making it as efficient as possible. Afterwards, we tried to fit the pull variation

in. This went very well as the pull approach borrows a great majority of the building

blocks and even mechanisms of the push approach.

We now store the references only at the owner side, we explain how those tweets are

retrieved in the algorithms chapter. Furthermore, those references are kept grouped by

timestamp, meaning the the tweets posted during the same time frame, for instance

the same hour, are grouped together. The timestamp is of the form: 05/06/11 15 h

26 min 03 s GMT, with some fields set to zero according to the chosen time granular-

ity. For instance, if we want the references to be grouped by hour we would have a

timestamp of this form: 05/06/11 15 h 00 min 00 s GMT. The full key looks like this:

/user/username/tweet/timestamp.

We also have to store the subscription date of the user in order to com-

pute the equivalent of the chunk numbers. This date will be stored at the key:

user/username/starttime.

44

Object Type Key Description

Userprofile

Value /user/username User profile with userinformation

Password Value user/username/password Hashed password ofuser

Topostset

Set /user/username/topost/chunkNbr

Set of lines wherethe user has to postthe references to histweets

Topostchunkcounter

Value /user/username/topost/chunkNbr/counter

Counter associatedto each chunk of thetopost set

Topostset size

Value /user/username/topost/size Number of chunks inthe Topost set of auser

LastTweetcorrectlyposted

Value /user/username/topost/lasttweetposted

Tweet number of thelast tweet correctlyposted

Tweet Value /user/username/tweet/tweetNbr Tweet object con-taining the message

Repliesto tweet

Set /user/username/tweet/tweetNbr/children

Replies to the tweet

Tweetcounter

Value /user/username/tweet/size Number of tweetsposted by a user

Lines set Set /user/username/linenames Names of the lines ofthe user

Linechunk

Set /user/username/line/linename/chunkNbr

Chunk of a line con-taining tweet refer-ences

Linechunkcounter

Value /user/username/line/linename/size

Number of chunks inthe line (head notcounted)

Lineusers

Set /user/username/line/linename/users

Users associated to aline

Lists set Set /user/username/listnames Names of the lists ofthe user

Listchunk

Set /user/username/list/listname/chunkNbr

Chunk of a list con-taining tweet refer-ences

Listchunkcounter

Value /user/username/list/listname/size Number of chunks inthe list

Table 4.1: Keys used in the datastore for the push design

45

4.3.4 Conclusion

You can find in Table 4.1 a summary of the kind of keys we use in our datastore for

the push design. Every key used is of course unique. Remember that the text in bold

is a variable while the rest is static.

Our datastore design was rebuild several times in order to meet the important

criteria we have fixed: simplicity, scalability and clarity. We have built a structure for

the lines that allow to retrieve easily the latest tweets in chronological order. We cut

the lines and the Topost set because those two can be very big (billions of tweets and

millions of followers). We have also designed a structure to efficiently search users in

the system.

Concerning which approach is the best between the pull and the push, you can have

the intuition that the push is the best approach for reads and the pull approach is the

best for the writes. We compare theorically the two approaches when discussing the

algorithms in Chapter 5 and test them in the experiments of the Chapter 6.

4.4 Running multiple services using the same datastore

There are numerous situations where multiple applications may want to share the

same datastore. For instance, we could easily imagine a globally distributed datastore

deployed in a peer-to-peer environment being used for multiple applications, exactly as

we suggested in our first architecture. This would encourage users to let the datastore

node run longer and would mitigate the heavy churn problem we would face if those

users only used the datastore for our Bwitter application. They would launch it to

consult their latest tweets or to post a tweet, and then directly close it. Despite we do

not face this churn problem in cloud or any other stable environment, this remark is

also valid for those. Indeed, some application’s plugins may want to store additional

data that should not be interfering with those of the main program while being able to

access it.

So while we could limit the access of the datastore to Bwitter it is a clear limitation.

We are thus going to take a closer look at the problem of sharing the datastore and

particularly the keyspace. After some thoughts, we reduced the problem of sharing the

keyspace to two smaller problems: key already used and unfortunate/malicious data

erasing.

We explored different ways to solve those problems at the datastore level. Even

though we did not use those solutions, it is still relevant to expose here our work and

conclusions. Note that this work has only been done on Beernet and not Scalaris. This

is due to our privileged collaboration with the developer of Beernet, Boris Mejıas, since

the beginning of our project.

46

4.4.1 The unprotected data problem

Early in the process, we elicited a crucial requirement. The integrity of the data

posted by the users on Bwitter must be preserved. A classical mechanism, but not

without flaws, is to use a capability based approach. You store the data using random

generated keys so that other applications and users cannot erase the value because they

simply do not know at which key the value is stored. However in applications where

content has to be available publicly, we cannot protect all our values by simply using

not guessable keys. By example, Bwitter allows any unknown user to add his name to

a Topost set of another user in order to subscribe to his tweets. This list must not only

be available to any user but also has to be writable by any user. In practice, we would

use the set abstraction provided by Beernet to implement this list. Any user needs

the possibility to add an element to the set, but it should be impossible for anyone

but the creator of the set and the user who added the value to remove the value. The

problem is that Beernet does not allow any form of authentication so key/value pairs

are unprotected. Hence, anybody that is able to send requests to Beernet can modify

and delete any data previously stored. We detail here several solutions that we have

imagined to solve this problem.

Safe environment assumption

At first, we assume Beernet is running on the cloud and that the nodes are man-

aged by another entity than the applications running on the top of it. This means

that nobody can add nodes except this entity and that the communications between

the different nodes cannot be spied. Indeed, Beernet inter-nodes communications are

done on a LAN inaccessible from outside the cloud. Moreover, we assume that the

communications between Beernet nodes and applications are encrypted so nobody is

able to spy them.

Cooperation between applications

The most naive solution is to do the assumption that all the applications running

Beernet are written without bugs and are respectful of each other. This means that

the applications check each time they want to write a Key1/Value1 pair that it exists

no other Key1/Value2 pair with the same key already written by another application.

Additionally, this operation has to be run in a transaction to avoid race conditions.

This should normally not induce too much performance overhead because usually ap-

plications will run transactions each time they store a value using the transactional

replicated storage of Beernet. In order to be able to perform this check, each time a

value is stored an information identifying the application that posted the value must

be added manually to the value by the posting application.

This solution makes a strong assumption, and even if this assumption holds it adds

complexity to the code of each application running on Beernet. Indeed, applications

need to parse each value they read and add information to each value posted.

47

Data protected by secrets

We now lift the assumption made in the previously presented solution. We assume

that several applications are running of the top of Beernet and are not respectful of

each other and thus do not cooperate. We would like to enable an application A to

protect the values it posted from being overwritten by an application B. This is not

possible without the help of Beernet because the two applications can access Beernet

freely and are not cooperating. We have thus designed a solution to enhance the API of

Beernet: we enable an application to protect a key/value pair it posted using a secret

chosen by itself. This secret is needed if another operation tries to modify or delete the

value associated to the key newly protected. Because Beernet is running in a secure

environment, secrets will not leak from Beernet, a malicious user can still try to guess

the application secret, but it is the application’s responsibility to use secrets that are

hard to discover.

A secret mechanism was developed by OpenDHT [26], they made possible to add

a removal secret so that when a delete operation is performed the secret is requested

to remove the value. In a very similar fashion, Beernet’s secret mechanism allows to

share values with other applications and keeping them protected at the same time.

Application A can now write a value and protect it against editing and deleting using

a secret. Without this secret anyone can still read the value but can not edit nor delete

it.

But, the secret mechanism developed for Beernet goes further. Indeed, the sets are

now protected by secrets too and offer much more flexibility. Three different secrets

can be used to protect the different parts of the set.

First, the Set Secret which is one of the two secrets associated to the set itself when

it is created. It can be seen as a master key allowing its owner to do all the operations

desired on the set. The creator of the set can thus destroy the set along with all its

contents, insert items into the set, but also delete separately each item contained into

the set.

Secondly, the Write Secret which is the other secret associated to the set itself when

it is created. This secret is required to add an item to the set. This way the creator of

the set can decide to who he gives the right to add items to his set.

Finally, the Value Secret associated to a given item in the set. This secret protects

a single item in the set against editing and deleting. This means only the user that has

added the item and the owner of the set can delete the item. This secret is set by the

user that adds the value to the set.

This new way to protect sets allows to easily implement numerous applications based

on user posting content. Comments on blogs are made extremely easy for instance. The

author of the blog can give the permission to other users to add comments to an entry.

All the users can now see the comments posted by their peers but can only edit the

comments they posted themselves. The author can manage the comments posted as he

has the right to delete and edit the comments too. This is only a short and simplistic

example, but we are convinced this new secret mechanism will make the development

48

of more complex applications much more easy.

New semantic using secrets We need three new kinds of fields, one for each secret,

in addition to the existing Key and Val fields. Those new fields are automatically set

to NO SECRET when applications uses the functions of the old API that do not use

any secrets. NO SECRET is a reserved value of Beernet indicating that there is no

secret. For example, we show the difference for the put function. It used to be:

put(K:Key V:Val)

Stores the value Val associated with the key Key at the responsible of the Hash of Key.

This operation can have two results, “commit” or “abort”.

The operation returns “commit” if:

• there is nothing stored associated with the key Key or there is a value stored

previously by a put operation.

• the value has successfully been stored.

Otherwise the operation returns “abort” and nothing changed.

The new version is now:

put (S:Secret K:Key V:Val)

Stores the triplet (Hash(Secret) Key Val) at the responsible of the Hash of Key. This

operation can have two results, “commit” or “abort”.


• there is nothing stored associated with the key Key or there is a triplet stored


• there is no triplet (Secret1 Key Val1) stored at the responsible of the Hash of Key

so that Hash(Secret) != Hash(Secret1).


49


If no value is specified for Secret, Beernet will assume it is the equivalent to

put(S:NO SECRET K:Key V:Val).

The whole new API of Beernet now contains a secure version of put, write, add,

remove and delete but also allows explicit set creation. The full new API semantic can

be found in the annexes at Chapter 8.

4.4.2 Key already used problem

At the moment in Beernet and in Scalaris, as in all key/value stores we know, there

is only one key space. This means that multiple services have to share it and if a service

uses one key another service cannot use it anymore. For some applications, not being

able to use a given key can be very annoying as keys may have a defined meaning and

the application expects to find a certain type of info at a certain type of key. This can

be solved designing more complex algorithms at the application level, but this adds

complexity not directly linked to the application, which is, at our sense, a bad idea.

Sharing a key space can thus create problems if multiple services want to use the exact

same keys. For instance, if another service decides to store the usernames of their users

at the keys “user/username” we have a conflict with our Bwitter application. This

means that the applications cannot both have a user with the same username. This

problem can not be solved with the secrets mechanism we proposed. It can be solved

using a capability based approach. This was not the case in the unsecured data problem

we just presented, indeed, the goal it not to protect the data but to avoid key conflicts

between applications.

The simplest way to avoid using the same keys is by appending a differentiation

number in front of the key. When an application wants to start using Beernet it

generates a root random key, for instance 93981452. From then on the application will

only use keys starting with 93981452. If we can be confident enough that no other

application will use this root random key, we can assume that we are working with our

own key space. We can thus design the application accordingly, removing the burden

of complex algorithms to recover from a key already used. In RFC41222 the authors

affirm being able to generate global unique identifier, we could use those identifiers as

root key, the chance that this key would be used two times on the same datastore is

infinitesimal.

This approach is also valid if you want to hide data from some application or users.

Indeed, you can imagine guessing the root key but it is in practice not possible.

2Can be found at http://www.ietf.org/rfc/rfc4122.txt

50

4.4.3 Conclusion

In this section we have addressed two problems that arise when multiple applications

share the same key space, namely the unsecured data problem and the key already used

problem. The first problem was solved with the secret mechanism that we designed

for Beernet and is now implemented in Beernet version 0.9. Now, key/value pairs and

key/value sets can be protected by a secret needed to modify or delete those values.

We even proposed a finer granularity at the set level. Indeed, it is possible to create a

set controlled by one person but that can be read and written by several. This can be

done while preventing other users than the managers to modify or delete values posted

in the set. The second problem is solved thanks to the capability base approach. We

can, thanks to those two mechanisms, run multiple applications in parallel on the same

Beernet without encountering any perturbations between them.

51

Chapter 5

Algorithms and Implementation

This chapter contains four sections. We first show the implementation of the cloud

based architecture we detailed in section 3.2.2. We then take a closer look at our three

main modules: the Nodes Manager, the Scalaris Connection Manager and the Bwit-

ter Request Handler. The Nodes Manager is responsible of launching the machines

needed, as well as performing remote operations on those machines. The task of the

Scalaris Connection Manager is to control the access of Bwitter to Scalaris. We finish

this chapter by presenting all the algorithms we have designed for the Bwitter Request

Handler. Those algorithms were designed to work with a key/value datastore support-

ing transactions. We also do a theoretical estimation of the number of reads and writes

performed by Bwitter for a given social network.

5.1 Implementation of the cloud based architecture

We did not produce the current implementation of Bwitter directly, we first went

through two other implementations that had several similarities with the current one.

In this section we briefly describe those first two implementations as they are integral

part of our project, and finish by detailing the third and final version.

5.1.1 Open peer-to-peer implementation

The first version implemented the open peer-to-peer architecture we presented in

section 3.2.1. In this solution it was necessary to protect data from malicious/unin-

tentional modification at the datastore level. This is why we developed the secrets

mechanism for Beernet we described in section 4.4.1. The secrets were used by Bwitter

to protect user data. This version was stateful, meaning that the client had to establish

a session by logging in before being able to use the functions offered by the Beernet

API. This was not really practical because the Beernet nodes had to remind all the

clients that were connected. Moreover, the load balancer had to be configured in order

to always attribute the same client to the same Bwitter node. This first version was

not totally implemented and only reached the draft state.

53

5.1.2 First cloud based implementation

Along the way we realized, as explained in section 3.2.4, that the cloud architecture

was a lot more adapted to our project. We thus made heavy changes to our imple-

mentation and came up with the second version of our application. Due to unexpected

maturity problems of Beernet we were not able to test our implementation with it and

our implementation was running on an emulated DHT.

This implementation was fully operational and even had a functional GUI. It was

presented at the “Foire du Libre” held the 6th of April at Louvain-la-Neuve1 and visitors

could try it at the Beernet stand.

As time went by, it became apparent that we would not be able to use Beernet for

our implementation, we thus decided to switch to Scalaris. Furthermore, after some

preliminary tests on our second implementation, we identified some heavy changes to

be made to our Bwitter API. This was caused by the decision to get rid of the sessions

we were maintaining for our users and have an API closer to the Representational State

Transfer (REST) principles [27]. This change in the interface combined to the need for a

switch to a new scalable database made us decide to start a fresh third implementation.

Figure 5.1: View of our global architecture, highlighting the three main layers: theGUI, Bwitter and Scalaris.

1“Foire du Libre” is a fair celebrating open source software and organised by the Louvain-li-nux:http://www.louvainlinux.be/foire-du-libre/, last accessed 05/08/2011

54

5.1.3 Final cloud based implementation

We will now present the final version of our application implementing the cloud

based architecture we detailed in section 3.2.2. You can see in Figure 5.1 a full repre-

sentation of our implementation.

The GUI

We currently do not have a fully functional GUI but a minimal one demonstrating

the important features of our application. Indeed, we focused on the design of other

aspects of our implementation, we thus leave the implementation of the complete im-

plementation of the GUI as future work. We could not adapt the previous version of

the GUI as it was designed for an old version of our application using a significantly

different version of the API. The GUI was implemented using the Flex technology from

Adobe2. This technology allows to create nice Rich Internet Application (RIA). We

decided to create a GUI that could be accessed through a web browser so that it could

be directly used with any operating systems and even with smart phones. A screenshot

of this basic GUI can be seen in Figure 5.2.

Figure 5.2: The GUI of our second implementation.

2http://www.adobe.com/products/flex/, last accessed 05/08/2011

55

Bwitter layer

This is our main layer, it contains a Nginx3 load balancer, a Tomcat4 server, the

Bwitter Request Handler (BRH), the Nodes Manager (NM), the Scalaris Connections

Manager (SCM) and a cache system.

The Nginx load balancer is not a real part of our implementation. Indeed, we did

not modify it and the only thing needed in order to use it is to configure it with the IP

addresses of the Bwitter nodes. As those are stateless no other special configuration is

needed.

The Tomcat 7.0 application server uses java servlets from java EE to offer a web-

based API and relays the requests to the BRH. Those Tomcat servers are accessed

through a reverse proxy server, in this case the Nginx load balancer which is told to

support 10k concurrent connections. This Nginx load balancer can be configure in

charge to serve static content, for example the GUI application, as well as doing load

balancing for the Bwitter nodes. The connections of the GUI to the web-based API is

performed using https in order to guarantee a secure channel.

We currently have the BRH, NM and SCM running on Amazon, they are detailed

in sections 5.2, 5.3 and 5.4.

The cache The SCM uses Ehcache v2.4.05 as cache system in order to increase the

performances and mitigate the popular value problem we discussed in section 3.2.3.

Note that we have one cache per Bwitter node and that they are not synchronised.

The values in the caches have a time to live of one minute so that they are refreshed

periodically. Values are added to the cache during the read operations, not during

the write operations. The cache only keeps three different kind of values in memory:

tweets, passwords and references to tweets, all the other elements are accessed directly

through Scalaris. As previously explained, the tweets were designed to be as immutable

as possible to be able to be included in the cache. The references to tweets are static too

and used in the posting recovery mechanism. The passwords are values that are used

very often as for each post we must fetch the hash of the password stored in the system

in order to verify if the password provided is the correct one. The three elements cited

above are only accessed through the cache if they are accessed via a transaction where

only them are involved, this is done in order to keep the strong consistency properties

in the other cases. For example, in the first pseudo code below, the two elements would

be accessed through Scalaris.

{

begin transacation

tweet t = read(someTweetKey)

write(Some key , Some value)

end transaction

}

3http://nginx.net/, last accessed 08/08/20114http://tomcat.apache.org/, last accessed 08/08/20115http://ehcache.org/, last accessed 08/08/2011

56

In this example the tweet can be accessed through the cache because it is the only

element involved in the transaction.

{

begin transacation

tweet t = read(someTweetKey)

end transaction

begin transaction

write(Some key , Some value)

end transaction

}

Scalaris layer

The lowest layer is the Scalaris layer which is accessed and managed via the SCM

and NM. We started the development of our system with Scalaris version 0.2.3 and

switched to version 0.3.0 when it was released the 15th of July as it was giving better

performances and corrected some bugs.

5.2 Nodes Manager

The Nodes Manager (NM) was designed to facilitate our tests and to allow us to

easily control nodes. The NM can start Bwitter nodes as well as Scalaris nodes. We

mainly use it to start Scalaris nodes to form the initial ring for our tests and to start

additional Scalaris nodes during our elasticity tests.

As we will further explain during our experiments in chapter 6, we are working

with the Amazon cloud infrastructure. We made a heavy use of the java API6 Amazon

offers in order to control the nodes, as it is closely linked to the tasks the NM perform.

Indeed, this API allows to start new machines on the cloud and to check the state of

the machines associated to an account. We list below the main tasks the NM performs

and describe briefly how we realized them.

As just said, the NM can be used to start new Bwitter nodes, but we did not design

any mechanism to detect when nodes should be added or removed. There are different

kinds of observable behaviours preceding flash crowd in social networks and it should

be possible to study them in order to predict flash crowds, but we did not do it. We

rather decided to focus on other aspects of our system.

Start new machines The NM can send commands to Amazon in order to start new

machines of a given type (Scalaris nodes or Bwitter nodes). We must fix the security

group (which indicates which ports must be open on a machine), the location of the

machine (east-america, europa,...), the type of instance (s1.small, c1.medium...) and

finally the security keys used to access the machines remotely. It is also possible to add

6http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/ec2/AmazonEC2.html, last accessed 07/08/2011

57

tags to machines (in a key/value pair fashion) to identify them more easily. We use

those to make a clear distinction between the Bwitter nodes and the Scalaris nodes.

Wait for machines to be started Once the command to start the machines has

been sent to Amazon, it is necessary to wait for the machines to be running. We do this

by requesting regularly the states of all the instances of the Amazon account and wait

until all the machines are in the running state. This is important to understand that all

the objects returned by Amazon API calls are not updated dynamically. This means

that an object representing an instance may not accurately represent this instance and

must be refreshed regularly to avoid working with old information.

Machines can be in four states, running (the machine is started), shutdowning (the

machine is stopping), pending (machine is being started), stopped (the machine is

stopped but can be restarted) and terminated (the machine is stopped and cannot be

started anymore).

Check machine reachability Some machines can be unreachable even if they are

in the running state. We do not know the reasons that would explain why machines

are sometimes unreachable but we noticed that machines in Amazon sometimes do not

respond to ping requests even from inside the private LAN of their own security group.

In addition, they sometimes respond to ping requests but not to ssh because they did

not correctly initialize their security keys at boot time. This problem gave us lots of

troubles during the tests, it is indeed necessary to reboot the machine and sometimes

to restart the test when it happens.

Launch a fresh Scalaris ring Once the machines are launched we still need to start

Scalaris on those machines. This requires to first create the configuration file, whose

main use is to indicate which nodes are already in the ring. We then by default stop

any remaining instances of Scalaris running on those machines before restarting it. The

first node is launched and the configuration file for the other nodes is build. Then we

launch each node sequentially every 2 seconds. We then wait a small period of time in

proportion to the size of the ring to let it stabilise and a fresh ring is now launched.

Add nodes to an existing ring This is similar to starting a new ring, except that

several nodes are already in the ring, therefore the configuration file contains several

nodes and not only the first one.

Reboot a node Particularly with version 0.2.3 of Scalaris which we used at beginning

of our work, we ended up frequently in situations where a node was correctly inserted

in the ring but it was impossible to make any write on it. We thus created a function to

restart this Scalaris node and insert it again in the ring so that the test could continue

normally. This usually happens right after the insertion of the node, thus we do a series

of dummy writes to test if the node is correctly bootstrapped and if it is not we restart

it. In the version 0.3 of Scalaris this bug is nearly nonexistent.

58

Most of those functions require to perform remote actions. In order to send files and

run commands on remote machines we use the Runtime class of java combined with SSH

and SCP. It was necessary to use the two options “-o UserKnownHostsFile=/dev/null”

and “-o StrictHostKeyChecking=no” with SSH and SCP in order to avoid checking if

host was already seen before, otherwise the execution stalls because SSH is waiting for

non coming answer. For example to stop Scalaris on a machine, we run the following

command using the exec method from the Runtime object.

ssh -o UserKnownHostsFile =/dev/null -o StrictHostKeyChecking=no -i BwitterXM.pem

ubuntu@10 .118.130.36 ‘‘sudo killall beam’’

We use a threadpool in order to run several commands and send files in parallel to

improve the throughput which is a must because the time to launch a complete ring

can otherwise be very long.

In conclusion, the NM allows us to automatically and efficiently launch Scalaris

rings. This was a valuable tool during our tests as we needed to start a new ring for

each test we did. Once a ring is launched we still have to connect to it, we explain how

we do this in the next section which is dedicated to the Scalaris Connections Manager.

5.3 Scalaris Connections Manager

The Scalaris Connections Manager (SCM) is implemented in a producer/consumer

fashion. The producers are the Bwitter functions. They produce small pieces of work

which use Scalaris functions, we call them Scalaris runnables (SR). SRs typically contain

one Scalaris transaction but they can contain several provided that the failure of one

of those does not introduce an inconsistent state. The SCM stores them in a blocking

FIFO queue and the consumers that we call Scalaris workers (SW), managed by the

SCM, access this queue to execute the SRs. Bwitter functions can efficiently wait that

the result of a SR is computed. Accesses to the SRs are synchronised and the SWs

notify any function that was waiting for the result of the SRs as soon as it is computed

or the execution of the SR is aborted. This design allows the Bwitter layer running

on top of Scalaris to easily make parallel requests to different Scalaris nodes without

taking care of any connections or threads. Taking a big task and splitting it in several

SRs pushed on the queue of the SCM indeed does the job.

We show in the next chapter at section 6.2.2 that controlling the number of connec-

tions to Scalaris nodes is important to get the best performances. Opening too many

connections increases the degree of conflicts and does not improve the performance.

On the contrary having not enough connections lowers the performance. We thus want

to control the number of connections to Scalaris and avoid opening and closing them

as it needlessly consumes resources. Moreover, a connection cannot be used by several

threads. Indeed, it can only handle one transaction at a time otherwise unknown errors

start appearing. It is thus crucial to control correctly the access to a connection so that

only one SW accesses it at a time.

We solve this problem by associating a dedicated connection to a SW. We could

59

have solved it differently, the SCM could have managed a pool of connections instead of

a pool of SWs and dispatch arriving work to a new thread with a free connection. We

believe our solution is better because we do not need to create a new thread for each

SR. A thread is created only once, when a SW is created, which limits tremendously the

time used to manage the life cycle of threads. We put below, in Figure 5.3, a drawing

of the architecture of the whole SCM connected to Scalaris nodes.

Figure 5.3: Scalaris Connections manager connected to Scalaris nodes.

It is possible to call the SCM to add a new SW to the existing ones or to remove

a SW on the fly, this does not need to be statically configured. The SCM will always

connect the new SW to the Scalaris node that has the lowest number of connections. It

does so by associating a Scalaris node to a SW, the SW is then responsible to open the

connection to the Scalaris node. The SWs are responsible to manage the connection

they have opened. They automatically reconnect if the connection is lost and they also

restart a Scalaris node if it has crashed. This must be done carefully because several

SWs can be using the same Scalaris node, we thus synchronized them so that only one

SW is responsible for restarting a dead Scalaris node. The state machine of a SW can

be seen in Figure 5.4, it highlights the different states and the events leading from one

state to an other. The SW starts by trying to connect with its Scalaris node, if too

many connection attempts fail the SW restarts the node and retries. Once the SW is

connected it waits for SR jobs to be run on the Scalaris node, runs them, retrieves the

result and waits for an other SR. If the connection with the Scalaris node is lost the

SW will try to reconnect.

60

Figure 5.4: State machine of a Scalaris Worker.

An important thing to notice with this design is that an SR can never create another

SR, add it in the blocking queue of the SCM and wait for his result. This situation can

indeed create deadlocks. Taking a simple case where we only have one SW, this one

takes an SR, called sr1, and executes it. Assuming that sr1 creates another SR, called

sr2, adds it to the blocking queue and waits for its result we have a deadlock. Indeed,

no SW will ever execute sr2 as we only have one SW that is already busy with sr1.

5.3.1 Failure handling

A SR can fail, but you as you can see in Figure 5.4, the SW will run the SR several

times before aborting it and take a new one. This implies that SRs must be designed

in such a way that when they fail they do not introduce partial state in the database

and can thus be restarted without any risk. This is important in order to be able to

restart jobs at this low layer because it simplifies the algorithms that are running on

the top of it. Indeed, they are not forced to develop themselves their own strategy to

recover from failure of Scalaris operations, this is needed if we want to avoid aborting

too often high level tasks. Those tasks can be quite complex and contain several SRs

which increase the probability that at least one SR fails. We only throw an exception

at higher level when the SR has failed several times. Algorithms running on the top

the SCM can then decide if they want to completely abort the task they were running

or to resend the SR to the SCM.

61

5.4 Bwitter Request Handler

In this section we detail the most important algorithms we used in the Bwitter

Request Handler: posting a tweet, reading tweets and deleting tweets.

We have developed two different approaches for posting and reading tweets: the

push and the pull. As represented in Figure 5.5, the push approach (on the left) is

when the user who posts a tweet is responsible for inserting the references inside the

lines of his followers. The pull approach (on the right) is when the user fetches all the

references himself from the lines of the people he follows.

Figure 5.5: Representation of the reads (dotted arrows) and writes (full arrows) pro-cesses. The tweets are posted in lines (rectangles) and read from them . Left) Pushdesign with one line per reader. Right) Pull design with one line per writer.

Note that in the following algorithms we will never post the whole tweet object into

the lines of the followers but rather a reference to it. So when we say we post a tweet

in someones line we mean the tweet reference. As explained in the previous chapter,

a reference contains the posting date, the username of the author or retweeter, the

username of the original poster if it is a retweet and the key of the referenced tweet.

It limits the amount of redundant data stored and makes it possible to easily edit or

delete a tweet, as it is explained in section 5.4.1 where we detail the operations to delete

a tweet.

All the algorithms we designed use the Scalaris Connection Manager we just pre-

sented. We do not develop any recovery mechanism to handle the failure of the execu-

tion of one of the SR involved in an algorithm. We assume that the recovery mechanism

62

of the Scalaris Connection Manager is sufficient so that failure will be scarce. A failure

of one SR can thus make a higher level function abort in some cases forcing the user

to restart manually the operation.

The algorithms are developed to work on a transactional key/value datastore with

list abstraction. However this one is not necessary, we could use a classical key/value

datastore and develop our own parsing to simulate lists.

An important fact to keep in mind is that concurrent reads on the same key do not

conflict with Scalaris, this means that two parallel reads in two different transactions

will not lead to an abort. On the other hand, two parallel transactions with one writing

a key and the other writing or reading the same key will conflict and this can lead to

abort.

In this section, we will explicitly detail the pseudo code of posting and reading

the tweets for the pull and push approach. We will compare for the two the number

of Scalaris opertations they need. We do not take into account the fact that some

operation can fail. The number of operations in the worse case can be easily computed

from the number of operations in the normal case by multiplying it by a factor of k,

where k is the maximum number of trials for one SR. We use the notation SR(some

piece of code) to explicit that a piece of code is executed by the SCM inside an SR,

this call is thus non blocking. To get the result of a piece of code executed inside an

SR we write result = SR(some piece of code), this blocks until the result is computed

or an exception is thrown.

5.4.1 The push approach

Post a Tweet

Posting a tweet is a core function of our application, it is thus important to have an

efficient, as well as robust way to post tweets. Our tweet posting algorithm must thus

be able to handle the failures of a datastore node as well as the failure of an application

node during the posting of the tweet. This algorithm must also scale with the number

of followers. It is also necessary to take into account that some users can have millions

of followers.

Below you can see the skeleton of the algorithm, it is composed of three main parts:

Posting the tweet object, posting the references to the followers’ lines and updating

the value of the last tweet correctly posted. Those are detailed in the next subsections.

The algorithm can be adapted for retweets and replies to tweets but we do not detail

it here as it is very similar.

postNewTweet(posterName , msg){

// F i r s t s tep

// Post the tweet object , i f t h i s s tep succeeds the tweet i s even tua l l y

posted everywhere .

tweetNbr = SR(posttweet(posterName , msg))

// Second step

// Produce the chunkProcessesors . Each o f them i s an SR r e s p on s i b l e o f

pos t ing the remaining tweets to the f o l l ow e r s ’ l i n e s f o r a g iven chunk

63

o f the Topost s e t u n t i l i t r eaches tweetNbr .

SRlist = SR(produceChunkProcessors(posterName , tweetNbr))

// Add a l l the chunkProcessors produced in the SCM.

foreach sr in SRlist {

add SR to the SCM

}

// Wait f o r the t e rm ina i t i on o f the chunkProcessors and check i f none f a i l e d

( a l l the tweets u n t i l tweetNubr c o r r e c t l y posted f o r a l l the f o l l ow e r s )

.

bool = false


try {

// Block un t i l the r e s u l t i s computed , no r e a l r e s u l t i s returned we

j u s t check i f nothing went wrong .

result = sr.get

} catch (exception) {

bool = true

}

}

// Third step

// I f none o f the p r ev i ou s l y launched chunkProcessors has f a i l e d mark tweets

as posted un t i l tweetNbr .

if(! bool ){

SR(markTheTweetsAsPosted(posterName , tweetNbr))

}

}

The algorithm starts with the posting of the tweet, if this first step finishes without

errors we can guarantee that the tweet will be eventually posted on the lines of all the

followers. Otherwise we abort the posting of the tweet and the user must manually

restart it. The second step is responsible for pushing the information to the lines of the

followers. This is the heavy part of the job, we thus decided to cut this job in several

independent SRs that run on several Scalaris nodes in parallel. We added a repair

mechanism which logs the operations successfully performed in order to recover from

failures during this part. Finally, the last step marks the tweet as correctly posted

on all the lines. As just mentioned, only the first step is needed to have the tweet

eventually posted to all the followers. Subsequent executions of the algorithm will

indeed automatically repair previously started work that failed. This repair is done

during the “Post the references” phase detailed later in this section.

Post the tweet object This first step is executed as one SR to guarantee atomicity.

As mentioned, if it succeeds the tweet will be eventually posted on all the lines. A

tweetNbr uniquely identifies one tweet for a given user. As you can see below, it is

attributed at the creation of the tweet.

posttweet(posterName , msg){

begin transaction

tweetNbr = read(/user/posterName/tweet/size)

postingDate = currentDate ()

tweetReference = buildTweetRef(posterName , tweetNbr , postingDate)

tweet = buildTweet(posterName , tweetNbr ,postingDate)

write(/user/posterName/tweet/tweetNbr/reference , tweetReference)

write(/user/posterName/tweet/size , tweetNbr +1)

write(/user/posterName/tweet/tweetNbr , tweet)

64

end transaction

return tweetNbr

}

You can notice that we save the tweet reference in order to easily recover from

failure later.

Post the references We now explain the next step of the posting algorithm which

is posting the references on the lines of the followers. This step repairs any previously

started tweet posting which has failed after the first step. It can also be run for this

purpose only. This repair mechanism is needed as this part of the algorithm is highly

subject to failures. Indeed, it writes to the line of every follower. Therefore, it can

potentially conflict with followers reading their lines and other posters posting their

tweets. This step is cut in two substeps: the first substep is to create the chunkPro-

cessors and the second one is to execute them.

Some stars have millions of followers, it would thus not be scalable to do the whole

work in one big transaction. Therefore, we split the work into several SRs run on dif-

ferent Scalaris nodes. Now remember that the Topost set is cut in several chunks. We

associate one SR to each chunk of the Topost set. It is responsible of posting all the

remaining tweets (which usually limit to the new tweet posted if there were no failures

before) to all the followers in its attributed chunk. We call an SR with this precise task

a chunkProcessor. A chunkProcessor stops when it reaches the tweetNbr with

which he was initialized. tweetNbr corresponds to the tweet number of the last tweet

that the chunkProcessor must post to the lines. If a chunkProcessor finishes with-

out error, we are sure that all the tweets up to tweetNbr are correctly posted for this

Topost chunk. The pseudo code below details the creation of the chunkProcessors.

// tweetNbr i s the l a s t tweet to post on the l i n e s .

produceChunkProcessors(posterName , tweetNbr , posterName){

// F i r s t we do a check in order to v e r i f y i f the job i s not a l r eady done ,

t h i s s tep can be skipped as i t i s j u s t an opt im i sa t i on . This t e s t i s

equ iva l en t to t e s t i f each counter a s s o c i a t ed with a chunk o f the Topost

s e t has a value at l e a s t equ iva l en t to tweetNbr but i s qu i cke r as only

one key must be acce s s ed .

begin transaction

lastTweetNbrCorrectlyProcessed = read(/user/posterName/tweet/processed)

end transaction

if(lastTweetNbrCorrectlyProcessed >= tweetNbr)

return new emptyList

// Read the number o f chunk in the Topost s e t .

begin transaction

nbrOfToPostSetChunks =

read(/user/posterName/topost/size) / topostSetChunkSize +1

end transaction

// Create the d i f f e r e n t chunkProcessors .

chunkIndex = 0

SRlist = new emptyList

while(chunkIndex < nbrOfToPostSetChunks){

SRlist.add( new chunkProcessor(posterName , chunkIndex , tweetNbr)

chunkIndex ++

65

}

return SRlist

}

You can notice that we only create the chunkProcessors in this part of the algo-

rithm and do not execute them. They are executed in the “Post the tweet object” phase

detailed at the beginning of the section. Indeed, chunkProcessors are SRs, and, as

explained at the end of section 5.3 where we present the SCM, an SR can never wait

for the result of another SR he launched because it can create deadlocks. We detail

below the algorithm of a chunkProcessor, which explicits how we post the references

to all the lines of the followers contained in a chunk.

chunkProcessor(chunkIndex , tweetNbr , posterName){

while(true){

begin transaction

//Compare the value o f the cur rent chunkCounter with tweetNbr , i f

chunkCounter i s b i gge r or equal job i s done .

chunkCounter = read(/user/posterName/topost/chunkIndex/counter)

if(chunkCounter >= tweetNbr) return

//Get the r e f e r e n c e corre spond ing to the next tweet that i s not

posted . The tweetNumber o f t h i s tweet i s equal to chunkCounter+1

as i t i s the l a s t c o r r e c t l y posted tweet f o r t h i s Topost s e t

chunk .

tweetReference = read(/user/posterName/tweet /( chunkCounter +1)

//Read the the topose t chunk conta in ing the r e f e r e n c e s to the

f o l l ow e r s ’ l i n e s .

lineKeys = read(/user/posterName/topost/chunkIndex)

// F ina l l y we add the r e f e r e n c e to a l l the l i n e s r e f e r en c ed in the

chunk , the func t i on c a l l e d must execute in the same t r an sa c t i on

as the cur rent one .

foreach lineKey in lineKeys {

addReferenceToLine(tweetReference , lineKey)

}

// I f chunkChounter has reached tweetNbr e x i t the loop .

if(chunkCounter +1 >= tweetNbr) return

end transaction

}

}

We must now detail the addReferenceToLine function which is responsible of posting

a particular reference to the line of a follower. Remember that the lines containing the

references are divided in chunks too. For this part we thus have two choices, either

we cut the line while posting or we put the burden of cutting at another moment (for

example when a user reads his new tweets).

Triggered cutting In the “triggered cutting” solution we do not cut the line during

the posting. Indeed prefer to do it at another moment in order to ease the post tweet

function which is already quite subject to failures. The only necessary operation is thus

to add the tweet reference to the head.

66

addReferenceToLine(tweetReference , lineKey){

add(lineKey/head , tweetReference)

}

The line must thus be cut at another moment. We chose to do it when a user reads

his tweets. Indeed the only operation needed to check if the head must be cut is to

read the head, which is almost always what the read tweet operation does, as the head

chunk contains the latest tweets. Hence the algorithm presented below must be run

each time a user reads his tweets. Most of the time the algorithm does not impose any

overhead on readings as the head must only be cut when it is full.

By taking advantage of the read tweet operation, we can avoid reading the head

during the cutting mechanism. However, we present below a version of the cut mech-

anism where we read the head in order to show the complete algorithm. In a real

implementation the head would be given as argument.

splitHead(lineKey){

begin transaction

headChunk = read(lineKey/head)

// While the head i s too big we t r a n s f e r nbrTweetsPerChunk to a new chunk .

headChanged = false

nbrOfChunkCreated = 0

if(headChunk.size <= nbrTweetsPerChunk)

return;

// Number o f chunks in the l i n e exc lud ing the head .

nbrOfChunkInLine = read(lineKey/nbrchunks)

while(headChunk.size <= nbrTweetsPerChunk){

headChanged = true

// Remove nbrTweetsPerChunk o l d e s t tweets from the headChunk t h i s does

not modify the datastore , j u s t our l o c a l copy .

newChunk = removeOldest(headChunk , nbrTweetsPerChunk)

// Write the new chunk in the l i n e

write(lineKey/nbrOfChunkInLine , newChunk)

nbrOfChunkCreated ++

}

// I f the head has changed wr i t e the new head and update the number o f

chunks .

if(headChanged){

write(lineKey/head , headChunk)

write(lineKey/nbrchunks , nbrOfChunkInLine+nbrOfChunkCreated)

}

end transaction

}

In conclusion, we can observe that the posting on the line is really easy because you

only have to add an element to a set but you have to pay the price later to split the

line.

Cutting the line while posting We now present the addReferenceToLine version

where we cut the line while posting. We add the tweet in the head of the line, and, if

67

the head is full, we flush the head and create a new chunk. The overhead for cutting is

thus paid while posting but not at each post.

addReferenceToLine(tweetReference , lineKey){

// Read the head .

headList = read(lineKey/head)

// Check i f the head i s f u l l and we need to c r e a t e a new chunk .

if(headList.size >= nbrTweetsPerChunk){

// Replace the head by a f r e s h one with the new tweet .

newList.add(tweetReference)

write(lineKey/head , newList)

chunkNumber = read(lineKey/nbrchunks)

// Write the new number o f chunk in the l i n e and the o ld head to the new

chunk .

write(lineKey/chunkNumber , headList)

write(lineKey/nbrchunks , chunkNumber + 1)

}

else{

headList.add(tweetReference)

write(lineKey/head , headList)

}

}

Observe that we usually do not do more operations than in the triggered cutting as

we only need to make a new chunk when the head is full. Thus adding a reference to

a line takes 1 read and 1 write usually and occasionally 2 reads and 3 writes.

Chronological ordering We have shown how to post in a reliable and efficient way

tweets on lines, however some tweets might be misplaced due to application failures

during the posts or latency in the network. We propose an improvement to maintain

strong chronological ordering between the tweets in all situations.

The idea is to have a date associated to each chunk of a line. This date would be

equal to the posting date of the newest tweet of the previous chunk. This way when

we add a tweet to a chunk we check that his posting date is newer than the posting

date of the date associated to the chunk. If it is not the case we walk back through

the line and find the first chunk for which it is true and add the tweet to this chunk.

This means that we can have more than nbrTweetsPerChunk tweets per chunk but

this has no repercussions on the other algorithms. We can adapt the two algorithms

described above to impose chronological ordering as just explained but we do not detail

it here.

This complicates the posting algorithm which should be as light as possible in order

to achieve the best scalability. We believe that is not absolutely crucial to have perfect

ordering between the tweets and thus should not make the post algorithm even more

complex.

Mark tweet as correctly posted This is the final step of the algorithm, if every-

thing succeeded before we can be sure that the tweets of the user until tweetNbr are

correctly posted on the lines of the followers present in his Topost set at the time of

68

the posting. We can thus update the lastTweetNbrCorrectlyProcessed variable to

tweetNbr. As already mentioned, this step is not mandatory and could be skipped, it

only permits to test later more efficiently that the tweets were correctly posted in the

produceChunkProcessors part of the postTweet algorithm. We must take into account

that several runs of the postNewTweet algorithms can be running concurrently. This

can happen if a user post quickly two tweets or if the recovery part of the algorithm

was called in response to some event. It is thus crucial to test the value of the last-

TweetNbrCorrectlyProcessed before erasing it with tweetNbr, indeed another

run posting a newer tweet (and thus a tweet with a higher tweetNbr than the one we

working on) can have just written a newer value for lastTweetNbrCorrectlyPro-

cessed.

markTheTweetsAsPosted(posterName , tweetNbr){

begin transaction

lastTweetNbrCorrectlyProcessed = read(/user/posterName/tweet/processed)

if(lastTweetNbrCorrectlyProcessed < tweetNbr)

write(/user/posterName/tweet/processed , tweetNbr)

end transaction

}

Theorical performance analysis This algorithm is heavy, which is normal as in

the push approach we favor the reads and put the burden on the writes. Let’s try

to have an idea of how many operations these algorithms need. When we talk about

operations we mean reads and writes on Scalaris. We have observed while testing that

writes and reads on Scalaris approximately take the same time. An operation to add

a value to a list using Scalaris requires to do a read and a write because there is no

built-in operations on sets. We now analyse the three steps of the algorithm.

First step is the posting of the tweet in the datastore, it requires 4 operations (1

read and 3 writes) in one transaction to post the tweet object, post the tweet reference

and update the tweetNbr.

The second step is the posting of the references in all the lines. This is the heaviest

step, the number of operations depends on the number of followers (nbrFollowers)

that a given user has. We first check the lastTweetNbrProcessed (one read): the

job is done if the check indicates that everything is correctly posted, this can happen

during recovery and concurrent posting. Assuming that we are in the normal situation

where everything goes correctly, we have one tweet to post and all the previous tweets

were correctly posted on all the lines. We read the size of the Topost set (one read)

then we can dispatch the work for each chunk of the topost set. So we need two reads

to create the chunkProcessors.

Each chunk of the Topost set requires one transaction, the size of the transaction

(number of keys it works on) depends on the number of followers per chunk of the

Topost set (nbrOfFollowersPerChunk). The size of the transaction for each chunk is

proportional to nbrOfFollowersPerChunk and the number of transactions is inversely

proportional to nbrOfFollowersPerChunk. We assume we had no failures previously

and that there only is one tweet to post for all the chunks. Each chunkProcessor thus

69

reads and writes one time his counter. It thus requires 2× nbrOfTopostSetChunks

operations or equivalently 2× nbrFollowers/nbrOfFollowersPerChunk.

We must post the reference on the lines of each follower in the Topost set. The

complexity of this operation depends if we cut the line while writing or not. If we

do not cut it we only need 2 operations to update the head with the new reference.

If we cut while posting we must also create a new chunk and flush the head which

requires 3 additional operations. In average we must cut the line every nbrTweet-

sPerChunk tweets, we thus assume that the final cost for one posting for the cutting

is 3/nbrTweetsPerChunk. Those operations must be done for every follower, we

thus finally get to 2 × nbrFollowers operations for the posting without cutting and

nbrFollowers×(2+3/nbrTweetsPerChunk) operations for the posting with cutting.

Despite the fact we do not cut the lines while posting in the first option we would

like to compute the overhead of cutting those lines at another moment. By example

while reading the new tweets, as this allows to avoid the burden of reading the head.

So, for each new chunk created while cutting the head we need to do a write (thus in

total nbrNewChunk writes). We must also flush the head (one write) and we must

update the number of chunks in the line (one read and one write). Thus we have 3 +

nbrNewChunk operations. If we consider that a reader reads his tweets regularly we

have a nbrNewChunk generally equal to 1.

The last step only requires 2 operations in one transaction, one to read the last-

TweetProcessed and one to update it.

To summarize, we have a different number of operations (nbOp) to perform if we

decide to cut while reading or not:

• Cutting while reading:

nbOp = 8 + 2× nbrFollowers + 2× nbrFollowers

nbrOfFollowersPerChunk

= 8 + nbrFollowers×(

2 +2


) (5.1)

• Cutting while writing:

nbOp = 8 + nbrFollowers×(

2 +3

nbrTweetsPerChunk

)

+ 2× nbrFollowers


= 8 + nbrFollowers×(

2 +2

nbrOfFollowersPerChunk+

3

nbrTweetsPerChunk

)(5.2)

70

• Difference between the two techniques:

Diff = nbrFollowers× 3

nbrTweetsPerChunk(5.3)

The difference between the two is small but we believe that the overhead introduced

for cutting can have side effects.

The number of operations involved in the posting of a tweet is mainly influenced

by two parameters that we can control: nbrOfFollowersPerChunk and nbrTweet-

sPerChunk. Increasing nbrOfFollowersPerChunk reduces the burden induced by

the management of the counter associated with each chunk of the Topost set. However,

it also makes transactions more complex because each chunkProcessor works on more

elements. On the other hand increasing nbrTweetsPerChunk means that we cut

the line less often, but we waste resources each time we update a chunk because we

are forced to load a bigger chunk. We made some tests in the experiment chapter at

section 6.3.2 to observe the impact of the nbrOfFollowersPerChunk parameter.

As pointed out previously, if we want to guarantee strong ordering of the tweets

between the chunks of a line we have to perform more operations. With the automatic

cutting design solution we would need to read one additional structure at each tweet

insertion. In the normal case the tweet will be inserted in the head chunk because it is

newer that all the ones previously posted. However, sometimes we have to insert the

tweet in an older chunk. This means that we have to walk back and find the adequate

chunk, involving in the worst case as many operations as there are chunks in the line,

this case is however very unlikely to happen. With the triggered cutting solution we

would not need to do any additional operations during the insertion because we always

insert the tweet in the first chunk. The burden related to the walk back would be

transferred on the split head function.

Delete a tweet

If the real tweet was posted on all the lines, it would be necessary for the delete

operation to find back all the lines where the tweet was posted and to remove the tweets

from all those lines. This would be really impractical for several reasons. First, you

would need to find back where a particular tweet was posted. Indeed, it is not enough

to know the lines where a tweet was posted, you must also find the chunk of the line in

which the tweet was posted. You must thus either maintain this information for each

tweet or walk through all the chunks of the line in order to find and delete the tweet.

This is why we post references to the tweets in the lines. To delete a tweet we only

need to access the tweet object that is located at a given key and mark it as deleted.

The BRH checks the mark when fetching a tweet and discards it if it has been marked

as deleted.

71

Reading tweets

We will now explain how we fetch tweets from the lines. Users on social networks

usually want to retrieve the latest news and less frequently walk back to find older

posts. We thus assume that users want to retrieve the tweets from the newest to the

oldest. So we do not load the whole line, instead we load only some tweets from it.

Because lines are already cut in chunks it is natural to fetch one chunk of the line at

a time starting with the first chunk of the line, called the head, which contains the

newest tweets. However it is possible to access directly one chunk of the line if needed.

The first chunk of the line can be directly accessed because the head is at a fixed

location. We suppose that the line is already cut when we read it. If we want to access

the chunk that follows the head we have to retrieve the number of chunks in the line,

compute the key of the penultimate chunk and request it.

The next step is to filter the references in order to discard the tweets posted by

users we do not follow anymore. Indeed, we never remove the tweets posted by a user

from a line. It means that all the tweets that were posted while we were following a

user that we do not follow any more will stay forever on the line. It also implies that

if we decide to follow again a user his tweets will reappear on the line.

Chunks only contain references to tweets, we thus still have to fetch the tweets using

the references remaining after the filtering. Once we have retrieved the tweets we filter

the deleted tweets. You can notice that we are forced to load the tweets before we can

filter the dead tweets as the references do not indicate if the tweet is deleted or not.

Once the filtering has been done we can return the pack of tweets remaining.

We present below the pseudo code we have implemented to read nbrTweets tweets

from a line. This code is run in an SR. To avoid complicating the code we did not show

the recovery mechanism inside it. In the implementation, while we are fetching the

tweets we do not abort the operation if we could not fetch a tweet, instead we just skip

it. The SR fails only if other data is not accessible as it is needed to fetch the tweets.

One missing tweet, on the other hand, does not compromise the rest of the operation.

We could also split the SR in two parts if we want to add the cutting mechanism. The

first part of the SR would read the head and split it if needed. Then it would give as

argument the current head to the second part of the algorithm removing the need to

read it again.

getTweetsFromLine(nbrTweets , linename , username){

refList = read(/user/username/line/linename/head)

chunkIndex = read(/user/username/line/linename/nbrchunks) - 1

while(refList.size < nbrTweets && chunkIndex >-1 ){

//Read the cur rent chunk

refList.add(read(/user/username/line/linename/chunkIndex))

chunkIndex --

}

users = read(/user/username/line/linename/users)

filter(refList , users)

tweets = new tweetList

72

for each tweetRef in refList{

tweet = read(/user/tweetRef.posterName/tweet/tweetRef.tweetNbr)

if(! tweet.isDeleted)

tweets.add(tweet)

}

orderTweetsFromNewestToOldest(tweets)

return tweets

}

Having the pseudo code we can, as we did for the posting algorithm, compute the

number of operations needed on Scalaris. The number of chunks we read depends of

nbrTweets and nbrTweetsPerChunk. We read the number of chunks in the line (1

read). We then read nbrTweets/nbrTweetsPerChunk chunks to get nbrTweets

tweet references. Then to filter the users, we must retrieve the user list associated to

the line (one read). We must then do nbrTweets reads (minus the number of tweets

associated to users that are not anymore on the line) to get the real tweets. Considering

that all the tweets we fetched are posted by users still associated to the line the result

is:

nbOp = 2 + nbrTweets +nbrTweets


The heavy part is thus the fetching of the tweets. We could have posted tweets

instead of references we would obtain 2 + nbrTweets/nbrTweetsPerChunk oper-

ations, reducing tremendously the number of operations to do (but not the data to

fetch) however the delete tweet operation would have been much more complex as we

explained before.

Add a User to a line

We explain here how we add a new follower (newfollowed) to an existing line

(linename). We first check if newfollowed is not already in the set of users associated

with linename (one read). If it is not already present we add it (one write). Once it is

done we add a reference to linename in the Topost set of newfollowed (one read and

one write). We also create an object containing a reference towards the chunk of the

Topost set in which we added the reference to linename so that we can easily remove

this one later. Note that those are not the same chunks as the ones we use to divide

lines. In total we thus have a cost of 3 writes and 2 reads. Sometimes we must also

create a new chunk, in this case we must update the number of chunks and thus add 2

writes and one read.

addUserToLine(username , linename , newfollowed){

SR(

begin transaction

users = SR(read(/user/username/line/linename/users))

if(newfollowed belongs to users)

return

users.add(newfollowed)

73

write(/user/username/line/linename/users ,users)

lasttopostchunk = read(/user/newfollowed/toposet/nbrchunks))-1

reflist = read(/user/newfollowed/topostset/lasttopostchunk)

// We must c r e a t e a new chunk

if(reflist.size >= nbrOfFollowersPerChunk){

lasttopostchunk ++

reflist = newList

write(/user/newfollowed/topostset/nbrchunk , lasttopostchunk +1)

//Create the counter a s s o c i a t ed with the chunk

lastTweetNbr = read(/user/newfollowed/tweets/size)-1

write(/user/newfollowed/topostset/lasttopostchunk/counter ,

lastTweetNbr)

}

reflist.add(new ref(username , linename))

write ((/ user/newfollowed/topostset/topostchunk/, reflist)

// Write the chunk o f the topost we posted in f o r easy removal .

write(/user/username/newfollowed/linename/, lasttopostchunk)

end transaction

)

}

Remove a user from line

We now want to remove a user (followingUsername) from a line (linename). We

first remove followingUsername from the set of users associated with linename (one

read and one write). We then read the object (see “Add a user to a line”) containing

the number of the chunk of the Topost set in which we added the reference to linename

and suppress it (one read and one write). We can then locate the chunk and remove

the reference from it (one read and one write). Thus in total 3 reads and 3 writes.

Note that we do not modify the number of chunks in the Topost set even if a chunk

becomes empty. Indeed we do not want to remap all the keys attributed to already

existing chunks that depends of this number of chunks.

removeUserFromLine(username , linename , followingUsername){

SR(

begin transaction


if(! followingUsername belongsto users)

return

users.remove(followingUsername)

write(/user/username/line/linename/users , users)

topostchunk= read(/user/username/followingUsername/linename /)

delete (/user/username/followingUsername/linename /)

reflist = read(/user/followingUsername/topostset/topostchunk)

reflist.remove(newRef(username ,linename))

write(/user/followingUsername/topostset/topostchunk , reflist)

// We must c r e a t e a new chunk

if(reflist.size >= nbrOfFollowersPerChunk){

lasttopostchunk ++

reflist = newList

write(/user/newfollowed/topostset/nbrchunk , lasttopostchunk +1)

}

74

reflist.add(new ref(username , linename))

write ((/ user/newfollowed/topostset/topostchunk/, reflist)

// Write the chunk o f the topost we posted in f o r easy removal

write(/user/username/newfollowed/linename/, lasttopostchunk)

end transaction

)

}

Create a user

The first thing to do when creating a user is to check if there is not already an other

user with the desired username registered in the system. To do so we check whether

there is already a value at the key “/user/username”. If there is already a value at

this key we can conclude that there is already a user registered with this username

and the user creation is aborted. Otherwise a user object is created, containing all the

information of the user, and is stored at this key.

5.4.2 The pull approach

As it was the case in section 4.3.3 of the design of the datastore chapter, we tried

to re-use as many of the building blocks and mechanisms from the push approach as

possible while having efficient alternative algorithms. With the pull approach only

the mechanisms needed to post and retrieve the tweets are heavily modified. Other

basic mechanisms such as adding a user are simplified because there is no need to

keep a Topost set up to date anymore or do not need to have some fields initialised.

Ultimately some simple mechanisms such as deleting tweets are exactly the same.

Post a tweet

The tweet itself is posted much in the same way as with the push approach that

we explained in section 5.4.1. The difference is that the user now only posts the tweet

references at one location. This location varies according to the time. This means

that all the tweets posted during a given time frame are going to be grouped together

and accessed via the same rounded timestamp. We call the set containing all the

tweets corresponding to a time frame a postTimeFrame. The timestamp is rounded

to the desired time granularity by setting some of the fields to 0, as explained in the

section 4.3.3 of the datastore chapter.

posttweet(posterName , msg){

SR(

begin transaction

tweetNbr = read(/user/posterName/tweet/size)

tweet = buildTweet(posterName , tweetNbr)

write(/user/posterName/tweet/tweetNbr , tweet)

write(/user/posterName/tweet/size , tweetNbr +1)

postingDate = currentDate ()

tweetReference = buildTweetRef(posterName , tweetNbr , postingDate)

references = read(/user/username/tweet/timestamp)

75

references.add(tweetReference)

// wr i t e the r e f e r e n c e to the g iven postTimeFrame

write(/user/username/tweet/timestamp , references )

end transaction

return tweetNbr

)

}

As expected we have a much lighter post tweet operation in this case with only 5

operations in total (2 reads and 3 writes). You could wonder why we still post references

instead of tweets. The reason comes from the algorithm to read tweets. As we explain

in the next section, a time frame must be read for each of the users followed. We thus

wanted to limit the size of a time frame.

Reading the tweets

This operation is now heavier as it has to retrieve the references from each author.

We have kept the chunks number format for the sake of simplicity and compatibility

with the existing API. The chunk 0 is the very first chunk associated to the user, the

timestamp of this chunk is the rounded registration time of the user. For example, if

a user registered at 05/06/11 15 h 00 min 00 s GMT and the time granularity is

counted in hours, when he requests to read the chunk 2 he will fetch all the tweets

posted between 05/06/11 17 h 00 min 00 s GMT and 05/06/11 17 h 59 min 59 s

GMT by all the users he is following. If a chunk with a negative value is requested the

latest chunk is returned along with its real chunk number.

In the same fashion as what we did with the post tweet in the push approach, we

create a series of smaller tasks. In this case we create one SR per user followed that

is responsible for fetching the tweets of this user. The tweets fetched are the tweets in

the chunk corresponding to the chunk number cNbr which is given as argument to the

function getTweetsFromLine that we describe below.

getTweetsFromLine(username , linename , cNbr){

// F i r s t s tep

// Produce a l i s t o f SRs to add to the SCM, each SR takes care o f one user

SRlist = SR(produceLineProcessors(username , linename , cNbr))

// Second step

// Add a l l the SRs produced in the SCM and get t h e i r tweets

foreach sr in SRlist

add SR to the SCM


try{

//Block un t i l the r e s u l t i s computed

result.add(sr.getTweets)

} catch (exception) {

//The tweets o f a user could not be r e t r i e v e , in t h i s case we abort

the

// the read ing .

return null

}

}

chronologicalSort(result)

76

return result

}

produceLineProcessors This part creates one lineProcessor per followed user. A

lineProcessor takes as argument the key of the chunk that it must fetch. To compute

the key of the chunk we must convert cNbr to a date because, as already explained,

lines are fragmented according to the time and thus each chunk in a line corresponds

to a specific date.

produceLineProcessors(username , linename , cNbr){

SR(

startTime = read(/user/username/starttime)

dateKey = chunkToDate(startTime , cNbr)


SRlist = new emptyList

for(User u: users){

SRlist.add(new lineProcessor(dateKey , user))

}

return SRlist

)

}

lineProcessor This part fetches the tweets posted by a given user during the

dateKey time frame up to the given date. Note that no ordering is done at this stage

as all the tweets are ordered at the end of getTweetsFromLine. As for the previous

reading tweets operation, we do not abort an SR if one of the tweets is not accessible,

but rather ignore the error as this does not compromise the rest of the operations.

This case is supposed to happen very rarely. Indeed tweet objects, once stored, are

only modified when the author wants to delete them, otherwise they are only read and

reads are not conflictual and should thus not abort.

lineProcessor(dateKey , user , cNbr){

SR(

refList= read(user/username/tweet/dateKey)

tweets = new tweetList

for each tweetRef in refList{

tweet = SR(read(/user/tweetRef.posterName/tweet/tweetRef.tweetNbr))

if(! tweet.isDeleted)

tweets.add(tweet)

}

return tweets

)

}

Theorical performance analysis The whole getTweetsFromLine operation is esti-

mated to do 2 + nbrFollowing + nbrRetrievedTweets basic Scalaris operations,

where nbrFollowing is the number of users followed and nbrRetrievedTweets the

total number of tweets to retrieve. Indeed we need one read to determine dateKey

77

and one read to determine the users we follow. Then to fetch the references we must for

each user read the chunk of their line corresponding to dateKey, thus nbrFollowing

operations. Finally we must do nbrRetrievedTweets operations to read the tweets

corresponding to the tweet references we just read.

Why not store the built line chunks An alternative approach would be to keep

the work done and to store the built chunk when it has been read. This would avoid

the need to build several times the same chunk of a line. The application could do a

simple check to see if a given chunk has already been built and, if it is the case, retrieve

the references from the chunk previously stored.

This might sound like an interesting optimisation but we have to keep in mind the

way our application is going to be used. Users almost never re-read tweets they have

already read, they usually want to see the last posted tweets. This means that they are

going to load the latest chunk to see if there are new references in it. This implies that

the latest chunk has to be rebuilt from scratch and thus storing the previous tweets

references will not speed up this operation. Furthermore trying to keep previously built

chunks of the line will increase the number of checks and operations to do in the case

the references were not already on the follower side, which will be the case when trying

to read previously unread tweets. Finally, the most obvious advantage of not storing

line chunks is that it decreases the space complexity.

We thus decided against this solution because it complicates the implementation,

increases the amount of data to keep in the system and slows down the most used

operations in order to increase the performance of rarely used operations.

5.4.3 Theoretical comparison of Pull and Push approach

We are first going to compare the two approaches based on the complexities we

computed in the previous section. We then try to give an intuition of the impact of

those complexities on the behaviour of Bwitter when used by simulated users.

Summary of the complexities

We are now going to compare the complexity of the push and pull approaches for

the two main operations, the postTweet and getTweetsFromLine. Below we present a

summary of those operations for both the puch and pull approach.

• Push - postTweet

nbOp = 8 + nbrFollowers×(2 +

3

nbrTweetsPerChunk+

2


)(5.5)

78

• Pull - postTweet

nbOp = 5 (5.6)

• Push - getTweetsFromLine

nbOp = 2 + nbrTweets +nbrTweets


• Pull - getTweetsFromLine

nbOp = 2 + nbrFollowings + nbrRetrievedTweets (5.8)

Before we start, here is a reminder of the different terms involved:

• nbrFollowers: number of users the user is followed by (pull/push).

• nbrFollowing: number of users the user follows (pull/push).

• nbrTweets: minimum number of tweets we want to retrieve (push).

• nbrTweetsPerChunk: number of tweets in one chunk (push).

• nbrRetrievedTweets: number of tweets retrieved in one get (pull).

• nbrOfFollowersPerChunk: number of followers in a chunk of the Topost set

(push).

• time granularity: the time frame corresponding to a post chunk (pull).

As announced, we can see that obviously the post operation in the push approach

is much heavier than in the pull. Indeed the time to post a tweet using the pull design

is constant while in the push approach it depends on the number of users that follow

you. On the opposite, the read is lighter in the push approach, indeed it depends on

the number of tweets retrieved as in the pull but its complexity does not grow with

nbrFollowings as it is the case in the pull.

The two designs thus have their respective heavy operation, the post for the push

and the read for the pull. However we believe that the first is more resistant to failures

because it does not need to succeed directly and can be recovered later while the second

must read from all the following successfully in order to produce a result. If we further

considerate that a user does not like to wait, the push is more reasonable as after the

first step of the post we can tell that the operations was a success, for the read in the

pull we must wait until the end of the whole operation in order to respond to a request.

However the push operations involve much more conflicts as they do a lot more write

than the pull operations.

We still did not prove if from a complexity perspective it is better to use the push or

the pull. To this end we would like to compare them according the number of followers

and following and the read and write rates.

79

Theorical Bwitter simulation

We now simulate the two designs we have presented. It is aimed at estimating the

global number of operations performed by the system and determining which design is

the best according to an unknown number of followers and read rate. This simulation

does not take into account failures during the algorithms, size of the data transfered,

neither complexity of the transactions (number of keys involved in a transaction).

Description of the problem As we can see the operations are not comparable the

way they are now. Indeed, in the push approach we fetch at least a specified amount of

tweets, and in the pull approach we retrieve an arbitrary number of tweets, depending

of the number of tweets posted during a given time frame. The two operations are

thus semantically different, and thus naturally their complexities depend on different

parameters. We would like to be able to compare them in terms of total number of

operations done on Scalaris. The main problem is that the parameters of the system

are unknown. Indeed, each user using Bwitter is different, we define a user in term of

his behaviour and have four parameters describing it:

• postingRate: the rate at which a user posts new tweets.

• readRate: the rate at which a user reads his tweets.

• nbrFollowers: the number of followers a user has.

• nbrFollowings: the number of followings a user has.

Moreover, we must fix all the design parameters involved in the complexities namely

nbrTweets, nbrTweetsPerChunk and nbrFollowersPerChunk for the push and

time granularity and nbrRetrievedTweets for the pull in order to estimate the

number of operations done.

Assumptions The parameters we just described vary a lot between users and are

unknown. Indeed we did not find any precise statistics about the usage of Twitter.

Because we would like to give you an idea of the performance of our two designs

according to those parameters, we have decided to fix those to some values that were

chosen by ourselves according to the following assumptions:

(1) Users more often read their tweets than they post a tweet.

(2) Most of the users in Twitter have more followings than followers, we call them fans.

Other users have a lot of followers compared to the number of user they follow, we

call them stars. This means that nbrFollowers for fans is smaller than for stars.

(3) Users are only interested by new tweets and when a user reads his tweets he reads

all the new tweets.

80

(4) readRate is the same for all the users and is the average of the read rates of each

user in the real network. Because we cannot compute it as we do not have the

figures, we take it as a parameter of the simulation.

(5) postingRate is the same for all the users and is the average of the posting rates

of each user in the real network. Because we cannot compute it as we do not have

the figures, we take it as a parameter of the simulation.

(6) nbrFollowings is the same for all the users.

The first three assumptions come from the observation of the usage of Twitter.

Generally users that connect to a Twitter application read their messages more often

than they post new ones. Also 1% of the users of Twitter are responsible for 50%

of its content, this observation motivated our distinction between the star and fans

behaviour. The second three assumptions were made in order to simplify the following

development.

Properties of the system simulated We define in this section two properties

that we derived from the assumptions we do above. Those properties fix some of the

simulation parameters that we defined before.

First property: The number of new tweets a user reads when he reads his tweets

is constant and equals to nbrOfNewTweets .

nbrOfNewTweets =postingRate× nbrFollowings

readRate

First, please notice that nbrOfNewTweets is constant as postingRate, nbrFol-

lowings and readRate are fixed. We derived this property from (3), (4), (5) and

(6). This property allows us to fix nbrTweets (push) and nbrRetrievedTweets

(pull) to nbrOfNewTweets. We also decided to read only one chunk (push) and one

postTimeFrame per following (pull). This choice was made in order to simplify the

simulation which is already rather complex. This constraint helps us to fix the time

granularity and the nbrTweetsPerChunk.

Concerning the push approach, it is easy to fix nbrTweets as it is a parameter of

the function call. In order for the tweets to be packed in the same number of chunks

at each read call it is sufficient to choose nbrTweetsPerChunk to be a multiple of

nbrOfNewTweets, or put differently:

nbrTweetsPerChunk % nbrOfNewTweets = 0

We decided to fix nbrTweetsPerChunk to nbrOfNewTweets. Therefore, we

need to read exactly one chunk in order to have the new tweets at each read operation.

81

Concerning the pull approach, we cannot directly influence how many tweets are

read when performing a read tweet operation. However we can fix the time granu-

larity so that:

time granularity %1

readRate= 0

This ensures that all the new tweets are always in the last postTimeFrame. We

chose to fix the time to the time granularity to 1/readRate. This allows to have the

smallest chunk possible (no unused references loaded) while fulfilling the just stated

property. Please note the choice of the time granularity does not have any direct

influence on the simulation. But we wanted to show it was possible to tune our design

to meet simulation constraints.

Second property: Each user has the same number of followers (nbrFollowers

is fixed).

We now discuss and prove that this second property is not restrictive. You can

notice that this property is aimed at simplifying assumption (2). In other words,

it claims that there is no distinctions between stars and fans, or any other way of

distinguishing users based on their nbrFollowers. This is in fact not needed. Our

simulation estimates the global number of operations performed by a system according

to some user profile. And, thanks to the two properties below, we can affirm that having

some users having more followers than others has no influence of the total number of

operations.

(7) The postingRate and readRate are the same for all users (which is exactly the

assumptions we made at (4) and (5)).

(8) The complexities of the operations in the two designs are linear with respect

to the number of followers and followings (you can observe it remembering that

nbrTweetsPerChunk, nbrFollowersPerChunk are constant parameters)

Property (8) states that if a user has one more follower it only increases the charge

it puts on the system by a constant amount (which is the same for all the users) for

each operation it performs. However, moving a follower from a user to an other user

does not change the total charge put on the system if all the users perform the same

number of operations. This last condition is exactly what property (7) states. If (7)

was not true we could have a system with a user having lots of followers but a posting

rate equals to 0 and an other user with a few followers and a postingRate different

from 0. The first user would not generate any posting load as he never posts. But if

you transfer one of his follower the second user it would change the total load put on

the system. To summarize, thanks to (7) and (8), we can always move followers from

users having more followers to users having less followers without changing the total

amount of operations perform on the network. We thus proved that it is not needed to

make a distinction between stars and fans.

82

In conclusion the two properties we just defined fix the following relations between

the simulation parameters.

• nbrFollowers = nbrFollowings

• (postingRate× nbrFollowings)

readRate= nbrOfNewTweets = nbrTweets =

nbrRetrievedTweets = nbrTweetsPerChunk

• time granularity = 1postingRate

The simulations We now explain the final details concerning the simulation. Below

you can see the formulas we use in order to simulate Bwitter. The first formulas is the

one for the push design and the second is the one for the pull design.

Push:

nbOp = postingRate×(

8 + nbrFollowers×(

2+

3

nbrNewTweets+

2


))+

readRate× (3 + nbrNewTweets)

(5.9)

Pull:

nbOp = postingRate× 5 + readRate×(nbrFollowings + 2 + nbrNewTweets)

(5.10)

Those formulas compute the number of operations performed with respect to the

readRate, the postingRate and the nbrFollowers. Remind that an operation is

a transactional read or a write. Because we do not simulate other operations than

reading tweets and posting tweets we have a direct relation between the two rates. If

we normalize them we have that readRate + postingRate = 1. We thus chose to

make readRate vary from 0 to 1 with postingRate varying accordingly. We defined

nbrUsers as the number of users in the system. We chose nbrFollowers, which, as

already stated, represents the mean number of followers each user has, and thus also

his number of followings as nbrFollowings = nbrFollowers. Because we had no idea

of the value of this number we chose some arbitrary values. The higher it is, the more

socially connected the users in our system are. Finally we must fix the last unknown

parameter: nbrOfFollowersPerChunk. This parameters is only present in the push

design, the number of operations that must be done in a write operation decreases as

it increases. The problem is that it is difficult to fix a value for it. We can not neglect

its influence but we cannot decently put it very high either. Indeed the number of

keys involved in the transactions while posting grow linearly with it. We thus made a

compromised and chose to have it equals to 20. We summarise below the values of the

parameters.

83

• nbrUsers = 100

• nbrFollowers = 10, 30, 70

• nbrOfFollowersPerChunk = 20

• (postingRate× nbrFollowings)

readRate= nbrOfNewTweets = nbrTweets =

nbrRetrievedTweets = nbrTweetsPerChunk

We have plotted the results of our simulation in Figure 5.6. Lines go by pair (one

push and one pull), lines with the same weight correspond to the same number of

followers. We have indicated the intersections that are relevant by a big read dots.

Figure 5.6: Number of Scalaris operations with respect to the read rate: comparisonbetween the pull and the push approach for nbrFollowers = 10, 30 and 70.

First, you can observe that all the lines of the pull approach are parallel, this means

that nbrFollowers influences the number of operations by a constant amount whatever

the readRate. We can thus see that the number of operations in the pull approach

does not vary much with the readRate which may be a surprising observation at first.

Even more surprisingly it decreases slowly with the readRate.

Secondly, we can see that, as expected, as the readRate increases the push ap-

proach becomes more and more interesting. When the nbrFollowers is smaller we

need to have a higher readRate before the push approach becomes more interesting

than the pull. If you observe the red dots you can see that some kind of asymptotic

functions seems to appear indicating that before some readRate the push approach

is never the good choice. We thus plotted the curve defined by the intersection of the

pull and push lines in Figure 5.7 to confirm this intuition. We have kept on the plot

the lines already shown before to better visualize what the curve represents. The curve

shows the intersections for nbrFollowers between 4 and 300. nbrFollowers values

smaller than 4 give intersections at readRate bigger than 1 which makes does not

make sense.

84

Figure 5.7: Intersection of the push/pull lines for nbrFollowers between 4 and 300.

This curve can be used to determine which design is theoritically the best according

to nbrFollowers and readRate. We can observe an asymptote around readRate =

0,7. We made the math for nbrFollowers = 30000 and obtained readRate = 0,672.

We can also note that once nbrFollowers is higher than 70 the black curve becomes

nearly vertical. This means that for a readRate bigger than 0,672 and nbrFollowers

bigger than 70 the push approach is theoretically always the best in terms of number

of Scalaris operations performed.

Conclusion

In conclusion, we have compared the push and the pull theoretically according to

an unknown mean nbrFollowers and an unknown readRate. We have seen that we

could find a value for the readRate under which the the pull approach is always the

best. However, if we are above this value and if nbrFollowers is bigger than 70 the

push approach is the best. It seems safe to assume that we are in the second case with

social networks like Twitter.

Moreover, the read algorithm in the pull is heavier and its termination must be

waited in order to respond to a given call which is not the case for the posting in the

push. According to those observations, we thus believe that the push approach is the

more adapted for a social network like Twitter. We will see in the next chapter if the

practical tests confirm this conclusion.

85

5.5 Conclusion

In this chapter we detailed the main modules of our implementation. The NM

is a powerful tool that allows us to manage the different machines we need to run

Scalaris nodes, and the SCM allows us to easily dispatch work on those nodes. The

BRH is the module on which we spend the most time and attention in order to design

the simplest and fastest algorithms. We believe we have minimised the complexity of

our most used algorithms. Finally, our theoretical comparison between the push and

the pull approaches comforts us in the idea that the push approach is probably more

adapted to our application. In the next chapter we are going to do tests on Scalaris

and Bwitter’s pull and push variations.

86

Chapter 6

Experiments

This chapter details the experiments we did on Scalaris and Bwitter. The first part

is dedicated to the description of the Amazon Elastic Compute Cloud, which was the

platform on which we did all our tests. We then detail the tests we do on Scalaris

and Bwitter. We do for both scalability and elasticity tests. We start in the second

section of this chapter with Scalaris as Bwitter tests’ results are heavily influenced by

Scalaris ones. Bwitter is tested in the third part. We study the influence of a cache and

of the nbrOfFollowersPerChunk parameter for the push. Afterwards, we test the

scalability and elasticity of our Bwitter push solution. Finally, we study the scalability

of the pull approach and finish with the conclusion.

6.1 Working with Amazon

We do not want to simulate the cloud platform by ourselves as we feel it would not

reflect the way our application would be used ultimately. We thus decide to work with

the Amazon Elastic Compute Cloud (Amazon EC2) because it is a professional and

realistic work environment.

6.1.1 Choosing the right instance type

An instance is virtual machine running on a physical machine, it is characterized by

four different attributes: CPU, network capabilities (we sometimes say IO capacity),

RAM memory and storage capacity. The last attribute is less interesting to us as none

of our tests use persistent storage. While working on the Amazon cloud infrastructure,

we used four kinds of instances: the standard micro, the standard small, the standard

large and the high CPU medium instance. The micro instance is the smallest possible

Amazon instance. It provides minimal CPU and IO capacity. The micro instance can

consume up to 2 EC2 Compute units for short period of burst. This is not enough for

running correctly Scalaris. According to Amazon, an EC2 Compute Unit is equivalent

to CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. You can find

87

more information about amazon instances and EC2 Compute unit on Amazon website1.

We show the description of Amazon’s micro instance in Table 6.1.

StandardMicroinstance

StandardSmallinstance

StandardLargeinstance

High CpuMediuminstance

Memory 613 MB 1.7 GB 7.5 GB 1.7 GB

ComputeUnits

Up to 2 EC2Compute Units(for shortperiodic bursts)

1 EC2 ComputeUnit (1 virtualcore with 1 EC2Compute Unit)

4 EC2 ComputeUnits (2 virtualcores with 2EC2 ComputeUnits each)

5 EC2 ComputeUnits (2 virtualcores with 2.5EC2 ComputeUnits each)

Storage EBS storageonly

160 GBinstance storage



Platform32-bit or 64-bit 32-bit 64-bit 32-bit

I/O Perf Low Moderate High Moderate

APIname

t1.micro m1.small m1.large c1.medium

Table 6.1: Characteristics of the different Amazon instance types we use during thetests.

The small instance is just above the micro, it provides moderate IO performance and

fixed CPU. Those were well suited to run up to 18 scalaris nodes. However, they showed

some CPU and IO limitations when we use a high number of connections and/or nodes.

As for the micro, the performance of the small instance can be found in Table 6.1. Most

of the tests use the small instances to run the Scalaris nodes as they were rather cheap

and efficient, but we could have benefits of instances with higher CPU and network

capabilities as shown later.

1Amazon EC2 FAQs, http://aws.amazon.com/ec2/faqs/, last accessed 27/07/2011

88

We also use the large instance which has better network performance than the two

others, and the High CPU medium instance which has same network performance but

way higher CPU performance. Those two instances are used for special tests when

we suspect that some behaviours can be explained by the lack of performance of the

previous instances.

At first, we tried to work with the micro machines, but they turned out not to be

powerful enough to support Scalaris and the operations we wanted to perform. Those

preliminary measures are thus not relevant and we only detail our experiments and

results with the other instances we presented.

6.1.2 Choosing an AMI

Instances need to have an associated Amazon Machine Image (AMI). AMIs can

have two kinds of storage instances: AMI storage and Elastic Based Storage (EBS).

The first one does not allow the user to stop and restart the machine, indeed once the

machine is stopped all the modifications done are lost. The second one works as a

normal personal computer, you can restart the machine and the changes done before

are still present. We use the EBS solution because it allows us to create custom images

easily from existing AMIs and store them, which is not possible with the classical AMI

storage.

6.1.3 Instance security group

Amazon instances all belong to a security group. This security group defines several

firewall settings for the instances. For the sake of simplicity, we have allowed all the

TCP connections as well as all the ICMP messages between the nodes.

6.1.4 Constructing Scalaris AMI

We started from the AMI with ID ami-06ad526f, this is a 32 bits image of ubuntu

11.04 (Natty Narwhal)2. The first step is to install all the packages needed to install

Scalaris: java jdk, erlang, make, svn and ant. We ran the following commands in order

to install the required packages.

sudo apt -get install erlang

sudo apt -get install make

sudo apt -get install openjdk -6-jdk

sudo apt -get install ant

sudo apt -get install subversion

We then installed the latest version (0.3.0) of Scalaris, downloaded from the SVN.

svn checkout http:// s c a l a r i s . goog lecode . com/svn/ trunk/

cd /home/ubuntu/trunk/

sudo ./ configure

sudo make install

sudo make install java

2Can be found at http://uec-images.ubuntu.com/releases/11.04/release/ last accessed 27/07/2011

89

We have also modified a little bit the starting scripts of Scalaris and added some

scripts to restart Scalaris easily on a machine. Once all those steps were performed,

the new AMI was done and ready to run Scalaris.

6.2 Working with Scalaris

We now detail the procedure to launch Scalaris and the different tests we did on it

before testing our Bwitter application.

6.2.1 Launching a Scalaris ring

The first thing to do is to modify the “scalaris.local.cfg” file, which is located in the

bin folder of Scalaris. The two important lines shown below must be modified.

{mgmt_server , {{127,0,0,1}, 14194, mgmt_server }}.

{known_hosts , [{{127,0,0,1}, 14195, service_per_vm }]}.

The mgmt server, known hosts and service per vm parts must not be modified,

otherwise Scalaris will not work correctly. Indeed, nodes do not connect correctly when

you modify those values. You must replace the IP address of the first line with the IP

address of the node running the manager server (mgm server). 14194 is the port on

which the manager server runs, note that you can change it. The second line contains

the known hosts, those are the other DHT nodes already inserted in the ring. Each

known host is identified by an IP address and a port on which it listens. Below is an

example of configuration.

{mgmt_server , {{192 ,168,1,1} , 14194 , mgmt_server }}.

{known_hosts , [ {{192 ,168 ,1 ,1}, 14195, service_per_vm},

{{192,168,1,2}, 14195 , service_per_vm},

{{192,168,1,3}, 14195 , service_per_vm},

{{192,168,1,1}, 14200 , service_per_vm} ] }.

In this configuration, one node (192.168.1.1) is running the management server and

a DHT node. Launching the nodes is quite simple, the three following commands are

used respectively to run the management server, the first node and another DHT node.

The “scalarisctl” binary is located in the bin folder of the Scalaris folder.

./ scalarisctl -n mgmt_server@hostname -p 14195 -y 8000 -m start

./ scalarisctl -n FirstNodeName@hostname -p 14195 -y 8000 -s -f start

./ scalarisctl -n AnotherNodeName@hostname -p 14195 -y 8000 -s start

Note that each node has a name, which is needed to communicate with Scalaris

nodes. The mapping between the node name and its location (IP address and port) is

done by the epmd server that is launched automatically with Scalaris. It is possible to

launch several Scalaris nodes on the same machine, they only need to have a different

node name. This node name is fixed with the “-n” parameter. In fact only the part

before the @ is the true name, but fixing the hostname is important if you want to avoid

90

communication problems when using the java API for Scalaris. Indeed, Java does not

resolve hostnames the same way Erlang does, and Scalaris is written in Erlang. Fixing

the hostname thus prevents Erlang to fix it itself, and using the same in Java avoids

the problem.

The parameter “-p” is used to fix the port on which the DHT nodes communicate.

This is important in order to configure the firewall settings. The parameter “-y” fixes

the port on which the webserver is running, this webserver is not mandatory but allows

to debug more easily as you can do put/get operations directly from this webpage. You

can also have a visual representation of the complete ring by going on the webpage

of the management server. We end up with the parameter “-m”, “-f”, “-s” which are

respectively used to start the manager server, the first node and a normal DHT node.

6.2.2 Scalaris performance analysis

Before doing any test directly related to Bwitter, we need some important informa-

tion about Scalaris itself in order to understand our future results. Our first analysis

focus on the connection strategy used to communicate with Scalaris nodes. We then

perform scalability and elasticity tests based on those results. Scalaris is configured

with a replication factor of 4. Scalaris does not allow to choose the consistency level

between replicas and thus always guarantees strong consistency. This means that read

and write operations are always done in a transaction. Read and write operations will

thus conflict if they are working on the same keys. However, concurrent reads on the

same value are not conflicting, which is important to keep in mind during the tests.

One important precision is that we only run one instance of Scalaris per Scalaris

node. We decided to do so because the Scalaris developers told us that having more than

one instance by node was less stable, and only sightly increases the overall performance

of the system. Moreover, small instances from Amazon might not be powerful enough

to handle more than one instance of Scalaris.

During our tests with Scalaris we take two measures: the time, in milliseconds, taken

to perform 20000 operations and the number of operations that failed during the test.

We do not apply any restart strategy, if an operation fails we report it and execute the

next operation. We then compute the throughput and the failure percentage defined

respectively by equations 6.1 and 6.2. We have chosen to show the throughput as it is

easier to analyze and closer to what we want to measure than the time. Moreover time

can sometimes be difficult to interpret on its own and cannot be compared with other

tests’ results, except if exactly the same number of operations are done. Concerning

the failure percentage, it has the advantage to be easily comparable for other people

doing similar tests to us.

Throughput =number of Scalaris operations successfully performed

measured total time(6.1)

Failure percentage =number of operations failed

number of operations performed× 100 (6.2)

91

Before presenting our tests we want to point out that the Amazon instances do

not provide a constant level of performance. This means that the performance of the

Scalaris nodes are variable from one run to the other. Indeed, we do not use the same

physical machines all the time, but virtual machines whose performance can vary from

time to time. We had lots of tests to run it was thus not possible to make several runs

for the same test. However, we did lots of tests, it was thus possible to observe if some

results deviate too much from what we already observed. In that case we restarted the

test.

Note that we do not detail all the tests we did on Scalaris as part of them were done

to familiarize ourselves with the system. We are thus going to present only the most

relevant ones that allow us to give the reader the broadest view of Scalaris’ behaviour.

Connection strategy test

This test is aimed at evaluating the impact the number of parallel connections

the dispatcher maintains towards a single Scalaris node has on the performance. A

connection is a TCP connection toward a Scalaris node which can be used to make

sequential requests. The word sequential is important as concurrent requests using the

same connection trigger errors. Indeed, Scalaris does not distinguish between different

requests which thus mix if done concurrently. The dispatcher is the node that sends

operations to Scalaris nodes. We have decided to run the dispatcher on another machine

than the Scalaris nodes because later we run our Bwitter nodes on dedicated machines.

Indeed, we believe that the overhead of Bwitter can perturb the execution of Scalaris

which is already quite heavy.

Our guess is that the conflict level (conflictLevel) plays an important role in the

optimal number of connections. We define the conflictLevel of a set of operations as

the chance that a random pair of operations in the set conflict if they occur at the same

time. Therefore, having more connections increases the probability that two conflicting

operations occur at the same time, leading to the failures of those.

We designed a benchmark with a fixed number of nodes, we have chosen 18 as

it is the maximum number of nodes we could launch in the test environment we were

provided, and some predefined conflict levels. We made the number of connections vary

for each value of conflictLevel. This benchmark consists of 20000 random operations

with as many reads as writes operating on a random key inside a given pool of keys.

The value written is always the constant String “test”. The conflictLevel is inversely

proportional to the number of keys on which we work. This means that the smaller the

number of different keys, the higher the chances are that two parallel operations work

on the same keys and thus conflict. We believe that 20000 operations are significant

enough so that little variations do not influence the overall results.

We decided that the best way to connect two nodes is to have a symmetric connec-

tion strategy with respect to each node. This makes sense as each node is supposed to

be equivalent to the other nodes.

92

Mathematically speaking this means that:

|number of connections to n1− number of connections to n2| ≤ 1

∀ n1, n2 where n1, n2 ∈ set of nodes

We apply this symmetric connection strategy during all our tests. Please also note

that in order to avoid side effects, we shut down the whole ring between each run

and start with a fresh ring each time. This test uses small Amazon instances for

the dispatcher and the Scalaris nodes. The results are summarized in Figure 6.1 and

Figure 6.2.

Figure 6.1: Read and write throughput with respect to the number of connections.

As we can see in Figure 6.2, the conflictLevel has a clear impact on the perfor-

mance. The number of failed operations increases whith the conflictLevel leading

to a lower throughput, as we can see in Figure 6.1. The number of failed operations

also increases with the number of connections. It becomes obvious that having less

connections lowers the number of failed operations in an environment where operations

can conflict.

We can distinguish two parts in Figure 6.1, the part before we reach the same

number of connections than nodes, and the part after we reached it. We will call this

rupture the break point. In the first part (except for conflictLevel equal to 0.1),

the number of operations per second is increasing almost linearly as the number of

connections increases. We thus deduce the following property: in normal conditions

where the textbfconflictLevel is not tremendously high, it is necessary to use as many

connections as nodes in order to fully take advantage of those nodes’ power.

93

Figure 6.2: Failure percentage with respect to the number of connections.

In the second part, the throughput varies with the conflictLevel. When the conflict

level is low, the throughput increases with the number of node connections up to a

certain point, and then, eventually, decreases again under the value measured at the

break point. We believe the increase is due to a more important load on Scalaris nodes.

The decreasing part can be explained by the growing number of failures observed. We

also believe that having only one dispatcher is maybe not enough. Indeed, the network

capacity of Amazon small instances is only moderate and traffic toward Scalaris nodes

increases with the throughput. It is thus possible that we have reached a maximum

throughput for one dispatcher. Finally, the throughput does not increase with the

number of node connections in cases of very high levels of conflict. For example with a

conflictLevel equal to 0,02, we can see that the throughput drops down directly after

the break point. Concerning the line with a conflictLevel equal to 0,1, the throughput

put increase stops even before the break point and then do not stop to decrease. This

indicates that, if the conflictLevel is really high, the optimal number of connections

is below the number of nodes, despite the fact that more parallel requests could be

handle by Scalaris.

We thus conclude that until a given level of conflict between operations we must

use at least as many connections as there are nodes. Moreover, using more connections

also increases the throughput but not as drastically and it depends of the environment

in which we are working.

Connection strategy conclusion: In the light of the tests we did, we have shown

the crucial influence of the number of connections as well as the conflictLevel on

94

the throughput and failure percentage. In case of highly conflicting environment, it

might be a good idea to reduce the number of connections a little bit. However, when

operations are almost not conflicting, a higher number of connections can significantly

increases the performance because it allows to put a higher load on Scalaris.

Choosing a the right number of connections is really difficult as it requires to esti-

mate the conflict level which is an application dependent parameter. Moreover, results

could also have been different for another number of nodes. We finally conclude that

we must use at least as many connections as there are nodes. Indeed, in most of the

practical situations the conflictLevel is not high enough to justify going under this

number.

Scalability test

Scalaris is claimed to be a scalable system. Despite we could have accepted it,

we wanted to verify this claim in our own environment, as it is really important to

understand the next tests.

First scalability test with one dispatcher and small instances: We performed

20000 writes on random keys and then read each of the keys we just wrote. The con-

flictLevel should be close to 0 as keys are chosen randomly using the Math.random()

function from java. We measure the time taken for all the writes and reads with respect

to the number of Scalaris nodes. We make the number of Scalaris nodes vary from 4 to

18, maintaining only one connection per node. As for the connection test, we use small

instances from Amazon for all the nodes (dispatcher and Scalaris nodes). The results

can be found in Figure 6.4.

We can clearly observe that the throughput increases when the number of nodes

increases. It seems to increase more slowly when the number of nodes become higher.

Indeed, from the 70% throughput increase we observe between 4 and 18 nodes, we

already have 45% of the throughput increase between 4 and 8 nodes.

Second scalability test with one dispatcher and medium instances: We were

surprised by the slow down at the end of the last test. Our assumption is that the

small instances are not powerful enough to handle a ring of that size. We thus restart

the test with medium instances for the Scalaris nodes, the other parameters remaining

the same. The results of this test can be found in Figure 6.4.

We can see a general improvement in performance with more powerful machines,

but again a decrease of scalability with a higher number of nodes. However, we can

notice that this decrease is not as marked and happens after a few more nodes than

in the previous case, around 10 nodes instead of 8. The performance of the machines

do certainly play a role but are probably not the cause of this decrease. Our guess

would be that there are some networking delays that come up because we only use one

dispatcher.

95

Figure 6.3: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for small instances and conflict level of 0.

Figure 6.4: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for small and medium instances and conflict level of 0.

96

Third scalability test with 2 dispatchers and small instances: Looking at our

logs we noticed that the time the nodes spend waiting for a new job when they finished

the previous one had an impact in this scalability. This time increases with the number

of nodes in the ring as the dispatcher must keep more nodes busy. Networking delays

are thus probably the source of this problem. We now want to see the magnitude of

this impact. Our idea was to add another dispatcher in order to increase the load on

the Scalaris nodes. We performed a series of tests to measure the impact of having two

dispatchers instead of one.

In the first series of runs we have one dispatcher maintaining two connections with

each Scalaris node, while in the second series of runs we have two dispatchers main-

taining each one connection per Scalaris node. Note that we use two connections in

the first case because we want to have the same number of parallel requests in the two

tests. In order to widen our view of the Scalaris behaviour we opted to have a conflict

level conflictLevel equal to 0,004. We thus do 20000 Scalaris operations, 20000 for

single dispatcher and 10000 for each dispatcher when we use two, with as many reads

as writes and make them overlap. Our results can be found in Figures 6.5 and 6.6.

Figure 6.5: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for one and two dispatchers on small instances and conflict levelof 0,004.

As we can see in Figure 6.5, the throughput does not seem to be very much affected

by the addition of a second dispatcher, even though we can notice a clear difference

starting when the ring has more than 8 nodes. The difference, however, seems too

small to conclude that the scalability issues are due to the increasing time the nodes

were waiting. Surprisingly, we see in Figure 6.6 that the number of failure percentage

is always higher with the single dispatcher.

97

Figure 6.6: Fail percentage for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for one and two dispatchers on small instances and conflict levelof 0,004.

Final scalability test with 4 dispatchers and small instances: Finally, we want

to see if a single dispatcher with higher network capacity is a better choice than several

small dispatchers with medium network capacity. We invite you to consult Table 6.1

in order to recall the specifications of small and large instances. As you can see the

large instance offers far better performances than the small instance in every domain.

We decided to make a final test with a really small conflictLevel equal to 0,00007 and

made the number of nodes vary from 8 to 16. Once more we chose another conflict

level in order to widen our view. We again do 20000 operations in total, in case of

a single dispatcher it performs 20000 operations and in case of 4 dispatchers each

dispatcher does 5000 operations. The single dispatcher has 4 connections per node

and the 4 dispatchers use one connection per node. Every dispatcher is connected to

every Scalaris node. The results are shown in Figure 6.7. We do not show the failure

percentages because their value is nearly equal to 0 and their variation is not relevant.

Our first note is that the performance for all the tests increase linearly, meaning

that all the configurations scale correctly. Then we can observe that using a small or

a large dispatcher has no effect on the performance. This means that a small instance

should be powerful enough to manage at least 16× 4 connections to Scalaris, and that

there are special conditions in the Amazon cloud that limit the networking performance

with one dispatcher. We believe the reason the 4 small dispatchers outperform the two

others is because they can send more quickly new jobs to Scalaris nodes than a single

dispatcher. This confirms the results we have obtained in the previous test.

98

Figure 6.7: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for one small, one large and four small dispatchers and conflictlevel of 0,00007.

We thus reach the conclusion that, while increasing the number of connections to

Scalaris can increase the performance, it is sometimes necessary to have several dis-

patchers to put enough load on Scalaris. We can finally observe that using 4 dispatchers

we have the throughput that approximately doubles as the number of nodes double,

indicating really good scalability results.

Comparaison with Scalaris developers scalability tests: We have discussed

with Florian Schintke, a member of the Scalaris developers, about their scalability tests.

They use a different approach than ours and do not do any conflicting operations. For

instance, they make the number of nodes vary and have 10 clients per node. Each

client begins by initializing a random key and then do 1000 increments on this key.

The probability of conflict between operations is thus infinitesimally small. They also

used more powerful machines than ours and were not working on the Amazon cloud.

Figure 6.8 is one of the results Florian Schintke sent us.

We can clearly see that Scalaris is correctly scaling. However, their tests are rather

different from ours for several reasons. First, they use completely different infrastruc-

tures. Secondly, most of our tests are working with a conflictLevel which is important

for us as we know that Bwitter will obviously work with conflicting values. Finally,

we do not have our dispatcher on the same machine as the Scalaris nodes. We believe

it is not realistic for us to have the Bwitter nodes (equivalent of the dispatcher in the

tests we made) directly on the Scalaris nodes as this would perturb Scalaris nodes that

99

can potentially already be under high load. Furthermore, we would reduce the benefit

gained from the cache by having more Bwitter nodes.

Figure 6.8: Increment Benchmark test of the Scalaris developers.

Final words on Scalability and the connection strategy: We have concluded

that Scalaris is scalable, as the performance clearly improves with the number of nodes.

We explain the performance slow down at high number of nodes because the load we

put on the Scalaris nodes is not high enough.

To increase the load we have three possibilities either we increase the number of

connections, or we use several dispatchers, or we improve networking performance of

the environment. Using several dispatchers gives slightly better results than having

only one. Therefore, we believe that after a certain number of connections managed

by a dispatcher it is a good idea to add another one to have better scalability. We

were limited in the number of machines at our disposal to do all the tests we wanted

to do. We believe that results would have been more explicit if we could reach a higher

maximum number of nodes.

Scalability is also limited by the conflictLevel. The higher the conflictLevel, the

less connections and parallel requests we can perform without having the number of

failures exploding, which is shown by the connections test.

Elasticity test

Test description Until now we worked with a constant number of nodes during each

test. In order to react to flash crowds, we need Scalaris to be elastic enough so that

the throughput can be increased quickly. The detection of the flash crowd is not part

100

of the test and we consider that the flash crowd starts at the beginning of the test.

Afterwards, we have to decide what is the best strategy to handle this flash crowd. To

determine it we observe the throughput as well as the failure percentage during the

whole test. The final throughput reached is also important for us, as well as the total

number of operations performed during the whole test, in order to determine which

behaviour is the best during the churn period.

Parameters We have observed that Scalaris scales well from 6 to 18 nodes and are

going to test different ways to get from 6 to 18 nodes under high load. We will use

one dispatcher to dispatch a constant number of parallel requests to Scalaris. This

dispatcher is also responsible for adding the new nodes to the ring. Note that it takes

between 45 seconds and 200 seconds to start a new node using the Amazon API. The

dispatcher samples periodically the number of operations correctly done as well as the

number of failures. This allows us to plot the evolution of the throughput and failure

percentage with respect to the time. We now present the different strategies we will

try. Each strategy is defined by a number of nodes to add at each adding point and a

constant time between each adding point. For each strategy we wait one minute before

adding the first node so that we can observe what is happening before and after.

(1) We do nothing in order to have a standard measure to compare with the other

results.

(2) One node added after one minute and then no more.

(3) One node added every minute until we reach eighteen nodes.

(4) Two nodes added every minute until we reach eighteen nodes.

(5) Two nodes added every two minutes until we reach eighteen nodes.

(6) Six nodes added every five minutes until we reach eighteen nodes.

(7) Twelve nodes added after one minute.

We believe that with those strategies we have covered almost all possible behaviours.

Meaning, doing nothing, adding nodes regularly and adding lots of nodes at the same

time but waiting longer after the next adding. We must precise that those strategies

are objectives, it may not be possible to add nodes as quickly as planned and thus we

will most probably observe jitter in the nodes starting time. We summarize below the

parameters of the test.

• 1 connnection per node

• nbrInitialData = 2000

• 15 minutes of test

• conflictLevel = 1/250 (so all the operations work on a pool of 250 keys)

101

• 6 nodes runnning initially

• 1 minute before adding first node(s)

• Large instance dispatcher

• Small instance Scalaris node

• Successful and failed operations sampled every 20 seconds.

According to the Scalaris developers and at the time we are writing, nodes buffer

the requests arriving while they are inserted in the ring and start responding to them

as soon as they are correctly inserted. The parameter nbrInitialData is a special

parameter. It is aimed at simulating previous content on Scalaris nodes. Indeed, in

order to maintain the replication factor, new Scalaris nodes must retrieve the values they

are now responsible for when they are added in the ring. This thus adds an overhead

during each insertion of nodes in the ring. We thus wanted to take this overhead into

account and be able to tune it with the parameter nbrInitialData. Before the test

starts we add nbrInitialData key/value pairs to the ring. The keys used are random

and the value is always the same and corresponds to a constant String of 360448 random

characters. We have chosen nbrInitialData equal to 2000. This means that we have

quite a lot of data that must be transferred to the Scalaris nodes before starting the

test. We have observed that the initialization phase takes approximately 5 minutes. We

have several tasks running on the dispatcher: one responsible to check that operations

are correctly done, one that must send time statistics, the management of the Scalaris

Connection Manager and the Nodes Manager which are both heavy tasks. This is why

we have chosen to use a large dispatcher.

Remarks on the tests environment Before getting to the results, we make two

general remarks. First, the Amazon cluster is unstable, sometimes some machines are

not reachable (ping not working). Secondly, sometimes the Ubuntu AMI we are using

does not initialize correctly the SSH keys and SSH is thus not working, we spotted this

problem really late and could not correct it. Indeed, it would have been necessary to

modify the AMI we used and it was too late to redo all the tests. When faced with

one of those two problems we are forced to reboot the machine on the fly, which is

quicker than launching a new one but takes some time and CPU. We consider that this

overhead is part of the test. This was not a problem with the previous tests as the

launching of the ring was done during the initialisation phase. This is indeed the first

test where we need to launch a new machine at run time.

Scalaris elasticity test results Figure 6.9 shows the evolution of the throughput

of the different strategies, the numbers in the legend of the graph correspond to the

numbering of the strategies presented above. The throughput is computed thanks to

the data collected every 20 seconds, the throughput at the time x is thus equal to the

average throughput between x-20 seconds and x. We show the marks when we begin

to start new instances by blue points on the graph, note that the number of instances

102

started depends on the strategy. We also show with red points the moment where

Scalaris is started on the nodes, this is the moment where the command is launched,

not when it is effectively inserted in the ring.

We first comment each strategy separately.

(1) During this strategy we do not add any node and thus keep the ring size at the

initial value of 6 nodes. As you can see the throughput stays stable during the

whole test.

(2) In this strategy we add only one node. We start the adding procedure 60 seconds

after the beginning of the test. We can see that this procedure has an impact on

the performance. Indeed, the graphs shows that the throughput decreases during

the insertion, but this is not due to the Scalaris churn. Scalaris is not started on the

node before 120 seconds. At that time, the throughput increases by a small amount

and stays stable until the end of the test. The node was thus quickly operational

and we could not notice a drop down in the performance after Scalaris starts.

(3) We tried to add one node every minute but we could only start 6 nodes on the

12 nodes planned. Indeed, launching one node correctly takes a certain amount of

time which varies from 45 seconds to 3 minutes approximately. The throughput is

more chaotic as we regularly add nodes. And, as we have seen in the last strategy,

between the moment we start inserting a new node and the moment where Scalaris

is effectively started on the node, the throughput lowers. Again as we observed in

strategy (2), the throughput increases directly after Scalaris is started on the node.

Finally, we remark it was not possible to observe the stabilization because nodes

are added too regularly and all the nodes were not added.

(4) Here we add two nodes every minute. This time we could add 10 nodes out of 12.

The throughput once again increases regularly with the adding of nodes while being

perturbed by this adding. It finished at a higher value than (3) simply because it

could reach more nodes at the end. The throughput reached a pretty high value

but was not stable at the end of the test.

(5) We could only add 6 nodes here which is nearly the same as with the strategy 3.

However, here we added two nodes at the same time (where we added one in 3) and

waited two times longer before adding them (120s instead of 60s). We can observe

some kind of period in their adding and see that they regularly reach the same

throughput than with the third strategy. This is confirmed at the end of the test

where they eventually reach the same throughput at the same number of nodes.

(6) We increased the number of nodes per adding to 6, we first add 6 nodes at 60s,

they were ready at 160s and we directly see a high increase in the throughput and

a quick stabilization. We observe the same behavior at the second 6 nodes adding

and finally reach a stable throughput around 560 ops/s. Something weird is that it

should reach the same throughput as (7). Indeed, the last node adding was done at

560s, and, as we have seen, the throughput stays stable from this time showing no

indication that it will ever increases. Our guess would be that physical machines

103

Figure 6.9: Throughput with respect to the time for the seven strategies presented,with a large dispatcher and small Scalaris nodes for a conflict level of 0,004.

104

Figure 6.10: Failure percentage with respect to the time for the seven strategies pre-sented, with a large dispatcher and small Scalaris nodes for a conflict level of 0,004.

105

position creates some special conditions limiting the number of messages that can

be exchanged between nodes and lowering the throughput. It is indeed possible as

each test is run with different nodes.

(7) In this last strategy we add 12 nodes directly at 60s and Scalaris is started on those

nodes at 120s. Between 60s and 120s, we have a diminution of the throughput which

seems proportional to the number of nodes added, this is normal as the number of

operations involved in starting 12 nodes grows with the number of nodes to start.

We can observe that this diminution is of about 25% of the throughput. However

as soon as the nodes’ starting is finished and Scalaris is booted the throughput

explodes and quickly reaches a stable value at 630 ops/s. We can confirm that

this value corresponds to the stable throughput for 18 nodes. Indeed, it is close

from what we obtain in the connection strategy test at section 6.2.2 for which we

obtained an average throughput of 650 ops/s.

We now summarize the results we obtain observing the throughput evolution for

each strategy. First, we notice that during the adding period (during which we do

the following tasks: launching the nodes on Amazon, periodic call to Amazon API

to check the instances state, sending the necessary files, retrieving from the nodes

the information necessary to launch Scalaris) the performance is lowered by a factor

proportional to the number of nodes. However, launching several nodes at the same

time is less time consuming as Amazon starts all the nodes in parallel and thus the

time waited for one node is divided by the number of nodes. Secondly, after Scalaris is

started on nodes and despite our initial data which is quite important, nodes are almost

instantly ready to operate as the throughput in all the strategies increases directly after

Scalaris is started on nodes. We believe it is because the number of initial data is too

small to observe any performance drop. Moreover, this throughput is quite stable. We

must also notice that several node strategies could not reach 18 nodes neither stabilize

because the test length was too short. It is not a problem as other strategies have

already shown better results than those and reached the best stable state possible

(7). Conclusions would thus not have been different from the actual ones. Finally, we

decided that the last strategy was the best according to the throughput evolution as it

allows to quickly reach a very high and stable throughput with only some disagreement.

We now look at the average throughput of each strategy during the test in Fig-

ure 6.11. This criteria is important in order to know which strategy maintains the best

average service during the 15 minutes we have to react to the flash crowd.

It is obvious that the last strategy outperforms the others which is not surprising

according to the evolution of the throughput we just observed. We still have to observe

the failure percentage evolution. It may give some indication of Scalaris’s instabilities.

We can see in Figure 6.10 that the failure percentage grows with the number of

nodes. As for the throughput we thus observe an increase of the failure percentage

after the node adding which is proportional to the number of nodes added. This is

what we observed in all our tests: increasing the number of connections increases the

number of failures. There is thus no reason to penalize the solution with higher failures

percentages.

106

Figure 6.11: Mean throughput results for the seven strategies presented, with a largedispatcher and small Scalaris nodes for a conflict level of 0,004.

Conclusion We conclude that the best strategy is to add all the nodes at the same

time because it is the quickest way to increase the throughput, it also gives the best

average throughput on 15 minutes and do not present a failure percentage higher than

usual for this number of connections. The results are thus very encouraging as it was

indeed possible to go from 6 nodes to 18 nodes in only two minutes with only a loss of

approximately 25% while nodes were started. Moreover, as soon as Scalaris is started

on nodes, the throughput reaches a value which is close to what we obtained before in

6.2.2. It would have been interesting to test with higher value of nbrInitialData to

try to observe loss of performance during Scalaris nodes insertion in the ring but we

lacked time to perform those tests.

6.3 Bwitter tests

Now that we have looked at the performance of Scalaris, we can study Bwitter

keeping in mind those results. As explained previously, we have implemented two

different approaches: the pull and the push. We are going to test and comment those

two in this section. However, we will focus on the push approach as it is the one we

have finally selected as the best approach. One section will be dedicated to the pull

approach later. Therefore, unless we explicitly specify it, we are talking about the push

approach.

We will start by showing the impact of the application cache we use in order to

solve the popular value problem. We then make a test to show the influence of nbrOf-

107

FollowersPerChunk, the number of followers per chunk of the Topost set. Then we

test the scalability and elasticity of the system we have implemented.

6.3.1 Experiment measures discussion

In this section we explain which data we measured during our tests in order to

clarify it for the rest of the experiment section.

Measures taken

The following tests are aimed at determining the best design and parameter choices.

We thus want to measure the performance of the different configurations we propose,

but we are also interested in determining how successfully the operations were per-

formed. We do two types of operations: reading and posting tweets.Those operations

have different success conditions and restart strategies, we detail it below.

First, we discuss the posting tweet operation. This operation is assumed to fail

only when the first step of the algorithm fails. Indeed, performing this step correctly

ensures that the tweet will eventually be posted to all the lines assuming the recovery

mechanism is triggered or another tweet is posted by the same user. In case the first step

fails, we restart the operation at the test level and do not count it as another operation,

but if any of the remaining steps fails we do not trigger the recovery mechanism. This

means that all the tweets posted during the tests are always stored in the system but

might not be posted in all the lines. However, we have noticed a negligible amount of

SR have aborted. This indicates that tweets are successfully posted on the lines most

of the time.

Secondly, concerning the reading of the tweets, we do not abort the whole operation

if one tweet is not available. This should almost never happen because, as shown in the

previous tests, concurrent reads are not conflictual. Moreover, tweets are frequently

read from the cache lowering even more the probability of failing. We restart the

operation only if an error occurs when accessing the line containing the tweet references.

We now describe the most relevant measures we took during our tests. We have

taken more in order to help us understanding some results and to verify that everything

was working correctly. However, those are most of the time not helping to understand

the results and would only clutter the text.

• Time:

We measure the total time in milliseconds needed to perform the requested num-

ber of operations.

• SR run:

This is the number of SRs that were performed during the whole test. Indeed,

the posting tweet operations are split in various SRs. We take this measure in

order to compare it with the number of restarted SRs and the number of aborted

SRs. An SR restarted is not counted in the SRs run.

108

• SR restarted:

This is the number of SRs that were restarted by Scalaris Workers, remember

that they restart a SR a given number of times, that we have fixed to 10, before

aborting. We use this value in conjunction with the SRs aborted and the SRs

run in order to compute the failure percentage.

• SR aborted:

This is the number of SRs that were aborted by Scalaris Workers. When an SR

is aborted the Bwitter operation that created the SR get an exception. If the

number of SR aborted is low we can be sure that the Bwitter operations were

successfully performed. In fact, the number of aborted SRs is extremely low, so

low that we actually got approximately two aborted operations during the tests

we are presenting here. This is mainly due to our aggressive restart strategy. As

just stated, we restart and retry a failed operation 10 times before aborting it.

We thus do not present this measure in our results.

• Cache hits:

This indicates the number of times a read was successfully performed from the

cache. Each cache hit avoids a transactional read on Scalaris.

• Cache miss:

This indicates the number of times an access to the cache was performed and no

entry was found in the cache. This is usually pretty low compared to the cache

hits as we access frequently the same data because the network simulated is small.

You could wonder why did not measure the failures at the Bwitter level. In fact,

we did not get any failures of any Bwitter operations during the tests we did. We thus

decided to measure the failures at the layer below which is the Scalaris Connection

Manager Layer. This measure is precise enough to compare the degree of failures

between the different tests we did. As it was the case with Scalaris, we rather represent

our results in terms of throughput and failure percentage.

• Throughput:

Our Bwitter tests generally consist of a given number of operations. When we

talk about an operation we mean one of the two we described in the previous

point: posting a tweet or reading tweets. According to the test settings those

operations can be more or less heavy. The throughput measure is the number of

operations per second achieved by the tested configuration. We believe this is the

best way to determine which configuration is the best for a given test as we feel

it fairly measures the global throughput of the whole system.

Throughput =number of Bwitter operations successfully performed

measured total time

• Failure percentage:

109

The failure percentage is the amount of restarted SRs divided by the number of

succeeded operations. We only take into account the number of restarted SRs

because as said the number of aborted SR is negligible.

Failure percentage =SRs restarted

SRs Run + SRs restarted× 100

Measuring the time

The time is always measured with the System.currentTimeMillis function from Java.

We use the following pseudo code to measure the time taken by a piece of code.

Long timeAtStart = System.currentTimeMillis;

codeWeWantToProfile ();

Long executionTime = System.currentTimeMillis - timeAtStart;

This method does not take into account that we are working in a concurrent en-

vironment. Imagine we want to measure the time taken by an operation. When the

test consists only of operations of the same type it is not problem, we can measure the

total time of the test and divide it by the number of operation performed. However,

if we mix operations of different types (by example posting tweets and reading tweets)

we can not use this method. Indeed, a posting tweet thread can be preempted by a

reading tweet thread and thus some time spent in the reading tweet thread will be

accounted for the posting thread that was preempted. We did not solve this problem

and so we did not measure the time taken by a single operation. Ultimately, we are

more interested in the time taken to perform a given number of operations rather than

the mean time of one type of operation.

6.3.2 Push design tests

The parameters

All the tests are based on a simulation of Bwitter’s use. Between each test we

restart Bwitter and Scalaris in order to avoid side effects that could happen due to old

tests previously done. This is time consuming because Scalaris is not persistent and we

need to initialize Bwitter with some data so that the tests are as realistic as possible.

We have two phases: the initialization phase and the main phase. In the first phase,

we create the users and one line for each of them and we add the owner of the line on it.

We also add a number of followers to each line in order to simulate social connections.

We use a hash function in order to chose which user a given user should follow. Finally,

each user posts some tweets to create data on the lines. This phase is never taken into

account in the results we present. In order to have comparable results the initialization

phase is exactly the same for all the tests.

In the second phase, we do two kinds of operations we previously described: post a

tweet and read tweets. We decided to only read the tweets contained in the head chunk

as this is what the users usually want to access. The second phase is finished after a

110

predefined number of operations were successfully performed. We decided to fix this

number of operations to 20000 because, as was the case for Scalaris, we feel that 20000

operations are significant enough so that little variations do not influence the overall

results.

The throughput is computed based on this phase. This second phase is not static

in opposition to the first phase, indeed the operations performed are made in different

order each time the test is run, and the number of operations of each type varies a little

bit. We made this choice because we wanted to avoid to create an artificial pattern by

fixing the order of the operations and because we believe it is the best way to simulate

the real use of Bwitter. Below we detail the parameters we use for the social network

simulation, Scalaris and Bwitter. Some values are fixed and others are variable, we

will not detail in each test the parameters that are fixed. Therefore, if you need more

information about a precise parameter please refer to this section. We only detail in

the tests parameters that are not fixed or parameters that differ from the values we

give here.

We could not find any precise numbers about Twitter’s use. We thus decided to

create two different social networks that, according to us, should be close to reality.

The different parameters associated to those two configurations in Table 6.2.

Heavy network Light network

Number of users 2000 4000

Lines per user 1 1

Users followed 50 25

Tweets per user at beginning 1 1Number of users

Users followed0,025 0,00625

Table 6.2: Social network parameters, part 1.

It is not possible to simulate a network as big as Twitter, we were thus forced to

simulate a smaller network. However, the initialization phase for those two networks is

already quite long. The names we have chosen for those two network are significant, the

heavy network is more dense than the light one. The heavy network overestimates the

real complexity of a network like Twitter in order to avoid presenting better results than

a real world network would give. Indeed, we have chosen nbrUsers and nbrFollowers in

order to have a dense network, which complicates the task of Bwitter. You can notice

that the ratio (number of users / users followed) is quite high. This ratio of 0,025

means that each user follows 2,5% of all the users in the network, which implies a quite

high level of conflict between concurrent operations. This ratio is the equivalent of the

conflict level in the Bwitter tests.

We believe the light network is closer to the reality, because it is absurd to imagine

that each user is following 2.5% of the users in the network. We thus designed this

other network which has a smaller ratio (number of users / users followed) equal to

0,00625 to see how our application reacts to different level of conflicts. We now detail

the parameters related to Scalaris that we grouped in Table 6.3.

111

Scalaris node type Small instance

Number of Scalaris nodes Varies from 4→ 18

Connections per node Usualy one, can vary during the tests

Number of trials per SR 10

Number of parallel requests Usualy 20, varies with the total number ofconnections to Scalaris nodes

Table 6.3: Scalaris parameters.

We can use a total of maximum 20 nodes during the experiments, taking into

account both Scalaris nodes and Bwitter nodes, however we use maximum 19 for his-

torical reasons. In order to maintain a high load during all our tests, we constantly

make 20 operations in parallel. If we use a higher number of connections per node we

increase this value so that it is always higher than the number of connections to Scalaris

nodes. Finally, we have configured the Scalaris Connections Manager so that each SR

is restarted 10 times before being aborted. We now present the Bwitter parameters we

grouped in Table 6.4 .

Dispatcher / Bwitter node type Small or Large instance

tweetchunksize 30

nbrOfFollowersPerChunk 20

Table 6.4: Bwitter application parameters.

We have two parameters of the Bwitter application to fix namely tweetchunksize

and nbrOfFollowersPerChunk. They are susceptible to have an impact on the re-

sults of our tests as they will influence the number of tweets read and the number of

operations involved in the write. We have chosen them so that the first tweet chunk

should contain a decent amount of tweets in order to have relevant tests. With a value

of 30 we estimate to 20 the number of tweets in the head chunk at the start of the test.

Indeed, each user should have got around 50 tweets in his line during the initialisation

phase.

Real system with stars and fans

In order to stick as much as possible to reality we have decided to populate our

system with two kind of users: stars and fans. Indeed, in Twitter, some users are a

lot more followed than they follow and the others follow more persons than they have

followers. We fixed to 10% the number of stars in the system, the rest of the users being

fans. Each user in the system has 75% of stars users amongst the users he follows. An

example of simulated network can be seen in Figure 6.12.

112

Figure 6.12: Simulated social network with social connections between users, each userfollows 3 users. Left) Random following pattern. Right) Nodes 2 and 4 are stars andeach user has a 2/3 probability per connection to follow a star.

Furthermore, users tend to do more reads than posts when visiting social networks,

we took this behaviour into account too by allowing to fix the read tweets operations

/ total number of operations ratio. We use the parameters we list in Table 6.5 for all

the tests.

Stars percentage 10% of users are stars

Pourcentage of stars in the followers 75% of the users followed

Read percentage 80% of the operations are read

Table 6.5: Social network parameters, part 2.

Cache influence

With this test we will prove that a cache mechanism is not optional and in fact

crucial for the performance of the system. We made two runs of our Bwitter simulation,

one with the cache and one without the cache. The parameters we used for the two

runs are the ones we just fixed with as exception the ones we list in Table 6.6.

Type of social network Heavy network

Dispatcher / Bwitter node Large istance

Number of Scalaris nodes 18

Connections per node 1

Table 6.6: Parameters changed for the cache test.

We have put the test results in Table 6.7. Remember that we have set a time to live

equal to 1 minute for the elements in the cache. Those elements thus stay a maximum

of 1 minute in the cache before being ejected, meaning that a deleted tweet can remain

113

visible for a maximum of 1 minute. The cache has a size big enough to keep all the

elements of the test that can be cached. This would probably not be the case in a real

situation, if we must remove an element from the cache because it is full we use a least

recently used strategy as explained in section 3.2.3.

Without cache With cache

Time taken for all the operation in seconds 863s 492s

Troughput (ops/s) 23,15 ops/s 40,59 ops/s

Failure percentage 1,32% 3,18%

Cache hits / 250704

Cache misses / 4431

Table 6.7: Performance comparison with and without application cache.

Obviously the cache is the quickest option, it allows to nearly double the number

of operations performed per second. This noticeable performance improvement is ex-

plained entirely by the frequent access to the cache. The cache is mainly used to access

tweet and passwords. Assuming tweets are in the cache, we avoid X transactions to

Scalaris, where X is the number of tweets read in one read operation. We saw during

the previous tests that reading a value from Scalaris approximately takes 1,5 ms with

18 nodes, the cache statistics indicates that the mean time taken for an access to the

cache is 0,006 ms. It is thus theoretically 250 times faster! According that we have

250704 hits we win (1, 5− 0, 006)× 250704ms = 374551ms = 375s for the whole test.

The difference between the two test times is 371s, the cache is thus indeed the main

factor improving the performance.

A side effect that can be observed when using the cache is that we have a higher

failure percentage. The failure percentage goes from 1,32% to 3,18%, which is still

a very good result, meaning that almost all of the Scalaris operations were correctly

performed. This is probably due to the higher number of concurrent posting due to the

cache usage, indeed, the reading tweets operations are a lot quicker and thus we have

more concurrent posting tweets operations than when we did not use a cache. This

implies that we have more conflicts. Indeed, we did a quick test and observed that

while only reading tweets we end up with failure percentage of 0, and at the contrary

when only posting tweets we had a failure percentage of 30%. Our assumption that

more concurrent tweet posting are responsible of this increase in failure percentage is

thus reasonnable.

In conclusion, the cache improves the global performance. The reading tweets

algorithms mainly benefits from the cache making reads even faster, which was our

goal. We could probably still optimize the cache usage but decided not to focus on this

part. The following tests will thus all use the cache described here.

Number of followers in a chunk of the topost set.

Before starting the scalability test we were curious to know in practice the influ-

ence of nbrOfFollowersPerChunk on the performance of our system. We first list

114

some theoretical elements that should help us understand the results. Then, we do a

simulation to see if verifies in a real test.

You must recall that the higher the nbrOfFollowersPerChunk, the higher the

number of keys involved in a write transaction, but the lower the number of necessary

transactions. Moreover, transactions involving more keys are in general more likely to

fail. We make use of our theoretical analysis, and we compute, using the Equation 5.2,

that we need respectively 174, 110, 102 and 98 Scalaris operations to do a single write

for values of nbrOfFollowersPerChunk of 1, 5, 10 and 20. Those results are displayed

in Figure 6.13.

nbOp = 8 + nbrFollowers×(

2 +3

nbrTweetsPerChunk+

2


)= 8 + 40×

(2 +

3

20+

2


)= 8 + 80 + 6 +

80


= 94 +80


(6.3)

Figure 6.13: Number of Scalaris operations needed to perform a Bwitter “post tweet”operation with respect to the number of followers per chunk.

We now want to evaluate in practice the impact of this parameter. We simulate

a social network with a higher level of conflict than the two we already presented in

115

order to have a clearer view of this impact. The levels of conflict of the heavy and light

networks we presented are 0,025 and 0,00625 respectively, in this test it is equal to

0,06. We measured the time needed to perform 10000 operations with different sizes of

nbrOfFollowersPerChunk. We summarize the simulation parameters in Table 6.8

and present our results in Figures 6.14 and 6.15.

Bwitter node / Dispatcher Small

Number of Scalaris nodes 10

Number of Bwitter operations 10000

Number of users 700

Users followed 40Number of users

Users followed0,06

Table 6.8: Parameters changed for the Topost set influence test.

Figure 6.14: Time measured to perform 10000 Bwitter operations with respect to thenumber of followers per chunk, results for small instances and conflict level of 0,06.

We can see that the time lowers a lot between one and five followers per chunk. This

has nothing surprising as the number of operations to do per tweet posted decreases a lot

between one and five as shown in Figure 6.14. The time difference is entirely explained

by the lower number of operations done for one tweet posting. Indeed, the cost of

the read operation stays the same whatever nbrOfFollowersPerChunk. However, if

we follow this reasoning the time should continue to lower between 5 and 20. But it

seems to stagnate and even to increase slightly at 20. We can explain it taking a look

at Figure 6.15 which plots the failure percentage. This one shows a big increase of

116

Figure 6.15: Failure percentage for 10000 Bwitter operations with respect to the numberof followers per chunk, results for small instances and conflict level of 0,06.

the failure percentage between 5 and 20. Indeed, as we mentioned in the introduction

of this section, the bigger nbrOfFollowersPerChunk the bigger the number of keys

involved per transaction during a tweet posting. And, larger transactions induce more

conflicts and thus more failures. The advantage of having less structures to manage

at higher value of nbrOfFollowersPerChunk seems thus to compensate with the

number of failures that increase also with nbrOfFollowersPerChunk, this is why we

observe this stagnation at the end of the graph.

In conclusion, we should not use a value too small for the nbrOfFollowersPer-

Chunk as in this case the number of operations increases a lot and the time thus

explodes. On the other hand, we should not use a too high value either as it quickly

increases the number of failures and thus the time. This is why we decided to use in

all our following tests a value of 20 for nbrOfFollowersPerChunk because it seems

a good compromise. It is maybe not the best choice but at least it seems to be a wise

one.

Scalability tests

With this test we evaluate the scalability of our application. We run our simulation

with the parameters described at the beginning of this section for different numbers of

nodes. We use the heavy network we have presented. We do not know for sure what

would be the best connection strategy as the degree of conflict of our simulation is hard

to evaluate. We thus test with a small dispatcher with one connection per node (1)

117

and with a small dispatcher with two connections per node (2). We also test with a

large dispatcher and one connection per node (3) as a small dispatcher is maybe not

powerful enough to handle the Bwitter tasks and lots of Scalaris connections.

Our results are grouped in Figures 6.16 and 6.17. The first shows the throughput

with respect to the nodes and the second plots the failure percentage.

Figure 6.16: Throughput for 20000 Bwitter operations on a heavy network with respectto the number of Scalaris nodes, results for one small dispatcher with one connectionper node, one small dispatcher with two connections per node and one large dispatcherwith one connection per node.

From Figures 6.16 we can see that (2) does not scale well. Indeed, the throughput

first increases until 12 nodes and then decreases at a level below the throughput reached

at 4 nodes. The failure percentage is more than twice the one for (1) and (3), which

seems to indicate than there are too many connections toward Scalaris. A simulation

with a smaller conflict level could have benefited from a higher number of connections

but we did not test it.

The throughput of (1) grows at regular pace until 14 nodes and then seems to slow

down. We have observed the same behavior during the Scalaris scalability tests but it

is more obvious here. The failure percentage grows linearly with the number of nodes,

which is normal.

Configuration (3) gives far better results in terms of throughput than (1) and (2). It

grows very well until 16 nodes and suddenly falls at 18 nodes. However the gap between

14 and 16 seems higher than usual, we thus believe that this situation was created by

exceptional conditions. We deduce from the observation of the throughput of (1), (2)

and (3) that a small dispatcher can not handle both Bwitter tasks and Scalaris related

118

Figure 6.17: Failure percentage for 20000 Bwitter operations on a heavy network withrespect to the number of Scalaris nodes, results for one small dispatcher with oneconnection per node, one small dispatcher with two connections per node and one largedispatcher with one connection per node.

tasks. We have indeed observed with Amazon’s basic monitoring tools that the CPU

as well as the network were used a lot more during these Bwitter tests than during

the Scalaris tests. It is not surprising as the values and keys used are bigger than

during the Scalaris tests and Bwitter performs various additional tasks. It therefore

indicates that it is necessary to use more powerfull machines than the small instances

from Amazon for the Bwitter nodes. The failure percentage grows slowly and is nearly

the same as for (1) until we reach 12 nodes. From 12 nodes and until 18 nodes, (3)

sees its failure percentage growing faster. This is probably because it has more CPU

and network capabilities and thus can run more transactions in parallel which creates

more conflicts. However, this seems to indicate that the gain of adding one node will

decrease slowly as the number of nodes becomes bigger. This has nothing surprising

and does not indicate a scalability problem. Indeed, during this test we increased the

number of parallel operations while keeping stable the number of users. Normally, the

number of machines grows with the size of the social network and thus the number of

users. But a user should not follow more users simply because there are more users in

the network.

In conclusion, Bwitter is scalable, but we need to have the Bwitter nodes powerful

enough to handle the necessary number of connections toward Scalaris while performing

the Bwitter tasks. We now make a final scalability test with a simulated social network

with a smaller conflict level which, we believe, is closer to reality. We only run the tests

with one large dispatcher and one connection per node. The parameters changed for

119

this test are in Table 6.9.

Bwitter node / Dispatcher Large

Number of Scalaris nodes 4→ 18


Network type Heavy and Light network

Table 6.9: Parameters changed for the push scalability test

This means that we now have a conflict level of 25/4000 = 0,00625. We show in

Figures 6.18 and 6.19 the results of the test as well as the results for the more dense

social network so that we can more easily compare the two.

Figure 6.18: Throughput for 20000 Bwitter operations with respect to the number ofScalaris nodes for the heavy and the light network, for one large dispatcher with oneconnection per node.

We observe as expected better performances with a smaller conflict level. The

failure percentage increases much slower than before which explains the tremendous

gain in performance. Looking at the two Bwitter scalability tests we can see that there

exists a pretty clear correlation between the failure percentage and the conflict level.

With 18 nodes and this conflict level we reach finally 66 ops/s which means around 13

tweets posted/s and 53 reads/s. If we make a small computation and assume a user

posts 3 tweets a day and reads their tweets 12 times a day, we estimate that we can

handle 380162 users with only 19 machines. This is obviously overestimated and not

precise but even a quarter of this number would be a good result.

During those tests we observed good scalability properties for the large dispatchers,

120

Figure 6.19: Failure percentage for 20000 Bwitter operations with respect to the numberof Scalaris nodes for the heavy and the light network, for one large dispatcher with oneconnection per node.

the small dispatchers were too short in resources. As for the Scalaris scalability test

we saw that a high conflict level reduces the throughput and lowers the gain obtained

from adding a machine. We now test Bwitter’s elasticity.

Elasticity tests

The scalability tests of Bwitter have shown good scalability results from 4 to 18

nodes for the heavy and the light network. However, we have decided to use the

light one, indeed the throughput increases faster with the number of nodes with this

network, we thus believe it is easier to observe elasticity with this one. Concerning

the nbrInitialData, defined during the elasticity tests on Scalaris, we have decided to

increase its value up to 20000. Indeed, the Scalaris elasticity test did not seem to show

any instability after nodes adding, we thus decided to try to increase the impact of

the churn. The time of the initialisation phase is very long, it takes approximately 45

minutes to post the initial data and additional 40 minutes to initialize Bwitter related

data such as followers, tweets and so on. It was thus not possible to push the number of

initial data much higher though we would have liked to. We keep the seven strategies we

defined during the elasticity tests on Scalaris and start with 6 initial nodes. However,

the results should be quite different because we used much more initial data and Bwitter

adds an important CPU and network overhead compared to the Scalaris operations we

did before. We present the results in Figures 6.20 and 6.21. As for the last elasticity

test, we present the evolution of the throughput as well as the failure percentage, we

121

Figure 6.20: Throughput with respect to the time, Bwitter results for the seven pre-sented strategies on Scalaris small instances with large dispatcher and and light network.

122

Figure 6.21: Failure percentage with respect to the time, Bwitter results for the sevenpresented strategies on Scalaris small instances with large dispatcher and and lightnetwork.

123

also indicate by blue dots the moment we start the machine on Amazon and by red dots

the moment at which Scalaris is started on nodes. We also indicate the final number

of nodes reached by each strategy in Table 6.10.

Strategy 1 2 3 4 5 6 7

Nodes added 0 1 5 8 8 12 12

Table 6.10: Number of nodes inserted in the ring at the end of the test.

First, you can observe that the throughput is much more unstable than during the

Scalaris elasticity test. The first reason is that the measure we take is much more

volatile. Secondly, we have put a lot more initial data in the system. This might thus

slow down Scalaris at some times, slowing down some read or post tweets operations

that are finished at the next sample when we take the measures and thus giving a big

gap between two measures. Thirdly, Bwitter operations are much more heavy than the

operations we did during Scalaris scalability test, it may thus have an impact on the

results.

We will not discuss each strategy in detail as we did for Scalaris. Instead we do

some general comments. We can observe that the first strategy’s throughput varies

a lot (between 20 and 30) all along the test, which means that even without adding

any node the throughput is quite variable. We also can see that as for the Scalaris

elasticity test, between the moment we start instances on Amazon and the moment

Scalaris is started on the nodes we have a slow down in the throughput. The adding of

nodes is once again directly effective, and the throughput in general increases. We also

observe that most of the strategies were not stabilized at the end of the test and that

their throughputs still vary a lot. But, as expected, the strategies that added the more

nodes during the test reached the highest throughput. The throughput is varying a lot,

it is thus not representative to chose a strategy according to the final throughput, we

thus turn toward the average throughput, represented in Figure 6.22, which is much

more easy to analyze.

As we can see, the strategies 6 and 7 that reach the highest number of nodes at the

end have also the highest average throughput. The strategies 4 and 5 have a similar

average throughput but 4 has a higher one because it adds its nodes before 5 and can

thus benefit sooner from the new nodes.

As we can see in 6.21 the failure percentage varies also a lot, we can also see that

when it reaches a peak the throughput naturally drops down. We see that when the

number of nodes grows the failure percentage also varies much more. We suppose the

peaks are an effect of Scalaris stabilisation algorithm run periodically.

So, once again, our conclusion is that the quicker you add nodes, the quicker you

increase the throughput and the higher average throughput you obtain during the tests.

However we can observe that strategy (7) was less stable than it was during the Scalaris

elasticity test. So maybe if we could have performed elasticity tests with more nodes,

we could have observed that adding all the nodes at the same time was not a good

idea. In order to conclude, we can say that, with our current resources, adding all the

124

nodes at the same time seems to be the best strategy.

Figure 6.22: Average throughput, Bwitter results for the seven presented strategies onScalaris small instances with large dispatcher and and light network.

6.3.3 Pull scalability test

In this final section we test the scalability of the push approach. We use exactly the

same parameters as those described at the beginning of this section. As for the other

scalability tests we make the number of nodes vary from 4 to 18, use one connection

per node and make 20000 Bwitter operations. We simulate the heavy and the light

networks. Those parameters are summarized in Table 6.11.

Bwitter node / Dispatcher Large

Number of Scalaris nodes 4→ 18


Number of Bwitter operations 20000

Network type Heavy and Light network

Users followed 40

Table 6.11: Parameters changed for the pull scalability test

The heavy network should give much worse results than the other one. Indeed,

from the theoretical analysis, we know that the complexity of the read operation grows

linearly with the number of followers for the pull approach. We read one chunk (here

one time frame) as we did for the push. We have set up the time frame at one day,

125

which is a reasonable choice for a real case application, so all the tweets posted are

posted in the same chunk. It may thus seem unfair compared to the push approach,

which flushes the head when it is full, while in the pull we are forced to read all the

tweets that were posted during the day. However, because we use a cache, this side

effect is strongly mitigated. Indeed, most of the tweets are in the cache and its read

access is really quick. The pull and push simulation are thus comparable. As the large

dispatcher has given better results for the push approach we decided to make the test

with a large dispatcher also. Concerning the Scalaris nodes we use, as usual, the small

instances. We put the throughput and the failure percentage for the heavy and the

light network in Figures 6.23 and 6.24.

Figure 6.23: Throughput for 20000 Bwitter operations with the pull approach withrespect to the number of Scalaris nodes for the heavy and the light network, for onelarge dispatcher with one connection per node.

The pull approach presents an excellent scalability for the two networks. Indeed, the

throughput increases perfectly linearly with the number of nodes. This good behavior

is due to the failure percentage that grows extremely slowly with the number of nodes.

This seems to indicate that it can handle a really high number of nodes. The low failure

percentage is the consequence of the low number of writes involved in the pull version

of the post tweet. Remember that the pull only writes the tweet reference at one place

and that when followers read their tweets they do not make any write. Operations

in the pull are thus nearly not conflictual at all. The failure percentage for the light

network seems to increase a lot at 16 nodes but it is only a visual effect. Indeed, it

only increases approximately of 0,05

For the same parameters, namely those described at the beginning of this section,

we go from around 18000 to 250000 Scalairs operations. This is due to high number of

126

reads and low number of writes in our test. As predicted the reads require much more

operations to be done using the pull design.

Figure 6.24: Failure percentage for 20000 Bwitter operations with the pull approachwith respect to the number of Scalaris nodes for the heavy and the light network, forone large dispatcher with one connection per node.

6.3.4 Conclusion: Pull versus Push

We want to say some final words about the pull and the push approach. We can

only compare the scalability tests because we did not performed elasticity test for the

pull. The results can be directly compared because we used the same parameters for

the two approaches. We put as usual the throughput and the failure percentage for the

push and the pull for the two network we tested. They are shown in Figures 6.25 and

6.26.

We can observe that the push approach for the two network types outperforms the

pull in term of throughput. The throughput also increases faster with the number of

nodes in the push approach. However, we can see that the throughput increase for

the push, as already observed, seems to reduce when we reach high number of nodes.

This is not the case with the pull approach which grows more constantly. The push

approach will thus probably reach a limit in scalability quicker than the pull.

127

Figure 6.25: Throughput for 20000 Bwitter operations with the push and pull approachwith respect to the number of Scalaris nodes for the heavy and the light network, forone large dispatcher with one connection per node.

Concerning the failure percentage, it is much more important in the push approach

and increases much quicker with the number of nodes. This explains why the increase

of the throughput in the push approach slows down with the number of nodes. The

pull approach does not present this problem as the failure percentage grows very slowly

and is ridiculously low.

We thus conclude that the two approaches have their pro and cons. The push

approach present a much better scalability but at the cost of a higher failure percentage.

The two approaches scale well but the pull does not seem to slow down. This seems to

indicate that the pull would be the most appropriate for a very high number of nodes.

However this last conclusion is purely hypothetical. We would need much larger scale

tests in order to confirm this intuition.

128

Figure 6.26: Failure percentage for 20000 Bwitter operations with the push and pullapproach with respect to the number of Scalaris nodes for the heavy and the lightnetwork, for one large dispatcher with one connection per node.

6.4 Conclusion

During this section we have shown how to configure Amazon and Scalaris. We

performed a series of tests concluding that Scalaris running on Amazon was indeed

scalable and elastic. We then performed a series of test on Bwitter, both push and

pull approaches. We have demonstrated for both that they were scaling very well.

The push approach presenting a quicker increase of performance but with a failure

percentage growing much faster. Finally, we showed that our system, based on the

push approach, was able to significantly improve its performances while facing a high

load in 15 minutes.

129

Chapter 7

Conclusion

Our goal was to design and develop a scalable and elastic implementation of a

social network application on the top of key/value datastore. Looking at the results

detailed in the previous chapter, we are confident we have reached our goal. Indeed,

we developed an implementation of our pull and push design, and they both showed

good scalability results. The elasticity was only tested for the push approach, and we

showed it was possible to quickly improve performance while assuring good level of

service. All those tests were achieved under real world assumptions using Amazon’s

Elastic Compute Cloud infrastructure. The implementation was realized with the goal

of being as close as possible to a real social network application, we thus took care at

protecting user data and at avoiding security flaws.

During our work with Beernet and its main developer Boris Mejıas, we identified

the basic requirements to allow different services to run on the same DHT without

interfering with each other. Those lead to the discovery of some potential improvements

for Beernet’s API, which are now implemented in version 0.9. This new API allows

users to protect and grant limited rights to their data by using a system of secrets.

Before testing Bwitter, we have also heavily tested Scalaris in order to understand

the future Bwitter tests results. We first showed the importance of choosing the right

number of connections. Afterwards, we studied deeply his scalability and tried different

strategies in order to evaluate the elasticity of Scalaris on Amazon’s EC2. It was shown

to be highly scalable and elastic.

Besides this work, we have also co-written an article, along with Peter Van Roy and

Boris Mejıas, entitled “Designing an Elastic and Scalable Social Network Application”.

In this article we detail some of the observations and design decisions which developped

in this master thesis. This article, that can be found in Chapter 10 of our annexes, has

been accepted for The Second International Conference on Cloud Computing, GRIDs,

and Virtualization1 organized by the IARIA and held the 25th to the 30th of September

2011 in Rome, Italy.


131

7.1 Further work

At multiple occasions during the tests, we concluded that it would have been in-

teresting to perform the tests with more nodes in order to have a better idea of the

scalability and the elasticity. Indeed, during this work we were limited in our tests to 20

machines. And hence, while Bwitter displayed good performance in this environment,

it would have been interesting to increase the number of machines in order to approach

a more realistic number of machines.

We also believe the flash crowd detection mechanism is an interesting subject to

study. Indeed, during our researches, we have noticed that there are sometimes telltale

behaviours in the network before a high peak of activity. It would thus be interesting

to try to design a mechanism based on those social behaviours in order to predict the

heavy loads and already allocate machines before the peak.

We did not study the downscale elasticity in our work because, according to the

Scalaris developers, their system does not handle graceful shutdowns yet in version

0.3.0. It would thus be interesting to observe and test Bwitter on Scalaris once this

feature is implemented in order to study its behaviour.

We did not address the load balancing between Bwitter nodes but it could be

interesting to develop an algorithm to detect which requests should be forwarded to

which Bwitter node in order to share the load between them. Following the same idea,

some requests, like tweets posted from stars, are quite heavy, it might also be a good

idea to split this work between the Bwitter nodes and not only between the Scalaris

nodes.

Finally, the load balancer of the Scalaris Connection Manager could be improved in

order to decide which SR should be executed in order to decrease the conflict between

SRs executed concurrently.

132

Bibliography

[1] Apache. Apache hbase, frontpage. http://hbase.apache.org, 2011. [Online; accessed

28-June-2011].

[2] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H.

Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Sto-

ica, and Matei Zaharia. Above the clouds: A berkeley view of cloud computing.

Technical Report UCB/EECS-2009-28, EECS Department, University of Califor-

nia, Berkeley, Feb 2009. URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/

EECS-2009-28.html.

[3] Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robert Morris, and Ion

Stoica. Looking up data in p2p systems. Commun. ACM, 46:43–48, February

2003. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/606272.606299. URL

http://doi.acm.org/10.1145/606272.606299.

[4] Shea Bennett. Twitter passes 300 million users, seeing 9.2 new registrations per sec-

ond. (allegedly.). http://www.mediabistro.com/alltwitter/twitter-300-million-users

b9026, 2011. [Online; accessed 28-June-2011].

[5] John Buford, Heather Yu, and Eng Keong Lua. P2P Networking and Applica-

tions. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. ISBN

0123742145, 9780123742148.

[6] Nicholas Carlson. Facebook has more than 600 million

users, goldman tells clients. http://www.businessinsider.com/

facebook-has-more-than-600-million-users-goldman-tells-clients-2011-1, 2011.

[Online; accessed 28-June-2011].

[7] Rick Cattell. Scalable sql and nosql data stores. ACM SIGMOD Record, 39(4),

dec 2010.

[8] Chris Clayton1. Standard cloud taxonomies and windows

azure. http://blogs.msdn.com/b/cclayton/archive/2011/06/07/

standard-cloud-taxonomies-and-windows-azure.aspx, 2011. [Online; accessed

26-July-2011].

[9] Technology Expert. Twitter proves itself again, in chilean earthquake. http://

www.tech-ex.net/2010/02/twitter-proves-itself-again-in-chilean.html, 2010. [Online;

accessed 28-June-2011].

133

http://hbase.apache.org

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html

http://doi.acm.org/10.1145/606272.606299

http://www.mediabistro.com/alltwitter/twitter-300-million-users_b9026

http://www.mediabistro.com/alltwitter/twitter-300-million-users_b9026

http://www.businessinsider.com/facebook-has-more-than-600-million-users-goldman-tells-clients-2011-1

http://www.businessinsider.com/facebook-has-more-than-600-million-users-goldman-tells-clients-2011-1

http://blogs.msdn.com/b/cclayton/archive/2011/06/07/standard-cloud-taxonomies-and-windows-azure.aspx

http://blogs.msdn.com/b/cclayton/archive/2011/06/07/standard-cloud-taxonomies-and-windows-azure.aspx

http://www.tech-ex.net/2010/02/twitter-proves-itself-again-in-chilean.html

http://www.tech-ex.net/2010/02/twitter-proves-itself-again-in-chilean.html

[10] Code Futures. Database sharding. http://www.codefutures.com/

database-sharding/, 2011. [Online; accessed 28-June-2011].

[11] Ali Ghodsi. Distributed k-ary System: Algorithms for Distributed Hash Tables.

PhD thesis, KTH –- Royal Institute of Technology, Stockholm, Sweden, dec 2006.

[12] Ali Ghodsi, Luc Alima, and Seif Haridi. Symmetric replication for structured

peer-to-peer systems. In Gianluca Moro, Sonia Bergamaschi, Sam Joseph, Jean-

Henry Morin, and Aris Ouksel, editors, Databases, Information Systems, and

Peer-to-Peer Computing, volume 4125 of Lecture Notes in Computer Science,

pages 74–85. Springer Berlin / Heidelberg, 2007. URL http://dx.doi.org/10.1007/

978-3-540-71661-7 7. 10.1007/978-3-540-71661-7 7.

[13] Ali Ghodsi, Luc Onana Alima, and Seif Haridi. Symmetric replication for

structured peer-to-peer systems. In Proceedings of the 2005/2006 interna-

tional conference on Databases, information systems, and peer-to-peer computing,

DBISP2P’05/06, pages 74–85, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN

978-3-540-71660-0. URL http://portal.acm.org/citation.cfm?id=1783738.1783748.

[14] Jim Gray and Leslie Lamport. Consensus on transaction commit. ACM Trans.

Database Syst., 31:133–160, March 2006. ISSN 0362-5915. doi: http://doi.acm.

org/10.1145/1132863.1132867. URL http://doi.acm.org/10.1145/1132863.1132867.

[15] El-Ansary Sameh Haridi Seif. An overview of structured overlay networks. Hand-

book on Theoretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless, and

Peer-to-Peer Networks, 2005.

[16] Abigail Hauslohner. Is egypt about to have a facebook revolution? http://www.

time.com/time/world/article/0,8599,2044142,00.html, 2011. [Online; accessed 28-

June-2011].

[17] Bill Heil and Mikolaj Piskorski. New twitter research: Men follow men and nobody

tweets. http://blogs.hbr.org/cs/2009/06/new twitter research men follo.html, 2009.

[Online; accessed 28-June-2011].

[18] Rachelle Matherne. Social media coverage of the haiti earthquake. http://sixestate.

com/social-media-coverage-of-the-haiti-earthquake/, 2010. [Online; accessed 28-

June-2011].

[19] Boris Mejıas and Peter Van Roy. Beernet: Building self-managing decentralized

systems with replicated transactional storage. IJARAS: International Journal of

Adaptive, Resilient, and Autonomic Systems, 1(3):1–24, July-Sept 2010. ISSN

1947-9220. doi: 10.4018/jaras.2010070101.

[20] MySQL. Mysql cluster. http://www.mysql.com/products/cluster/, 2011. [Online;


[21] John Naughton. Yet another facebook revolution: why are we so surprised? http://

www.guardian.co.uk/technology/2011/jan/23/social-networking-rules-ok, 2011. [On-

line; accessed 28-June-2011].

134

http://www.codefutures.com/database-sharding/

http://www.codefutures.com/database-sharding/

http://dx.doi.org/10.1007/978-3-540-71661-7_7

http://dx.doi.org/10.1007/978-3-540-71661-7_7

http://portal.acm.org/citation.cfm?id=1783738.1783748

http://doi.acm.org/10.1145/1132863.1132867

http://www.time.com/time/world/article/0,8599,2044142,00.html

http://www.time.com/time/world/article/0,8599,2044142,00.html

http://blogs.hbr.org/cs/2009/06/new_twitter_research_men_follo.html

http://sixestate.com/social-media-coverage-of-the-haiti-earthquake/

http://sixestate.com/social-media-coverage-of-the-haiti-earthquake/

http://www.mysql.com/products/cluster/

http://www.guardian.co.uk/technology/2011/jan/23/social-networking-rules-ok

http://www.guardian.co.uk/technology/2011/jan/23/social-networking-rules-ok

[22] Timothy Grance Peter Mell. The nist definition of cloud computing (draft). Rec-

ommendations of the National Institute of Standards and Technology, 2011.

[23] Programming Languages and Distributed Computing Research Group, UCLou-

vain. Beernet: pbeer-to-pbeer network. http://beernet.info.ucl.ac.be, 2009. URL

http://beernet.info.ucl.ac.be.

[24] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker.

A scalable content-addressable network. In Proceedings of the 2001 conference

on Applications, technologies, architectures, and protocols for computer commu-

nications, SIGCOMM ’01, pages 161–172, New York, NY, USA, 2001. ACM.

ISBN 1-58113-411-8. doi: http://doi.acm.org/10.1145/383059.383072. URL http:

//doi.acm.org/10.1145/383059.383072.

[25] Redis. Redis. http://redis.io/, 2011. [Online; accessed 28-June-2011].

[26] Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy,

Scott Shenker, Ion Stoica, and Harlan Yu. Opendht: a public dht service and its

uses. SIGCOMM Comput. Commun. Rev., 35:73–84, August 2005. ISSN 0146-

4833. doi: http://doi.acm.org/10.1145/1090191.1080102. URL http://doi.acm.org/

10.1145/1090191.1080102.

[27] Alex Rodriguez. Restful web services: The basics. https://www.ibm.com/

developerworks/webservices/library/ws-restful/, 2008. [Online; accessed 13-August-

2011].

[28] Antony Rowstron and Peter Druschel. Storage management and caching in past,

a large-scale, persistent peer-to-peer storage utility. SIGOPS Oper. Syst. Rev., 35:

188–201, October 2001. ISSN 0163-5980. doi: http://doi.acm.org/10.1145/502059.

502053. URL http://doi.acm.org/10.1145/502059.502053.

[29] Thorsten Schutt, Florian Schintke, and Alexander Reinefeld. Scalaris: reliable

transactional p2p key/value store. In Proceedings of the 7th ACM SIGPLAN work-

shop on ERLANG, ERLANG ’08, pages 41–48, New York, NY, USA, 2008. ACM.

ISBN 978-1-60558-065-4. doi: http://doi.acm.org/10.1145/1411273.1411280. URL

http://doi.acm.org/10.1145/1411273.1411280.

[30] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Bal-

akrishnan. Chord: A scalable peer-to-peer lookup service for internet applica-

tions. SIGCOMM Comput. Commun. Rev., 31:149–160, August 2001. ISSN 0146-


10.1145/964723.383071.

[31] Chunqiang Tang, Zhichen Xu, and Mallik Mahalingam. psearch: information

retrieval in structured overlays. SIGCOMM Comput. Commun. Rev., 33:89–94,

January 2003. ISSN 0146-4833. doi: http://doi.acm.org/10.1145/774763.774777.

URL http://doi.acm.org/10.1145/774763.774777.

135

http://beernet.info.ucl.ac.be

http://beernet.info.ucl.ac.be

http://doi.acm.org/10.1145/383059.383072

http://doi.acm.org/10.1145/383059.383072

http://redis.io/

http://doi.acm.org/10.1145/1090191.1080102

http://doi.acm.org/10.1145/1090191.1080102

https://www.ibm.com/developerworks/webservices/library/ws-restful/

https://www.ibm.com/developerworks/webservices/library/ws-restful/

http://doi.acm.org/10.1145/502059.502053

http://doi.acm.org/10.1145/1411273.1411280

http://doi.acm.org/10.1145/964723.383071

http://doi.acm.org/10.1145/964723.383071

http://doi.acm.org/10.1145/774763.774777

[32] G. Tselentis, J. Domingue, A. Galis, A. Gavras, and D. Hausheer. Towards the

Future Internet: A European Research Perspective. IOS Press, Amsterdam, The

Netherlands, The Netherlands, 2009. ISBN 1607500078, 9781607500070.

[33] Twitter. #numbers. http://blog.twitter.com/2011/03/numbers.html, 2011. [Online;


[34] Guido Urdaneta, Guillaume Pierre, and Maarten Van Steen. A survey of dht

security techniques. ACM Comput. Surv., 43:8:1–8:49, February 2011. ISSN 0360-


10.1145/1883612.1883615.

[35] Harry Wallop. Japan earthquake: how twitter and facebook

helped. http://www.telegraph.co.uk/technology/twitter/8379101/

Japan-earthquake-how-Twitter-and-Facebook-helped.html, 2011. [Online; ac-

cessed 28-June-2011].

[36] Evan Weaver. Improving running components. http://www.slideshare.net/Eweaver/

improving-running-components-at-twitter, 2009. [Online; accessed 28-June-2011].

[37] Wikipedia. Trusted platform module. http://en.wikipedia.org/wiki/Trusted

Platform Module, 2011. [Online; accessed 28-June-2011].

[38] Wikipedia. Trusted computing. http://en.wikipedia.org/wiki/Trusted computing#

Remote attestation, 2011. [Online; accessed 28-June-2011].

[39] Wikipedia. Partition (database). http://en.wikipedia.org/wiki/Partition (database),

2011. [Online; accessed 28-June-2011].

[40] Ethan Zuckerman. The first twitter revolution? http://www.foreignpolicy.com/

articles/2011/01/14/the first twitter revolution, 2011. [Online; accessed 28-June-

2011].

136

http://blog.twitter.com/2011/03/numbers.html

http://doi.acm.org/10.1145/1883612.1883615

http://doi.acm.org/10.1145/1883612.1883615

http://www.telegraph.co.uk/technology/twitter/8379101/Japan-earthquake-how-Twitter-and-Facebook-helped.html

http://www.telegraph.co.uk/technology/twitter/8379101/Japan-earthquake-how-Twitter-and-Facebook-helped.html

http://www.slideshare.net/Eweaver/improving-running-components-at-twitter

http://www.slideshare.net/Eweaver/improving-running-components-at-twitter

http://en.wikipedia.org/wiki/Trusted_Platform_Module

http://en.wikipedia.org/wiki/Trusted_Platform_Module

http://en.wikipedia.org/wiki/Trusted_computing#Remote_attestation

http://en.wikipedia.org/wiki/Trusted_computing#Remote_attestation

http://en.wikipedia.org/wiki/Partition_(database)

http://www.foreignpolicy.com/articles/2011/01/14/the_first_twitter_revolution

http://www.foreignpolicy.com/articles/2011/01/14/the_first_twitter_revolution

Part II

The Annexes

137

Chapter 8

Beernet Secret API

8.1 Without replication

8.1.1 Put

put (S:Secret K:Key V:Val)

Stores the triplet (Hash(Secret) Key Val) at the responsible of the Hash of Key.





• there is no triplet (Secret1 Key Val1) stored at the responsible of the Hash of Key

so that Hash(Secret) 6= Hash(Secret1).



If no value is specified for Secret Beernet will assume it is the equivalent to

put(S:NO SECRET K:Key V:Val).

8.1.2 Delete

delete(S:Secret K:Key)

Deletes the triplet (Hash(Secret1) Key Val) stored at the responsible of the Hash

of Key. This operation can have two results, “commit” or “abort”.


• there is a triplet (Hash(Secret1) Key Val) stored with a put operation at the

responsible of the Hash of Key.

139

• Hash(Secret) = Hash(Secret1)

• the triplet has successfully been deleted



delete(S:NO SECRET K:Key).

8.2 With replication

8.2.1 Write

write(S:Secret K:Key V:Val)

Stores the triplet (Hash(Secret) Key Val) at the majority of replicas, updating the

value gives a new version number to the triplet. This operation can have two results,

“commit” or “abort”.

The operation returns “commit” if


previously by a write operation at the majority of the replicas.

• there is no triplet (Secret1 Key Val) where Hash(Secret) 6= Hash(Secret1) stored

in the majority of the replicas

• the triplet has been correctly stored in the majority of the replicas



write(S:NO SECRET K:Key V:Val).

8.2.2 CreateSet

createSet(SS:SSecret K:Key S:Secret)

Stores the triplet (Hash(SSecret) Key Hash(Secret)) at the majority of replicas.



• there is nothing stored associated with the key Key in the majority of the replicas

• there is no triplet (Hash(SSecret1) Key Hash(Secret1)) stored in the majority of

the replicas yet

• the triplet has been correctly stored in the majority of the replicas

140


If no value is specified for SSecret or Secret Beernet will set those values to

NO SECRET.

8.2.3 Add

add(S:Secret K:Key SV:SValue V:Val)

Adds the quadruplet (Secret Key SValue Val) to the set referenced by the key Key

in the majority of the replicas. This operation can have two results, “commit” or

“abort”.


• there is no triplet (Hash(SSecret1) Key Hash(Secret1)) stored at the majority of

the replicas with Hash(Secret1) 6= Hash(Secret)

• there is no quadruplet (Hash(Secret2) Key Hash(SValue2) Val) with

Hash(SValue2) 6= Hash(SValue) stored in the majority of the replicas

• the quadruplet has successfully been stored in the majority of the replicas


Note that if no (Hash(SSecret1) Key Hash(Secret1)) was stored previously at this

key by createSet Beernet will assume it is the equivalent to createSet(SS:NO SECRET

K:Key S:Secret) followed by add(S:Secret K:Key SV:SValue V:Val) where NO SECRET

is a reserved value of Beernet.

If no value is specified for Secret or SValue Beernet will set those values to

NO SECRET.

8.2.4 Remove

remove(S:Secret K:Key SV:SValue V:Val)

If no value is provided for VaI, this means we are dealing with a key/value pair and

not a key/value set, and so SValue is not evaluated. It deletes the triplet (Hash(Secret1)

Key Val1) stored at the majority of the replicas. This operation can have two results,



• there is a triplet (Hash(Secret) Key Val1) stored with a write operation at the

majority of the replicas.


• the triplet has successfully been deleted from the majority of the replicas

141


If a value is provided for Val, this means we are dealing with a value in a set and

SValue will be checked. It deletes the quadruplet (Hash(Secret1) Key Hash(SValue1)

Val1) stored at the majority of the replicas. This operation can have two results,



• there is a quadruplet (Hash(Secret1) Key Hash(SValue1) Val1) stored with an

add operation and there is a triplet (Hash(SSecret1) Key Hash(Secret1)) stored

with a createSet operation at the majority of the replicas.

• Val = Val1


• Hash(SValue) = Hash(SValue1) or Hash(SValue) = Hash(SSecret1)

• the quadruplet has successfully been deleted from the majority of the replicas


If no value is specified for Secret or SValue Beernet will set those values to

NO SECRET.

8.2.5 DestroySet

destroySet(SS:SSecret K:Key)

Deletes the triplet (Hash(SSecret1) Key Hash(Secret1)) and all the quadruplets

(Hash(Secret1) Key Hash(SValue1) Val) at the majority of replicas. This operation

can have two results, “commit” or “abort”.


• there is a triplet (Hash(SSecret1) Key Hash(Secret1)) stored at the majority of

the replicas

• Hash(SSecret) = Hash(SSecret1)

• the triplet and quadruplets have successfully been deleted at the majority of the

replicas


If no value is specified for SSecret Beernet will assume it is the equivalent to de-

stroySet(SS:NO SECRET K:Key).

142

Chapter 9

Bwitter API

9.1 User management

9.1.1 createUser

public void createUser(String userName, String password, String realName)

Creates a user with his personal information.

Parameters:

• userName - the userName of the user, may not contain spaces.

• password - the password of the user, has to be at least 8 characters long and must

contain at least one number and one special character (not from the 26 letters of

the alaphabet).

• realName - the full name of the user, must contain a first and last name.

Throws:

• UserAlreadyUsed - if there already exists a user with this userName.

• PassWordTooWeak - if the password does not meet the requirements.

• UserNameInvalid - if either the userName ot realName does not meet the require-

ments.

• ActionNotDoneException - if there was another problem during the operation.

143

9.1.2 deleteAccount

public boolean deleteAccount(String userName, String password)

Deletes the account of the user along with his lists and lines. Also deletes all the tweets

this user posted.

Parameters:

• userName - the userName of the user performing the operation.

• password - the password of the user performing the operation.

Throws:

• BadCredentials - if the provided userName does not exist or if the password does

not match the userName.


9.2 Tweets

9.2.1 postTweet

public void postTweet(String userName, String password, String msg)

Posts the message in order to be displayed in all the lines following the user

Parameters:



• msg - a String containing the message

Throws:



• ValueNotFound - if a critical value needed to perform the operation could not be

retrieved.


144

9.2.2 reTweet

public void reTweet(String userName, String password, String tweetID)

Posts the referenced tweet as a retweet in order to be displayed in all the lines following

the user.

Parameters:



• tweetID - reference of the tweet to retweet

Throws:

• ActionAlreadyPerformed - if this action has already been performed previously.




retrieved.


9.2.3 reply

public void reply(String userName, String password, String msg, String tweetID)

Posts a new tweet with msg as message in order to be displayed in all the lines following

the user. The new tweet contains a reference to his parent tweet referenced by tweetID

and is added to its children.

Parameters:



• msg - a String containing the message

• tweetID - reference of the tweet to which to reply

145

Throws:




retrieved.


9.2.4 deleteTweet

public void deleteTweet(String userName, String password, int tweetnbr)

Deletes the tweet of the user with the specified number.

Parameters:



• tweetnbr - number of the tweet to delete.

Throws:




retrieved.


146

9.3 Lines

9.3.1 addUser

public void addUser(String userName, String password, String lineName,

String newFollowinguserName)

Adds the specified user to the specified line. From now all the tweets posted by the

specified user will be displayed in the specified line.

Parameters:



• lineName - name of the line to which the user should be added.

• newFollowingUserName - name of the user that should be added.

Throws:




retrieved.


9.3.2 removeUser

public void removeUser(String userName, String password, String lineName,

String followinguserName)

Removes the specified user from the specified line. From now all the tweets posted by

the specified user will no longer be displayed in the specified line anymore.

Parameters:



• lineName - name of the line from which the user should be removed

• newFollowingUserName - name of the user that should be removed

147

Throws:




retrieved.


9.3.3 allUsersFromLine

public Collection<String> allUsersFromLine(String userName, String lineName)

Retrieves all the users followed in the specified line owned by the specified user.

Parameters:

• lineName - name of the line

• userName - name of the user owning the line

Returns:

A Collection of Strings containing all the userNames of the users followed in the

specified line.

Throws:


retrieved.


9.3.4 allTweet

public Collection<Tweet> allTweet(String userName)

Retrieves all the tweets from the specified user. Should only be used for testing the

application.

Parameters:

• userName - name of the user.

148

Returns:

A LinkedList of all the Tweets of the user ordered chronologically.

Throws:


retrieved.


9.3.5 getTweetsFromLine

public TweetChunk getTweetsFromLine(String userName, String lineName, int cNbr,

String date)

Retrieves the tweets from the chunk with the number equal to cNbr from the line

lineName of the user userName that were posted after date. If date is null all the

tweets from the chunk are returned. If cNbr is negative the last chunk from the line is

returned.

Parameters:

• lineName - name of the line.

• userName - name of the user owning the line.

• cNbr - number of the chunk of the line you want to read. The chunks are ordered

from oldest to most recent, with the most recent chunk having the highest number.

• date - String representing the limit date with the format “05/06/11 15 h 26 min

03 s GMT”

Returns:

A TweetChunk containing a LinkedList of Tweets ordered chronologically and the

number of the chunk in which they are stored.

Throws:

• ParseException - if the date has not the correct format and could not be parsed.


retrieved.


149

9.3.6 createLine

public void createLine(String userName, String password, String lineName)

Creates a new line with the specified name for specified user as an owner.

Parameters:



• lineName - name of the new line to create.

Throws:

• LineAlreadyExists - if the user already has a line with the same name.

• BadCredentials - if the provided userName does not exists or if the password does



retrieved.


9.3.7 deleteLine

public void deleteLine(String userName, String password, String lineName)

Deletes the specified line owned by the user

Parameters:



• lineName - name of the line to be deleted, note that the userline and timeline can

not be deleted.

Throws:




retrieved.


150

9.3.8 getLineNames

public Collection<String> getlineNames(String userName)

Retrieves the names of all the lines of the user.

Parameters:

• userName - the userName of the owner of the lines.

Returns:

A LinkedList of Strings containing the names of all the lines.

Throws:


retrieved.


9.4 Lists

9.4.1 addTweetToList

public void addTweetToList(String userName, String password, String listname,

String tweetID)

Adds the referenced tweet to the specified list.

Parameters:



• listName - name of the user’s list.

• tweetID - reference to the tweet to add to the list.

Throws:

• ActionAlreadyPerformed - if the tweet has already been added to the list previ-

ously.

151




retrieved.


9.4.2 removeTweetFromList

public void removeTweetFromList(String userName, String password, String listname,

String tweetID)

Remove the referenced tweet from the specified list.

Parameters:



• listName - name of the user’s list.

• tweetID - reference of the tweet to remove from the list.

Throws:




retrieved.


9.4.3 getTweetsFromList

public TweetChunk getTweetsFromList(String userName, String listname,

int cNbr, String date)

Retrieves the tweets from the chunk with the number equal to cNbr from the list

listname of the user userName that were posted after date. If date is null all the tweets

from the chunk are returned. If cNbr is negative the last chunk from the line is returned.

152

Parameters:

• listName - name of the list

• userName - name of the user owning the list.

• cNbr - number of the chunk of the line you want to read. The chunks are ordered

from oldest to most recent, with the most recent chunk having the highest number.

• date - String representing the limit date with the format “05/06/11 15 h 26 min

03 s GMT”

Returns:

A TweetChunk containing a LinkedList of Tweets ordered chronologically and the

number of the chunk in which they are stored.

Throws:


retrieved.


9.4.4 createList

public void createList(String userName, String password, String listname)

Creates a new list with the specified name and the user as an owner

Parameters:



• listName - name of the new list to create

Throws:

• ListAlreadyExists - if the user already has a list with the same name.




retrieved.


153

9.4.5 deleteList

public void deleteList(String userName, String password, String listname)

Deletes the specified list

Parameters:



• listName - name of the line to be deleted, note that the favoritelist cannot be

delete

Throws:




retrieved.


9.4.6 getListNames

public Collection<String> getlistnames(String userName)

Retrieves the names of all the lines of the user.

Parameters:

• userName - name of the user owning the lists.

Returns:

A LinkedList of Strings containing the names of all the lists.

Throws:


retrieved.


154

Chapter 10

The paper

During the course of our project we have co-written an article, along with Peter

Van Roy and Boris Mejıas, entitled “Designing an Elastic and Scalable Social Network

Application”.

The contents of this paper were based on our second implementation of Bwitter,

which we detail in section 5.1.2. It is thus not fully representative of our final imple-

mentations and design choices.

This article has been accepted for The Second International Conference on Cloud

Computing, GRIDs, and Virtualization1 organized by the IARIA and held the 25th to

the 30th of September 2011 in Rome, Italy.

The submitted version of this paper can be found next page.


155

Designing an Elastic and Scalable Social Network Application

Xavier De Coster, Matthieu Ghilain, Boris Mejıas, Peter Van RoyICTEAM institute

Universite catholique de LouvainLouvain-la-Neuve, Belgium

{decoster.xavier,ghilainm}@gmail.com {boris.mejias,peter.vanroy}@uclouvain.be

Abstract—Central server-based social networks can sufferfrom overloading caused by social trends and make the servicemomentarily unavailable preventing users to access it whenthey most want it. Central server-based social networks arenot adapted to face rapid growth of data or flash crowds.In this work we present a way to design a scalable, elasticand secure Twitter-like social network application build on thetop of Beernet, a transactional key/value datastore. By beingscalable and elastic the application avoids both overloading andwasting resources by scalung up and down quickly.

Keywords-Scalable; elastic; social network; design.

I. INTRODUCTION

Social networks are an increasing popular way for peopleto interact and express themselves. People can now createcontent and easily share it with other people. The servers ofthose services can only handle a given number of requestsat the same time, so if there are too many requests theserver can become overloaded. Social networks thus have topredict the amount of load they will have to face in order tohave enough resources at their disposal. Statically allocatingresources based on the mean utilisation of the service wouldlead to a waste during slack periods and overloading duringpeak periods. Twitter (http://www.twitter.com) shows the“Fail Whale” graphic whenever overloading occurs. This is atricky situation as this load is related to many social factors,some of which are impossible to predict. For instance wewant to be able to handle the high amount of people sendingChristmas or New Year wishes but also reacting to naturaldisasters. This is why we want to turn towards scalable andelastic solutions, allowing the system to add and removeresources on the fly in order to fit the required load. Inthis work we are going to focus on the design of a socialnetwork with elastic and scalable infrastructure: Bwitter, asecure Twitter-like social network built on Beernet [1], ascalable key/value store. In the next section we will overviewthe basic required operations for a social network. We willthen explain why we chose Beernet for this project inSection III and how to run multiple services on top of it inSection IV, in this section we will also discuss some possibleimprovements for DHTs in order to increase their securityand offer a richer application programming interface. Wethen take a closer look at the design of our application inSection V. In Section VI we will compare two types of

architectures on which this social network can run, one fullydistributed based on peer-to-peer and one centralised basedon the cloud. We will then finish with the implementationof our prototype in Section VII and a small conclusion atSection VIII.

II. A QUICK OVERVIEW OF REQUIRED OPERATIONS

Bwitter is designed to be a secure social network basedon Twitter. Twitter is a microblogging system, and while itlooks relatively simple at first sight it hides some complexfunctionalities. We included almost all of those in Bwitterand added some others. We will only depict the relevantfunctionalities here that will help us to analyse the designof the system and the differences between a centralised anddecentralised architecture.

A. NomenclatureThere are only a few core concepts on which our appli-

cation is based. A tweet is basically a short message withadditional meta information. It contains a message up to 140characters, the author’s username and a timestamp of whenit was posted. If the tweet is part of a discussion, it keepsa reference to the tweet it is an answer to and also keepsthe references towards tweets that are replies to it. A useris anybody who has registered in the system. A few piecesof information about the user are kept in memory by theapplication, such as her complete name and her password,used for authentication. A line is a collection of tweets andusers. The owner of the line can define which users he wantsto associate with the line. The tweets posted by those userswill be displayed in this line. This allows a user to haveseveral lines with different topics and users associated.

B. Basic operations1) Post a tweet: A user can publish a message by posting

a tweet. The application will post the tweet in the lines towhich the user is associated. This way all the users followingher have the tweet displayed in their line.

2) Retweet a tweet: When a user likes a tweet from another user she can decide to share it by retweeting it. Thiswill have the effect of “sending” the retweet to all the linesto which the user is associated. The retweet will be displayedin the lines as if the original author posted it but with theretweeter’s name indicated.

3) Reply to a tweet: A user can decide to reply to a tweet.This will include a reference to the reply tweet inside theinitial tweet. Additionally a reply keeps a reference to thetweet to which it responds. This allows to build the wholeconversation tree.

4) Create a line: A user can create additional lines withcustom names to regroup specific users.

5) Add and remove users from a line: A user can asso-ciate a new user to a line, from then on all the tweets thisnewly added user posts will be included in the line. A usercan also remove a user from a line, she will then not see thetweets of this user in her line anymore and will not receiveher new tweets either.

6) Read tweets: A user can read the tweets from a lineby packs of 20 tweets. She can also refresh the tweets of aline to retrieve the tweets that have been posted since herlast refresh.

III. WHY BEERNET?

Beernet [2] is a transactional, scalable and elastic peer-to-peer key/value data store build on the top of a DHT. Peersin Beernet are organized in a relaxed Chord-like ring [3]and keep O(log(N)) fingers for routing. This relaxed ring ismore fault tolerant than a traditional ring and its robust joinand leave algorithm to handle churn make Beernet a goodcandidate to build an elastic system. Any peer can performlookup and store operations for any key in O(log(N)), whereN is the number of peers in the network. The key distributionis done using a consistent hash function, roughly distributingthe load among the peers. These two properties are a strongadvantage for scalability of the system compared to solutionslike client/server.

Beernet provides transactional storage with strong con-sistency, using different data abstractions. Fault-tolerance isachieved through symmetric replication, which has severaladvantages that we will not detail here compared to leaf-set and successor list replication strategy [4]. In everytransaction, a dynamically chosen transaction manager (TM)guarantees that if the transaction is committed, at least themajority of the replicas of an item stores the latest valueof the item. A set of replicated TMs guarantees that thetransaction does not rely on the survival of the TM leader.Transactions can involve several items. If the transaction iscommitted, all items are modified. Updates are performedusing optimistic locking.

With respect to data abstractions, Beernet provides notonly key/value-pairs as in Chord-alike networks, but alsokey/value sets, as in OpenDHT-alike networks [5]. The com-bination of these two abstractions provides more possibilitiesin order to design and build the database, as we will explainin Section V. Moreover, key/value sets are lock-free inBeernet, providing better performance.

We opted for Beernet because of those native data ab-stractions and its elastic and scalability properties. But any

scalable and elastic key/value store providing transactionalstorage with strong consistency providing those data abstrac-tions could be used too.

IV. RUNNING MULTIPLE SERVICES ON BEERNET

Multiple services running on the same DHT can conflictwith each other. We will now discuss two mechanismsdesigned to avoid those conflicts.

A. Protecting data with Secrets

Early in the process, we elicited a crucial requirement.The integrity of the data posted by the users on Bwittermust be preserved. A classical mechanism, but not withoutflaws, is to use a capability-based approach. Data is stored atrandom generated keys so that other applications and usersusing Beernet cannot erase others values because they do notknow at which keys these values are stored. But in Bwitter,some information must be available for everybody and thuskeys must be known by all users, meaning that we cannot userandom keys. For example, any user must be able to retrievethe user profile of another user, it must thus know the keyat which it is stored. The problem is that Beernet does notallow any form of authentication so key/value pairs are leftunprotected, meaning that anybody able to make requests toBeernet can modify or delete any previously stored data.

We make a first and naive assumption that servicesrunning on Beernet are bug free and respectful of each other.They thus check at each write operation that nothing else isstored at a given key otherwise they cancel the operation.Thanks to the transactional support of Beernet the check andthe write can be done atomically. This way we can avoidrace conditions where process A reads, the process B reads,both concluding that there is nothing at a given key and bothwriting a value leading to the lost of one of the two writes.

This assumption is not realistic and adds complexity to thecode of each application running on Beernet. We thus relaxit and assume that Beernet is running in a safe environmentlike the cloud, which implies that no malicious node canbe added to Beernet. We allow any application to makerequests directly to any Beernet node from the Internet. Wedesigned a mechanism called “secrets” to protect key/valuepairs and key/value sets stored on Beernet enriching theexisting Beernet API.

Applications can now associate secrets to key/value pairsand key/value sets they store. This secret is not mandatory,if no secret is provided a “public” secret is automaticallyadded. This secret is needed to modify or delete what isstored at the key protected. For instance we could have thefollowing situation. A first request stores at the key bar thevalue foo using the secret ASecret, then another request triesto store at key bar another value using a secret differentfrom ASecret. Because secrets are different Beernet rejectsthe last request, which will thus have no effect on the datastore. A similar mechanism has been implemented for sets,

allowing to dissociate the protection of the set as a wholeand the values it contains.

Secrets are implemented in Beernet and have been testedthrough our Bwitter application. A similar but weaker mech-anism is proposed by OpenDHT [5]. Complete informationconcerning the new secret API can be found at Bwitter’sweb site (http://bwitter.dyndns.org/).

B. Dictionaries

At the moment in Beernet, as in all key/value stores weknow, there is only one key space. This can cause problemsif multiple services use the same key. For instance twoservices might design their database storing the user profilesat a key equal to the username of a user. This means they cannot both have a user with the same username. This problemcannot be solved with the secrets mechanism we proposed.We thus propose to enhance the current Beernet API withmultiple dictionaries. A dictionary has a unique name andrefers to a key-space in Beernet. A new application cancreate a dictionary as it starts using Beernet. It can latercreate new dictionaries at run-time as needed, which allowsthe developpers to build more efficient and robust imple-mentation. Dictionaries can be efficiently created on the flyin O(log(N)) where N is the number of peers in the Beernetnetwork. Moreover dictionaries do not degrade storing andreading performance of Beernet. If two applications need toshare data they just have to use the same dictionary. Thishas not yet been implemented, but API and algorithms arecurrently being designed. An open problem is how to avoidmalicious applications to access the dictionary of anotherapplication.

V. DESIGN PROCESS

We will now present our design choices and explain howwe relieve machines hosting popular values.

A. Main directions

We will start by discussing the main design choices wemade for our implementation.

1) Make reads cheap: While designing the constructionmechanism of the lines we were faced with the followingchoice: Either push the information and put the burden onthe write, making the “post tweet” operation add a referenceto the tweet in the lines of each follower. Or pulling theinformation and build the lines when a user wants to readthem, by fetching all the tweets posted by the users hefollows and reordering them. As people do more reads thanwrites on social networks, based on the assumption that eachposted tweet is at least read one time, we opted to makereads cheaper than writes.

2) Do not store full tweets in the lines but references:There is no need to replicate the whole tweet inside eachline, as a tweet could be potentially contain a lot of in-formation and should be easy to delete. To delete a tweet

the application only has to edit the stored tweet and doesnot need go through every line that could contain the tweet.When loading the tweet the application can see if it has beendeleted or not.

3) Minimise the changes to an object: We want theobjects to be as static as possible to enable cache systems.This is why we do not store potentially dynamic informationinside the objects but rather have a pointer in them, pointingto a place where we could find the information. For instance,Tweets are only modified when we delete them, if there is areply to them, the ID of the new child is stored in a separatedset.

4) Do not make users load unnecessary things: Loadingthe whole line each time we want to see the new tweetswould result in an unnecessarily high number of messagesexchanged and would be highly bandwidth consuming. Thisis why we decided to cut lines, which in fact are just bigsorted set, into subsets, which are sets of x tweets, that canbe organised in a linked list fashion, where x is a tunableparameter. This way the user can load tweets in chunks ofx tweets. The first subset contains all the references to thetweets posted since the last time the user retrieved the line,it can thus be much larger than x tweets, it is not a problemas users generally want to check all the new tweets whenthey consult a line. The cutting is then done as follows: theapplication removes the x oldest references from the firstset, posts them in an new subset and repeats the operationuntil the loaded first set is smaller than x.

5) Retrieving Tweets in order: Due to the cutting mech-anism and delays in the network we can not be sure thateach reference contained in a subset is strictly newer thanthe references stored in the next subset. So we also retrievethe tweet references from this one and only select the first20 newest references before fetching the tweets.

6) Filtering the references: When a user is dissociatedfrom a line we do not want our application to still displaythe tweets he posted previously. We decided not to scanthe whole line to remove all the references added by thisuser, but rather remove the user from the list of the usersassociated with the line and filter the references-based onthis list before fetching the corresponding tweets.

7) Only encrypt sensitive data: Most of the data in Twit-ter is not private so there would be no point in encryptingit. Only the sensitive data such as the password of the usersshould be protected by encryption when stored.

8) Modularity: Even if our whole design and architecturerelies on the features and API offered by Beernet it is alwaysbetter to be modular and to define clear interfaces so we canreplace a whole layer by an other easily. For instance anyother DHT could easily be used, provided it supports thesame data abstractions or they can be simulated.

B. Improving overall performance adding a cache

1) The popular value problem: Given the properties ofthe DHT, a key/value pair is mapped to a node or fnodes, where f is the replication factor, depending of theredundancy level desired. This implies that if a key isfrequently requested, the nodes responsible for it can beoverloaded while the rest of the network is mostly idleand adding additional machines is not going to improve thesituation. It is not uncommon on Twitter to have wildlypopular tweets that are retweeted by thousands of users.In the worst case the retweets can be seen as exponentialphenomenon as all the users following the retweeter aresusceptible to retweet it too.

2) Use an application cache as solution: Adding nodeswill not solve the problem, because the number of nodesresponsible for a key/value pair will not change. In order toreduce this number of requests we have decided to add acache with a LRU replacement strategy at the applicationlevel. This solves the retweet problem because now theapplication, which is in charge of several users, will havein its cache the tweet as soon as one of its user reads thepopular tweet. This tweet will stay in the cache because theusers frequently make requests to read it. This way we willreduce the load put on the nodes responsible for the tweet.

We now have to take into account that values are notimmutable, they can be deleted and modified. A naivesolution would be to do active pulling to Beernet to detectchanges to the key/value pair stored in the cache. This wouldbe quite inefficient as there are several values, like tweets,that almost never change. In order to avoid pulling we needa mechanism that warns us when a change is done to akey/value pair stored in the cache. Beernet, as describedin [1], allows an application to register to a key/value pairand to receive a notification when this value is updated. Ourapplication cache will thus register to each key/value pairthat it actually holds and when it receives a notification fromBeernet indicating that a pair has been updated it will updateits corresponding replicas. This mechanism has the bigadvantage of removing unnecessary requests. Notificationsare asynchronous, so the replicas in the cache can havedifferent values at a given moment, leading to an eventualconsistency model for the reads. On the other hand writes donot go through the cache but directly to Beernet, this allowsto keep strong consistency for the writes inside Beernet.This is an acceptable trade off as we do not need strongconsistency for reads inside a social network.

VI. ARCHITECTURE

We will present two different scalable architectures forour application. In both architectures our application isdecomposed in three loosely coupled layers. From top tobottom, the Graphic User Interface (GUI), Bwitter usedto handle the operations described in Section II and thekey/value data store. For this last layer we use Beernet,

but it could be replaced by any key/value store with similarproperties. As a remainder the data store must provide read-/write operations on values and sets as well as implementingthe secrets we described before. This architecture is verymodular, each layer can be changed assuming it respectsthe API of the layer above. We now have to decide whereBeernet will run. We have two options, either let the Beernetnodes run on the users’ machines or run them on thecloud, leading to two radically different architectures: thecompletely decentralised architecture and the cloud-basedarchitecture.

A. Completely decentralised architecture

In a fully decentralised architecture the user runs a Beer-net node and the Bwitter application on her machine. TheBwitter application will do requests directly to this localBeernet node. Ideally this local Beernet node should not berestricted to the Bwitter application but should also be acces-sible for other applications. The problem with this approachis that the user can bypass protection mechanism enforcedat higher level by accessing DHT low level functions ofBeernet. Usually this is not a problem as untrusted userswould not know at which key the data is stored and thuscan not compromise it. But in our case the data has tobe at known keys so that the application can dynamicallyretrieve them. This means that any user understanding howour application works would be able to delete, edit or forgelines, users, tweets and references. This would be a securitynightmare.

We tried to tackle this problem with the secret mecha-nism we designed to enrich Beernet’s interface. While thisprevented the users to edit or delete data they did not createthemselves we could not prevent them to forge elements. Toavoid this we needed a way to authenticate every data postedby a user. There are cryptographic mechanisms to enforcethis and ways to efficiently manage the keys but they areoutside the scope of this paper.

Even with those mechanisms in place we have to en-force security at the DHT level. Beernet uses encryption tocommunicate between different nodes to avoid confidentialinformation leak. But anyone could add modified Beernetnodes behaving maliciously. Aside usual attacks [6], acorrupted node could be modified to reveal all the secretsinside the requests going through it. We thus have to makesure that the code running the Beernet node is not modified,so we need a mechanism that enforce remote attestation asdescribed in [7]. This can be done by using a TPM, whichprovides cryptographic code signature in hardware, on theusers’ machine in order to be able to prove to other Beernetnodes that the client’s node is a trustworthy node. Until aBeernet node has a way to tell for sure it can trust anotherBeernet node we are in a dead end. Indeed anyone stealingthe secret of another user can erase any data posted by theuser.

Assuming that a Twitter session time is short, this canbe a problem if our application is the only one running onthe top of Beernet. Indeed it will result in nodes frequentlyjoining and leaving the network with a short connectiontime. Each of those changes in the topology of Beernetwill modify the keys for which the nodes are responsibletriggering key/value pairs reallocation itself leading to animportant and undesirable churn. This would not be an idealenvironment for a DHT.

B. Cloud-based architecture

With this architecture the Bwitter and the Beernet nodeswill run on the cloud, which is an adequate environmentfor scalable and elastic applications. We can thus easily addor remove Bwitter and Beernet nodes to meet the demand,increasing the efficiency of the network. A Bwitter node is amachine running Bwitter but generally also a Beernet node.This solution also allows us to keep a stable DHT as nodesare not subject to high churn as it was the case in the firstarchitecture we presented.

Using this solution we do not have all the security issueswe had with the fully decentralised architecture. This isbecause the users do not have direct access to the Beernetnodes anymore but have to go through a Bwitter nodeand can only perform operations defined in Section II.Furthermore, the communication channel between the GUIand the Bwitter node can guarantee authenticity of the serverand encryption of data being transmitted, for instance usinghttps. Bwitter requires users to be authenticated to access ormodify their data. Doing so we provide data integrity andauthenticity because, for instance, Bwitter does not allowa user to delete a tweet that he did not post or to posta tweet using the username of someone else. The securityproblem concerning possible revelations of user secrets dueto a malicious node is not relevant anymore as our DHT isfully under our control.

The cloud-based architecture is thus more secure andstable, this is why we have finally chosen to implement thissolution, we now take a closer look at how the layer stackis build. Note that in spite of our researches we did not findany information about Twitters current architecture so weare not able to compare both architectures.

As said before the Beernet layer runs on the cloud, thislayer is monitored in order to detect flash crowds andBeernet nodes will be added and removed on the fly to meetthe demand.

The intermediate layer, also running on the cloud, isBwitter, it communicates with Beernet and the GUIs. Thislayer can be put on the same machine as a Beernet node oron another machine. Normally there should be less Bwitternodes than Beernet nodes. One Bwitter node is associated toa Beernet node but can be re-linked to another Beernet nodeif it goes down. Each Bwitter node should be connected to adifferent Beernet node in order to share the load. In practice

the Bwitter nodes will not be accessible directly, they willbe accessed through a fast and transparent reverse proxy thatwill be in charge of doing load balancing between Bwitternodes. At the moment Bwitter nodes use sessions to identifythe users, so the reverse proxy is forced to keep track of thesessions in order to be able to map the same client to thesame Bwitter node. We plan to change this behavior to offera completely REST Bwitter API.

The top layer is the GUI, it connects to a Bwitternode using a secure connection channel that guaranteethe authenticity of the Bwitter node and encrypts all thecommunications between the GUI and the Bwitter node.Multiple GUI modules can connect to the same Bwitternode. The GUI layer is the only one running on the clientmachine.

C. Elasticity

We previously explained that to prevent the Fail Whaleerror, the system needs to scale up to allocate more resourceto be able to answer an increase of user requests. Once theload of the system gets back to normal, the system needs toscale down to release unused resources. We briefly explainhow a ring-based key/value store needs to handle elasticityin terms of data management. We are currently working onmaking the elastic behaviour more efficient in Beernet.

1) Scale up: When a node j joins the ring in betweenpeers i and k, it takes over part of the responsibilityof its successor, more specifically all keys from i to j.Therefore, data migration is needed from peer k to peer j.The migration involves not only the data associated to keysin the range ]i, j], but also the replicated items symmetricallymatching the range. Other noSQL databases such as HBase(http://hbase.apache.org) do not trigger any data migrationupon adding new nodes to the system, showing betterperformance scaling up.

2) Scale down: There are two ways of removing nodesfrom the system: by gently leaving and by failing. It is veryreasonable to consider gently leaves in cloud environments,because the system explicitly decides to reduce the size ofthe system. In such case, it is assumed that the leaving peerj has time enough to migrate all its data to its successorwho becomes the new responsible for the key range ]i, j],being i the predecessor.

Scaling down due to the failure of peers is much morecomplicated because the new responsible of the missingkey range needs to recover the data from the remainingreplicas. The difficulty comes from the fact that the valueof application keys is unknown, since the hash function isnot bijective. Therefore, the peer needs to perform a rangequery, as in Scalaris [8], but based on the hash keys. Anothercomplication is that there are no replica sets based on keyranges, but on each single key.

VII. IMPLEMENTATION

We have implemented a prototype based on our cloud-based architecture. Sources are freely available at [bwit-ter.dyndns.org]. We will now detail how we actually imple-mented it. You can see a full schema of our implementationin Figure 1.

As explained, our architecture has three main layers. TheDHT layer is implemented using Beernet, build in Oz v1.3.2(http://www.mozart-oz.org/) enhanced with the secret mech-anism. Beernet is accessible through a socket API , we usedit to communicate with the Bwitter layer. An alternative testversion of the data store layer used for testing the applicationis also made available at http://bwitter.dyndns.org.

Figure 1. Implementation structure scheme

At the top of the Bwitter layer is a Tomcat 7.0 applicationserver (http://tomcat.apache.org) using java servlets fromjava EE. Bwitter is accessible from the internet through anAPI that The Bwitter layer is connected to the bottom layerusing sockets to communicate with an Oz agent controllingBeernet. The Bwitter nodes are accessible remotely viaan http API, finally we would like to make it completelyconform to REST API. The Tomcat servers are not directlyaccessed, they are accessed through a reverse proxy server,in this case nginx (http://wiki.nginx.org), which is told tosupport 10k concurrent connections. This nginx server isin charge of serving static content as well as doing loadbalancing for the Tomcat servers. This load balancing isperformed so that messages of a same session are alwaysmapped to the same Tomcat server, this is necessary asauthentication is needed to perform some of the Bwitteroperations and we did not want to share the state of theusers sessions between the Bwitter nodes for performancereasons. The connection to the web-based API is performedusing https to meet the secure channel requirement of ourarchitecture.

The last layer is the GUI, we decided to implement itas a Rich Internet Application (RIA), using the Adobe Flex

technology (http://www.adobe.com/products/flex). This GUIuses the web API we developed to access Bwitter.

VIII. CONCLUSION

Our goal was to build a new system able to withstand flashcrowd by relying on an elastic and scalable architecture. Thisallows us to add resources to face heavier traffic and avoidwaste of resources.

While the prototype is not yet totally finished our wholedesign is totally scalable, meaning we do not have singleabsurdly huge operations due to the high number of usersone might follow or be followed by. We avoid overloadingspecific machines because we do not rely on any globalkeys and use our cache mechanism to prevent the retweetproblem. Some preliminary scalability tests have been doneon Amazon and are encouraging.

During the implementation we also came across twopotentially important improvements for key/value stores,namely duplicating the key space using multiple dictionariesand the protection of data via secrets, with the last one nowimplemented in Beernets latest release.

REFERENCES

[1] B. Mejıas and P. Van Roy, “Beernet: Building self-managingdecentralized systems with replicated transactional storage,”IJARAS: International Journal of Adaptive, Resilient, andAutonomic Systems, vol. 1, no. 3, pp. 1–24, July-Sept 2010.

[2] Programming Languages and Distributed Computing ResearchGroup, UCLouvain, “Beernet: pbeer-to-pbeer network,” http://beernet.info.ucl.ac.be, 2009. [Online]. Available: http://beernet.info.ucl.ac.be

[3] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Bal-akrishnan, “Chord: A scalable peer-to-peer lookup servicefor internet applications,” SIGCOMM Comput. Commun. Rev.,vol. 31, pp. 149–160, August 2001.

[4] A. Ghodsi, L. O. Alima, and S. Haridi, “Symmetric replicationfor structured peer-to-peer systems,” in Proceedings of the2005/2006 international conference on Databases, informationsystems, and peer-to-peer computing, ser. DBISP2P’05/06.Berlin, Heidelberg: Springer-Verlag, 2007, pp. 74–85.

[5] S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy,S. Shenker, I. Stoica, and H. Yu, “Opendht: a public dht serviceand its uses,” SIGCOMM Comput. Commun. Rev., vol. 35, pp.73–84, August 2005.

[6] G. Urdaneta, G. Pierre, and M. v. Steen, “A survey of dhtsecurity techniques,” ACM Computing Surveys, vol. 43, no. 2,jan 2011.

[7] Wikipedia, “Trusted computing,” http://en.wikipedia.org/wiki/Trusted\ computing\#Remote\ attestation, 2011, [Online; ac-cessed 28-June-2011].

[8] T. Schutt, F. Schintke, and A. Reinefeld, “Scalaris: reliabletransactional p2p key/value store,” in Proceedings of the 7thACM SIGPLAN workshop on ERLANG, ser. ERLANG ’08.New York, NY, USA: ACM, 2008, pp. 41–48.

Designing an elastic and scalable social network applicationpvr/MemoireDeCosterGhilain.pdf ·...

Documents

Transcript of Designing an elastic and scalable social network applicationpvr/MemoireDeCosterGhilain.pdf ·...