Louvain School of Management Université Catholique de Louvain Welcome!
Designing an elastic and scalable social network applicationpvr/MemoireDeCosterGhilain.pdf ·...
Transcript of Designing an elastic and scalable social network applicationpvr/MemoireDeCosterGhilain.pdf ·...
Universite catholique de Louvain
Louvain School of Engineering
Computing Science Engineering Department
Designing an elastic and scalable socialnetwork application
Promoter:
Pr. Peter Van Roy
Readers:
Pr. Marc Lobelle
Boris Mejıas
Master thesis presented for the
obtention of the grade of master in
computer engineering, option
networking and security, by
Xavier De Coster and Matthieu Ghilain.
Louvain-la-Neuve
Academic year 2010 - 2011
ii
Acknowledgments
The Bwitter team would like to thank Pr. Peter Van Roy for his help and insightful
comments.
We also want to thank Boris Mejıas for his guidance and availability during the
whole project.
We thank Florian Schintke, member of the Scalaris developing team, for his help
during our analysis of Scalaris and the numerous answers he provided to our questions.
We also thank Quentin Hunin for his support and constructive feedback during the
last few weeks of the redaction.
Finally, we would also like to thank our families, our friends and our girlfriends,
Ines and Lorraine, for their unconditional support and encouragement.
iii
Abstract
The amount of traffic on web based social networks is very difficult to predict. In
order to avoid wasting resources during low traffic periods or being overloaded during
peak periods, it would be interesting to adapt the amount of resources dedicated to the
service.
In this work we detail the design and implementation of our own social network
application, called Bwitter. Our first goal is to make Bwitter performance scales with
the number of machines we dedicate to it. Our second goal is linked to our first one, we
want to make Bwitter elastic so that it can react to flash crowds without suspending
its services by adding resources in order to handle this load. To achieve the desired
scalability and elasticity, Bwitter is implemented on a scalable key/value datastore with
transactional capabilities running on the Cloud.
During our tests we study the behaviour of Bwitter using the Scalaris datastore and
having both running on Amazon’s Elastic Compute Cloud. We show that the perfor-
mance of Bwitter increases almost linearly with the number of resources we allocate to
it. Bwitter is also able to improve its performance significantly in a matter of minutes.
ii
Contents
I The Project vii
1 Introduction 1
1.1 Social networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scalable Data Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Bwitter project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Bwitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 State-of-the-art 7
2.1 Scalable datastores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Key/value Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Document Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Extensible Record Stores . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Peer-to-peer systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Study of scalable key/value stores properties . . . . . . . . . . . . . . . 13
2.4.1 Network topology . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Storage abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Replication strategy and consistency model . . . . . . . . . . . . 16
2.4.4 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.5 Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iii
2.4.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 The Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 The Architecture 23
3.1 The requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Non-Functional requirements . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Open peer-to-peer architecture . . . . . . . . . . . . . . . . . . . 27
3.2.2 Cloud Based architecture . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 The popular value problem . . . . . . . . . . . . . . . . . . . . . 30
3.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 The Datastore 33
4.1 The datastore choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Identifying what we need . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Our two choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 General Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Design of the datastore . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Key uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Push approach design details . . . . . . . . . . . . . . . . . . . . 39
4.3.3 The Pull Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Running multiple services using the same datastore . . . . . . . . . . . . 46
4.4.1 The unprotected data problem . . . . . . . . . . . . . . . . . . . 47
4.4.2 Key already used problem . . . . . . . . . . . . . . . . . . . . . . 50
4.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Algorithms and Implementation 53
5.1 Implementation of the cloud based architecture . . . . . . . . . . . . . . 53
5.1.1 Open peer-to-peer implementation . . . . . . . . . . . . . . . . . 53
5.1.2 First cloud based implementation . . . . . . . . . . . . . . . . . . 54
5.1.3 Final cloud based implementation . . . . . . . . . . . . . . . . . 55
iv
5.2 Nodes Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Scalaris Connections Manager . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.1 Failure handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Bwitter Request Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.1 The push approach . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.2 The pull approach . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.3 Theoretical comparison of Pull and Push approach . . . . . . . . 78
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Experiments 87
6.1 Working with Amazon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.1 Choosing the right instance type . . . . . . . . . . . . . . . . . . 87
6.1.2 Choosing an AMI . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.3 Instance security group . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.4 Constructing Scalaris AMI . . . . . . . . . . . . . . . . . . . . . 89
6.2 Working with Scalaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Launching a Scalaris ring . . . . . . . . . . . . . . . . . . . . . . 90
6.2.2 Scalaris performance analysis . . . . . . . . . . . . . . . . . . . . 91
6.3 Bwitter tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 Experiment measures discussion . . . . . . . . . . . . . . . . . . 108
6.3.2 Push design tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.3 Pull scalability test . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.4 Conclusion: Pull versus Push . . . . . . . . . . . . . . . . . . . . 127
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Conclusion 131
7.1 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
II The Annexes 137
8 Beernet Secret API 139
8.1 Without replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.1 Put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.2 Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2 With replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
v
8.2.1 Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.2 CreateSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2.3 Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2.4 Remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2.5 DestroySet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9 Bwitter API 143
9.1 User management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.1.1 createUser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.1.2 deleteAccount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.2 Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.2.1 postTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.2.2 reTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.2.3 reply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.2.4 deleteTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.3 Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.3.1 addUser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.3.2 removeUser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.3.3 allUsersFromLine . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.3.4 allTweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.3.5 getTweetsFromLine . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.3.6 createLine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.3.7 deleteLine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.3.8 getLineNames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.4 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.4.1 addTweetToList . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.4.2 removeTweetFromList . . . . . . . . . . . . . . . . . . . . . . . . 152
9.4.3 getTweetsFromList . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.4.4 createList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.4.5 deleteList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.4.6 getListNames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10 The paper 155
vi
Part I
The Project
vii
Chapter 1
Introduction
The web 2.0 offers many new services to the users of the Internet. They can now
share, generate and upload content online faster and more easily than ever before. All
those services require computing, bandwidth and storage resources. To predict the
required amount of those resources can be tricky, especially if a service wants to avoid
wasting them while being at the same time able to face high usage peaks. We are going
to take a closer look at the scalability and elasticity of perhaps the most famous of
those web 2.0 services, namely social networks.
1.1 Social networks
Social networks such as Facebook and Twitter are an increasing popular way for
people to interact and express themselves. Facebook, for instance has 600 million active
users [6]. People can now create content and easily share it with other people. Social
networks are now a mean of communication on their own, used by politicians, artists
and brands to easily reach large communities to promote themselves or their products.
It also allows people to quickly organise social events like barbecues or even nationwide
revolutions like what happened in Tunisia [21, 40] or Egypt [16].
Social networks are also a powerful tool to communicate during natural disasters.
Twitter and Facebook were very useful to find updates from relatives and friends when
the mobile phone networks and some of the telephone landlines collapsed in the hours
following the 8.9 scale earthquake in Japan. The US State Department even used
Twitter to publish emergency numbers [35]. Other examples are the Haiti [18] and
Chile [9] earthquake that were covered in real time thanks to social networks, with
photos sent out to the rest of the world directly via Twitter.
It is thus critical that social networks do not crash when their users need it the
most. But, the servers of those social networks can only handle a given number of
requests at the same time, therefore, if there are too many requests, the server can
become overloaded. A typical result of overloading is Twitter suspending its services
and displaying the “Fail Whale” shown in Figure 1.1.
1
Figure 1.1: “Lifting a Dreamer” aka Fail Whale, illustration by Yiying Lu displayedwhen Twitter is overloaded.
To avoid overloading efficiently is a tricky problem as this load is related to many
social factors, some of which are impossible to predict. For instance we want to be able
to handle the high amount of people sending Christmas or New Year wishes but also
reacting to natural disasters. This is why we want to turn towards scalable and elastic
solutions, allowing the system to add and remove resources on the fly in order to fit
the required load.
Social networks are also platforms where users can share personal information,
destined to be seen only by some specific peers. Other personal information such as
coordinates are sometimes also stored into their system too. More and more users begin
to worry who ultimately has access to this information and what they can do with it.
It is thus important to have a system that is secure and enforces the privacy of the end
user.
1.2 Scalable Data Stores
The web 2.0 called for a different kind of database than the previous Relation
Database Management Systems (RDBMS) solutions. They needed data stores able to
host huge amounts of data and handle many parallel requests at the same time. There
are now numerous scalable and elastic storage solutions to answer this demand. These
scalable data stores allow to store increasingly more data and to handle more requests
as we allocate more resources to them. This is because they have been build to share
their load on the different machines allocated to them. Those data stores have also
elastic properties, allowing them to add or remove resources to gracefully upscale or
downscale without having to be rebooted. This elasticity is crucial in order to upscale
to face sudden increases in traffic but also to downscale when the hype is over in order
to avoid wasting resources.
2
As our work revolves around the scalability and elasticity of social network appli-
cations we are bound to work with those scalable data stores. Many different kinds of
scalable data stores exist and we are going to present them in our state-of-the-art in
Chapter 2.
1.3 The Cloud
The cloud is a phenomenon that is hard to ignore these days as most web appli-
cations tends to rely on it in order to provide their services. The cloud refers to on
demand resources such as storage, bandwidth and processing power but also to on de-
mand services such as mail service or word processing [2]. The computation can thus
be transferred from the users’ machine, as it was the case in the past, to the machines
forming the cloud. This allows users to have machines with very little computational re-
sources or storage but still be able to execute heavy calculations or store huge amounts
of data. A typical analogy to the usage of cloud resources is the usage of public utilities
such as water or electricity. Specialised companies provide those services at a fraction
of the price it would cost us if we needed to deploy and maintain ourselves all the
infrastructure required for that service.
The cloud is thus the ideal platform to use if we do not want to invest in costly
hardware and maintenance. This is especially true if we do not know beforehand if
our service will be successful. We can start small and only pay for a little amount of
resources. If our service is popular we can easily grow by requesting more resources and
thus paying a higher price. But, if our service does not manage to attract many people
we did not waste our money investing in powerful servers. Furthermore, the resources
the cloud offers are elastic, meaning you can increase or decrease them on the fly and
only pay for the amount you really need to keep your service going.
We are going to use the scalability and elasticity properties of the cloud during our
work, this is why we will detail it even further in our state-of-the-art in Chapter 2.
1.4 The Bwitter project
Bwitter is a lighter version of Twitter, the famous social network. Some readers
might be unfamiliar with Twitter, we will thus introduce it briefly before going further
with the description of Bwitter.
1.4.1 Twitter
Twitter is a micro-blogging system that allows users to post small text messages
of 140 characters called tweets. An enormous amount of tweets is posted each day,
according to Twitter themselves [33], 177 millions of tweets were posted in march 2011
and the record is 6939 tweets per second 4 seconds after midnight in Japan on New
Year’s Day.
3
Users can choose to display the messages of other users they find interesting in their
line by following them. In Figure 1.2, you can see the home screen of Twitter with user
Zulag (aka Xavier De Coster, co-author of this Master thesis) logged in, the messages
of the users he follows are displayed as a stream in his “Timeline”.
Figure 1.2: Home screen of Twitter’s web interface.
Twitter offers additional functionalities to this message posting, for instance a user
can reply to or retweet (share) any message he wants. He can also address a message
directly to another user by starting his message with “@destinationUser”. The main
difference of Twitter, and now also Google+ 1, in comparison to other social networks
such as Facebook is the asymmetry of the social connections. This means that the
connection does not go in both directions: a user A can follow a user B without user
B having to follow user A. This is unlike the Facebook system where two users can be-
come “Friends” and automatically see each others updates. This behaviour encourages
Twitter to be used as a place where fans can follow their favourite stars. Up to such a
point that 10% of the users accounts for 90% of the traffic [17].
The tweets can also be tagged by users using hashtags. The tweets containing an
hashtag are automatically added to a group of tweets associated to this hashtag.
1https://plus.google.com/, last accessed 13/08/2011
4
1.4.2 Bwitter
We decided to develop Bwitter as an elastic and scalable social network application
and to study how it behaves faced to flash crowds and heavy traffic.
Bwitter is an open source version of Twitter based on a scalable data store and
developed to run on a highly elastic cloud architecture. Bwitter thus presents similar
functionalities to Twitter. We decided for Twitter as it is one of the more basic social
networks and because it is now incredibly famous. The data store used by Bwitter is a
key/value store which we will present in detail later in Chapter 4. Bwitter was thought
so that other services could run on this data store without interfering with each other’s
data.
Bwitter is developed in multiple layers that are loosely coupled allowing for maximal
modularity. We added an optional cache layer on the top of the key/value data store
in order to maximize performance. Bwitter manages the cloud machines on which
it runs as well as the data store nodes, restarting them when needed. During the
implementation we took advantage of existing and proven technologies leading to an
efficient and robust implementation.
1.4.3 Contributions
The main contributions of this work are:
• Design of a scalable social network for microblogging.
• Improvement of Beernet’s API
• Helping to improve the bootstrapping of Scalaris and studying its behaviour on
the Amazon Elastic Compute Cloud .
During the development of Bwitter we identified some potential improvements with
one of the datastores we were using, namely Beernet [19, 23] and designed a new API
allowing to protect and manage the rights to the data stored. This new API supporting
secrets is now implemented and supported in Beernet version 0.9.
In order to further understand the behaviour of Bwitter we did performance tests
with the Scalaris [29] data store on the Amazon’s Elastic Compute Cloud (EC2) test-
ing its scalability and elasticity. We also studied the impact of the machine resources,
the number of parallel requests and having conflicting operations on Scalaris’ perfor-
mance. During our discussions with the developers of Scalaris we helped them locate
an instability in the booting of their system.
Ultimately, we implemented two different designs for Bwitter and tested both on
Amazon’s EC2 showing really good scalability and elasticity properties. During the
course of the development we presented a demo of our project at the “Foire du Libre”
held the 6th of April at Louvain-la-Neuve2 at the Beernet stand. We have also co-
2“Foire du Libre” is a fair celebrating open source software and organised by the Louvain-li-nux:http://www.louvainlinux.be/foire-du-libre/, last accessed 05/08/2011
5
written an article, along with Peter Van Roy and Boris Mejıas, entitled “Designing
an Elastic and Scalable Social Network Application” in which we detail some of the
observations and design decisions we developed in this master thesis. This article,
that can be found in Chapter 10 of our annexes, has been accepted for The Second
International Conference on Cloud Computing, GRIDs, and Virtualization3 organized
by the IARIA and held the 25th to the 30th of September 2011 in Rome, Italy.
1.5 Roadmap
We start with our state-of-the-art in Chapter 2, we discuss the different technologies
we used and explored during the development of Bwitter such as scalable data stores
and the cloud services.
We then identify the main requirements of our project and discuss the general
architecture of Bwitter in Chapter 3. We then explain why we choose to base it on
the cloud instead of letting it run in the wild on an open peer-to-peer system. In this
chapter we also explain how a cache could solve potential problems due to values being
too popular.
The next step is to take an in depth look at the data store we are going to use in
Chapter 4. We detail our main objectives in terms of data representation and explain
how we decided to store the different data abstractions we use in our data store. We
also take a look at how we can avoid conflicts between two different applications using
the same data store.
We detail the different modules composing the Bwitter system in Chapter 5, high-
lighting their purpose and the main algorithms developed to implement them. We also
compare more thoughtfully the two different approaches, push and pull, for our appli-
cations most crucial functions, the post tweet and read tweets. We end this chapter
with a global overview of the implemented architecture and detail how the different
modules fit together.
We carry on by doing a series of experiments in Chapter 6. We start by testing
Scalaris and measuring the impact of a few chosen parameters on its performance,
scalability and elasticity. We then continue by measuring the performance, scalability
and elasticity of Bwitter and compare the results for the push and pull approaches.
We finish this master thesis with a conclusion in Chapter 7 where we reflect on the
achieved work, the lessons learned and the possible further improvements that could
be made to our application.
In the annexes you find the new API we designed for Beernet, the API of Bwitter
and a section for our mathematical demonstrations.
3CLOUD COMPUTING 2011, http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html,last accessed 13/08/2011
6
Chapter 2
State-of-the-art
In this section we are going to take a look at the relevant technologies that could be
useful to the Bwitter project. We start with the different existing scalable datastores in
order to decide which kind is most appropriate for our application. From there, we take
a closer look at peer-to-peer systems and their look up performances and further study
the properties of Distributed Hash Tables (DHT). Finally, we overview the different
services the cloud has to offer.
2.1 Scalable datastores
We start our state-of-the-art with a section about scalable datastores. As our
application is going to heavily rely on a datastore, it is important to understand the
different kinds that are available today as well as their pros and cons [7].
There are several kinds of scalable datastores available each with their own speci-
ficities, but four main classes can be put forward: Key/value Stores, Document Stores,
Extensible Record Stores and Relational Databases. We are going to compare the
functionalities they provide and the way they achieve scalability.
Most of those datastores do not provide ACID properties, but BASE properties.
ACID stands for Atomicity, Consistency, Isolation, Durability and BASE stands for
Basically Available, Soft state, Eventually consistent. This eventual consistency is often
said to be a consequence of Eric Brewer’s CAP theorem [29], which states that a system
can have only two out of three of the following properties: consistency, availability, and
partition-tolerance. Most of the scalable datastores decide to give up consistency but
some of them decide to have more complex trade offs.
2.1.1 Key/value Stores
These are the simplest kind of datastore, they store values at a user defined indexes
called keys and behave as hash tables. They are very useful if you need to lookup objects
based on only one attribute, otherwise you might want to use a more complex datastore.
Some key/value stores provide key/set abstractions allowing to store multiple values at
7
a single key. Key/value stores all support insert, delete, and lookup operations, but they
also generally provide a persistence mechanism and additional functionalities such as
versioning, locking and transactions. Replication can be synchronous or asynchronous,
the second option allows faster operations, but some updates may be lost on a crash and
consistency cannot be guaranteed. Their scalability is ensured through key distribution
over nodes, and some present ACID properties. In conclusion, this solution, by its
simplicity, allows to easily scale. But every rose has its thorn, this simplicity comes
at the cost of poor data structures abstractions. Notable examples are Scalaris, Riak,
Voldemort, Redis and Beernet.
Figure 2.1: Data organisation in a key/value datastore.
2.1.2 Document Stores
These systems store documents and index them. A document can be seen as an
object that has attribute names that are dynamically defined for each document at
runtime. Those attributes are not necessarily predefined in a global schema, unlike,
for instance, SQL that imposes to define the schema beforehand. Moreover, those
attributes can be complex, meaning that nested and complex values are allowed. It is
possible to explicitly define indexes to speed up research.
Replication is asynchronous in order to increase the speed of the operations. Of-
ten scalability can be ensured by reading only one replica, and thus sacrificing strong
consistency, but some document stores, like MongoDB, can obtain scalability without
that compromise. MongoDB allows to split parts of a collection across several nodes in
order to increase scalability instead of relying on replication. This technique is called
sharding.
Figure 2.2: Data organisation in a document store.
A popular abstraction, called domain, database, collection or bucket, depending of
the document store, is often provided to allow the user to regroup documents together.
Users can query collections based on multiple attribute value constraints. Document
stores are useful to store different kinds of objects and to make queries on attributes
8
those objects share. Other notable examples are CouchDB, SimpleDB and TerraStore.
2.1.3 Extensible Record Stores
These systems, also known as wide column stores, and probably motivated by
Google’s success with BigTable, store extensible records. Extensible records are hy-
brids between tuples, that are simple rows of relational tables with predefined attribute
names, and documents that have attribute names defined on a per-record basis. In-
deed, extensible record stores have families of attributes defined in a global schema,
but inside these families new attributes can be defined at run-time.
The extensible record store data model relies on rows and columns that can be
partitioned vertically and horizontally across nodes to ensure scalability. Rows are
split across nodes based on the primary key, usually they are grouped by key range and
not randomly. Columns of a table are distributed across nodes based on user defined
“column groups” regrouping attributes that are usually best stored together on the
same node. For instance all attributes of an employee concerning his address (address,
city, country) will be placed in one column group and all the attributes concerning the
means of contacting him (email, phone number, fax number) will be stored in another
column group.
As the document stores, extensible record stores are useful to store different kinds
of objects and to make queries on shared attributes. Moreover, they can provide higher
throughput at the cost of a bit more complexity for the programmer when defining the
column groups. Notable examples are HBase, Cassandra and HyperTable.
Figure 2.3: Data organisation in an extensible record store.
2.1.4 Relational Databases
These systems store, index, and query tuples via the well known SQL interface.
They offer less flexibility than document stores and extensible record stores because
tuples are fixed by a global schema, defined during the design of the database. Moreover,
the classical relational database model system is not well suited for scalability [29].
There are several proposed solutions to scale the database [13], but they all suffer from
disadvantages. A classical solution is to use a master/slave approach to distribute the
work: the slaves handle the reads and the master server is responsible for the writes.
The first drawback is the eventual consistency. Each slave has its own copy of the
data, and even if we normally have near real-time replication, we do not have strong
9
consistency which is sometimes needed. The second immediate drawback is that the
master server quickly becomes a bottleneck when the amount of writes increases.
Cluster computing solutions improve this by using the same data for several nodes
but with only one node responsible for writing. They thus provide strong consistency,
but the bottleneck problem remains.
Finally, the share nothing architecture, introduced by Google [10], should scale to
an infinite number of nodes because each node would share nothing at all with the other
nodes. In this approach, each node is responsible for a different part of the database
and has its own memory, disk and CPU. In order to divide the database, sometimes
called sharding the database, we split the tables into several non overlapping tables
and dispatch these tables to different shards, that thus share nothing, so that the load
is divided between them. Usually the cutting of the tables is done horizontally. This
means that different rows are assigned to different shards given a partition criteria [39]
based on the value of a primary key. The partition criteria can be range partitioning
(the shard is responsible for a range of keys), list partitioning (the shard is responsible
for a given list of keys) or hash partitioning (hash of the key determines the shard
responsible for the key).
To achieve redundancy each shard is replicated, in MySQL Cluster [20] for example
each shard is replicated two times. But, to implement this solution correctly, several
challenges have to be solved. Particularly, how to partition the data into multiple non
overlapping shards and with load fairly divided between the shards? The answer to
this question is closely related to the application area. The splitting is natural if, for
example, the table to split is a table containing data for Americans and Europeans
people, but in most of the cases this can be quite tricky.
Figure 2.4: In a relational database data can be subdivided and accessed via fixedfields.
Those relational databases have presented improved horizontal scalability, provided
that the operations do not span many nodes. While they are not as scalable as the pre-
viously mentioned datastores, they might be in a near future. The appeal to relational
database is obvious, it has a well established user base and support from its commu-
nity, which means there are already multiple tools existing ready to be used with it.
Furthermore it has ACID properties, this makes life generally easier for the program-
mer. Notable examples are MySQL Cluster, VoltDB, Clustrix, ScaleDB, ScaleBase and
NimbusDB.
10
2.2 Peer-to-peer systems
We also decided to give a close look at peer-to-peer (P2P) systems. They are an
interesting alternative to classical client/server systems because they allow a more ef-
ficient use of resources like bandwidth, CPU and memory. This is because every peer
is equivalent in the application and has a dual client/server role, and therefore can
serve content as a classical server sharing load between the members of the network.
Moreover, because of this dual role, availability of the content increases with the net-
work size encouraging scalability of the system, which is a property we are very much
interested in. P2P systems have also the crucial property that they do not have any
central point of failure, neither a central point of coordination which often becomes a
bottleneck when the system needs to grow. These properties are extremely important
in distributed computing because they increase robustness of the system as well as
scalability.
There are three main categories of P2P systems [31] which vary according to their
topologies and their look-up performances. The first and oldest relies on a central index
maintaining mapping between the files reference and the peers holding the file. This
index is managed by central servers that provide the look-up service. This contradicts
what we just told about peer equivalence and implies that this generation is not a true
peer-to-peer system. Therefore, a peer wanting to access some file must first connect
to this server to find the peers responsible for this data, and then it can connect to the
peer holding the data. This is shown in Figure 2.5. This is the solution developed by
Napster, the famous file-sharing system.
Figure 2.5: P2P system relying on a central index to look up files: A) Searching Node(0) asks the Central Server (CS) where it can find a given file. B) The CS gives theaddress of node 3 to node 0. C) Node 0 retrieves the file directly from node 3.
11
The second category does not rely on any server to perform queries and has an
unstructured topology. Therefore, the connections between the peers in the network
are established arbitrarily. In this category of P2P systems, there is no relation between
the node and the data for which it is responsible. It follows that the look-up mechanism
must be a flooding like mechanism. In Gnutella, the flooding algorithm has a limited
scope in order to limit the number of messages exchanged. Therefore, it could happen
that a value present in the network is not found. Indeed, it is possible that a query
could not reach the peer holding the value because the flooding diameter was too small.
This is illustrated in Figure 2.6.
Figure 2.6: P2P system using flooding to look up files: A) Searching Node (0) floods thenetwork with a request for a file B) A query reaches node 2 which hosts a correspondingfile and responds directly to 0. Note that if a query has a time to live of 2 and if nodes1, 2 and 3 host a corresponding file, only nodes 1 and 2 will respond to 0 as 3 is toofar away from 0.
In order to provide look-up consistency, the flooding diameter must be N , with N
being the number of peers in the network, however this would not scale in large systems.
In order to resolve this problem, the third generation of P2P systems changed from an
unstructured to a structured topology improving drastically the look-up performances.
Distributed hash tables (DHT) are the most frequent abstraction used by P2P with a
structured topology. We take a closer look at them in the next section.
12
2.3 DHT
DHTs were designed in order to solve the look-up problem present in many P2P
systems [3]. They provide the same operations to store and retrieve key/value pairs as a
classical hash table. A key is what identifies a value, the value is the data you want to as-
sociate with this key. As an example, consider a movie named “Why DHTs are fun.avi”
and the actual file containing the movie, the key would logically be the title of the film
and the value is the file.
Each peer in a DHT system can handle key look-ups and key/value pair storing
requests avoiding the bottleneck of central servers. Another problem addressed by
those systems is the repartition of the key/value pairs responsibility between the peers.
Each key/value pair and peer has an identifier. The identifier domain can be anything,
taking the example of Chord-like DHT, the identifier is an integer between 0 and N ,
where N is a chosen parameter. Those identifiers are used to determine which peer are
responsible for which key/value pairs. Each peer has an interval it is responsible for,
this interval is computed based on its identifier and other peers in the network. Once
again we take the example of Chord, a peer is responsible of all the identifiers between
its identifier and the identifier of the next peer in the network not included. A peer
stores all the key/value pairs with an identifier in its interval.
The identifiers are computed most often using a consistent hash function. Assuming
each peer has an IP address associated, its identifier is computed by taking the result
of the application of this function to its IP address. Some systems allows a peer to
chose an identifier. The identifier of a key/value pair is computed taking the hash
function of this key. The use of a consistent hash function to compute identifiers allows
a roughly fair division of the key space between peers, which is a crucial point for
scalability. Moreover, this kind of hash function has the advantage that adding a peer
to the system does not cause a lot of identifiers to be remapped to other peers, which
improves the elasticity of the system.
DHTs, as said in the peer-to-peer point, are the third generation of peer-to-peer
systems. Compared to the previous generation, they mainly solve the scalability prob-
lems of the look-ups mechanism. Indeed, we now have a relation between the key of a
value and a peer, which permits us to achieve better look-ups performance by rooting
the look-up request to the responsible peer instead of flooding the network, which was
not scalable.
2.4 Study of scalable key/value stores properties
Bwitter is built on the top of a key/value datastore. Key/value datastores are sys-
tems that implement a DHT and offer other services on the top of it. We now compare
possible design choices when implementing systems offering a DHT abstraction. The
comparison is based on the following criteria: consistency model, replication strategy,
storage abstraction, network topology, churn, transactional support and finally security.
13
2.4.1 Network topology
The network topology refers to how peers are organized in the network, there may
be important differences between the different DHTs implementations. It is also a
crucial design point because it will influence deeply the performance of the look-up
mechanism as well as the fault-tolerance of the network. We will take a look at some
important network topologies.
In Chord-like topologies [30], nodes are organized in a ring (see Figure 2.7) and
keep a list of successors and predecessors as well as a routing table, which is filled with
fingers chosen according to various policies. We call a finger a reference to another
peer in the system, usually it is the IP address of this peer. The size of the routing
table varies among the systems, Chord keeps log2(N) fingers, where N is the number of
nodes int the system. DKS, which is a generalization of Chord, keeps logk(N), where
k is predefined constant. This is a trade-off between better look-up performance and
bigger routing tables. We summarize the most common choices in Table 2.1. Each
Chord node also keeps in its successor list log2(N) successors, this in order to recover
from node failures. This topology is widespread because it allows for efficient routing
as well as easy self-organization upon joins, leaves and failures.
Beernet topology is similar to Chord but differs in one crucial point. In Chord,
nodes must be connected with their direct predecessor, in Beernet they only need to
know the key of their predecessor creating a branch when a node cannot join its direct
predecessor. This property is the reason why the topology of Beernet is called relaxed
ring (see Figure 2.7). Indeed, when a node does not have the link toward its predecessor
the ring is not perfect. This topology is more resistant because it makes less assumptions
while preserving the consistent look-up. You can find more information about Beernet
topology in [19].
Scalaris currently relies on a Chord topology too. The Scalaris team is currently
working in order to use another Chord like topology too called Chord#, which is very
much like a classic Chord except it stores keys in lexicographical order. Furthermore,
the routing is not done in the key space but rather in the node space. This allows range
queries and allows the application to choose where to place the data in the ring [32].
Number of fingers Look-up performances
O(1) O(n)
O(log(n)) O(log(n)/log(log(n)))
O(log(n)) O(log(n)) (more common)
O(√N) O(1)
Table 2.1: Number of fingers versus Look-up performances for N nodes in the network.
Chord, as well as Beernet, does not take advantage of the underlying physical
topology. Pastry [28], Tapestry, and Kadmelia [15] also assume a circular key space
but try to tackle this problem by keeping a list of nodes which they can join with low
latency. They choose their fingers giving preference to nodes in that list.
14
We finally detail the topology of CAN [24] because it differs significantly from the
topology of the other DHTs. Nodes are organized so that they divide a virtual d-
dimensional Cartesian coordinate space. Each node is responsible for a part of this
space. In order to join the network a node, that we call A, chooses a random point
in the space. It then contacts the node responsible for this point, called B. Finally, B
splits its zone in two giving the half of the zone it was responsible for to A. Nodes only
maintain routes towards their immediate neighbours. In CAN two nodes are neighbours
if their zones touch along d − 1 dimensions. In order to better understand, imagine a
square (2 dimensions) divided into rectangle chunks, which correspond to zones, two
nodes would be neighbours if the rectangles they are responsible for have an edge in
common. According to the results in [24], for a d dimensional space partitioned in
n zones, the average routing path length is (d/4) × n1/d and the average path length
is O(n1/d). You can observe that the average path length decreases as the number
of dimensions increases, but this comes at the cost of higher space complexity for
maintaining routing tables. Moreover, each join and leave becomes more costly as the
number of dimensions increases. Indeed the number of neighbours of a node increases,
and thus the complexity to maintain routing table consistency grows. However, the
topology is not linked with the physical topology of the nodes. You can see an example
of how nodes are organized in CAN for d = 2 in Figure 2.7, each rectangle represents
a zone controlled by a node.
Figure 2.7: From left to right, the ring overlay (CHORD) , the relaxed ring overlay(Beernet) and a 2-dimension CAN overlay.
2.4.2 Storage abstraction
As mentioned before, key/value stores allow to do all the operations provided by
classical hashtables on key/value pairs, namely look-up, store and delete operations. In
order to be clear, a key is uniquely associated to a value, storing another value with the
same key will erase any previously stored value. Beernet, Redis [25] and OpenDHT [26]
allow to work with key/set pairs, where each key can be associated to a set that can
contain multiple values, and thus when a look-up on a key associated to a set is done it
returns all the values in the set. OpenDHT only works with key/set pairs which leads
to more complex algorithms for the applications using it.
15
2.4.3 Replication strategy and consistency model
In order to provide redundancy, these systems often provide replication services.
Those vary according to guarantees they offer: improved reliability of the system and/or
availability. The replication is done by storing a value at k different nodes instead of
only one, k is called the replication factor.
Beernet and Scalaris offer symmetric replication with strong consistency using a
transactionnal layer build on the top of their DHT implementation. Strong consistency
means that read operations always return the latest correctly written value, this is
achieved by always writing and reading from a majority of the replica set. In symmetric
replication [12], each node identifier is associated with a set of (k − 1) other node
identifiers, which we call the replica set. When using replication, a key/value pair is
stored at the node responsible for the identifier of “key” and at all the nodes which
are responsible for an identifier inside the replica set. Nodes maintain routes toward
nodes with “symmetric identifier” so that they can contact directly any of the replicas
of the key/value pairs they are responsible for. Strong consistency between replicas
does not come for free. Indeed, each time a value is accessed a majority of the replicas
must be contacted. Thus in such a scheme, it is not possible to increase availability
of the content through replication. We address this problem in section 3.2.3. Beernet
currently does not handle the restoration of the replication factor when a node fails
abruptly.
CAN does not have consistency problems because it works with immutable content,
meaning that values cannot be updated. This is a clear limitation when implementing
a social network where updates are frequent. CAN proposes replication through what
they call realities. A node, when joining the network, join r coordinate spaces and is in
charge of a different zone in a different space, each space coordinate is called a reality.
When a key/value pair is added, it is added in all the realities. Therefore, because the
nodes are in charge of different zones in different realities, different nodes are in charge
of the new added pair. To create these realities, a different hash function is applied to
map the node to different coordinates in each reality. This strategy, as every strategy
relying on different hash functions, has two major drawbacks compared to symmetric
replication [12]. First, the inverse of the hash function is not computable. Therefore,
it is not possible to recover the original key before hashing, while it is needed to fetch
the value of the remaining replicas. Moreover, because of hash function distribution
properties, and even if it was possible to find the inverse, other replicas would be spread
all over the remaining nodes. This forces the node in charge of restoring the replication
factor to contact a multitude of nodes. In conclusion, because we can not find the
inverse of the hash function, the replication degree of pairs thus decreases at each node
failure.
Pastry uses a different approach based on leaf-set which is close to the successor set
approach. As for CAN, Pastry assumes that values are immutable, and so there is no
problem of consistency between the replicas, but it is at the cost of no updates of values.
Pastry stores the replicas at nodes that have the closest ids with respect to the value’s
key. So if the replication factor is k, you have k/2 replicas before and after the key. In
16
the successor set approach all the replicas are stored at the k successors of the key. This
strategy allows to maintain the replication factor because it is possible to find other
replicas at the contrary of CAN strategy. But the algorithms to maintain the replication
factor are expensive compared to the cheap symmetric replication strategy [12].
2.4.4 Transactions
While not many key/value datastores offer the possibility to do transactions, it
is a crucial feature. “A transaction is a group of operations that have the following
properties: atomic, consistent, isolated, and durable (ACID)” 1. A transaction can have
two results: abort or commit. When a transaction commits we can be sure that all the
operations inside the transaction have been done successfully. On the other hand if a
transaction aborts we know none of the operations have been done. We know only two
key/value datastores that implement transactions : Beernet and Scalaris.
Transactions are usually achieved using a two phase commit (2PC) algorithm. The
two phases are the validation phase and the write phase. Both phases are supervised
by a Transaction Manager (TM), while all the nodes responsible for the involved items
become Transaction Participants (TP). During the validation phase, the TM try to lock
the involved resources on every TP. If the TM receives an abort message the operation
is aborted. Otherwise, the TM sends a commit message to all the TPs making the
update permanent and releasing the lock.
Figure 2.8: Two-Phase Commit protocol (left) reaching termination and (right) notreaching termination, image taken from [19].
A serious problem could arise if the TM fails during this operation. Indeed, the
locks would not be released as you can see on Figure 2.8. This is why some systems,
such as Beernet and Scalaris, decided to add a replicated Transaction Managers that
can take over in case the TM fails. This transaction algorithm is based on the Paxos
concensus algorithm.
Beernet adds a phase to the 2PC algorithm before registering the lock. In the first
phase, the client, who is the original TM, will do read and write operations without
taking any locks. In a second phrase, and before committing the transaction, it registers
1MSDN, what is a transaction? http://msdn.microsoft.com/en-us/library/aa366402(VS.85).aspx,last accessed 13/08/2011.
17
to a bunch of transaction managers that can, as said before, takeover the transaction
if the main TM fails. It then does the prepare phase of the 2PC, and send a message
to all the TPs in order to take the lock on the required items. The TPs send the result
of the transaction to each of the RTM, which will then send their results to the main
TM. It can then take a decision to commit or abort the transaction if the majority of
the RTM have voted the same. When the TM has taken its decision, it sends a final
message to the TPs so that they can release the locks.
This algorithm is said to be eager because modifications are done optimistically
before requesting any lock in the read phase. This algorithm makes the assumption
that the majority of the TPs and the TM will survive during the transaction. You can
find more details about this algorithm in Jim Gray and Leslie Lamport’s “Consensus
on transaction commit article” [14] .
2.4.5 Churn
We define the churn as John Buford, Heather Yu and Eng K. did it in “P2P Net-
working and Applications” [5]. The churn is ’the arrival and departure of peers to and
from the overlay, which changes the peer population of the overlay’. This is important
for DHTs who want to have good elastic properties to handle high rate of churn.
Let’s take a look at how a classical Chord-like DHT handle joining of nodes. As in
any peer-to-peer network, a joining node needs to know how to contact a node already
in the network. It first contacts this node that routes it toward the node responsible to
insert it in the ring. This last node is the successor in the ring of the joining node, thus
the node that has the identifier that follows the identifier of the joining node. There
are now two steps to perform to enter the ring: contact the successor to warn it that its
predecessor has changed, and contact the predecessor to warn it that the joining node is
its new successor. This is not robust as a failure of the nodes, or a networking problem
that prevents the new node to join its predecessor, can create a broken ring. Beernet
solved this problem with its relaxed ring as explained when discussing the network
topology in section 2.4.1. It adds a phase to this protocol during which the joining
node signals to the successor node that it has correctly contacted the predecessor. The
successor can then remove its pointer to the old predecessor as you can see in Figure 2.9.
Therefore, this algorithm maintains look-up consistency and tolerates network failures.
After joining the ring, the new node has to retrieve the key/value pairs it is responsible
for. It can do it by contacting its successor that was in charge of those values before.
When a node wants to leave the ring the opposite operations are done. If it is a
gently leave, the node sends the data to the nodes now responsible for the values it
hosted, and it tells its neighbours to update their pointers. However, if it is an abrupt
leave, the other nodes have to detect the absence of the node, and have to execute
more complex algorithms to find the remaining nodes responsible for the data the
missing node hosted. This operation varies a lot according to the replication strategy,
as explained in the replication strategy point. Anyway, it is an heavy and complex
operation that should be avoided if possible by leaving gently the network.
18
Figure 2.9: The join algorithm: A) Q contacts the successor R. B) R accepts theinsertion and replies with P’s address, R now considers Q as predecessor but keeps P inits predecessor list, Q contacts the predecessor P. C) Q tells P he is the new successorand P accepts it. D) Q tells R the insertion was successful and R drops R from itspredlist. Image taken from [19].
It is thus obvious that, while those mechanisms ensure the survivability of the
system in a environment where nodes can fail or disconnect abruptly, the performances
are going to be better if the nodes use gentle leaves.
2.4.6 Security
There are numerous of known attacks against DHT based systems [34]. Many
DHTs are able to work under the assumption that the number of malicious nodes stays
lower than a certain fraction f of the total number of nodes. In the case of Sybil
attacks, a malicious user inserts many malicious nodes into the system in order to
go over that limit. Once the attacker has enough malicious nodes into the system,
it can easily interfere with the the routing and replication algorithms. In the case
of Eclipse attacks, a malicious node can “eclipse” a correct node by manipulating all
the neighbours responsible to point to that node in order to skip it, meaning no one
can access it anymore. Those attacks can lead to routing and storage disruption if
malicious nodes works together to deny requests or to return different values than the
ones expected.
Assuring security in such systems when they are running in open environments such
Internet is thus a though challenge. Note that those attacks are possible only if the
DHT accepts nodes from untrusted users. While most of the DHT based systems we
know do not currently provide such a security level, we have good reasons to believe
those issues are being worked on. Still we need to keep those issues in mind when
designing our architecture.
If a malicious user has access to the datastore he can also try to delete, edit, or forge
data causing damage to the application using that data. Those attacks are generally
19
avoided by using capability based security. The idea is that if the attacker does not
know where to look he will not be able to find the data as it is stored at unguessable
keys.
OpenDHT goes even further and offers a secret mechanism, allowing users to asso-
ciate a secret to a given value. If anyone wants to delete that value, he has to provide
that secret. Note that in OpenDHT you can not replace a value by another, as multiple
values can be stored at a given key. Doing a put on a key will thus only add the value
to the set.
2.5 The Cloud
Bwitter is destined to run on the cloud in order to take advantage of its scalable
and elastic nature. Everyone has heard about the cloud, but ultimately many different
definitions exists. So we decided to explicit here the definition of the could that we are
going to use trough this work. We decided to use the National Institute of Standards
and Technology (NIST) definition of cloud computing [22]:
“Cloud computing is a model for enabling ubiquitous, convenient, on-demand net-
work access to a shared pool of configurable computing resources (e.g., networks,
servers, storage, applications, and services) that can be rapidly provisioned and re-
leased with minimal management effort or service provider interaction. This cloud
model promotes availability and is composed of five essential characteristics, three ser-
vice models, and four deployment models.”
The five essential characteristics mentioned are on-demand self-service, broad net-
work access, resource pooling, rapid elasticity and measured service. On-demand self-
service means that the users can adjust the amount of resources whenever they need
without having to go trough a service provider’s employee. Broad network access means
that the resources can be accessed from a broad range of mechanisms or devices. Re-
source pooling means that the resources of the provider can be assigned and reassigned
to different clients dynamically in order to meet the clients requirements in the most
effective way. Rapid elasticity means that resources can be allocated and removed in a
transparent way in order to meet the amount of resources needed by the client. Mea-
sured service means that the resources provided are monitored and can be recorded
transparently.
The three service models are Cloud Software, Cloud Platform or Cloud Infrastruc-
ture as a Service (SaaS, PaaS, IaaS). In the SaaS case the client has access to a software
running on the cloud, but has not access to the underlying cloud infrastructure. This
application can usually be accessed either via a web browser. In the PaaS case, the
client is able to deploy applications and manage them on the cloud infrastructure but
does not manage it. Finally, in the IaaS case, the client can manage the basic resources
such as network, processing power and storage. The client can furthermore deploy
applications and manage them. These models are compared in Figure 2.10.
The four deployment models are private, community, public or hybrid cloud. A
20
Figure 2.10: The three service models compared to a classic model, image taken from [8].
private cloud is owned by an organisation and is used only by it, unlike a community
cloud that is shared between a few selected organisations. Those solutions may provide
better privacy than a public cloud maintained by an organisation that is selling its
services to end users or other organisations. An hybrid cloud is a combination of at
least two clouds that are still different entities, but are bound in order to allow data
and application portability.
2.6 Conlusion
In this chapter we have explored the different types of scalable datastores. We
studied more deeply the DHTs and more particularly the ones offering transactions.
The technological advancements in those fields made it possible to build an efficient and
robust implementation of a Twitter-like system on the top of a peer-to-peer system,
taking advantage of their assumed scalability and elasticity properties. In the next
chapter we describe two possible architectures for Bwitter.
21
22
Chapter 3
The Architecture
In this section we are going to present the architecture of our application.The plat-
form on which an application is based can have an important impact on its architecture.
We thus explore the repercussions of having an application running either on a peer-
to-peer network based on the users’ machines, either on a stable cloud based platform.
The two solutions lead to two radically different architectures in terms of performance,
accessibility of the different layers, as well as in terms of security concerns.
But, before developing the architecture we take a look at the different requirements
of our application functional wise and non functional wise.
3.1 The requirements
Bwitter is designed to be a secure social network based on Twitter, and while it
looks relatively simple at first sight, it hides some complex functionalities. We included
almost all those functionalities in Bwitter and decided to include some others. We
depict the relevant functionalities that will help us to analyse the design of the system,
highlight the differences between a centralised and decentralised architecture, study
the feasibility of overcoming the problems described above and test its behaviour when
faced with heavy traffic and flash crowds.
3.1.1 Non-Functional requirements
Product requirements
• Scalability:
We are facing a system that is continuously growing in terms of users [4] but
also in terms of traffic [33]. It is thus crucial that our system’s performance
increases almost linearly with the number of machines we allocate to it, this is
known as horizontal scalability. We are not interested here in vertical scalability,
which means adding or removing resources (CPU, RAM, disk) from an individual
machine, as it is harder to achieve dynamically and usually more costly than
23
horizontal scaling.
• Elasticity:
As we explained, the load social network applications can handle must be able to
vary in real time in order to adapt to social reasons. They must sometimes face
high peaks of demands for some shorts periods, but do not need to keep the then
needed amount of resources the rest of the time. Therefore, it is inefficient to
have a fixed number of nodes. If you want to be able to handle peaks of load, you
have to over-provision the number of nodes in your data center. This is why our
system needs to be able to upscale when there are high demands and to downscale
easily when the peak is over to avoid wasting resources.
• Fault tolerance, availability and integrity:
The system has to be fault tolerant, this means that even if some machines in
the system fail the whole system is still able to function. Also the integrity of the
data and availability of the service have to be ensured as it is a major requirement
of every social network.
• Security:
Bwitter must ensure authenticity, integrity and confidentiality of the data posted
by users over the whole system. No malicious user should be able to forge, edit
or delete data in the system. Finally, Bwitter must forbid access to confidential
data such as password. Those requirements must hold even with Bwitter’s code
released as open source.
• Lightness of the application:
The end user should only have a fast and light interface performing little calcu-
lation. The goal is to be as portable as possible, so that smart phones and other
devices with less computing power can also use our application. This implies that
the heavy calculations should be done on the server side.
• Performance:
We need good performances for a lot of small reads and writes. Indeed, small
values are frequently read, written and updated in social network applications.
Organizational requirements
• Modularity:
Our project should be build in different modules and it should possible to easily
replace a layer by another based on clearly defined interfaces. For instance, the
graphical user interface (GUI) module could be desktop based or web based and
the main application should not see any difference.
24
• Open source:
We want our project to be released in the wild with its source code available for
anyone wanting to experiment with it. This also means that the libraries we use
in the development of our system should be open source.
• Use existing technologies:
We do not want to re-invent everything on our own so we decided to use already
developed open source tools during our development.
3.1.2 Functional requirements
Nomenclature
There are only a few core concepts on which our application is based:
• A tweet is basically a short message with additional meta information. It con-
tains a message up to 140 characters, the author’s username and a timestamp of
when it was posted. If the tweet is part of a discussion, it keeps a reference to
the tweet it is an answer to and also keeps the references towards tweets that are
replies to it.
• A user is anybody who has registered in the system. A few pieces of information
about the user are kept in the datastore, such as his complete name and the MD5
hash of his password, used for authentication.
• A line is a collection of tweets and users. The owner of the line can define which
users he wants to associate with the line. The tweets posted by those users are
from then on displayed in this line. This allows a user to have several lines with
different thematics and users associated.
Basic operations
There are many different social networks existing today, and while they each have
their own particularities, a few core operations to share, publish or discuss content
are almost always present. Based on our own use of social networks and on Twitter
functionalities, we identified a restricted number of operations our social network had
to be capable of.
• Post a tweet:
A user can publish a message by posting a tweet. The application posts the tweet
in the lines to which the user is associated. This way all the users following him
have the tweet displayed in their line.
25
• Retweet a tweet:
When a user likes a tweet from another user he can decide to share it by retweeting
it. This has the effect of “sending” the retweet to all the lines to which the user
is associated. The retweet is displayed in the lines as if the original author posted
it, but with the retweeter’s name indicated.
• Reply to a tweet:
A user can decide to reply to a tweet. This includes a reference to the reply tweet
inside the initial tweet. Additionally, a reply keeps a reference to the tweet to
which it responds. This allows to build the whole conversation tree.
• Create a line / a list:
A user can create additional lines / lists with custom names to regroup specific
users / tweets.
• Add and remove users from a line:
A user can associate a new user to a line, from then on, all the tweets this newly
added user posts will be included in the line. A user can also remove a user from
a line, he will not see the tweets of this user in his line anymore and will not
receive his new tweets either. Note that if a user re-adds a previously removed
user, the tweets he posted when he was still associated to the line will re-appear.
• Add and remove a tweet from a list:
A user can store a new tweet into a list to be able to retrieve it later easily. The
user can also decide later to remove this tweet from the list.
• Read tweets:
A user can read the tweets from a line in packs. The size of those packs is a
parameter, for example we can decide to retrieve the tweets by packs of 20. He
can also refresh the tweets of a line, or a list to retrieve the tweets that have been
posted since his last refresh.
3.1.3 Conclusion
We just presented the requirements of our application as well as the functionalities.
The most important requirements are scalability, elasticity, availability and security.
The next section details two different possible architectures we elaborated based on the
presented requirements.
3.2 Architecture
As previously mentioned we now present two different scalable architectures for
our application. In both architectures, our application is decomposed in three loosely
coupled layers as we can see in Figure 3.1. From top to bottom, the Graphic User
26
Interface (GUI), Bwitter which handles the operations described in the section 3.1.2
and the scalable datastore. The datastore is distributed amongst multiple nodes that
we call datastore nodes. In the next chapter, we present Beernet and Scalaris, the two
datastores that we have used.
Figure 3.1: Comparison of the architectures. Left) Cloud based architecture. Right)Open peer-to-peer architecture.
This architecture is very modular, each layer can be changed assuming it respects
the API of the layer above. We now have to decide where the datastore will run. We
have two options, either let the datastore nodes run on the users’ machines or run
them on the cloud, leading to two radically different architectures: the open peer-to-
peer architecture and the cloud-based architecture.
In both architectures we try to achieve a secure solution as building an insecure
application would not be realistic. Indeed, if a malicious user can either reveal personal
information or steal the identity of someone, our application would be both pointless
and dangerous. We finally compare the two architectures based on the requirements
we elaborated in the previous section.
3.2.1 Open peer-to-peer architecture
In a fully decentralised architecture, the user runs a datastore node and the Bwitter
application on his machine. The Bwitter application does requests directly to this local
datastore node. Ideally this local datastore node should not be restricted to the Bwitter
application, but should also be accessible for other applications. The problem with this
approach is that the user can bypass protection mechanisms enforced at higher level by
accessing the datastore’s low level functions. Usually this is not a problem as untrusted
users would not know at which key the data is stored so they cannot compromise it. But
in our case, the data has to be at known keys so that the application can dynamically
retrieve them. This means that any user understanding how our application works
would be able to delete, edit or forge lines, users, tweets and references. This would be
a security nightmare.
27
We tried to tackle this problem with the secret mechanism we designed to enrich
Beernet’s interface which is presented later. But while it prevents users to edit or delete
data they did not create themselves, we could not prevent them to forge elements. To
avoid this we need a way to authenticate every data posted by a user.
This could be done by enforcing authentication at the datastore level, but this is
a feature that is not always provided. We could also do this at the application layer.
Indeed, assuming that each user has public and private key information, we could
authenticate all the data posted using asymmetric cryptography. However, this would
require to do a cryptographic operation for each read and write operation. This would
also force users to store their private and public keys either on the datastore, either
on their local machine, or a mix of both. A possible solution would be to have users
storing their public key in the datastore at a public location, so anyone needing the
public key can retrieve it easily. The private key of a user is stored at a private location
that only himself can find back, by example using a key that is the hash of his password
concatenated with his username. Additionally, a sealed local cache could be maintained
on the user’s machine containing his private key and the public keys of all the users
with whom he has contacts. This cache is useful to avoid the constant reloading of all
the needed keys each time the user want to use the application. Furthermore, public
keys are values that seldom change. If a cryptographic problem is encountered while
using a key from the cache, the key is reloaded from the datastore in order to avoid
problems due to cache corruption or public key changed by the owner.
Even with those mechanisms in place, we have to enforce security at the datas-
tore level. Beernet uses encryption to communicate between different nodes to avoid
confidential information leak. But anyone could add modified Beernet nodes behaving
maliciously. Aside usual attacks presented in our state-of-the-art, a corrupted node
could be modified to reveal all the secrets inside the requests going trough it. Scalaris
faces the same problem as its code is widely available too. We thus have to make sure
that the code running the datastore node is not modified, so we need a mechanism
that enforces remote attestation as described in [38]. This can be done by using a
Trusted Platform Module (TPM) [37], which provides cryptographic code signature in
hardware, on the users’ machine in order to be able to prove to other datastore nodes
that the client’s node is a trustworthy node. Until a datastore node has a way to tell
for sure it can trust another datastore node we are in a dead end. This is especially
true for Beernet’s new secret mechanism described in section 4.4.1, as anyone stealing
the secret of another user can erase any data posted by the user.
Assuming that a Twitter session time is short, there could be a problem if our
application is the only one running on the top of our datastore. Indeed, it will result in
nodes frequently joining and leaving the network with a short connection time. Each
of those changes in the topology of our datastore modifies the keys for which the nodes
are responsible, it triggers key/value pairs reallocations leading to an important and
undesirable churn. This would not be an ideal environment for a DHT. Furthermore,
as we saw in the state-of-the-art, DHT based datastores, such as Beernet and Scalaris,
are sill exposed to attacks such as Sybil and Eclipse attacks if they accept malicious
nodes.
28
In our requirements we stated that the system has to be fault tolerant and that the
integrity of the data must be preserved. The integrity of the data is guaranteed thanks
to the replication at the datastore level. Because this environment is not stable, we
need to have a higher replication factor than usual. The impact is double. First, peers
are responsible for more keys making worse the already important churn. Secondly,
each transaction involves more peers which degrades the overall performance of the
system.
In conclusion, this solution has the advantage to provide free computing power
automatically growing with the number of users. But scalability, elasticity and security
are compromised due to the lack of control on the machines and to the difficulty to
control direct access to the datastore by users. We now take a look at the alternative
architecture based on the cloud.
3.2.2 Cloud Based architecture
With this architecture the Bwitter and datastore nodes run on a cloud platform.
A Bwitter node is a machine running Bwitter but generally also a datastore node.
This solution offers good elastic properties assuming we have an efficient cloud service,
meaning that we can quickly obtain machines ready for use. We can thus add or
remove Bwitter and datastore nodes to meet the demand, optimizing our use of the
machines. This solution also allows us to keep a stable DHT as nodes are not subject
to high churn, as it was the case in the first architecture we presented. Hence, a lower
replication factor is acceptable which should boost the performance. Moreover, the
communications should be much quicker between nodes in a cloud infrastructure than
between nodes spread over the world, which will in turn increase performance. Finally,
all the nodes are managed by us, there are thus no Eclipse or Sybil attacks possible in
this case.
Using this solutionc we do not have all the security issues we had with the open
peer-to-peer architecture. Indeed, the users do not have direct access to the datastore
nodes anymore, but have to go trough a Bwitter node which limit their possible actions
to the operations defined in section 3.1.2. Furthermore, the communication channel
between the GUI and the Bwitter nodes can guaranty authenticity of the server and
encryption of data being transmitted, for instance using https. Bwitter requires users
to be authenticated to modify their data. Doing so we provide data integrity and
authenticity because. For instance, Bwitter does not permit a user to delete a tweet
that he did not post, or to post a tweet using the username of someone else. The
malicious revelation of user secrets due to a corrupted node is not relevant anymore as
the datastore is fully under our control.
The cloud based architecture is more secure, stable and offers obvious advantages
for scalability and elasticity. This is why we have finally chosen to implement this
solution, we now take a closer look at how the layer stack is build.
The lowest layer, the datastore, runs on the cloud and is hidden from outside which
means no user can access it directly, all the attacks targeting the datastore are thus
29
avoided. Indeed, all the accesses to the datastore are done via Bwitter. This layer is
monitored in order to detect overload and, taking the advantage of the cloud, datastore
nodes will be added and removed on the fly to meet the demand.
The intermediate layer, Bwitter, is also running on the cloud and communicates with
the datastore nodes and the GUIs. A Bwitter node is connected to several datastore
nodes. They have an internal load balancer that dispatches work fairly on the datastore
nodes. The load balancer is the Scalaris Connection Manager (SCM) that we present
in the implementation section at 5.3. In practice, the Bwitter nodes are not accessible
directly, they are accessed through a fast and transparent reverse proxy that splits the
load between Bwitter nodes. We also designed a module that runs in parallel of the
SCM and that we call the Node Manager (NM). It is responsible for the bootstrapping
of the ring as well as adding nodes if needed. However, we do not have any module
responsible to decide if a new node should be launched.
The Bwitter nodes offer a REST-like (Representation State Transfer [27]) API to
the higher layer. This means that among other they are completely stateless, this is
important because it improves the clarity of the code and makes it easier to product bug
free code. Being stateless means that the application does not have to keep information
for each client. It can thus more easily scale with the number of clients and also allows
to dispatch requests from the same client to different nodes, suppressing the burden of
managing sessions.
Some values can be frequently accessed in a social network, a caching system is thus
crucial to achieve decent performance. We thus decided to add a cache at this level
in order to reduce the load on the datastore. We are going more into details about
the cache later in the next section. A similar cache mechanism in the decentralized
architecture would not be useful. Indeed, the advantage of the cache is that it contains
values that are susceptible to be accessed by several users. Therefore, if there is only
one user accessing it the gain will probably be very small.
The top layer is the GUI, it connects to a Bwitter node using a secure connection
channel that guarantes the authenticity of the Bwitter node, and encrypts all the com-
munications between them. Multiple GUI modules can, of course, connect to the same
Bwitter node. The GUI layer is the only one running on the client machine.
3.2.3 The popular value problem
Describing the problem
Given the properties of our datastores both based on DHTs, a key/value pair is
mapped to f nodes, where f is the replication factor, depending of the redundancy
level desired. This implies that if a key is frequently requested, the nodes responsible
for it can be overloaded while the rest of the network is mostly idle. Therefore, adding
additional machines is not going to improve the situation. It is not uncommon on
Twitter to have wildly popular tweets that are retweeted by thousands of users. In the
worst cases, retweets can be seen as exponential phenomenon as all the users following
30
the retweeter are susceptible to retweet it too.
The solution: use an application cache
Adding nodes do not solve the problem because the number of nodes responsible
for a key/value pair do not change. In order to reduce this number of requests, we have
decided to add a cache with a Least Recently Used (LRU) replacement strategy at the
application level.
This cache keeps the last values read. We keep associated with each key/value pair
in the cache a timestamp indicating the last time the value was read. When we face a
cache miss, we evince from the cache the pair that has the oldest timestamp value.
This solves the retweet problem because now the application has in its cache the
tweet as soon as it gets a request to read the popular tweet. This tweet stays in the
cache because the users frequently make requests to read it. This way we reduce the
load of the nodes responsible for the tweet and automatically increase availability of
popular values.
We have to take into account that values are not immutable, they can be deleted
and modified. It is thus necessary to have a mechanism to “refresh” the value inside
the cache. A naive solution would be to do active polling to the datastore to detect
changes to the key/value pairs stored in the cache. This would be quite inefficient
as there are several values, like tweets, that almost never change. In order to avoid
polling, we need a mechanism that warns us when a change is done to a key/value
pair stored in the cache. The datastore must thus allow an application to register to a
key/value pair and to receive a notification when this value is updated. Our application
cache thus registers to each key/value pair that it actually holds and when it receives a
notification from the datastore indicating that a pair has been updated it updates its
corresponding replica. This mechanism has the big advantage of removing unnecessary
polling requests. Notifications are asynchronous, so the replicas in the cache can have
different values at a given moment, leading to an eventually consistency model for the
reads. It is still possible to bypass the cache if a strong consistency is needed, but this
is application dependant. On the other hand, writes do not go trough the cache but
directly to the datastore, this allows to keep strong consistency for the writes inside
the datastore. This is an acceptable trade off as we do not need strong consistency for
the most of the reads in Bwitter. For example, it is not a problem to see for a small
period of time a deleted tweet in the line of a user.
Beernet, as described in [19], offers such a notification mechanism, making possible
to design an efficient eventually consistent cache. Scalaris however does not provide
such feature, we thus needed to find another solution in order to avoid active polling.
We decided to use a time to live of one minute for the values in the cache, meaning
that one minute after being read the first time the value is removed from the cache.
This way any value read from the cache is at most out of date of one minute, which is
not a problem.
31
3.2.4 Conclusion
We have presented two different possible architectures: the open peer-to-peer and
the cloud based architecture. We summarize in Table 3.1 the differences between the
two solutions.
Open peer-to-peer architecture Cloud based architecture
Security No control on the DHT leading tonumerous security flaws.
Full control on the DHT which ishidden from users. Surface of attackmuch smaller.
DHT con-trol andstability
High, uncontrollable and undesir-able churn. Connections betweennodes can be really bad.
Much stabler environment and pos-sible control on the number of nodesin order to scale up and down.
Costs Costs are supported by users (main-tenance of a DHT node).
High costs but directly proportionalto resources needed.
Performance Number of nodes normally propor-tional to the number of users but“quality” of the nodes is uncertain.
Nodes are well connected and cloudguarantees their performance. Con-trol allows optimization.
Cache No possible improvement of the per-formance using a cache.
High potential for performance in-creases using cache.
Table 3.1: Comparison between the open peer-to-peer architecture and the cloud basedarchitecture.
We have opted for the cloud based architecture as it has numerous advantages
compared to the open peer-to-peer one. From a performance point of view, it has
better network properties, less churn, a smaller replication factor and finally a cache
can be added to boost the performance. Moreover, security requirements are hard to
achieve in the open peer-to-peer, while most of security problems are solved simply
by moving to the cloud architecture. The only obvious advantage of the peer-to-peer
solution it that it is free. In the next chapter we take a look at the datastore we are
using and how we represent our data in it.
32
Chapter 4
The Datastore
In this section we are taking a closer look at the datastores we are going to use:
Beernet and Scalaris. From there we identify the design guidelines we followed to build
the datastore schema and finally detail this one. We end up this chapter by discussing
the problem of running several services on the same datastore which brings us to the
secret API we have designed for Beernet.
4.1 The datastore choice
4.1.1 Identifying what we need
As we saw in the state of the art there are several types of datastores: key/value
stores, document stores, extensible stores and relational databases. We have only a
few types of objects to store in our datastore, namely lines, lists, users and tweets.
Furthermore, we do not need any complex operations like the joins and queries available
in RDBMS. We want to use a simpler data model to avoid the unnecessary burden of
maintaining complex structures. Moreover, we want to have the most scalable and
elastic solution possible and RDBMS-like systems were shown not to be efficient in
those field.
For all those reasons we opted for key/value stores, but more precisely key/values
stores with transactional capacities. Transactions allow us to pack several operations
together and execute them atomically. A transaction either executes all those opera-
tions successfully, or none if the transaction aborts. This allows us to generate unique
keys and maintain the integrity of our data structures.
Suppose we want to store a value Bar at a key Key, nothing can guarantee us that
nothing else was already stored at Key. We thus do two operations: operation A, we
make a look-up on Key, operation B, if response of operation A was “not found” we store
Bar at Key. But this is not correct because nothing guarantees that no other operation
C on Key can happen between operations A and B. We must thus run operations A
and B in a transaction so that no other operation C can come in between them. During
our discussion of the datastore design in section 4.3, we use the transactional support
33
to generate unique IDs using counters that we read and increment atomically.
Persistence is a key requirement that we do not address in Bwitter. Unfortunately,
the key/value datastores that fulfil our other requirements do not provide persistence.
Scalaris is planning to add this feature but is still in development phase. We could
use a parallel datastore to do backups as Twitter does [36], but we do not address this
problem.
The datastore must be robust in the sense it must be capable of handling a lot of
churn without failing. This is crucial in the case of our fully distributed architecture.
Indeed, machines would not be under our control and a large amount of machines would
constantly join and abruptly leave the system. Our datastore should be able to manage
those abrupt leaves, having a similar behaviour as machine failures, to ensure no data is
lost. As we decided to go with the cloud based architecture, we work in an environment
where the machines provided are not expected to fail abruptly. Robustness is thus still
critical but datastores can have more complex algorithms to recover from failures as
those are less subject to happen. Although most of the machine leaves and joins are
under control, those operations must be efficient in order to have an elastic application.
Handling correctly the churn means that the datastore must maintain correct routing
between the peers as well as the replication factor.
4.1.2 Our two choices
There are several key/value datastore available but only two offer transactions with
transactional capabilities: Beernet and Scalaris. The two fulfill our datastore require-
ments but vary in some points. We now introduce these two datastores.
Beernet
Beernet [19, 23] is a transactional, scalable and elastic peer-to-peer key/value data-
store build on the top of a DHT. Peers in Beernet are organized in a relaxed Chord-like
ring [30] and keep O(log(N)) fingers for routing, where N is the number of peers in
the network. This relaxed ring is more fault tolerant than a traditional ring and its
robust join and leave algorithms to handle churn make Beernet a good candidate to
build an elastic system. Any peer can perform lookup and store operations for any key
in O(log(N)). The key distribution is done using a consistent hash function, roughly
distributing the load among the peers. These two properties are strong advantages for
system scalability compared to solutions like client/server model.
Beernet provides transactional storage with strong consistency, using different data
abstractions. Fault-tolerance is achieved through symmetric replication. It has several
advantages, that we do not detail here, compared to leaf-set and successor list repli-
cation strategies [11]. In every transaction, a dynamically chosen transaction manager
(TM) guarantees that if the transaction is committed, at least the majority of the repli-
cas of an item stores the latest value of the item. A set of replicated TMs guarantees
that the transaction does not rely on the survival of the TM leader. Transactions can
involve several items. If the transaction is committed, all items are modified. Updates
34
are performed using optimistic locking.
With respect to data abstractions, Beernet provides not only key/value-pairs as in
Chord-like networks, but also key/value sets with non blocking add operations, as in
OpenDHT-like networks [26]. The combination of these two abstractions provides more
possibilities in order to design and build the datastore, as we explain in Section 4.3.
Moreover, key/value sets are lock-free in Beernet, providing better performance for set
operations.
Elasticity in Beernet
We previously explained that to prevent overloading, the system needs to scale up
to allocate more resources to be able to answer to an increase of user requests. Once
the load of the system gets back to normal, the system needs to scale down to release
unused resources. We briefly explain how Beernet handles elasticity in terms of data
management.
Scale up: When a node j joins the ring in between peers i and k, it takes over part
of the responsibility of its successor, more specifically all keys from i to j. Therefore,
data migration is needed from peer k to peer j. The migration involves not only the
data associated to keys in the range ]i; j], but also the replicated items symmetrically
matching the range. Other noSQL datastores such as HBase [1] do not trigger any data
migration upon new nodes adding to the system, showing better performance scaling
up.
Scale down: There are two ways of removing nodes from the system: by gently
leaving and by failing. It is very reasonable to consider gently leave in cloud environ-
ments, because the system explicitly decides to reduce the size of the system. In such
case, it is assumed that the leaving peer j has time enough to migrate all its data to
its successor who becomes the new responsible for the key range ]i; j], i being j’s pre-
decessor. Scaling down due to the failure of peers is much more complicated because
the new responsible node of the missing key range needs to recover the data from the
remaining replicas. The difficulty comes from the fact that the value of application
keys is unknown, since the hash function is not bijective. Therefore, the peer needs
to perform a range query, as in Scalaris [29], but based on the hash keys. Another
complication is that there are no replica sets based on key ranges, but on each single
key.
Scalaris
Much like Beernet, Scalaris offers a transactional, scalable and elastic peer-to-peer
key/value datastore and is also build on the top of a DHT [29]. Scalaris is currently
based on a traditional Chord, with a possible upgrade to Chord#. While not as
fault tolerant as Beernet, Scalaris is a good candidate for building elastic systems
too. Lookup and store operations have the same complexity, O(log(N)), where N is
the number of peers in the network. Currently the key distribution is done using a
hash function but could be lexicographically ordered after the upgrade to Chord#.
35
As in Beernet, Scalaris provides transactional storage with strong consistency, and
fault-tolerance is achieved trough symmetric replication. Transactions are taken care of
by a local Transaction Manager associated with the node to which the user is connected.
Transactions are done in an optimistic way, trying to execute it completely on the
associated node and then trying to store it at the responsible nodes if it succeeded.
Beside the classical key/value-pairs, Scalaris also supports key/value lists as a data
abstraction. Lists, as opposed to Beernet sets, are not lock-free and there exists no
add operation on lists. In order to add atomically an element to the list, we must in
a single transaction read the list, add the element to it, and write it back to Scalaris.
Lists are thus a convenient abstraction that avoids the programmer to develop his own
parsing system, but do not offer any performance improvement.
Conclusion
Beernet and Scalaris both fit our needs with their elastic and scalability properties
and their native data abstractions. Unfortunately, due to some unexpected problems
with Beernet we were forced to continue on Scalaris alone. This was disappointing as
we were working closely with Boris Mejıas, the developer of Beernet, to further improve
his system with a richer API presented in section 4.4.1.
4.2 General Design
The design of the datastore is closely linked to our application requirements. Hence
before going straight into the design of the datastore, we take some time to explain
the guidelines we elicited from the requirements to build the datastore’s schema. Some
choices might be unclear now but they will be clarified when we present the algorithms
in Chapter 5.
Make reads cheap
While designing the lines we had to decide if we should favour the reads or the
writes. If we privilege the reads we push the information to the line and put the
burden on the write. In this case the “post tweet” operation adds a reference to the
tweet in the lines of each follower, we call this the push approach. On the other hand,
we could privilege the writes. In this case, we pull the information and build the lines
each time a user wants to read them. It is done by fetching all the tweets posted by the
users he follows and reordering them, we call this the pull approach. As people do more
reads than posts on social networks, and based on the assumption that each posted
tweet is at least read one time, we opted to make reads cheaper than writes and thus
privileged the push approach. However, we also study the pull approach and compare
it with the push when we present our algorithms in Chapter 5 and our experiments in
Chapter 6.
36
Do not store tweets in the lines but references
There is no need to replicate the whole tweet inside each line, as a tweet could be
potentially contain a lot of information and should be easy to delete. Therefore, we
prefer to store references to tweets. To delete a tweet the application only has to edit
the stored tweet and does not need to go trough every line that could contain the tweet.
When loading the tweet the application can see if it has been deleted or not.
Minimise the changes to an object
We want the objects to be as immutable as possible to enable cache systems. This
is why we avoid to store potentially dynamic information inside the objects but rather
have a pointer to it. For instance tweets are only modified when we delete them, this
is why a reply to a tweet should not modify the tweet itself.
Do not make users load unnecessary things
Loading the whole line each time we want to see the new tweets would result in
an unnecessary high number of exchanged messages and would be highly bandwidth
consuming. This is why we decided to cut lines, which in fact are just big sorted sets,
into subsets of x tweets, that can be organised in a linked list fashion and where x is
a tunable parameter. Sets fragmentation is done differently depending on the chosen
design of the datastore. This is explained later in the algorithms section.
Retrieving tweets in order
Users want to retrieve first the tweets posted last, tweets are thus dated to allow
ordering. Tweets must thus be stored so that getting the most recent is easy and
efficient. We have build an algorithm that guarantees the correct ordering of the tweets
inside our lines even with network reordering and failures.
Filtering the references
When a user is dissociated from a line we do not want our application to still display
the tweets he posted previously. We decided not to scan the whole line to remove all
the references added by this user. Indeed, we rather remove the user from the list of
the users associated with the line, and filter the references based on this list before
fetching the corresponding tweets.
Only encrypt sensitive data
Most of the data in Twitter is not private so there would be no point in encrypting
it. Only the sensitive data such as the passwords of the users should be protected by
encryption when stored in the datastore.
37
Simple data structures
We believe that having complex data structures is not a good idea in a key/value
store. Indeed, in order to maintain those we need to use transactions. Those are more
likely to fail if, to update a data structure, we need to access a lots of different keys at
the same time.
4.3 Design of the datastore
The design of the datastore is an important part of the project. In our case it
is more complicated than for a classical database because we do not have high level
data structures like database tables. As reminder, Beernet and Scalaris both provide
two different data structures, key/value pairs and key/set (or key/list), the second one
allowing to store multiple values at the same key.
As we wanted an easy way to store and retrieve java objects from the datastore
we decided to serialized them. Java objects when serialized are transformed to String
conforming to XML 1 format. After serialization they are stored as value in our datas-
tore. This has the advantage that we can easily recover java objects if needed later, or
directly respond to Bwitter requests in XML without even deserializing those objects.
Moreover, XML has the advantage to be a widely used format and thus a lot of exist-
ing libraries handle it. The process to add something in our datastore is the following:
create a java object, serialize it, choose a unique key and finally store the key/value
pair in the datastore. We avoid on purpose to talk about robustness and shared key
space now as we dedicate two sections to those two problems after the details of our
design.
Our first attempt to design a social network on a key/value pair datastore was based
on references. We say that it was based on references because everything, except the
user object, was stored at random and meaningless keys. Those user profiles contained
references to the other objects belonging to the user. For example, the lines of a user
were kept in a user set whose reference was kept in the user object.
After some thought we decided to drop the random keys and references, and replaced
them with a design based on human understandable and computable keys. The key
space layout now looks like a file directory. We do not need to follow a chain of
references to access an object anymore, it can be directly addressed. This also removes
the burden of managing the references. This in turn leads to a reduced number of
operations needed and improved performance. Moreover, the old design had a bigger
space complexity because it had to store references to every object from the user profile
to the object itself. Thanks to this simple addressing, it is also easier to write clearer
code and avoid bugs. Note that through this section, when we talk about keys, the
variable parts of the key are written in bold characters while the static parts are not.
We have two different datastore designs: one for the push and one the pull approach.
The push approach pushes the information posted to the readers and the pull one
1http://www.w3.org/XML/, last accessed 14/08/2011
38
retrieves the information from the poster. We focus on the push approach because we
believe it is the most adapted to our application and describe shortly the pull design.
4.3.1 Key uniqueness
For now we assume only Bwitter is running on the datastore. We must still ensure
key uniqueness to avoid unwanted overwriting of data. In order to do so, information
must be kept in the datastore for each key already used. This information must be
stored at a known location. We separate the datastore into several groups of objects,
for example the tweets of a user, a line a of user, sets of tweets, etc. For each of those
groups we keep track of the number of objects in this group so that we can forge a
new key for each new object. Each group must have a unique base key from which
we can create new unique keys for the members of the group. For example, we show
how we add a new tweet to the tweets already posted by a user. We assume that the
tweets of a user are stored at the base key “/user/username/tweets/” (username is
the username of the user) called tweetBase, and that the number of his tweets is stored
at “tweetBase/size”. The following pseudo code adds a new tweet to tweets already
posted by that user.
addNewTweet{
begin transaction
x = Read(‘‘tweetBase/size’’)
x++
Write(x, ‘‘tweetBase/size’’)
Store the new tweet at the key ‘‘tweetBase/x’’
end transaction
}
This ensures that we always use unique keys when adding a new object to a group.
The drawback is that all objects adding are done via the same key where the number
of those objects is stored. Any two parallel transactions that add an object to the same
group thus conflict. Therefore, it is important to keep this limitation in mind while
designing our data structures.
We still have a problem, we just stated that we need the base keys to be unique.
We consider that the username of a user must be unique, this allows us to create unique
base keys for each user. The uniqueness can be easily checked at the registering of each
user.
4.3.2 Push approach design details
Users
The user object, represented in Figure 4.1, the real real name of the user and his
registration date. Any other personal information could be added later to this object.
We store the user object at “user/username”.
39
Figure 4.1: User profile object of user “Paul”.
We store the hashed password of the user at the key “user/username/password”.
We use it for authentication for each operation involving writing. We store this value
on its own as it is requested more often than the other user’s personal information.
We propose to add a special structure, shown in Figure 4.2, that allows to search for
users. Indeed, searches in a key/value pairs are not well supported because application
keys are not organized in a lexical order in the ring, but according to an hash function.
We thus group in the same key/set (key/list) pair real names that share some prefix,
we do not make any difference between upper and lower case. This user search tree we
propose is a binary search tree, we made this choice because we know it is an efficient
structure for insertion and retrieval. Leaf nodes contains matching between real names
and usernames to allow to find the username of a user thanks to his real name. Indeed,
people do not necessarily know the username of someone and we identify users thanks
to their username. Therefore, this structure is crucial for users to easily find people
they know in our system. All the leaf nodes together represent the whole alphabet.
Parent nodes do not contain any search information, they only keep references to their
children. Leaf nodes have an approximate maximum size. When the size of a leaf node
reaches this limit, we add two children to it and and split his responsibility interval
between the two children. We did not developed a formal algorithm for this search tree
due to lack of time, it is thus not present in our implementation.
Figure 4.2: Username search tree.
40
Lines and lists
Lines and lists are really similar, we thus only detail lines because lists are lines but
without any users associated. A line has a set of tweets and a set of users associated.
In practice and as said in the main guidelines, tweets are not stored on lines, instead we
store references to them. Those references contain a date, the username of the poster
used for filtering, the username of the original poster if it is a retweet and the key of
the referenced tweet, as can be seen in Figure 4.3.
Figure 4.3: Reference to tweet object to be stored in a line or list.
Sets of usernames are not split like tweet sets because they are always read in their
entirety when used. We also keep a set containing all the lines and list names so that
we can easily retrieve them (see Figure 4.4).
Figure 4.4: Left) Lines set of user “Paul”. Right) User set of the “coolpeople” line ofuser “Paul”.
The set of tweets associated with a line or list can become very big. Taking into
account our main design guidelines, we do not store them in one set but as list of
chunks organized in chronological order from most recent to oldest, as you can see in
Figure 4.5. The head is at a fixed location (/user/username/line/linename/head),
which allows us to quickly add an element to this set and read the latest tweets. The
other chunks are located at a fix based key (/user/username/line/linename) to which
we concatenate a number called chunkNbr. The chunk with the chunkNbr equal to
0 is the oldest. The newest chunk has a chunkNbr equal to the value contained at
the key “/user/username/line/linename/size” minus 1. It is thus easy to access any
chunk of the line.
This may not be obvious at the moment but the number of tweets in each chunk
has a big importance as it influences the complexity of the algorithms we present in
41
the next section.
Figure 4.5: Top) Number of chunks in the “coolpeople” line of user “Paul”. Bottom)The head chunk and two chunks of the “coolpeople” line of user “Paul”.
Topost set
The Topost set, represented in Figure 4.6, contains references to lines (keys of the
lines in the datastore) in which the user must post references to his tweets. We do
not store the whole reference to a line because some part of the reference are constant.
Instead, we store what is needed to find the line back: the name of the line and the
username of the line’s owner.
As it was the case for the lines, the Topost set is fragmented using the same tech-
nique. Each of its chunks contains maximum nbrOfFollowersPerChunk references,
this is a parameter that has to be tuned and is further discussed in section 6.3.2 of our
experiment chapter. Moreover, each chunk also has a counter, it is used to implement
the post tweet algorithm robustly. This counter has a value between -1 and the number
of tweets that a user has posted not included. From Figure 4.6, you can notice that
the tweets of the owner of the Topost set were not correctly posted for all the chunks.
Indeed, the counter values differ between the chunks indicating some remaining tweets
to post. We add another counter that is used to remember the tweet number of the last
tweet that was correctly posted, this counter is also initialized at -1. In this example,
assuming Paul has already posted 12 tweets, we can see that one tweet needs to be
posted for chunk 0 and two for chunk 1.
Tweet
The messages the users post are called tweets. As mentioned before a tweet is a
small message of 140 characters. The tweet object contains a message field as well as a
poster field. Moreover, some tweets can be retweeted, to handle this situation we added
an original author field that contains the name of the original author of the tweet. This
field is null if the tweet is not a retweet. Tweets are also dated using second precision,
the time used when storing in the datastore is the Greenwich Mean Time (GMT) for
the whole system, it is up to the GUI layer to adapt the time to the local area when
42
Figure 4.6: Different parts of the Topost set of user “Paul”. Top left) Number of chunksin the Topost set. Top right) Global counter of correctly posted tweets. Center) Chunkcounters of correctly posted tweets. Bottom) Chunks of the Topost set.
displaying the tweet. A field indicates if this tweet was deleted by his owner. Finally,
users can answer to tweets. We want to be able to find the complete conversation back
given one tweet, therefore we keep a reference to a potential parent and to a set of
children. We put an example in Figure 4.7, Tweet2 is a response to Tweet1, Tweet3
is response to Tweet2, Tweet6 is a response to Tweet4 and so on... Tweets are stored
only once in the datastore, we made this choice in order to make their deletion easier
and to minimize the data stored in the datastore.
Figure 4.7: Conversation Tree.
43
The key of a new tweet is the concatenation of the prefix of the key: “/user/textb-
fusername/tweet/” with the number of tweets already posted by a user. The schema
of the tweet number 42 posted by the user “Paul” is shown below in Figure 4.8.
Figure 4.8: Left) Tweet number 42 object of user “Paul”. Right) Number of tweets ofuser “Paul”.
4.3.3 The Pull Variation
As explained in the introduction, we have also decided to experiment with a vari-
ation of the push based design, and to observe how the system would behave if we
decided to pull the information instead of pushing it. As this was not our primary
goal, we decided to focus on the design of the datastore with only the push approach
in mind making it as efficient as possible. Afterwards, we tried to fit the pull variation
in. This went very well as the pull approach borrows a great majority of the building
blocks and even mechanisms of the push approach.
We now store the references only at the owner side, we explain how those tweets are
retrieved in the algorithms chapter. Furthermore, those references are kept grouped by
timestamp, meaning the the tweets posted during the same time frame, for instance
the same hour, are grouped together. The timestamp is of the form: 05/06/11 15 h
26 min 03 s GMT, with some fields set to zero according to the chosen time granular-
ity. For instance, if we want the references to be grouped by hour we would have a
timestamp of this form: 05/06/11 15 h 00 min 00 s GMT. The full key looks like this:
/user/username/tweet/timestamp.
We also have to store the subscription date of the user in order to com-
pute the equivalent of the chunk numbers. This date will be stored at the key:
user/username/starttime.
44
Object Type Key Description
Userprofile
Value /user/username User profile with userinformation
Password Value user/username/password Hashed password ofuser
Topostset
Set /user/username/topost/chunkNbr
Set of lines wherethe user has to postthe references to histweets
Topostchunkcounter
Value /user/username/topost/chunkNbr/counter
Counter associatedto each chunk of thetopost set
Topostset size
Value /user/username/topost/size Number of chunks inthe Topost set of auser
LastTweetcorrectlyposted
Value /user/username/topost/lasttweetposted
Tweet number of thelast tweet correctlyposted
Tweet Value /user/username/tweet/tweetNbr Tweet object con-taining the message
Repliesto tweet
Set /user/username/tweet/tweetNbr/children
Replies to the tweet
Tweetcounter
Value /user/username/tweet/size Number of tweetsposted by a user
Lines set Set /user/username/linenames Names of the lines ofthe user
Linechunk
Set /user/username/line/linename/chunkNbr
Chunk of a line con-taining tweet refer-ences
Linechunkcounter
Value /user/username/line/linename/size
Number of chunks inthe line (head notcounted)
Lineusers
Set /user/username/line/linename/users
Users associated to aline
Lists set Set /user/username/listnames Names of the lists ofthe user
Listchunk
Set /user/username/list/listname/chunkNbr
Chunk of a list con-taining tweet refer-ences
Listchunkcounter
Value /user/username/list/listname/size Number of chunks inthe list
Table 4.1: Keys used in the datastore for the push design
45
4.3.4 Conclusion
You can find in Table 4.1 a summary of the kind of keys we use in our datastore for
the push design. Every key used is of course unique. Remember that the text in bold
is a variable while the rest is static.
Our datastore design was rebuild several times in order to meet the important
criteria we have fixed: simplicity, scalability and clarity. We have built a structure for
the lines that allow to retrieve easily the latest tweets in chronological order. We cut
the lines and the Topost set because those two can be very big (billions of tweets and
millions of followers). We have also designed a structure to efficiently search users in
the system.
Concerning which approach is the best between the pull and the push, you can have
the intuition that the push is the best approach for reads and the pull approach is the
best for the writes. We compare theorically the two approaches when discussing the
algorithms in Chapter 5 and test them in the experiments of the Chapter 6.
4.4 Running multiple services using the same datastore
There are numerous situations where multiple applications may want to share the
same datastore. For instance, we could easily imagine a globally distributed datastore
deployed in a peer-to-peer environment being used for multiple applications, exactly as
we suggested in our first architecture. This would encourage users to let the datastore
node run longer and would mitigate the heavy churn problem we would face if those
users only used the datastore for our Bwitter application. They would launch it to
consult their latest tweets or to post a tweet, and then directly close it. Despite we do
not face this churn problem in cloud or any other stable environment, this remark is
also valid for those. Indeed, some application’s plugins may want to store additional
data that should not be interfering with those of the main program while being able to
access it.
So while we could limit the access of the datastore to Bwitter it is a clear limitation.
We are thus going to take a closer look at the problem of sharing the datastore and
particularly the keyspace. After some thoughts, we reduced the problem of sharing the
keyspace to two smaller problems: key already used and unfortunate/malicious data
erasing.
We explored different ways to solve those problems at the datastore level. Even
though we did not use those solutions, it is still relevant to expose here our work and
conclusions. Note that this work has only been done on Beernet and not Scalaris. This
is due to our privileged collaboration with the developer of Beernet, Boris Mejıas, since
the beginning of our project.
46
4.4.1 The unprotected data problem
Early in the process, we elicited a crucial requirement. The integrity of the data
posted by the users on Bwitter must be preserved. A classical mechanism, but not
without flaws, is to use a capability based approach. You store the data using random
generated keys so that other applications and users cannot erase the value because they
simply do not know at which key the value is stored. However in applications where
content has to be available publicly, we cannot protect all our values by simply using
not guessable keys. By example, Bwitter allows any unknown user to add his name to
a Topost set of another user in order to subscribe to his tweets. This list must not only
be available to any user but also has to be writable by any user. In practice, we would
use the set abstraction provided by Beernet to implement this list. Any user needs
the possibility to add an element to the set, but it should be impossible for anyone
but the creator of the set and the user who added the value to remove the value. The
problem is that Beernet does not allow any form of authentication so key/value pairs
are unprotected. Hence, anybody that is able to send requests to Beernet can modify
and delete any data previously stored. We detail here several solutions that we have
imagined to solve this problem.
Safe environment assumption
At first, we assume Beernet is running on the cloud and that the nodes are man-
aged by another entity than the applications running on the top of it. This means
that nobody can add nodes except this entity and that the communications between
the different nodes cannot be spied. Indeed, Beernet inter-nodes communications are
done on a LAN inaccessible from outside the cloud. Moreover, we assume that the
communications between Beernet nodes and applications are encrypted so nobody is
able to spy them.
Cooperation between applications
The most naive solution is to do the assumption that all the applications running
Beernet are written without bugs and are respectful of each other. This means that
the applications check each time they want to write a Key1/Value1 pair that it exists
no other Key1/Value2 pair with the same key already written by another application.
Additionally, this operation has to be run in a transaction to avoid race conditions.
This should normally not induce too much performance overhead because usually ap-
plications will run transactions each time they store a value using the transactional
replicated storage of Beernet. In order to be able to perform this check, each time a
value is stored an information identifying the application that posted the value must
be added manually to the value by the posting application.
This solution makes a strong assumption, and even if this assumption holds it adds
complexity to the code of each application running on Beernet. Indeed, applications
need to parse each value they read and add information to each value posted.
47
Data protected by secrets
We now lift the assumption made in the previously presented solution. We assume
that several applications are running of the top of Beernet and are not respectful of
each other and thus do not cooperate. We would like to enable an application A to
protect the values it posted from being overwritten by an application B. This is not
possible without the help of Beernet because the two applications can access Beernet
freely and are not cooperating. We have thus designed a solution to enhance the API of
Beernet: we enable an application to protect a key/value pair it posted using a secret
chosen by itself. This secret is needed if another operation tries to modify or delete the
value associated to the key newly protected. Because Beernet is running in a secure
environment, secrets will not leak from Beernet, a malicious user can still try to guess
the application secret, but it is the application’s responsibility to use secrets that are
hard to discover.
A secret mechanism was developed by OpenDHT [26], they made possible to add
a removal secret so that when a delete operation is performed the secret is requested
to remove the value. In a very similar fashion, Beernet’s secret mechanism allows to
share values with other applications and keeping them protected at the same time.
Application A can now write a value and protect it against editing and deleting using
a secret. Without this secret anyone can still read the value but can not edit nor delete
it.
But, the secret mechanism developed for Beernet goes further. Indeed, the sets are
now protected by secrets too and offer much more flexibility. Three different secrets
can be used to protect the different parts of the set.
First, the Set Secret which is one of the two secrets associated to the set itself when
it is created. It can be seen as a master key allowing its owner to do all the operations
desired on the set. The creator of the set can thus destroy the set along with all its
contents, insert items into the set, but also delete separately each item contained into
the set.
Secondly, the Write Secret which is the other secret associated to the set itself when
it is created. This secret is required to add an item to the set. This way the creator of
the set can decide to who he gives the right to add items to his set.
Finally, the Value Secret associated to a given item in the set. This secret protects
a single item in the set against editing and deleting. This means only the user that has
added the item and the owner of the set can delete the item. This secret is set by the
user that adds the value to the set.
This new way to protect sets allows to easily implement numerous applications based
on user posting content. Comments on blogs are made extremely easy for instance. The
author of the blog can give the permission to other users to add comments to an entry.
All the users can now see the comments posted by their peers but can only edit the
comments they posted themselves. The author can manage the comments posted as he
has the right to delete and edit the comments too. This is only a short and simplistic
example, but we are convinced this new secret mechanism will make the development
48
of more complex applications much more easy.
New semantic using secrets We need three new kinds of fields, one for each secret,
in addition to the existing Key and Val fields. Those new fields are automatically set
to NO SECRET when applications uses the functions of the old API that do not use
any secrets. NO SECRET is a reserved value of Beernet indicating that there is no
secret. For example, we show the difference for the put function. It used to be:
put(K:Key V:Val)
Stores the value Val associated with the key Key at the responsible of the Hash of Key.
This operation can have two results, “commit” or “abort”.
The operation returns “commit” if:
• there is nothing stored associated with the key Key or there is a value stored
previously by a put operation.
• the value has successfully been stored.
Otherwise the operation returns “abort” and nothing changed.
The new version is now:
put (S:Secret K:Key V:Val)
Stores the triplet (Hash(Secret) Key Val) at the responsible of the Hash of Key. This
operation can have two results, “commit” or “abort”.
The operation returns “commit” if:
• there is nothing stored associated with the key Key or there is a triplet stored
previously by a put operation.
• there is no triplet (Secret1 Key Val1) stored at the responsible of the Hash of Key
so that Hash(Secret) != Hash(Secret1).
• the value has successfully been stored.
49
Otherwise the operation returns “abort” and nothing changed.
If no value is specified for Secret, Beernet will assume it is the equivalent to
put(S:NO SECRET K:Key V:Val).
The whole new API of Beernet now contains a secure version of put, write, add,
remove and delete but also allows explicit set creation. The full new API semantic can
be found in the annexes at Chapter 8.
4.4.2 Key already used problem
At the moment in Beernet and in Scalaris, as in all key/value stores we know, there
is only one key space. This means that multiple services have to share it and if a service
uses one key another service cannot use it anymore. For some applications, not being
able to use a given key can be very annoying as keys may have a defined meaning and
the application expects to find a certain type of info at a certain type of key. This can
be solved designing more complex algorithms at the application level, but this adds
complexity not directly linked to the application, which is, at our sense, a bad idea.
Sharing a key space can thus create problems if multiple services want to use the exact
same keys. For instance, if another service decides to store the usernames of their users
at the keys “user/username” we have a conflict with our Bwitter application. This
means that the applications cannot both have a user with the same username. This
problem can not be solved with the secrets mechanism we proposed. It can be solved
using a capability based approach. This was not the case in the unsecured data problem
we just presented, indeed, the goal it not to protect the data but to avoid key conflicts
between applications.
The simplest way to avoid using the same keys is by appending a differentiation
number in front of the key. When an application wants to start using Beernet it
generates a root random key, for instance 93981452. From then on the application will
only use keys starting with 93981452. If we can be confident enough that no other
application will use this root random key, we can assume that we are working with our
own key space. We can thus design the application accordingly, removing the burden
of complex algorithms to recover from a key already used. In RFC41222 the authors
affirm being able to generate global unique identifier, we could use those identifiers as
root key, the chance that this key would be used two times on the same datastore is
infinitesimal.
This approach is also valid if you want to hide data from some application or users.
Indeed, you can imagine guessing the root key but it is in practice not possible.
2Can be found at http://www.ietf.org/rfc/rfc4122.txt
50
4.4.3 Conclusion
In this section we have addressed two problems that arise when multiple applications
share the same key space, namely the unsecured data problem and the key already used
problem. The first problem was solved with the secret mechanism that we designed
for Beernet and is now implemented in Beernet version 0.9. Now, key/value pairs and
key/value sets can be protected by a secret needed to modify or delete those values.
We even proposed a finer granularity at the set level. Indeed, it is possible to create a
set controlled by one person but that can be read and written by several. This can be
done while preventing other users than the managers to modify or delete values posted
in the set. The second problem is solved thanks to the capability base approach. We
can, thanks to those two mechanisms, run multiple applications in parallel on the same
Beernet without encountering any perturbations between them.
51
52
Chapter 5
Algorithms and Implementation
This chapter contains four sections. We first show the implementation of the cloud
based architecture we detailed in section 3.2.2. We then take a closer look at our three
main modules: the Nodes Manager, the Scalaris Connection Manager and the Bwit-
ter Request Handler. The Nodes Manager is responsible of launching the machines
needed, as well as performing remote operations on those machines. The task of the
Scalaris Connection Manager is to control the access of Bwitter to Scalaris. We finish
this chapter by presenting all the algorithms we have designed for the Bwitter Request
Handler. Those algorithms were designed to work with a key/value datastore support-
ing transactions. We also do a theoretical estimation of the number of reads and writes
performed by Bwitter for a given social network.
5.1 Implementation of the cloud based architecture
We did not produce the current implementation of Bwitter directly, we first went
through two other implementations that had several similarities with the current one.
In this section we briefly describe those first two implementations as they are integral
part of our project, and finish by detailing the third and final version.
5.1.1 Open peer-to-peer implementation
The first version implemented the open peer-to-peer architecture we presented in
section 3.2.1. In this solution it was necessary to protect data from malicious/unin-
tentional modification at the datastore level. This is why we developed the secrets
mechanism for Beernet we described in section 4.4.1. The secrets were used by Bwitter
to protect user data. This version was stateful, meaning that the client had to establish
a session by logging in before being able to use the functions offered by the Beernet
API. This was not really practical because the Beernet nodes had to remind all the
clients that were connected. Moreover, the load balancer had to be configured in order
to always attribute the same client to the same Bwitter node. This first version was
not totally implemented and only reached the draft state.
53
5.1.2 First cloud based implementation
Along the way we realized, as explained in section 3.2.4, that the cloud architecture
was a lot more adapted to our project. We thus made heavy changes to our imple-
mentation and came up with the second version of our application. Due to unexpected
maturity problems of Beernet we were not able to test our implementation with it and
our implementation was running on an emulated DHT.
This implementation was fully operational and even had a functional GUI. It was
presented at the “Foire du Libre” held the 6th of April at Louvain-la-Neuve1 and visitors
could try it at the Beernet stand.
As time went by, it became apparent that we would not be able to use Beernet for
our implementation, we thus decided to switch to Scalaris. Furthermore, after some
preliminary tests on our second implementation, we identified some heavy changes to
be made to our Bwitter API. This was caused by the decision to get rid of the sessions
we were maintaining for our users and have an API closer to the Representational State
Transfer (REST) principles [27]. This change in the interface combined to the need for a
switch to a new scalable database made us decide to start a fresh third implementation.
Figure 5.1: View of our global architecture, highlighting the three main layers: theGUI, Bwitter and Scalaris.
1“Foire du Libre” is a fair celebrating open source software and organised by the Louvain-li-nux:http://www.louvainlinux.be/foire-du-libre/, last accessed 05/08/2011
54
5.1.3 Final cloud based implementation
We will now present the final version of our application implementing the cloud
based architecture we detailed in section 3.2.2. You can see in Figure 5.1 a full repre-
sentation of our implementation.
The GUI
We currently do not have a fully functional GUI but a minimal one demonstrating
the important features of our application. Indeed, we focused on the design of other
aspects of our implementation, we thus leave the implementation of the complete im-
plementation of the GUI as future work. We could not adapt the previous version of
the GUI as it was designed for an old version of our application using a significantly
different version of the API. The GUI was implemented using the Flex technology from
Adobe2. This technology allows to create nice Rich Internet Application (RIA). We
decided to create a GUI that could be accessed through a web browser so that it could
be directly used with any operating systems and even with smart phones. A screenshot
of this basic GUI can be seen in Figure 5.2.
Figure 5.2: The GUI of our second implementation.
2http://www.adobe.com/products/flex/, last accessed 05/08/2011
55
Bwitter layer
This is our main layer, it contains a Nginx3 load balancer, a Tomcat4 server, the
Bwitter Request Handler (BRH), the Nodes Manager (NM), the Scalaris Connections
Manager (SCM) and a cache system.
The Nginx load balancer is not a real part of our implementation. Indeed, we did
not modify it and the only thing needed in order to use it is to configure it with the IP
addresses of the Bwitter nodes. As those are stateless no other special configuration is
needed.
The Tomcat 7.0 application server uses java servlets from java EE to offer a web-
based API and relays the requests to the BRH. Those Tomcat servers are accessed
through a reverse proxy server, in this case the Nginx load balancer which is told to
support 10k concurrent connections. This Nginx load balancer can be configure in
charge to serve static content, for example the GUI application, as well as doing load
balancing for the Bwitter nodes. The connections of the GUI to the web-based API is
performed using https in order to guarantee a secure channel.
We currently have the BRH, NM and SCM running on Amazon, they are detailed
in sections 5.2, 5.3 and 5.4.
The cache The SCM uses Ehcache v2.4.05 as cache system in order to increase the
performances and mitigate the popular value problem we discussed in section 3.2.3.
Note that we have one cache per Bwitter node and that they are not synchronised.
The values in the caches have a time to live of one minute so that they are refreshed
periodically. Values are added to the cache during the read operations, not during
the write operations. The cache only keeps three different kind of values in memory:
tweets, passwords and references to tweets, all the other elements are accessed directly
through Scalaris. As previously explained, the tweets were designed to be as immutable
as possible to be able to be included in the cache. The references to tweets are static too
and used in the posting recovery mechanism. The passwords are values that are used
very often as for each post we must fetch the hash of the password stored in the system
in order to verify if the password provided is the correct one. The three elements cited
above are only accessed through the cache if they are accessed via a transaction where
only them are involved, this is done in order to keep the strong consistency properties
in the other cases. For example, in the first pseudo code below, the two elements would
be accessed through Scalaris.
{
begin transacation
tweet t = read(someTweetKey)
write(Some key , Some value)
end transaction
}
3http://nginx.net/, last accessed 08/08/20114http://tomcat.apache.org/, last accessed 08/08/20115http://ehcache.org/, last accessed 08/08/2011
56
In this example the tweet can be accessed through the cache because it is the only
element involved in the transaction.
{
begin transacation
tweet t = read(someTweetKey)
end transaction
begin transaction
write(Some key , Some value)
end transaction
}
Scalaris layer
The lowest layer is the Scalaris layer which is accessed and managed via the SCM
and NM. We started the development of our system with Scalaris version 0.2.3 and
switched to version 0.3.0 when it was released the 15th of July as it was giving better
performances and corrected some bugs.
5.2 Nodes Manager
The Nodes Manager (NM) was designed to facilitate our tests and to allow us to
easily control nodes. The NM can start Bwitter nodes as well as Scalaris nodes. We
mainly use it to start Scalaris nodes to form the initial ring for our tests and to start
additional Scalaris nodes during our elasticity tests.
As we will further explain during our experiments in chapter 6, we are working
with the Amazon cloud infrastructure. We made a heavy use of the java API6 Amazon
offers in order to control the nodes, as it is closely linked to the tasks the NM perform.
Indeed, this API allows to start new machines on the cloud and to check the state of
the machines associated to an account. We list below the main tasks the NM performs
and describe briefly how we realized them.
As just said, the NM can be used to start new Bwitter nodes, but we did not design
any mechanism to detect when nodes should be added or removed. There are different
kinds of observable behaviours preceding flash crowd in social networks and it should
be possible to study them in order to predict flash crowds, but we did not do it. We
rather decided to focus on other aspects of our system.
Start new machines The NM can send commands to Amazon in order to start new
machines of a given type (Scalaris nodes or Bwitter nodes). We must fix the security
group (which indicates which ports must be open on a machine), the location of the
machine (east-america, europa,...), the type of instance (s1.small, c1.medium...) and
finally the security keys used to access the machines remotely. It is also possible to add
6http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/ec2/AmazonEC2.html, last accessed 07/08/2011
57
tags to machines (in a key/value pair fashion) to identify them more easily. We use
those to make a clear distinction between the Bwitter nodes and the Scalaris nodes.
Wait for machines to be started Once the command to start the machines has
been sent to Amazon, it is necessary to wait for the machines to be running. We do this
by requesting regularly the states of all the instances of the Amazon account and wait
until all the machines are in the running state. This is important to understand that all
the objects returned by Amazon API calls are not updated dynamically. This means
that an object representing an instance may not accurately represent this instance and
must be refreshed regularly to avoid working with old information.
Machines can be in four states, running (the machine is started), shutdowning (the
machine is stopping), pending (machine is being started), stopped (the machine is
stopped but can be restarted) and terminated (the machine is stopped and cannot be
started anymore).
Check machine reachability Some machines can be unreachable even if they are
in the running state. We do not know the reasons that would explain why machines
are sometimes unreachable but we noticed that machines in Amazon sometimes do not
respond to ping requests even from inside the private LAN of their own security group.
In addition, they sometimes respond to ping requests but not to ssh because they did
not correctly initialize their security keys at boot time. This problem gave us lots of
troubles during the tests, it is indeed necessary to reboot the machine and sometimes
to restart the test when it happens.
Launch a fresh Scalaris ring Once the machines are launched we still need to start
Scalaris on those machines. This requires to first create the configuration file, whose
main use is to indicate which nodes are already in the ring. We then by default stop
any remaining instances of Scalaris running on those machines before restarting it. The
first node is launched and the configuration file for the other nodes is build. Then we
launch each node sequentially every 2 seconds. We then wait a small period of time in
proportion to the size of the ring to let it stabilise and a fresh ring is now launched.
Add nodes to an existing ring This is similar to starting a new ring, except that
several nodes are already in the ring, therefore the configuration file contains several
nodes and not only the first one.
Reboot a node Particularly with version 0.2.3 of Scalaris which we used at beginning
of our work, we ended up frequently in situations where a node was correctly inserted
in the ring but it was impossible to make any write on it. We thus created a function to
restart this Scalaris node and insert it again in the ring so that the test could continue
normally. This usually happens right after the insertion of the node, thus we do a series
of dummy writes to test if the node is correctly bootstrapped and if it is not we restart
it. In the version 0.3 of Scalaris this bug is nearly nonexistent.
58
Most of those functions require to perform remote actions. In order to send files and
run commands on remote machines we use the Runtime class of java combined with SSH
and SCP. It was necessary to use the two options “-o UserKnownHostsFile=/dev/null”
and “-o StrictHostKeyChecking=no” with SSH and SCP in order to avoid checking if
host was already seen before, otherwise the execution stalls because SSH is waiting for
non coming answer. For example to stop Scalaris on a machine, we run the following
command using the exec method from the Runtime object.
ssh -o UserKnownHostsFile =/dev/null -o StrictHostKeyChecking=no -i BwitterXM.pem
ubuntu@10 .118.130.36 ‘‘sudo killall beam’’
We use a threadpool in order to run several commands and send files in parallel to
improve the throughput which is a must because the time to launch a complete ring
can otherwise be very long.
In conclusion, the NM allows us to automatically and efficiently launch Scalaris
rings. This was a valuable tool during our tests as we needed to start a new ring for
each test we did. Once a ring is launched we still have to connect to it, we explain how
we do this in the next section which is dedicated to the Scalaris Connections Manager.
5.3 Scalaris Connections Manager
The Scalaris Connections Manager (SCM) is implemented in a producer/consumer
fashion. The producers are the Bwitter functions. They produce small pieces of work
which use Scalaris functions, we call them Scalaris runnables (SR). SRs typically contain
one Scalaris transaction but they can contain several provided that the failure of one
of those does not introduce an inconsistent state. The SCM stores them in a blocking
FIFO queue and the consumers that we call Scalaris workers (SW), managed by the
SCM, access this queue to execute the SRs. Bwitter functions can efficiently wait that
the result of a SR is computed. Accesses to the SRs are synchronised and the SWs
notify any function that was waiting for the result of the SRs as soon as it is computed
or the execution of the SR is aborted. This design allows the Bwitter layer running
on top of Scalaris to easily make parallel requests to different Scalaris nodes without
taking care of any connections or threads. Taking a big task and splitting it in several
SRs pushed on the queue of the SCM indeed does the job.
We show in the next chapter at section 6.2.2 that controlling the number of connec-
tions to Scalaris nodes is important to get the best performances. Opening too many
connections increases the degree of conflicts and does not improve the performance.
On the contrary having not enough connections lowers the performance. We thus want
to control the number of connections to Scalaris and avoid opening and closing them
as it needlessly consumes resources. Moreover, a connection cannot be used by several
threads. Indeed, it can only handle one transaction at a time otherwise unknown errors
start appearing. It is thus crucial to control correctly the access to a connection so that
only one SW accesses it at a time.
We solve this problem by associating a dedicated connection to a SW. We could
59
have solved it differently, the SCM could have managed a pool of connections instead of
a pool of SWs and dispatch arriving work to a new thread with a free connection. We
believe our solution is better because we do not need to create a new thread for each
SR. A thread is created only once, when a SW is created, which limits tremendously the
time used to manage the life cycle of threads. We put below, in Figure 5.3, a drawing
of the architecture of the whole SCM connected to Scalaris nodes.
Figure 5.3: Scalaris Connections manager connected to Scalaris nodes.
It is possible to call the SCM to add a new SW to the existing ones or to remove
a SW on the fly, this does not need to be statically configured. The SCM will always
connect the new SW to the Scalaris node that has the lowest number of connections. It
does so by associating a Scalaris node to a SW, the SW is then responsible to open the
connection to the Scalaris node. The SWs are responsible to manage the connection
they have opened. They automatically reconnect if the connection is lost and they also
restart a Scalaris node if it has crashed. This must be done carefully because several
SWs can be using the same Scalaris node, we thus synchronized them so that only one
SW is responsible for restarting a dead Scalaris node. The state machine of a SW can
be seen in Figure 5.4, it highlights the different states and the events leading from one
state to an other. The SW starts by trying to connect with its Scalaris node, if too
many connection attempts fail the SW restarts the node and retries. Once the SW is
connected it waits for SR jobs to be run on the Scalaris node, runs them, retrieves the
result and waits for an other SR. If the connection with the Scalaris node is lost the
SW will try to reconnect.
60
Figure 5.4: State machine of a Scalaris Worker.
An important thing to notice with this design is that an SR can never create another
SR, add it in the blocking queue of the SCM and wait for his result. This situation can
indeed create deadlocks. Taking a simple case where we only have one SW, this one
takes an SR, called sr1, and executes it. Assuming that sr1 creates another SR, called
sr2, adds it to the blocking queue and waits for its result we have a deadlock. Indeed,
no SW will ever execute sr2 as we only have one SW that is already busy with sr1.
5.3.1 Failure handling
A SR can fail, but you as you can see in Figure 5.4, the SW will run the SR several
times before aborting it and take a new one. This implies that SRs must be designed
in such a way that when they fail they do not introduce partial state in the database
and can thus be restarted without any risk. This is important in order to be able to
restart jobs at this low layer because it simplifies the algorithms that are running on
the top of it. Indeed, they are not forced to develop themselves their own strategy to
recover from failure of Scalaris operations, this is needed if we want to avoid aborting
too often high level tasks. Those tasks can be quite complex and contain several SRs
which increase the probability that at least one SR fails. We only throw an exception
at higher level when the SR has failed several times. Algorithms running on the top
the SCM can then decide if they want to completely abort the task they were running
or to resend the SR to the SCM.
61
5.4 Bwitter Request Handler
In this section we detail the most important algorithms we used in the Bwitter
Request Handler: posting a tweet, reading tweets and deleting tweets.
We have developed two different approaches for posting and reading tweets: the
push and the pull. As represented in Figure 5.5, the push approach (on the left) is
when the user who posts a tweet is responsible for inserting the references inside the
lines of his followers. The pull approach (on the right) is when the user fetches all the
references himself from the lines of the people he follows.
Figure 5.5: Representation of the reads (dotted arrows) and writes (full arrows) pro-cesses. The tweets are posted in lines (rectangles) and read from them . Left) Pushdesign with one line per reader. Right) Pull design with one line per writer.
Note that in the following algorithms we will never post the whole tweet object into
the lines of the followers but rather a reference to it. So when we say we post a tweet
in someones line we mean the tweet reference. As explained in the previous chapter,
a reference contains the posting date, the username of the author or retweeter, the
username of the original poster if it is a retweet and the key of the referenced tweet.
It limits the amount of redundant data stored and makes it possible to easily edit or
delete a tweet, as it is explained in section 5.4.1 where we detail the operations to delete
a tweet.
All the algorithms we designed use the Scalaris Connection Manager we just pre-
sented. We do not develop any recovery mechanism to handle the failure of the execu-
tion of one of the SR involved in an algorithm. We assume that the recovery mechanism
62
of the Scalaris Connection Manager is sufficient so that failure will be scarce. A failure
of one SR can thus make a higher level function abort in some cases forcing the user
to restart manually the operation.
The algorithms are developed to work on a transactional key/value datastore with
list abstraction. However this one is not necessary, we could use a classical key/value
datastore and develop our own parsing to simulate lists.
An important fact to keep in mind is that concurrent reads on the same key do not
conflict with Scalaris, this means that two parallel reads in two different transactions
will not lead to an abort. On the other hand, two parallel transactions with one writing
a key and the other writing or reading the same key will conflict and this can lead to
abort.
In this section, we will explicitly detail the pseudo code of posting and reading
the tweets for the pull and push approach. We will compare for the two the number
of Scalaris opertations they need. We do not take into account the fact that some
operation can fail. The number of operations in the worse case can be easily computed
from the number of operations in the normal case by multiplying it by a factor of k,
where k is the maximum number of trials for one SR. We use the notation SR(some
piece of code) to explicit that a piece of code is executed by the SCM inside an SR,
this call is thus non blocking. To get the result of a piece of code executed inside an
SR we write result = SR(some piece of code), this blocks until the result is computed
or an exception is thrown.
5.4.1 The push approach
Post a Tweet
Posting a tweet is a core function of our application, it is thus important to have an
efficient, as well as robust way to post tweets. Our tweet posting algorithm must thus
be able to handle the failures of a datastore node as well as the failure of an application
node during the posting of the tweet. This algorithm must also scale with the number
of followers. It is also necessary to take into account that some users can have millions
of followers.
Below you can see the skeleton of the algorithm, it is composed of three main parts:
Posting the tweet object, posting the references to the followers’ lines and updating
the value of the last tweet correctly posted. Those are detailed in the next subsections.
The algorithm can be adapted for retweets and replies to tweets but we do not detail
it here as it is very similar.
postNewTweet(posterName , msg){
// F i r s t s tep
// Post the tweet object , i f t h i s s tep succeeds the tweet i s even tua l l y
posted everywhere .
tweetNbr = SR(posttweet(posterName , msg))
// Second step
// Produce the chunkProcessesors . Each o f them i s an SR r e s p on s i b l e o f
pos t ing the remaining tweets to the f o l l ow e r s ’ l i n e s f o r a g iven chunk
63
o f the Topost s e t u n t i l i t r eaches tweetNbr .
SRlist = SR(produceChunkProcessors(posterName , tweetNbr))
// Add a l l the chunkProcessors produced in the SCM.
foreach sr in SRlist {
add SR to the SCM
}
// Wait f o r the t e rm ina i t i on o f the chunkProcessors and check i f none f a i l e d
( a l l the tweets u n t i l tweetNubr c o r r e c t l y posted f o r a l l the f o l l ow e r s )
.
bool = false
foreach sr in SRlist {
try {
// Block un t i l the r e s u l t i s computed , no r e a l r e s u l t i s returned we
j u s t check i f nothing went wrong .
result = sr.get
} catch (exception) {
bool = true
}
}
// Third step
// I f none o f the p r ev i ou s l y launched chunkProcessors has f a i l e d mark tweets
as posted un t i l tweetNbr .
if(! bool ){
SR(markTheTweetsAsPosted(posterName , tweetNbr))
}
}
The algorithm starts with the posting of the tweet, if this first step finishes without
errors we can guarantee that the tweet will be eventually posted on the lines of all the
followers. Otherwise we abort the posting of the tweet and the user must manually
restart it. The second step is responsible for pushing the information to the lines of the
followers. This is the heavy part of the job, we thus decided to cut this job in several
independent SRs that run on several Scalaris nodes in parallel. We added a repair
mechanism which logs the operations successfully performed in order to recover from
failures during this part. Finally, the last step marks the tweet as correctly posted
on all the lines. As just mentioned, only the first step is needed to have the tweet
eventually posted to all the followers. Subsequent executions of the algorithm will
indeed automatically repair previously started work that failed. This repair is done
during the “Post the references” phase detailed later in this section.
Post the tweet object This first step is executed as one SR to guarantee atomicity.
As mentioned, if it succeeds the tweet will be eventually posted on all the lines. A
tweetNbr uniquely identifies one tweet for a given user. As you can see below, it is
attributed at the creation of the tweet.
posttweet(posterName , msg){
begin transaction
tweetNbr = read(/user/posterName/tweet/size)
postingDate = currentDate ()
tweetReference = buildTweetRef(posterName , tweetNbr , postingDate)
tweet = buildTweet(posterName , tweetNbr ,postingDate)
write(/user/posterName/tweet/tweetNbr/reference , tweetReference)
write(/user/posterName/tweet/size , tweetNbr +1)
write(/user/posterName/tweet/tweetNbr , tweet)
64
end transaction
return tweetNbr
}
You can notice that we save the tweet reference in order to easily recover from
failure later.
Post the references We now explain the next step of the posting algorithm which
is posting the references on the lines of the followers. This step repairs any previously
started tweet posting which has failed after the first step. It can also be run for this
purpose only. This repair mechanism is needed as this part of the algorithm is highly
subject to failures. Indeed, it writes to the line of every follower. Therefore, it can
potentially conflict with followers reading their lines and other posters posting their
tweets. This step is cut in two substeps: the first substep is to create the chunkPro-
cessors and the second one is to execute them.
Some stars have millions of followers, it would thus not be scalable to do the whole
work in one big transaction. Therefore, we split the work into several SRs run on dif-
ferent Scalaris nodes. Now remember that the Topost set is cut in several chunks. We
associate one SR to each chunk of the Topost set. It is responsible of posting all the
remaining tweets (which usually limit to the new tweet posted if there were no failures
before) to all the followers in its attributed chunk. We call an SR with this precise task
a chunkProcessor. A chunkProcessor stops when it reaches the tweetNbr with
which he was initialized. tweetNbr corresponds to the tweet number of the last tweet
that the chunkProcessor must post to the lines. If a chunkProcessor finishes with-
out error, we are sure that all the tweets up to tweetNbr are correctly posted for this
Topost chunk. The pseudo code below details the creation of the chunkProcessors.
// tweetNbr i s the l a s t tweet to post on the l i n e s .
produceChunkProcessors(posterName , tweetNbr , posterName){
// F i r s t we do a check in order to v e r i f y i f the job i s not a l r eady done ,
t h i s s tep can be skipped as i t i s j u s t an opt im i sa t i on . This t e s t i s
equ iva l en t to t e s t i f each counter a s s o c i a t ed with a chunk o f the Topost
s e t has a value at l e a s t equ iva l en t to tweetNbr but i s qu i cke r as only
one key must be acce s s ed .
begin transaction
lastTweetNbrCorrectlyProcessed = read(/user/posterName/tweet/processed)
end transaction
if(lastTweetNbrCorrectlyProcessed >= tweetNbr)
return new emptyList
// Read the number o f chunk in the Topost s e t .
begin transaction
nbrOfToPostSetChunks =
read(/user/posterName/topost/size) / topostSetChunkSize +1
end transaction
// Create the d i f f e r e n t chunkProcessors .
chunkIndex = 0
SRlist = new emptyList
while(chunkIndex < nbrOfToPostSetChunks){
SRlist.add( new chunkProcessor(posterName , chunkIndex , tweetNbr)
chunkIndex ++
65
}
return SRlist
}
You can notice that we only create the chunkProcessors in this part of the algo-
rithm and do not execute them. They are executed in the “Post the tweet object” phase
detailed at the beginning of the section. Indeed, chunkProcessors are SRs, and, as
explained at the end of section 5.3 where we present the SCM, an SR can never wait
for the result of another SR he launched because it can create deadlocks. We detail
below the algorithm of a chunkProcessor, which explicits how we post the references
to all the lines of the followers contained in a chunk.
chunkProcessor(chunkIndex , tweetNbr , posterName){
while(true){
begin transaction
//Compare the value o f the cur rent chunkCounter with tweetNbr , i f
chunkCounter i s b i gge r or equal job i s done .
chunkCounter = read(/user/posterName/topost/chunkIndex/counter)
if(chunkCounter >= tweetNbr) return
//Get the r e f e r e n c e corre spond ing to the next tweet that i s not
posted . The tweetNumber o f t h i s tweet i s equal to chunkCounter+1
as i t i s the l a s t c o r r e c t l y posted tweet f o r t h i s Topost s e t
chunk .
tweetReference = read(/user/posterName/tweet /( chunkCounter +1)
//Read the the topose t chunk conta in ing the r e f e r e n c e s to the
f o l l ow e r s ’ l i n e s .
lineKeys = read(/user/posterName/topost/chunkIndex)
// F ina l l y we add the r e f e r e n c e to a l l the l i n e s r e f e r en c ed in the
chunk , the func t i on c a l l e d must execute in the same t r an sa c t i on
as the cur rent one .
foreach lineKey in lineKeys {
addReferenceToLine(tweetReference , lineKey)
}
// I f chunkChounter has reached tweetNbr e x i t the loop .
if(chunkCounter +1 >= tweetNbr) return
end transaction
}
}
We must now detail the addReferenceToLine function which is responsible of posting
a particular reference to the line of a follower. Remember that the lines containing the
references are divided in chunks too. For this part we thus have two choices, either
we cut the line while posting or we put the burden of cutting at another moment (for
example when a user reads his new tweets).
Triggered cutting In the “triggered cutting” solution we do not cut the line during
the posting. Indeed prefer to do it at another moment in order to ease the post tweet
function which is already quite subject to failures. The only necessary operation is thus
to add the tweet reference to the head.
66
addReferenceToLine(tweetReference , lineKey){
add(lineKey/head , tweetReference)
}
The line must thus be cut at another moment. We chose to do it when a user reads
his tweets. Indeed the only operation needed to check if the head must be cut is to
read the head, which is almost always what the read tweet operation does, as the head
chunk contains the latest tweets. Hence the algorithm presented below must be run
each time a user reads his tweets. Most of the time the algorithm does not impose any
overhead on readings as the head must only be cut when it is full.
By taking advantage of the read tweet operation, we can avoid reading the head
during the cutting mechanism. However, we present below a version of the cut mech-
anism where we read the head in order to show the complete algorithm. In a real
implementation the head would be given as argument.
splitHead(lineKey){
begin transaction
headChunk = read(lineKey/head)
// While the head i s too big we t r a n s f e r nbrTweetsPerChunk to a new chunk .
headChanged = false
nbrOfChunkCreated = 0
if(headChunk.size <= nbrTweetsPerChunk)
return;
// Number o f chunks in the l i n e exc lud ing the head .
nbrOfChunkInLine = read(lineKey/nbrchunks)
while(headChunk.size <= nbrTweetsPerChunk){
headChanged = true
// Remove nbrTweetsPerChunk o l d e s t tweets from the headChunk t h i s does
not modify the datastore , j u s t our l o c a l copy .
newChunk = removeOldest(headChunk , nbrTweetsPerChunk)
// Write the new chunk in the l i n e
write(lineKey/nbrOfChunkInLine , newChunk)
nbrOfChunkCreated ++
}
// I f the head has changed wr i t e the new head and update the number o f
chunks .
if(headChanged){
write(lineKey/head , headChunk)
write(lineKey/nbrchunks , nbrOfChunkInLine+nbrOfChunkCreated)
}
end transaction
}
In conclusion, we can observe that the posting on the line is really easy because you
only have to add an element to a set but you have to pay the price later to split the
line.
Cutting the line while posting We now present the addReferenceToLine version
where we cut the line while posting. We add the tweet in the head of the line, and, if
67
the head is full, we flush the head and create a new chunk. The overhead for cutting is
thus paid while posting but not at each post.
addReferenceToLine(tweetReference , lineKey){
// Read the head .
headList = read(lineKey/head)
// Check i f the head i s f u l l and we need to c r e a t e a new chunk .
if(headList.size >= nbrTweetsPerChunk){
// Replace the head by a f r e s h one with the new tweet .
newList.add(tweetReference)
write(lineKey/head , newList)
chunkNumber = read(lineKey/nbrchunks)
// Write the new number o f chunk in the l i n e and the o ld head to the new
chunk .
write(lineKey/chunkNumber , headList)
write(lineKey/nbrchunks , chunkNumber + 1)
}
else{
headList.add(tweetReference)
write(lineKey/head , headList)
}
}
Observe that we usually do not do more operations than in the triggered cutting as
we only need to make a new chunk when the head is full. Thus adding a reference to
a line takes 1 read and 1 write usually and occasionally 2 reads and 3 writes.
Chronological ordering We have shown how to post in a reliable and efficient way
tweets on lines, however some tweets might be misplaced due to application failures
during the posts or latency in the network. We propose an improvement to maintain
strong chronological ordering between the tweets in all situations.
The idea is to have a date associated to each chunk of a line. This date would be
equal to the posting date of the newest tweet of the previous chunk. This way when
we add a tweet to a chunk we check that his posting date is newer than the posting
date of the date associated to the chunk. If it is not the case we walk back through
the line and find the first chunk for which it is true and add the tweet to this chunk.
This means that we can have more than nbrTweetsPerChunk tweets per chunk but
this has no repercussions on the other algorithms. We can adapt the two algorithms
described above to impose chronological ordering as just explained but we do not detail
it here.
This complicates the posting algorithm which should be as light as possible in order
to achieve the best scalability. We believe that is not absolutely crucial to have perfect
ordering between the tweets and thus should not make the post algorithm even more
complex.
Mark tweet as correctly posted This is the final step of the algorithm, if every-
thing succeeded before we can be sure that the tweets of the user until tweetNbr are
correctly posted on the lines of the followers present in his Topost set at the time of
68
the posting. We can thus update the lastTweetNbrCorrectlyProcessed variable to
tweetNbr. As already mentioned, this step is not mandatory and could be skipped, it
only permits to test later more efficiently that the tweets were correctly posted in the
produceChunkProcessors part of the postTweet algorithm. We must take into account
that several runs of the postNewTweet algorithms can be running concurrently. This
can happen if a user post quickly two tweets or if the recovery part of the algorithm
was called in response to some event. It is thus crucial to test the value of the last-
TweetNbrCorrectlyProcessed before erasing it with tweetNbr, indeed another
run posting a newer tweet (and thus a tweet with a higher tweetNbr than the one we
working on) can have just written a newer value for lastTweetNbrCorrectlyPro-
cessed.
markTheTweetsAsPosted(posterName , tweetNbr){
begin transaction
lastTweetNbrCorrectlyProcessed = read(/user/posterName/tweet/processed)
if(lastTweetNbrCorrectlyProcessed < tweetNbr)
write(/user/posterName/tweet/processed , tweetNbr)
end transaction
}
Theorical performance analysis This algorithm is heavy, which is normal as in
the push approach we favor the reads and put the burden on the writes. Let’s try
to have an idea of how many operations these algorithms need. When we talk about
operations we mean reads and writes on Scalaris. We have observed while testing that
writes and reads on Scalaris approximately take the same time. An operation to add
a value to a list using Scalaris requires to do a read and a write because there is no
built-in operations on sets. We now analyse the three steps of the algorithm.
First step is the posting of the tweet in the datastore, it requires 4 operations (1
read and 3 writes) in one transaction to post the tweet object, post the tweet reference
and update the tweetNbr.
The second step is the posting of the references in all the lines. This is the heaviest
step, the number of operations depends on the number of followers (nbrFollowers)
that a given user has. We first check the lastTweetNbrProcessed (one read): the
job is done if the check indicates that everything is correctly posted, this can happen
during recovery and concurrent posting. Assuming that we are in the normal situation
where everything goes correctly, we have one tweet to post and all the previous tweets
were correctly posted on all the lines. We read the size of the Topost set (one read)
then we can dispatch the work for each chunk of the topost set. So we need two reads
to create the chunkProcessors.
Each chunk of the Topost set requires one transaction, the size of the transaction
(number of keys it works on) depends on the number of followers per chunk of the
Topost set (nbrOfFollowersPerChunk). The size of the transaction for each chunk is
proportional to nbrOfFollowersPerChunk and the number of transactions is inversely
proportional to nbrOfFollowersPerChunk. We assume we had no failures previously
and that there only is one tweet to post for all the chunks. Each chunkProcessor thus
69
reads and writes one time his counter. It thus requires 2× nbrOfTopostSetChunks
operations or equivalently 2× nbrFollowers/nbrOfFollowersPerChunk.
We must post the reference on the lines of each follower in the Topost set. The
complexity of this operation depends if we cut the line while writing or not. If we
do not cut it we only need 2 operations to update the head with the new reference.
If we cut while posting we must also create a new chunk and flush the head which
requires 3 additional operations. In average we must cut the line every nbrTweet-
sPerChunk tweets, we thus assume that the final cost for one posting for the cutting
is 3/nbrTweetsPerChunk. Those operations must be done for every follower, we
thus finally get to 2 × nbrFollowers operations for the posting without cutting and
nbrFollowers×(2+3/nbrTweetsPerChunk) operations for the posting with cutting.
Despite the fact we do not cut the lines while posting in the first option we would
like to compute the overhead of cutting those lines at another moment. By example
while reading the new tweets, as this allows to avoid the burden of reading the head.
So, for each new chunk created while cutting the head we need to do a write (thus in
total nbrNewChunk writes). We must also flush the head (one write) and we must
update the number of chunks in the line (one read and one write). Thus we have 3 +
nbrNewChunk operations. If we consider that a reader reads his tweets regularly we
have a nbrNewChunk generally equal to 1.
The last step only requires 2 operations in one transaction, one to read the last-
TweetProcessed and one to update it.
To summarize, we have a different number of operations (nbOp) to perform if we
decide to cut while reading or not:
• Cutting while reading:
nbOp = 8 + 2× nbrFollowers + 2× nbrFollowers
nbrOfFollowersPerChunk
= 8 + nbrFollowers×(
2 +2
nbrOfFollowersPerChunk
) (5.1)
• Cutting while writing:
nbOp = 8 + nbrFollowers×(
2 +3
nbrTweetsPerChunk
)
+ 2× nbrFollowers
nbrOfFollowersPerChunk
= 8 + nbrFollowers×(
2 +2
nbrOfFollowersPerChunk+
3
nbrTweetsPerChunk
)(5.2)
70
• Difference between the two techniques:
Diff = nbrFollowers× 3
nbrTweetsPerChunk(5.3)
The difference between the two is small but we believe that the overhead introduced
for cutting can have side effects.
The number of operations involved in the posting of a tweet is mainly influenced
by two parameters that we can control: nbrOfFollowersPerChunk and nbrTweet-
sPerChunk. Increasing nbrOfFollowersPerChunk reduces the burden induced by
the management of the counter associated with each chunk of the Topost set. However,
it also makes transactions more complex because each chunkProcessor works on more
elements. On the other hand increasing nbrTweetsPerChunk means that we cut
the line less often, but we waste resources each time we update a chunk because we
are forced to load a bigger chunk. We made some tests in the experiment chapter at
section 6.3.2 to observe the impact of the nbrOfFollowersPerChunk parameter.
As pointed out previously, if we want to guarantee strong ordering of the tweets
between the chunks of a line we have to perform more operations. With the automatic
cutting design solution we would need to read one additional structure at each tweet
insertion. In the normal case the tweet will be inserted in the head chunk because it is
newer that all the ones previously posted. However, sometimes we have to insert the
tweet in an older chunk. This means that we have to walk back and find the adequate
chunk, involving in the worst case as many operations as there are chunks in the line,
this case is however very unlikely to happen. With the triggered cutting solution we
would not need to do any additional operations during the insertion because we always
insert the tweet in the first chunk. The burden related to the walk back would be
transferred on the split head function.
Delete a tweet
If the real tweet was posted on all the lines, it would be necessary for the delete
operation to find back all the lines where the tweet was posted and to remove the tweets
from all those lines. This would be really impractical for several reasons. First, you
would need to find back where a particular tweet was posted. Indeed, it is not enough
to know the lines where a tweet was posted, you must also find the chunk of the line in
which the tweet was posted. You must thus either maintain this information for each
tweet or walk through all the chunks of the line in order to find and delete the tweet.
This is why we post references to the tweets in the lines. To delete a tweet we only
need to access the tweet object that is located at a given key and mark it as deleted.
The BRH checks the mark when fetching a tweet and discards it if it has been marked
as deleted.
71
Reading tweets
We will now explain how we fetch tweets from the lines. Users on social networks
usually want to retrieve the latest news and less frequently walk back to find older
posts. We thus assume that users want to retrieve the tweets from the newest to the
oldest. So we do not load the whole line, instead we load only some tweets from it.
Because lines are already cut in chunks it is natural to fetch one chunk of the line at
a time starting with the first chunk of the line, called the head, which contains the
newest tweets. However it is possible to access directly one chunk of the line if needed.
The first chunk of the line can be directly accessed because the head is at a fixed
location. We suppose that the line is already cut when we read it. If we want to access
the chunk that follows the head we have to retrieve the number of chunks in the line,
compute the key of the penultimate chunk and request it.
The next step is to filter the references in order to discard the tweets posted by
users we do not follow anymore. Indeed, we never remove the tweets posted by a user
from a line. It means that all the tweets that were posted while we were following a
user that we do not follow any more will stay forever on the line. It also implies that
if we decide to follow again a user his tweets will reappear on the line.
Chunks only contain references to tweets, we thus still have to fetch the tweets using
the references remaining after the filtering. Once we have retrieved the tweets we filter
the deleted tweets. You can notice that we are forced to load the tweets before we can
filter the dead tweets as the references do not indicate if the tweet is deleted or not.
Once the filtering has been done we can return the pack of tweets remaining.
We present below the pseudo code we have implemented to read nbrTweets tweets
from a line. This code is run in an SR. To avoid complicating the code we did not show
the recovery mechanism inside it. In the implementation, while we are fetching the
tweets we do not abort the operation if we could not fetch a tweet, instead we just skip
it. The SR fails only if other data is not accessible as it is needed to fetch the tweets.
One missing tweet, on the other hand, does not compromise the rest of the operation.
We could also split the SR in two parts if we want to add the cutting mechanism. The
first part of the SR would read the head and split it if needed. Then it would give as
argument the current head to the second part of the algorithm removing the need to
read it again.
getTweetsFromLine(nbrTweets , linename , username){
refList = read(/user/username/line/linename/head)
chunkIndex = read(/user/username/line/linename/nbrchunks) - 1
while(refList.size < nbrTweets && chunkIndex >-1 ){
//Read the cur rent chunk
refList.add(read(/user/username/line/linename/chunkIndex))
chunkIndex --
}
users = read(/user/username/line/linename/users)
filter(refList , users)
tweets = new tweetList
72
for each tweetRef in refList{
tweet = read(/user/tweetRef.posterName/tweet/tweetRef.tweetNbr)
if(! tweet.isDeleted)
tweets.add(tweet)
}
orderTweetsFromNewestToOldest(tweets)
return tweets
}
Having the pseudo code we can, as we did for the posting algorithm, compute the
number of operations needed on Scalaris. The number of chunks we read depends of
nbrTweets and nbrTweetsPerChunk. We read the number of chunks in the line (1
read). We then read nbrTweets/nbrTweetsPerChunk chunks to get nbrTweets
tweet references. Then to filter the users, we must retrieve the user list associated to
the line (one read). We must then do nbrTweets reads (minus the number of tweets
associated to users that are not anymore on the line) to get the real tweets. Considering
that all the tweets we fetched are posted by users still associated to the line the result
is:
nbOp = 2 + nbrTweets +nbrTweets
nbrTweetsPerChunk(5.4)
The heavy part is thus the fetching of the tweets. We could have posted tweets
instead of references we would obtain 2 + nbrTweets/nbrTweetsPerChunk oper-
ations, reducing tremendously the number of operations to do (but not the data to
fetch) however the delete tweet operation would have been much more complex as we
explained before.
Add a User to a line
We explain here how we add a new follower (newfollowed) to an existing line
(linename). We first check if newfollowed is not already in the set of users associated
with linename (one read). If it is not already present we add it (one write). Once it is
done we add a reference to linename in the Topost set of newfollowed (one read and
one write). We also create an object containing a reference towards the chunk of the
Topost set in which we added the reference to linename so that we can easily remove
this one later. Note that those are not the same chunks as the ones we use to divide
lines. In total we thus have a cost of 3 writes and 2 reads. Sometimes we must also
create a new chunk, in this case we must update the number of chunks and thus add 2
writes and one read.
addUserToLine(username , linename , newfollowed){
SR(
begin transaction
users = SR(read(/user/username/line/linename/users))
if(newfollowed belongs to users)
return
users.add(newfollowed)
73
write(/user/username/line/linename/users ,users)
lasttopostchunk = read(/user/newfollowed/toposet/nbrchunks))-1
reflist = read(/user/newfollowed/topostset/lasttopostchunk)
// We must c r e a t e a new chunk
if(reflist.size >= nbrOfFollowersPerChunk){
lasttopostchunk ++
reflist = newList
write(/user/newfollowed/topostset/nbrchunk , lasttopostchunk +1)
//Create the counter a s s o c i a t ed with the chunk
lastTweetNbr = read(/user/newfollowed/tweets/size)-1
write(/user/newfollowed/topostset/lasttopostchunk/counter ,
lastTweetNbr)
}
reflist.add(new ref(username , linename))
write ((/ user/newfollowed/topostset/topostchunk/, reflist)
// Write the chunk o f the topost we posted in f o r easy removal .
write(/user/username/newfollowed/linename/, lasttopostchunk)
end transaction
)
}
Remove a user from line
We now want to remove a user (followingUsername) from a line (linename). We
first remove followingUsername from the set of users associated with linename (one
read and one write). We then read the object (see “Add a user to a line”) containing
the number of the chunk of the Topost set in which we added the reference to linename
and suppress it (one read and one write). We can then locate the chunk and remove
the reference from it (one read and one write). Thus in total 3 reads and 3 writes.
Note that we do not modify the number of chunks in the Topost set even if a chunk
becomes empty. Indeed we do not want to remap all the keys attributed to already
existing chunks that depends of this number of chunks.
removeUserFromLine(username , linename , followingUsername){
SR(
begin transaction
users = read(/user/username/line/linename/users)
if(! followingUsername belongsto users)
return
users.remove(followingUsername)
write(/user/username/line/linename/users , users)
topostchunk= read(/user/username/followingUsername/linename /)
delete (/user/username/followingUsername/linename /)
reflist = read(/user/followingUsername/topostset/topostchunk)
reflist.remove(newRef(username ,linename))
write(/user/followingUsername/topostset/topostchunk , reflist)
// We must c r e a t e a new chunk
if(reflist.size >= nbrOfFollowersPerChunk){
lasttopostchunk ++
reflist = newList
write(/user/newfollowed/topostset/nbrchunk , lasttopostchunk +1)
}
74
reflist.add(new ref(username , linename))
write ((/ user/newfollowed/topostset/topostchunk/, reflist)
// Write the chunk o f the topost we posted in f o r easy removal
write(/user/username/newfollowed/linename/, lasttopostchunk)
end transaction
)
}
Create a user
The first thing to do when creating a user is to check if there is not already an other
user with the desired username registered in the system. To do so we check whether
there is already a value at the key “/user/username”. If there is already a value at
this key we can conclude that there is already a user registered with this username
and the user creation is aborted. Otherwise a user object is created, containing all the
information of the user, and is stored at this key.
5.4.2 The pull approach
As it was the case in section 4.3.3 of the design of the datastore chapter, we tried
to re-use as many of the building blocks and mechanisms from the push approach as
possible while having efficient alternative algorithms. With the pull approach only
the mechanisms needed to post and retrieve the tweets are heavily modified. Other
basic mechanisms such as adding a user are simplified because there is no need to
keep a Topost set up to date anymore or do not need to have some fields initialised.
Ultimately some simple mechanisms such as deleting tweets are exactly the same.
Post a tweet
The tweet itself is posted much in the same way as with the push approach that
we explained in section 5.4.1. The difference is that the user now only posts the tweet
references at one location. This location varies according to the time. This means
that all the tweets posted during a given time frame are going to be grouped together
and accessed via the same rounded timestamp. We call the set containing all the
tweets corresponding to a time frame a postTimeFrame. The timestamp is rounded
to the desired time granularity by setting some of the fields to 0, as explained in the
section 4.3.3 of the datastore chapter.
posttweet(posterName , msg){
SR(
begin transaction
tweetNbr = read(/user/posterName/tweet/size)
tweet = buildTweet(posterName , tweetNbr)
write(/user/posterName/tweet/tweetNbr , tweet)
write(/user/posterName/tweet/size , tweetNbr +1)
postingDate = currentDate ()
tweetReference = buildTweetRef(posterName , tweetNbr , postingDate)
references = read(/user/username/tweet/timestamp)
75
references.add(tweetReference)
// wr i t e the r e f e r e n c e to the g iven postTimeFrame
write(/user/username/tweet/timestamp , references )
end transaction
return tweetNbr
)
}
As expected we have a much lighter post tweet operation in this case with only 5
operations in total (2 reads and 3 writes). You could wonder why we still post references
instead of tweets. The reason comes from the algorithm to read tweets. As we explain
in the next section, a time frame must be read for each of the users followed. We thus
wanted to limit the size of a time frame.
Reading the tweets
This operation is now heavier as it has to retrieve the references from each author.
We have kept the chunks number format for the sake of simplicity and compatibility
with the existing API. The chunk 0 is the very first chunk associated to the user, the
timestamp of this chunk is the rounded registration time of the user. For example, if
a user registered at 05/06/11 15 h 00 min 00 s GMT and the time granularity is
counted in hours, when he requests to read the chunk 2 he will fetch all the tweets
posted between 05/06/11 17 h 00 min 00 s GMT and 05/06/11 17 h 59 min 59 s
GMT by all the users he is following. If a chunk with a negative value is requested the
latest chunk is returned along with its real chunk number.
In the same fashion as what we did with the post tweet in the push approach, we
create a series of smaller tasks. In this case we create one SR per user followed that
is responsible for fetching the tweets of this user. The tweets fetched are the tweets in
the chunk corresponding to the chunk number cNbr which is given as argument to the
function getTweetsFromLine that we describe below.
getTweetsFromLine(username , linename , cNbr){
// F i r s t s tep
// Produce a l i s t o f SRs to add to the SCM, each SR takes care o f one user
SRlist = SR(produceLineProcessors(username , linename , cNbr))
// Second step
// Add a l l the SRs produced in the SCM and get t h e i r tweets
foreach sr in SRlist
add SR to the SCM
foreach sr in SRlist {
try{
//Block un t i l the r e s u l t i s computed
result.add(sr.getTweets)
} catch (exception) {
//The tweets o f a user could not be r e t r i e v e , in t h i s case we abort
the
// the read ing .
return null
}
}
chronologicalSort(result)
76
return result
}
produceLineProcessors This part creates one lineProcessor per followed user. A
lineProcessor takes as argument the key of the chunk that it must fetch. To compute
the key of the chunk we must convert cNbr to a date because, as already explained,
lines are fragmented according to the time and thus each chunk in a line corresponds
to a specific date.
produceLineProcessors(username , linename , cNbr){
SR(
startTime = read(/user/username/starttime)
dateKey = chunkToDate(startTime , cNbr)
users = read(/user/username/line/linename/users)
SRlist = new emptyList
for(User u: users){
SRlist.add(new lineProcessor(dateKey , user))
}
return SRlist
)
}
lineProcessor This part fetches the tweets posted by a given user during the
dateKey time frame up to the given date. Note that no ordering is done at this stage
as all the tweets are ordered at the end of getTweetsFromLine. As for the previous
reading tweets operation, we do not abort an SR if one of the tweets is not accessible,
but rather ignore the error as this does not compromise the rest of the operations.
This case is supposed to happen very rarely. Indeed tweet objects, once stored, are
only modified when the author wants to delete them, otherwise they are only read and
reads are not conflictual and should thus not abort.
lineProcessor(dateKey , user , cNbr){
SR(
refList= read(user/username/tweet/dateKey)
tweets = new tweetList
for each tweetRef in refList{
tweet = SR(read(/user/tweetRef.posterName/tweet/tweetRef.tweetNbr))
if(! tweet.isDeleted)
tweets.add(tweet)
}
return tweets
)
}
Theorical performance analysis The whole getTweetsFromLine operation is esti-
mated to do 2 + nbrFollowing + nbrRetrievedTweets basic Scalaris operations,
where nbrFollowing is the number of users followed and nbrRetrievedTweets the
total number of tweets to retrieve. Indeed we need one read to determine dateKey
77
and one read to determine the users we follow. Then to fetch the references we must for
each user read the chunk of their line corresponding to dateKey, thus nbrFollowing
operations. Finally we must do nbrRetrievedTweets operations to read the tweets
corresponding to the tweet references we just read.
Why not store the built line chunks An alternative approach would be to keep
the work done and to store the built chunk when it has been read. This would avoid
the need to build several times the same chunk of a line. The application could do a
simple check to see if a given chunk has already been built and, if it is the case, retrieve
the references from the chunk previously stored.
This might sound like an interesting optimisation but we have to keep in mind the
way our application is going to be used. Users almost never re-read tweets they have
already read, they usually want to see the last posted tweets. This means that they are
going to load the latest chunk to see if there are new references in it. This implies that
the latest chunk has to be rebuilt from scratch and thus storing the previous tweets
references will not speed up this operation. Furthermore trying to keep previously built
chunks of the line will increase the number of checks and operations to do in the case
the references were not already on the follower side, which will be the case when trying
to read previously unread tweets. Finally, the most obvious advantage of not storing
line chunks is that it decreases the space complexity.
We thus decided against this solution because it complicates the implementation,
increases the amount of data to keep in the system and slows down the most used
operations in order to increase the performance of rarely used operations.
5.4.3 Theoretical comparison of Pull and Push approach
We are first going to compare the two approaches based on the complexities we
computed in the previous section. We then try to give an intuition of the impact of
those complexities on the behaviour of Bwitter when used by simulated users.
Summary of the complexities
We are now going to compare the complexity of the push and pull approaches for
the two main operations, the postTweet and getTweetsFromLine. Below we present a
summary of those operations for both the puch and pull approach.
• Push - postTweet
nbOp = 8 + nbrFollowers×(2 +
3
nbrTweetsPerChunk+
2
nbrOfFollowersPerChunk
)(5.5)
78
• Pull - postTweet
nbOp = 5 (5.6)
• Push - getTweetsFromLine
nbOp = 2 + nbrTweets +nbrTweets
nbrTweetsPerChunk(5.7)
• Pull - getTweetsFromLine
nbOp = 2 + nbrFollowings + nbrRetrievedTweets (5.8)
Before we start, here is a reminder of the different terms involved:
• nbrFollowers: number of users the user is followed by (pull/push).
• nbrFollowing: number of users the user follows (pull/push).
• nbrTweets: minimum number of tweets we want to retrieve (push).
• nbrTweetsPerChunk: number of tweets in one chunk (push).
• nbrRetrievedTweets: number of tweets retrieved in one get (pull).
• nbrOfFollowersPerChunk: number of followers in a chunk of the Topost set
(push).
• time granularity: the time frame corresponding to a post chunk (pull).
As announced, we can see that obviously the post operation in the push approach
is much heavier than in the pull. Indeed the time to post a tweet using the pull design
is constant while in the push approach it depends on the number of users that follow
you. On the opposite, the read is lighter in the push approach, indeed it depends on
the number of tweets retrieved as in the pull but its complexity does not grow with
nbrFollowings as it is the case in the pull.
The two designs thus have their respective heavy operation, the post for the push
and the read for the pull. However we believe that the first is more resistant to failures
because it does not need to succeed directly and can be recovered later while the second
must read from all the following successfully in order to produce a result. If we further
considerate that a user does not like to wait, the push is more reasonable as after the
first step of the post we can tell that the operations was a success, for the read in the
pull we must wait until the end of the whole operation in order to respond to a request.
However the push operations involve much more conflicts as they do a lot more write
than the pull operations.
We still did not prove if from a complexity perspective it is better to use the push or
the pull. To this end we would like to compare them according the number of followers
and following and the read and write rates.
79
Theorical Bwitter simulation
We now simulate the two designs we have presented. It is aimed at estimating the
global number of operations performed by the system and determining which design is
the best according to an unknown number of followers and read rate. This simulation
does not take into account failures during the algorithms, size of the data transfered,
neither complexity of the transactions (number of keys involved in a transaction).
Description of the problem As we can see the operations are not comparable the
way they are now. Indeed, in the push approach we fetch at least a specified amount of
tweets, and in the pull approach we retrieve an arbitrary number of tweets, depending
of the number of tweets posted during a given time frame. The two operations are
thus semantically different, and thus naturally their complexities depend on different
parameters. We would like to be able to compare them in terms of total number of
operations done on Scalaris. The main problem is that the parameters of the system
are unknown. Indeed, each user using Bwitter is different, we define a user in term of
his behaviour and have four parameters describing it:
• postingRate: the rate at which a user posts new tweets.
• readRate: the rate at which a user reads his tweets.
• nbrFollowers: the number of followers a user has.
• nbrFollowings: the number of followings a user has.
Moreover, we must fix all the design parameters involved in the complexities namely
nbrTweets, nbrTweetsPerChunk and nbrFollowersPerChunk for the push and
time granularity and nbrRetrievedTweets for the pull in order to estimate the
number of operations done.
Assumptions The parameters we just described vary a lot between users and are
unknown. Indeed we did not find any precise statistics about the usage of Twitter.
Because we would like to give you an idea of the performance of our two designs
according to those parameters, we have decided to fix those to some values that were
chosen by ourselves according to the following assumptions:
(1) Users more often read their tweets than they post a tweet.
(2) Most of the users in Twitter have more followings than followers, we call them fans.
Other users have a lot of followers compared to the number of user they follow, we
call them stars. This means that nbrFollowers for fans is smaller than for stars.
(3) Users are only interested by new tweets and when a user reads his tweets he reads
all the new tweets.
80
(4) readRate is the same for all the users and is the average of the read rates of each
user in the real network. Because we cannot compute it as we do not have the
figures, we take it as a parameter of the simulation.
(5) postingRate is the same for all the users and is the average of the posting rates
of each user in the real network. Because we cannot compute it as we do not have
the figures, we take it as a parameter of the simulation.
(6) nbrFollowings is the same for all the users.
The first three assumptions come from the observation of the usage of Twitter.
Generally users that connect to a Twitter application read their messages more often
than they post new ones. Also 1% of the users of Twitter are responsible for 50%
of its content, this observation motivated our distinction between the star and fans
behaviour. The second three assumptions were made in order to simplify the following
development.
Properties of the system simulated We define in this section two properties
that we derived from the assumptions we do above. Those properties fix some of the
simulation parameters that we defined before.
First property: The number of new tweets a user reads when he reads his tweets
is constant and equals to nbrOfNewTweets .
nbrOfNewTweets =postingRate× nbrFollowings
readRate
First, please notice that nbrOfNewTweets is constant as postingRate, nbrFol-
lowings and readRate are fixed. We derived this property from (3), (4), (5) and
(6). This property allows us to fix nbrTweets (push) and nbrRetrievedTweets
(pull) to nbrOfNewTweets. We also decided to read only one chunk (push) and one
postTimeFrame per following (pull). This choice was made in order to simplify the
simulation which is already rather complex. This constraint helps us to fix the time
granularity and the nbrTweetsPerChunk.
Concerning the push approach, it is easy to fix nbrTweets as it is a parameter of
the function call. In order for the tweets to be packed in the same number of chunks
at each read call it is sufficient to choose nbrTweetsPerChunk to be a multiple of
nbrOfNewTweets, or put differently:
nbrTweetsPerChunk % nbrOfNewTweets = 0
We decided to fix nbrTweetsPerChunk to nbrOfNewTweets. Therefore, we
need to read exactly one chunk in order to have the new tweets at each read operation.
81
Concerning the pull approach, we cannot directly influence how many tweets are
read when performing a read tweet operation. However we can fix the time granu-
larity so that:
time granularity %1
readRate= 0
This ensures that all the new tweets are always in the last postTimeFrame. We
chose to fix the time to the time granularity to 1/readRate. This allows to have the
smallest chunk possible (no unused references loaded) while fulfilling the just stated
property. Please note the choice of the time granularity does not have any direct
influence on the simulation. But we wanted to show it was possible to tune our design
to meet simulation constraints.
Second property: Each user has the same number of followers (nbrFollowers
is fixed).
We now discuss and prove that this second property is not restrictive. You can
notice that this property is aimed at simplifying assumption (2). In other words,
it claims that there is no distinctions between stars and fans, or any other way of
distinguishing users based on their nbrFollowers. This is in fact not needed. Our
simulation estimates the global number of operations performed by a system according
to some user profile. And, thanks to the two properties below, we can affirm that having
some users having more followers than others has no influence of the total number of
operations.
(7) The postingRate and readRate are the same for all users (which is exactly the
assumptions we made at (4) and (5)).
(8) The complexities of the operations in the two designs are linear with respect
to the number of followers and followings (you can observe it remembering that
nbrTweetsPerChunk, nbrFollowersPerChunk are constant parameters)
Property (8) states that if a user has one more follower it only increases the charge
it puts on the system by a constant amount (which is the same for all the users) for
each operation it performs. However, moving a follower from a user to an other user
does not change the total charge put on the system if all the users perform the same
number of operations. This last condition is exactly what property (7) states. If (7)
was not true we could have a system with a user having lots of followers but a posting
rate equals to 0 and an other user with a few followers and a postingRate different
from 0. The first user would not generate any posting load as he never posts. But if
you transfer one of his follower the second user it would change the total load put on
the system. To summarize, thanks to (7) and (8), we can always move followers from
users having more followers to users having less followers without changing the total
amount of operations perform on the network. We thus proved that it is not needed to
make a distinction between stars and fans.
82
In conclusion the two properties we just defined fix the following relations between
the simulation parameters.
• nbrFollowers = nbrFollowings
• (postingRate× nbrFollowings)
readRate= nbrOfNewTweets = nbrTweets =
nbrRetrievedTweets = nbrTweetsPerChunk
• time granularity = 1postingRate
The simulations We now explain the final details concerning the simulation. Below
you can see the formulas we use in order to simulate Bwitter. The first formulas is the
one for the push design and the second is the one for the pull design.
Push:
nbOp = postingRate×(
8 + nbrFollowers×(
2+
3
nbrNewTweets+
2
nbrOfFollowersPerChunk
))+
readRate× (3 + nbrNewTweets)
(5.9)
Pull:
nbOp = postingRate× 5 + readRate×(nbrFollowings + 2 + nbrNewTweets)
(5.10)
Those formulas compute the number of operations performed with respect to the
readRate, the postingRate and the nbrFollowers. Remind that an operation is
a transactional read or a write. Because we do not simulate other operations than
reading tweets and posting tweets we have a direct relation between the two rates. If
we normalize them we have that readRate + postingRate = 1. We thus chose to
make readRate vary from 0 to 1 with postingRate varying accordingly. We defined
nbrUsers as the number of users in the system. We chose nbrFollowers, which, as
already stated, represents the mean number of followers each user has, and thus also
his number of followings as nbrFollowings = nbrFollowers. Because we had no idea
of the value of this number we chose some arbitrary values. The higher it is, the more
socially connected the users in our system are. Finally we must fix the last unknown
parameter: nbrOfFollowersPerChunk. This parameters is only present in the push
design, the number of operations that must be done in a write operation decreases as
it increases. The problem is that it is difficult to fix a value for it. We can not neglect
its influence but we cannot decently put it very high either. Indeed the number of
keys involved in the transactions while posting grow linearly with it. We thus made a
compromised and chose to have it equals to 20. We summarise below the values of the
parameters.
83
• nbrUsers = 100
• nbrFollowers = 10, 30, 70
• nbrOfFollowersPerChunk = 20
• (postingRate× nbrFollowings)
readRate= nbrOfNewTweets = nbrTweets =
nbrRetrievedTweets = nbrTweetsPerChunk
We have plotted the results of our simulation in Figure 5.6. Lines go by pair (one
push and one pull), lines with the same weight correspond to the same number of
followers. We have indicated the intersections that are relevant by a big read dots.
Figure 5.6: Number of Scalaris operations with respect to the read rate: comparisonbetween the pull and the push approach for nbrFollowers = 10, 30 and 70.
First, you can observe that all the lines of the pull approach are parallel, this means
that nbrFollowers influences the number of operations by a constant amount whatever
the readRate. We can thus see that the number of operations in the pull approach
does not vary much with the readRate which may be a surprising observation at first.
Even more surprisingly it decreases slowly with the readRate.
Secondly, we can see that, as expected, as the readRate increases the push ap-
proach becomes more and more interesting. When the nbrFollowers is smaller we
need to have a higher readRate before the push approach becomes more interesting
than the pull. If you observe the red dots you can see that some kind of asymptotic
functions seems to appear indicating that before some readRate the push approach
is never the good choice. We thus plotted the curve defined by the intersection of the
pull and push lines in Figure 5.7 to confirm this intuition. We have kept on the plot
the lines already shown before to better visualize what the curve represents. The curve
shows the intersections for nbrFollowers between 4 and 300. nbrFollowers values
smaller than 4 give intersections at readRate bigger than 1 which makes does not
make sense.
84
Figure 5.7: Intersection of the push/pull lines for nbrFollowers between 4 and 300.
This curve can be used to determine which design is theoritically the best according
to nbrFollowers and readRate. We can observe an asymptote around readRate =
0,7. We made the math for nbrFollowers = 30000 and obtained readRate = 0,672.
We can also note that once nbrFollowers is higher than 70 the black curve becomes
nearly vertical. This means that for a readRate bigger than 0,672 and nbrFollowers
bigger than 70 the push approach is theoretically always the best in terms of number
of Scalaris operations performed.
Conclusion
In conclusion, we have compared the push and the pull theoretically according to
an unknown mean nbrFollowers and an unknown readRate. We have seen that we
could find a value for the readRate under which the the pull approach is always the
best. However, if we are above this value and if nbrFollowers is bigger than 70 the
push approach is the best. It seems safe to assume that we are in the second case with
social networks like Twitter.
Moreover, the read algorithm in the pull is heavier and its termination must be
waited in order to respond to a given call which is not the case for the posting in the
push. According to those observations, we thus believe that the push approach is the
more adapted for a social network like Twitter. We will see in the next chapter if the
practical tests confirm this conclusion.
85
5.5 Conclusion
In this chapter we detailed the main modules of our implementation. The NM
is a powerful tool that allows us to manage the different machines we need to run
Scalaris nodes, and the SCM allows us to easily dispatch work on those nodes. The
BRH is the module on which we spend the most time and attention in order to design
the simplest and fastest algorithms. We believe we have minimised the complexity of
our most used algorithms. Finally, our theoretical comparison between the push and
the pull approaches comforts us in the idea that the push approach is probably more
adapted to our application. In the next chapter we are going to do tests on Scalaris
and Bwitter’s pull and push variations.
86
Chapter 6
Experiments
This chapter details the experiments we did on Scalaris and Bwitter. The first part
is dedicated to the description of the Amazon Elastic Compute Cloud, which was the
platform on which we did all our tests. We then detail the tests we do on Scalaris
and Bwitter. We do for both scalability and elasticity tests. We start in the second
section of this chapter with Scalaris as Bwitter tests’ results are heavily influenced by
Scalaris ones. Bwitter is tested in the third part. We study the influence of a cache and
of the nbrOfFollowersPerChunk parameter for the push. Afterwards, we test the
scalability and elasticity of our Bwitter push solution. Finally, we study the scalability
of the pull approach and finish with the conclusion.
6.1 Working with Amazon
We do not want to simulate the cloud platform by ourselves as we feel it would not
reflect the way our application would be used ultimately. We thus decide to work with
the Amazon Elastic Compute Cloud (Amazon EC2) because it is a professional and
realistic work environment.
6.1.1 Choosing the right instance type
An instance is virtual machine running on a physical machine, it is characterized by
four different attributes: CPU, network capabilities (we sometimes say IO capacity),
RAM memory and storage capacity. The last attribute is less interesting to us as none
of our tests use persistent storage. While working on the Amazon cloud infrastructure,
we used four kinds of instances: the standard micro, the standard small, the standard
large and the high CPU medium instance. The micro instance is the smallest possible
Amazon instance. It provides minimal CPU and IO capacity. The micro instance can
consume up to 2 EC2 Compute units for short period of burst. This is not enough for
running correctly Scalaris. According to Amazon, an EC2 Compute Unit is equivalent
to CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. You can find
87
more information about amazon instances and EC2 Compute unit on Amazon website1.
We show the description of Amazon’s micro instance in Table 6.1.
StandardMicroinstance
StandardSmallinstance
StandardLargeinstance
High CpuMediuminstance
Memory 613 MB 1.7 GB 7.5 GB 1.7 GB
ComputeUnits
Up to 2 EC2Compute Units(for shortperiodic bursts)
1 EC2 ComputeUnit (1 virtualcore with 1 EC2Compute Unit)
4 EC2 ComputeUnits (2 virtualcores with 2EC2 ComputeUnits each)
5 EC2 ComputeUnits (2 virtualcores with 2.5EC2 ComputeUnits each)
Storage EBS storageonly
160 GBinstance storage
850 GBinstance storage
350 GBinstance storage
Platform32-bit or 64-bit 32-bit 64-bit 32-bit
I/O Perf Low Moderate High Moderate
APIname
t1.micro m1.small m1.large c1.medium
Table 6.1: Characteristics of the different Amazon instance types we use during thetests.
The small instance is just above the micro, it provides moderate IO performance and
fixed CPU. Those were well suited to run up to 18 scalaris nodes. However, they showed
some CPU and IO limitations when we use a high number of connections and/or nodes.
As for the micro, the performance of the small instance can be found in Table 6.1. Most
of the tests use the small instances to run the Scalaris nodes as they were rather cheap
and efficient, but we could have benefits of instances with higher CPU and network
capabilities as shown later.
1Amazon EC2 FAQs, http://aws.amazon.com/ec2/faqs/, last accessed 27/07/2011
88
We also use the large instance which has better network performance than the two
others, and the High CPU medium instance which has same network performance but
way higher CPU performance. Those two instances are used for special tests when
we suspect that some behaviours can be explained by the lack of performance of the
previous instances.
At first, we tried to work with the micro machines, but they turned out not to be
powerful enough to support Scalaris and the operations we wanted to perform. Those
preliminary measures are thus not relevant and we only detail our experiments and
results with the other instances we presented.
6.1.2 Choosing an AMI
Instances need to have an associated Amazon Machine Image (AMI). AMIs can
have two kinds of storage instances: AMI storage and Elastic Based Storage (EBS).
The first one does not allow the user to stop and restart the machine, indeed once the
machine is stopped all the modifications done are lost. The second one works as a
normal personal computer, you can restart the machine and the changes done before
are still present. We use the EBS solution because it allows us to create custom images
easily from existing AMIs and store them, which is not possible with the classical AMI
storage.
6.1.3 Instance security group
Amazon instances all belong to a security group. This security group defines several
firewall settings for the instances. For the sake of simplicity, we have allowed all the
TCP connections as well as all the ICMP messages between the nodes.
6.1.4 Constructing Scalaris AMI
We started from the AMI with ID ami-06ad526f, this is a 32 bits image of ubuntu
11.04 (Natty Narwhal)2. The first step is to install all the packages needed to install
Scalaris: java jdk, erlang, make, svn and ant. We ran the following commands in order
to install the required packages.
sudo apt -get install erlang
sudo apt -get install make
sudo apt -get install openjdk -6-jdk
sudo apt -get install ant
sudo apt -get install subversion
We then installed the latest version (0.3.0) of Scalaris, downloaded from the SVN.
svn checkout http:// s c a l a r i s . goog lecode . com/svn/ trunk/
cd /home/ubuntu/trunk/
sudo ./ configure
sudo make install
sudo make install java
2Can be found at http://uec-images.ubuntu.com/releases/11.04/release/ last accessed 27/07/2011
89
We have also modified a little bit the starting scripts of Scalaris and added some
scripts to restart Scalaris easily on a machine. Once all those steps were performed,
the new AMI was done and ready to run Scalaris.
6.2 Working with Scalaris
We now detail the procedure to launch Scalaris and the different tests we did on it
before testing our Bwitter application.
6.2.1 Launching a Scalaris ring
The first thing to do is to modify the “scalaris.local.cfg” file, which is located in the
bin folder of Scalaris. The two important lines shown below must be modified.
{mgmt_server , {{127,0,0,1}, 14194, mgmt_server }}.
{known_hosts , [{{127,0,0,1}, 14195, service_per_vm }]}.
The mgmt server, known hosts and service per vm parts must not be modified,
otherwise Scalaris will not work correctly. Indeed, nodes do not connect correctly when
you modify those values. You must replace the IP address of the first line with the IP
address of the node running the manager server (mgm server). 14194 is the port on
which the manager server runs, note that you can change it. The second line contains
the known hosts, those are the other DHT nodes already inserted in the ring. Each
known host is identified by an IP address and a port on which it listens. Below is an
example of configuration.
{mgmt_server , {{192 ,168,1,1} , 14194 , mgmt_server }}.
{known_hosts , [ {{192 ,168 ,1 ,1}, 14195, service_per_vm},
{{192,168,1,2}, 14195 , service_per_vm},
{{192,168,1,3}, 14195 , service_per_vm},
{{192,168,1,1}, 14200 , service_per_vm} ] }.
In this configuration, one node (192.168.1.1) is running the management server and
a DHT node. Launching the nodes is quite simple, the three following commands are
used respectively to run the management server, the first node and another DHT node.
The “scalarisctl” binary is located in the bin folder of the Scalaris folder.
./ scalarisctl -n mgmt_server@hostname -p 14195 -y 8000 -m start
./ scalarisctl -n FirstNodeName@hostname -p 14195 -y 8000 -s -f start
./ scalarisctl -n AnotherNodeName@hostname -p 14195 -y 8000 -s start
Note that each node has a name, which is needed to communicate with Scalaris
nodes. The mapping between the node name and its location (IP address and port) is
done by the epmd server that is launched automatically with Scalaris. It is possible to
launch several Scalaris nodes on the same machine, they only need to have a different
node name. This node name is fixed with the “-n” parameter. In fact only the part
before the @ is the true name, but fixing the hostname is important if you want to avoid
90
communication problems when using the java API for Scalaris. Indeed, Java does not
resolve hostnames the same way Erlang does, and Scalaris is written in Erlang. Fixing
the hostname thus prevents Erlang to fix it itself, and using the same in Java avoids
the problem.
The parameter “-p” is used to fix the port on which the DHT nodes communicate.
This is important in order to configure the firewall settings. The parameter “-y” fixes
the port on which the webserver is running, this webserver is not mandatory but allows
to debug more easily as you can do put/get operations directly from this webpage. You
can also have a visual representation of the complete ring by going on the webpage
of the management server. We end up with the parameter “-m”, “-f”, “-s” which are
respectively used to start the manager server, the first node and a normal DHT node.
6.2.2 Scalaris performance analysis
Before doing any test directly related to Bwitter, we need some important informa-
tion about Scalaris itself in order to understand our future results. Our first analysis
focus on the connection strategy used to communicate with Scalaris nodes. We then
perform scalability and elasticity tests based on those results. Scalaris is configured
with a replication factor of 4. Scalaris does not allow to choose the consistency level
between replicas and thus always guarantees strong consistency. This means that read
and write operations are always done in a transaction. Read and write operations will
thus conflict if they are working on the same keys. However, concurrent reads on the
same value are not conflicting, which is important to keep in mind during the tests.
One important precision is that we only run one instance of Scalaris per Scalaris
node. We decided to do so because the Scalaris developers told us that having more than
one instance by node was less stable, and only sightly increases the overall performance
of the system. Moreover, small instances from Amazon might not be powerful enough
to handle more than one instance of Scalaris.
During our tests with Scalaris we take two measures: the time, in milliseconds, taken
to perform 20000 operations and the number of operations that failed during the test.
We do not apply any restart strategy, if an operation fails we report it and execute the
next operation. We then compute the throughput and the failure percentage defined
respectively by equations 6.1 and 6.2. We have chosen to show the throughput as it is
easier to analyze and closer to what we want to measure than the time. Moreover time
can sometimes be difficult to interpret on its own and cannot be compared with other
tests’ results, except if exactly the same number of operations are done. Concerning
the failure percentage, it has the advantage to be easily comparable for other people
doing similar tests to us.
Throughput =number of Scalaris operations successfully performed
measured total time(6.1)
Failure percentage =number of operations failed
number of operations performed× 100 (6.2)
91
Before presenting our tests we want to point out that the Amazon instances do
not provide a constant level of performance. This means that the performance of the
Scalaris nodes are variable from one run to the other. Indeed, we do not use the same
physical machines all the time, but virtual machines whose performance can vary from
time to time. We had lots of tests to run it was thus not possible to make several runs
for the same test. However, we did lots of tests, it was thus possible to observe if some
results deviate too much from what we already observed. In that case we restarted the
test.
Note that we do not detail all the tests we did on Scalaris as part of them were done
to familiarize ourselves with the system. We are thus going to present only the most
relevant ones that allow us to give the reader the broadest view of Scalaris’ behaviour.
Connection strategy test
This test is aimed at evaluating the impact the number of parallel connections
the dispatcher maintains towards a single Scalaris node has on the performance. A
connection is a TCP connection toward a Scalaris node which can be used to make
sequential requests. The word sequential is important as concurrent requests using the
same connection trigger errors. Indeed, Scalaris does not distinguish between different
requests which thus mix if done concurrently. The dispatcher is the node that sends
operations to Scalaris nodes. We have decided to run the dispatcher on another machine
than the Scalaris nodes because later we run our Bwitter nodes on dedicated machines.
Indeed, we believe that the overhead of Bwitter can perturb the execution of Scalaris
which is already quite heavy.
Our guess is that the conflict level (conflictLevel) plays an important role in the
optimal number of connections. We define the conflictLevel of a set of operations as
the chance that a random pair of operations in the set conflict if they occur at the same
time. Therefore, having more connections increases the probability that two conflicting
operations occur at the same time, leading to the failures of those.
We designed a benchmark with a fixed number of nodes, we have chosen 18 as
it is the maximum number of nodes we could launch in the test environment we were
provided, and some predefined conflict levels. We made the number of connections vary
for each value of conflictLevel. This benchmark consists of 20000 random operations
with as many reads as writes operating on a random key inside a given pool of keys.
The value written is always the constant String “test”. The conflictLevel is inversely
proportional to the number of keys on which we work. This means that the smaller the
number of different keys, the higher the chances are that two parallel operations work
on the same keys and thus conflict. We believe that 20000 operations are significant
enough so that little variations do not influence the overall results.
We decided that the best way to connect two nodes is to have a symmetric connec-
tion strategy with respect to each node. This makes sense as each node is supposed to
be equivalent to the other nodes.
92
Mathematically speaking this means that:
|number of connections to n1− number of connections to n2| ≤ 1
∀ n1, n2 where n1, n2 ∈ set of nodes
We apply this symmetric connection strategy during all our tests. Please also note
that in order to avoid side effects, we shut down the whole ring between each run
and start with a fresh ring each time. This test uses small Amazon instances for
the dispatcher and the Scalaris nodes. The results are summarized in Figure 6.1 and
Figure 6.2.
Figure 6.1: Read and write throughput with respect to the number of connections.
As we can see in Figure 6.2, the conflictLevel has a clear impact on the perfor-
mance. The number of failed operations increases whith the conflictLevel leading
to a lower throughput, as we can see in Figure 6.1. The number of failed operations
also increases with the number of connections. It becomes obvious that having less
connections lowers the number of failed operations in an environment where operations
can conflict.
We can distinguish two parts in Figure 6.1, the part before we reach the same
number of connections than nodes, and the part after we reached it. We will call this
rupture the break point. In the first part (except for conflictLevel equal to 0.1),
the number of operations per second is increasing almost linearly as the number of
connections increases. We thus deduce the following property: in normal conditions
where the textbfconflictLevel is not tremendously high, it is necessary to use as many
connections as nodes in order to fully take advantage of those nodes’ power.
93
Figure 6.2: Failure percentage with respect to the number of connections.
In the second part, the throughput varies with the conflictLevel. When the conflict
level is low, the throughput increases with the number of node connections up to a
certain point, and then, eventually, decreases again under the value measured at the
break point. We believe the increase is due to a more important load on Scalaris nodes.
The decreasing part can be explained by the growing number of failures observed. We
also believe that having only one dispatcher is maybe not enough. Indeed, the network
capacity of Amazon small instances is only moderate and traffic toward Scalaris nodes
increases with the throughput. It is thus possible that we have reached a maximum
throughput for one dispatcher. Finally, the throughput does not increase with the
number of node connections in cases of very high levels of conflict. For example with a
conflictLevel equal to 0,02, we can see that the throughput drops down directly after
the break point. Concerning the line with a conflictLevel equal to 0,1, the throughput
put increase stops even before the break point and then do not stop to decrease. This
indicates that, if the conflictLevel is really high, the optimal number of connections
is below the number of nodes, despite the fact that more parallel requests could be
handle by Scalaris.
We thus conclude that until a given level of conflict between operations we must
use at least as many connections as there are nodes. Moreover, using more connections
also increases the throughput but not as drastically and it depends of the environment
in which we are working.
Connection strategy conclusion: In the light of the tests we did, we have shown
the crucial influence of the number of connections as well as the conflictLevel on
94
the throughput and failure percentage. In case of highly conflicting environment, it
might be a good idea to reduce the number of connections a little bit. However, when
operations are almost not conflicting, a higher number of connections can significantly
increases the performance because it allows to put a higher load on Scalaris.
Choosing a the right number of connections is really difficult as it requires to esti-
mate the conflict level which is an application dependent parameter. Moreover, results
could also have been different for another number of nodes. We finally conclude that
we must use at least as many connections as there are nodes. Indeed, in most of the
practical situations the conflictLevel is not high enough to justify going under this
number.
Scalability test
Scalaris is claimed to be a scalable system. Despite we could have accepted it,
we wanted to verify this claim in our own environment, as it is really important to
understand the next tests.
First scalability test with one dispatcher and small instances: We performed
20000 writes on random keys and then read each of the keys we just wrote. The con-
flictLevel should be close to 0 as keys are chosen randomly using the Math.random()
function from java. We measure the time taken for all the writes and reads with respect
to the number of Scalaris nodes. We make the number of Scalaris nodes vary from 4 to
18, maintaining only one connection per node. As for the connection test, we use small
instances from Amazon for all the nodes (dispatcher and Scalaris nodes). The results
can be found in Figure 6.4.
We can clearly observe that the throughput increases when the number of nodes
increases. It seems to increase more slowly when the number of nodes become higher.
Indeed, from the 70% throughput increase we observe between 4 and 18 nodes, we
already have 45% of the throughput increase between 4 and 8 nodes.
Second scalability test with one dispatcher and medium instances: We were
surprised by the slow down at the end of the last test. Our assumption is that the
small instances are not powerful enough to handle a ring of that size. We thus restart
the test with medium instances for the Scalaris nodes, the other parameters remaining
the same. The results of this test can be found in Figure 6.4.
We can see a general improvement in performance with more powerful machines,
but again a decrease of scalability with a higher number of nodes. However, we can
notice that this decrease is not as marked and happens after a few more nodes than
in the previous case, around 10 nodes instead of 8. The performance of the machines
do certainly play a role but are probably not the cause of this decrease. Our guess
would be that there are some networking delays that come up because we only use one
dispatcher.
95
Figure 6.3: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for small instances and conflict level of 0.
Figure 6.4: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for small and medium instances and conflict level of 0.
96
Third scalability test with 2 dispatchers and small instances: Looking at our
logs we noticed that the time the nodes spend waiting for a new job when they finished
the previous one had an impact in this scalability. This time increases with the number
of nodes in the ring as the dispatcher must keep more nodes busy. Networking delays
are thus probably the source of this problem. We now want to see the magnitude of
this impact. Our idea was to add another dispatcher in order to increase the load on
the Scalaris nodes. We performed a series of tests to measure the impact of having two
dispatchers instead of one.
In the first series of runs we have one dispatcher maintaining two connections with
each Scalaris node, while in the second series of runs we have two dispatchers main-
taining each one connection per Scalaris node. Note that we use two connections in
the first case because we want to have the same number of parallel requests in the two
tests. In order to widen our view of the Scalaris behaviour we opted to have a conflict
level conflictLevel equal to 0,004. We thus do 20000 Scalaris operations, 20000 for
single dispatcher and 10000 for each dispatcher when we use two, with as many reads
as writes and make them overlap. Our results can be found in Figures 6.5 and 6.6.
Figure 6.5: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for one and two dispatchers on small instances and conflict levelof 0,004.
As we can see in Figure 6.5, the throughput does not seem to be very much affected
by the addition of a second dispatcher, even though we can notice a clear difference
starting when the ring has more than 8 nodes. The difference, however, seems too
small to conclude that the scalability issues are due to the increasing time the nodes
were waiting. Surprisingly, we see in Figure 6.6 that the number of failure percentage
is always higher with the single dispatcher.
97
Figure 6.6: Fail percentage for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for one and two dispatchers on small instances and conflict levelof 0,004.
Final scalability test with 4 dispatchers and small instances: Finally, we want
to see if a single dispatcher with higher network capacity is a better choice than several
small dispatchers with medium network capacity. We invite you to consult Table 6.1
in order to recall the specifications of small and large instances. As you can see the
large instance offers far better performances than the small instance in every domain.
We decided to make a final test with a really small conflictLevel equal to 0,00007 and
made the number of nodes vary from 8 to 16. Once more we chose another conflict
level in order to widen our view. We again do 20000 operations in total, in case of
a single dispatcher it performs 20000 operations and in case of 4 dispatchers each
dispatcher does 5000 operations. The single dispatcher has 4 connections per node
and the 4 dispatchers use one connection per node. Every dispatcher is connected to
every Scalaris node. The results are shown in Figure 6.7. We do not show the failure
percentages because their value is nearly equal to 0 and their variation is not relevant.
Our first note is that the performance for all the tests increase linearly, meaning
that all the configurations scale correctly. Then we can observe that using a small or
a large dispatcher has no effect on the performance. This means that a small instance
should be powerful enough to manage at least 16× 4 connections to Scalaris, and that
there are special conditions in the Amazon cloud that limit the networking performance
with one dispatcher. We believe the reason the 4 small dispatchers outperform the two
others is because they can send more quickly new jobs to Scalaris nodes than a single
dispatcher. This confirms the results we have obtained in the previous test.
98
Figure 6.7: Throughput for 20000 Scalaris operations with respect to the number ofScalaris nodes, results for one small, one large and four small dispatchers and conflictlevel of 0,00007.
We thus reach the conclusion that, while increasing the number of connections to
Scalaris can increase the performance, it is sometimes necessary to have several dis-
patchers to put enough load on Scalaris. We can finally observe that using 4 dispatchers
we have the throughput that approximately doubles as the number of nodes double,
indicating really good scalability results.
Comparaison with Scalaris developers scalability tests: We have discussed
with Florian Schintke, a member of the Scalaris developers, about their scalability tests.
They use a different approach than ours and do not do any conflicting operations. For
instance, they make the number of nodes vary and have 10 clients per node. Each
client begins by initializing a random key and then do 1000 increments on this key.
The probability of conflict between operations is thus infinitesimally small. They also
used more powerful machines than ours and were not working on the Amazon cloud.
Figure 6.8 is one of the results Florian Schintke sent us.
We can clearly see that Scalaris is correctly scaling. However, their tests are rather
different from ours for several reasons. First, they use completely different infrastruc-
tures. Secondly, most of our tests are working with a conflictLevel which is important
for us as we know that Bwitter will obviously work with conflicting values. Finally,
we do not have our dispatcher on the same machine as the Scalaris nodes. We believe
it is not realistic for us to have the Bwitter nodes (equivalent of the dispatcher in the
tests we made) directly on the Scalaris nodes as this would perturb Scalaris nodes that
99
can potentially already be under high load. Furthermore, we would reduce the benefit
gained from the cache by having more Bwitter nodes.
Figure 6.8: Increment Benchmark test of the Scalaris developers.
Final words on Scalability and the connection strategy: We have concluded
that Scalaris is scalable, as the performance clearly improves with the number of nodes.
We explain the performance slow down at high number of nodes because the load we
put on the Scalaris nodes is not high enough.
To increase the load we have three possibilities either we increase the number of
connections, or we use several dispatchers, or we improve networking performance of
the environment. Using several dispatchers gives slightly better results than having
only one. Therefore, we believe that after a certain number of connections managed
by a dispatcher it is a good idea to add another one to have better scalability. We
were limited in the number of machines at our disposal to do all the tests we wanted
to do. We believe that results would have been more explicit if we could reach a higher
maximum number of nodes.
Scalability is also limited by the conflictLevel. The higher the conflictLevel, the
less connections and parallel requests we can perform without having the number of
failures exploding, which is shown by the connections test.
Elasticity test
Test description Until now we worked with a constant number of nodes during each
test. In order to react to flash crowds, we need Scalaris to be elastic enough so that
the throughput can be increased quickly. The detection of the flash crowd is not part
100
of the test and we consider that the flash crowd starts at the beginning of the test.
Afterwards, we have to decide what is the best strategy to handle this flash crowd. To
determine it we observe the throughput as well as the failure percentage during the
whole test. The final throughput reached is also important for us, as well as the total
number of operations performed during the whole test, in order to determine which
behaviour is the best during the churn period.
Parameters We have observed that Scalaris scales well from 6 to 18 nodes and are
going to test different ways to get from 6 to 18 nodes under high load. We will use
one dispatcher to dispatch a constant number of parallel requests to Scalaris. This
dispatcher is also responsible for adding the new nodes to the ring. Note that it takes
between 45 seconds and 200 seconds to start a new node using the Amazon API. The
dispatcher samples periodically the number of operations correctly done as well as the
number of failures. This allows us to plot the evolution of the throughput and failure
percentage with respect to the time. We now present the different strategies we will
try. Each strategy is defined by a number of nodes to add at each adding point and a
constant time between each adding point. For each strategy we wait one minute before
adding the first node so that we can observe what is happening before and after.
(1) We do nothing in order to have a standard measure to compare with the other
results.
(2) One node added after one minute and then no more.
(3) One node added every minute until we reach eighteen nodes.
(4) Two nodes added every minute until we reach eighteen nodes.
(5) Two nodes added every two minutes until we reach eighteen nodes.
(6) Six nodes added every five minutes until we reach eighteen nodes.
(7) Twelve nodes added after one minute.
We believe that with those strategies we have covered almost all possible behaviours.
Meaning, doing nothing, adding nodes regularly and adding lots of nodes at the same
time but waiting longer after the next adding. We must precise that those strategies
are objectives, it may not be possible to add nodes as quickly as planned and thus we
will most probably observe jitter in the nodes starting time. We summarize below the
parameters of the test.
• 1 connnection per node
• nbrInitialData = 2000
• 15 minutes of test
• conflictLevel = 1/250 (so all the operations work on a pool of 250 keys)
101
• 6 nodes runnning initially
• 1 minute before adding first node(s)
• Large instance dispatcher
• Small instance Scalaris node
• Successful and failed operations sampled every 20 seconds.
According to the Scalaris developers and at the time we are writing, nodes buffer
the requests arriving while they are inserted in the ring and start responding to them
as soon as they are correctly inserted. The parameter nbrInitialData is a special
parameter. It is aimed at simulating previous content on Scalaris nodes. Indeed, in
order to maintain the replication factor, new Scalaris nodes must retrieve the values they
are now responsible for when they are added in the ring. This thus adds an overhead
during each insertion of nodes in the ring. We thus wanted to take this overhead into
account and be able to tune it with the parameter nbrInitialData. Before the test
starts we add nbrInitialData key/value pairs to the ring. The keys used are random
and the value is always the same and corresponds to a constant String of 360448 random
characters. We have chosen nbrInitialData equal to 2000. This means that we have
quite a lot of data that must be transferred to the Scalaris nodes before starting the
test. We have observed that the initialization phase takes approximately 5 minutes. We
have several tasks running on the dispatcher: one responsible to check that operations
are correctly done, one that must send time statistics, the management of the Scalaris
Connection Manager and the Nodes Manager which are both heavy tasks. This is why
we have chosen to use a large dispatcher.
Remarks on the tests environment Before getting to the results, we make two
general remarks. First, the Amazon cluster is unstable, sometimes some machines are
not reachable (ping not working). Secondly, sometimes the Ubuntu AMI we are using
does not initialize correctly the SSH keys and SSH is thus not working, we spotted this
problem really late and could not correct it. Indeed, it would have been necessary to
modify the AMI we used and it was too late to redo all the tests. When faced with
one of those two problems we are forced to reboot the machine on the fly, which is
quicker than launching a new one but takes some time and CPU. We consider that this
overhead is part of the test. This was not a problem with the previous tests as the
launching of the ring was done during the initialisation phase. This is indeed the first
test where we need to launch a new machine at run time.
Scalaris elasticity test results Figure 6.9 shows the evolution of the throughput
of the different strategies, the numbers in the legend of the graph correspond to the
numbering of the strategies presented above. The throughput is computed thanks to
the data collected every 20 seconds, the throughput at the time x is thus equal to the
average throughput between x-20 seconds and x. We show the marks when we begin
to start new instances by blue points on the graph, note that the number of instances
102
started depends on the strategy. We also show with red points the moment where
Scalaris is started on the nodes, this is the moment where the command is launched,
not when it is effectively inserted in the ring.
We first comment each strategy separately.
(1) During this strategy we do not add any node and thus keep the ring size at the
initial value of 6 nodes. As you can see the throughput stays stable during the
whole test.
(2) In this strategy we add only one node. We start the adding procedure 60 seconds
after the beginning of the test. We can see that this procedure has an impact on
the performance. Indeed, the graphs shows that the throughput decreases during
the insertion, but this is not due to the Scalaris churn. Scalaris is not started on the
node before 120 seconds. At that time, the throughput increases by a small amount
and stays stable until the end of the test. The node was thus quickly operational
and we could not notice a drop down in the performance after Scalaris starts.
(3) We tried to add one node every minute but we could only start 6 nodes on the
12 nodes planned. Indeed, launching one node correctly takes a certain amount of
time which varies from 45 seconds to 3 minutes approximately. The throughput is
more chaotic as we regularly add nodes. And, as we have seen in the last strategy,
between the moment we start inserting a new node and the moment where Scalaris
is effectively started on the node, the throughput lowers. Again as we observed in
strategy (2), the throughput increases directly after Scalaris is started on the node.
Finally, we remark it was not possible to observe the stabilization because nodes
are added too regularly and all the nodes were not added.
(4) Here we add two nodes every minute. This time we could add 10 nodes out of 12.
The throughput once again increases regularly with the adding of nodes while being
perturbed by this adding. It finished at a higher value than (3) simply because it
could reach more nodes at the end. The throughput reached a pretty high value
but was not stable at the end of the test.
(5) We could only add 6 nodes here which is nearly the same as with the strategy 3.
However, here we added two nodes at the same time (where we added one in 3) and
waited two times longer before adding them (120s instead of 60s). We can observe
some kind of period in their adding and see that they regularly reach the same
throughput than with the third strategy. This is confirmed at the end of the test
where they eventually reach the same throughput at the same number of nodes.
(6) We increased the number of nodes per adding to 6, we first add 6 nodes at 60s,
they were ready at 160s and we directly see a high increase in the throughput and
a quick stabilization. We observe the same behavior at the second 6 nodes adding
and finally reach a stable throughput around 560 ops/s. Something weird is that it
should reach the same throughput as (7). Indeed, the last node adding was done at
560s, and, as we have seen, the throughput stays stable from this time showing no
indication that it will ever increases. Our guess would be that physical machines
103
Figure 6.9: Throughput with respect to the time for the seven strategies presented,with a large dispatcher and small Scalaris nodes for a conflict level of 0,004.
104
Figure 6.10: Failure percentage with respect to the time for the seven strategies pre-sented, with a large dispatcher and small Scalaris nodes for a conflict level of 0,004.
105
position creates some special conditions limiting the number of messages that can
be exchanged between nodes and lowering the throughput. It is indeed possible as
each test is run with different nodes.
(7) In this last strategy we add 12 nodes directly at 60s and Scalaris is started on those
nodes at 120s. Between 60s and 120s, we have a diminution of the throughput which
seems proportional to the number of nodes added, this is normal as the number of
operations involved in starting 12 nodes grows with the number of nodes to start.
We can observe that this diminution is of about 25% of the throughput. However
as soon as the nodes’ starting is finished and Scalaris is booted the throughput
explodes and quickly reaches a stable value at 630 ops/s. We can confirm that
this value corresponds to the stable throughput for 18 nodes. Indeed, it is close
from what we obtain in the connection strategy test at section 6.2.2 for which we
obtained an average throughput of 650 ops/s.
We now summarize the results we obtain observing the throughput evolution for
each strategy. First, we notice that during the adding period (during which we do
the following tasks: launching the nodes on Amazon, periodic call to Amazon API
to check the instances state, sending the necessary files, retrieving from the nodes
the information necessary to launch Scalaris) the performance is lowered by a factor
proportional to the number of nodes. However, launching several nodes at the same
time is less time consuming as Amazon starts all the nodes in parallel and thus the
time waited for one node is divided by the number of nodes. Secondly, after Scalaris is
started on nodes and despite our initial data which is quite important, nodes are almost
instantly ready to operate as the throughput in all the strategies increases directly after
Scalaris is started on nodes. We believe it is because the number of initial data is too
small to observe any performance drop. Moreover, this throughput is quite stable. We
must also notice that several node strategies could not reach 18 nodes neither stabilize
because the test length was too short. It is not a problem as other strategies have
already shown better results than those and reached the best stable state possible
(7). Conclusions would thus not have been different from the actual ones. Finally, we
decided that the last strategy was the best according to the throughput evolution as it
allows to quickly reach a very high and stable throughput with only some disagreement.
We now look at the average throughput of each strategy during the test in Fig-
ure 6.11. This criteria is important in order to know which strategy maintains the best
average service during the 15 minutes we have to react to the flash crowd.
It is obvious that the last strategy outperforms the others which is not surprising
according to the evolution of the throughput we just observed. We still have to observe
the failure percentage evolution. It may give some indication of Scalaris’s instabilities.
We can see in Figure 6.10 that the failure percentage grows with the number of
nodes. As for the throughput we thus observe an increase of the failure percentage
after the node adding which is proportional to the number of nodes added. This is
what we observed in all our tests: increasing the number of connections increases the
number of failures. There is thus no reason to penalize the solution with higher failures
percentages.
106
Figure 6.11: Mean throughput results for the seven strategies presented, with a largedispatcher and small Scalaris nodes for a conflict level of 0,004.
Conclusion We conclude that the best strategy is to add all the nodes at the same
time because it is the quickest way to increase the throughput, it also gives the best
average throughput on 15 minutes and do not present a failure percentage higher than
usual for this number of connections. The results are thus very encouraging as it was
indeed possible to go from 6 nodes to 18 nodes in only two minutes with only a loss of
approximately 25% while nodes were started. Moreover, as soon as Scalaris is started
on nodes, the throughput reaches a value which is close to what we obtained before in
6.2.2. It would have been interesting to test with higher value of nbrInitialData to
try to observe loss of performance during Scalaris nodes insertion in the ring but we
lacked time to perform those tests.
6.3 Bwitter tests
Now that we have looked at the performance of Scalaris, we can study Bwitter
keeping in mind those results. As explained previously, we have implemented two
different approaches: the pull and the push. We are going to test and comment those
two in this section. However, we will focus on the push approach as it is the one we
have finally selected as the best approach. One section will be dedicated to the pull
approach later. Therefore, unless we explicitly specify it, we are talking about the push
approach.
We will start by showing the impact of the application cache we use in order to
solve the popular value problem. We then make a test to show the influence of nbrOf-
107
FollowersPerChunk, the number of followers per chunk of the Topost set. Then we
test the scalability and elasticity of the system we have implemented.
6.3.1 Experiment measures discussion
In this section we explain which data we measured during our tests in order to
clarify it for the rest of the experiment section.
Measures taken
The following tests are aimed at determining the best design and parameter choices.
We thus want to measure the performance of the different configurations we propose,
but we are also interested in determining how successfully the operations were per-
formed. We do two types of operations: reading and posting tweets.Those operations
have different success conditions and restart strategies, we detail it below.
First, we discuss the posting tweet operation. This operation is assumed to fail
only when the first step of the algorithm fails. Indeed, performing this step correctly
ensures that the tweet will eventually be posted to all the lines assuming the recovery
mechanism is triggered or another tweet is posted by the same user. In case the first step
fails, we restart the operation at the test level and do not count it as another operation,
but if any of the remaining steps fails we do not trigger the recovery mechanism. This
means that all the tweets posted during the tests are always stored in the system but
might not be posted in all the lines. However, we have noticed a negligible amount of
SR have aborted. This indicates that tweets are successfully posted on the lines most
of the time.
Secondly, concerning the reading of the tweets, we do not abort the whole operation
if one tweet is not available. This should almost never happen because, as shown in the
previous tests, concurrent reads are not conflictual. Moreover, tweets are frequently
read from the cache lowering even more the probability of failing. We restart the
operation only if an error occurs when accessing the line containing the tweet references.
We now describe the most relevant measures we took during our tests. We have
taken more in order to help us understanding some results and to verify that everything
was working correctly. However, those are most of the time not helping to understand
the results and would only clutter the text.
• Time:
We measure the total time in milliseconds needed to perform the requested num-
ber of operations.
• SR run:
This is the number of SRs that were performed during the whole test. Indeed,
the posting tweet operations are split in various SRs. We take this measure in
order to compare it with the number of restarted SRs and the number of aborted
SRs. An SR restarted is not counted in the SRs run.
108
• SR restarted:
This is the number of SRs that were restarted by Scalaris Workers, remember
that they restart a SR a given number of times, that we have fixed to 10, before
aborting. We use this value in conjunction with the SRs aborted and the SRs
run in order to compute the failure percentage.
• SR aborted:
This is the number of SRs that were aborted by Scalaris Workers. When an SR
is aborted the Bwitter operation that created the SR get an exception. If the
number of SR aborted is low we can be sure that the Bwitter operations were
successfully performed. In fact, the number of aborted SRs is extremely low, so
low that we actually got approximately two aborted operations during the tests
we are presenting here. This is mainly due to our aggressive restart strategy. As
just stated, we restart and retry a failed operation 10 times before aborting it.
We thus do not present this measure in our results.
• Cache hits:
This indicates the number of times a read was successfully performed from the
cache. Each cache hit avoids a transactional read on Scalaris.
• Cache miss:
This indicates the number of times an access to the cache was performed and no
entry was found in the cache. This is usually pretty low compared to the cache
hits as we access frequently the same data because the network simulated is small.
You could wonder why did not measure the failures at the Bwitter level. In fact,
we did not get any failures of any Bwitter operations during the tests we did. We thus
decided to measure the failures at the layer below which is the Scalaris Connection
Manager Layer. This measure is precise enough to compare the degree of failures
between the different tests we did. As it was the case with Scalaris, we rather represent
our results in terms of throughput and failure percentage.
• Throughput:
Our Bwitter tests generally consist of a given number of operations. When we
talk about an operation we mean one of the two we described in the previous
point: posting a tweet or reading tweets. According to the test settings those
operations can be more or less heavy. The throughput measure is the number of
operations per second achieved by the tested configuration. We believe this is the
best way to determine which configuration is the best for a given test as we feel
it fairly measures the global throughput of the whole system.
Throughput =number of Bwitter operations successfully performed
measured total time
• Failure percentage:
109
The failure percentage is the amount of restarted SRs divided by the number of
succeeded operations. We only take into account the number of restarted SRs
because as said the number of aborted SR is negligible.
Failure percentage =SRs restarted
SRs Run + SRs restarted× 100
Measuring the time
The time is always measured with the System.currentTimeMillis function from Java.
We use the following pseudo code to measure the time taken by a piece of code.
Long timeAtStart = System.currentTimeMillis;
codeWeWantToProfile ();
Long executionTime = System.currentTimeMillis - timeAtStart;
This method does not take into account that we are working in a concurrent en-
vironment. Imagine we want to measure the time taken by an operation. When the
test consists only of operations of the same type it is not problem, we can measure the
total time of the test and divide it by the number of operation performed. However,
if we mix operations of different types (by example posting tweets and reading tweets)
we can not use this method. Indeed, a posting tweet thread can be preempted by a
reading tweet thread and thus some time spent in the reading tweet thread will be
accounted for the posting thread that was preempted. We did not solve this problem
and so we did not measure the time taken by a single operation. Ultimately, we are
more interested in the time taken to perform a given number of operations rather than
the mean time of one type of operation.
6.3.2 Push design tests
The parameters
All the tests are based on a simulation of Bwitter’s use. Between each test we
restart Bwitter and Scalaris in order to avoid side effects that could happen due to old
tests previously done. This is time consuming because Scalaris is not persistent and we
need to initialize Bwitter with some data so that the tests are as realistic as possible.
We have two phases: the initialization phase and the main phase. In the first phase,
we create the users and one line for each of them and we add the owner of the line on it.
We also add a number of followers to each line in order to simulate social connections.
We use a hash function in order to chose which user a given user should follow. Finally,
each user posts some tweets to create data on the lines. This phase is never taken into
account in the results we present. In order to have comparable results the initialization
phase is exactly the same for all the tests.
In the second phase, we do two kinds of operations we previously described: post a
tweet and read tweets. We decided to only read the tweets contained in the head chunk
as this is what the users usually want to access. The second phase is finished after a
110
predefined number of operations were successfully performed. We decided to fix this
number of operations to 20000 because, as was the case for Scalaris, we feel that 20000
operations are significant enough so that little variations do not influence the overall
results.
The throughput is computed based on this phase. This second phase is not static
in opposition to the first phase, indeed the operations performed are made in different
order each time the test is run, and the number of operations of each type varies a little
bit. We made this choice because we wanted to avoid to create an artificial pattern by
fixing the order of the operations and because we believe it is the best way to simulate
the real use of Bwitter. Below we detail the parameters we use for the social network
simulation, Scalaris and Bwitter. Some values are fixed and others are variable, we
will not detail in each test the parameters that are fixed. Therefore, if you need more
information about a precise parameter please refer to this section. We only detail in
the tests parameters that are not fixed or parameters that differ from the values we
give here.
We could not find any precise numbers about Twitter’s use. We thus decided to
create two different social networks that, according to us, should be close to reality.
The different parameters associated to those two configurations in Table 6.2.
Heavy network Light network
Number of users 2000 4000
Lines per user 1 1
Users followed 50 25
Tweets per user at beginning 1 1Number of users
Users followed0,025 0,00625
Table 6.2: Social network parameters, part 1.
It is not possible to simulate a network as big as Twitter, we were thus forced to
simulate a smaller network. However, the initialization phase for those two networks is
already quite long. The names we have chosen for those two network are significant, the
heavy network is more dense than the light one. The heavy network overestimates the
real complexity of a network like Twitter in order to avoid presenting better results than
a real world network would give. Indeed, we have chosen nbrUsers and nbrFollowers in
order to have a dense network, which complicates the task of Bwitter. You can notice
that the ratio (number of users / users followed) is quite high. This ratio of 0,025
means that each user follows 2,5% of all the users in the network, which implies a quite
high level of conflict between concurrent operations. This ratio is the equivalent of the
conflict level in the Bwitter tests.
We believe the light network is closer to the reality, because it is absurd to imagine
that each user is following 2.5% of the users in the network. We thus designed this
other network which has a smaller ratio (number of users / users followed) equal to
0,00625 to see how our application reacts to different level of conflicts. We now detail
the parameters related to Scalaris that we grouped in Table 6.3.
111
Scalaris node type Small instance
Number of Scalaris nodes Varies from 4→ 18
Connections per node Usualy one, can vary during the tests
Number of trials per SR 10
Number of parallel requests Usualy 20, varies with the total number ofconnections to Scalaris nodes
Table 6.3: Scalaris parameters.
We can use a total of maximum 20 nodes during the experiments, taking into
account both Scalaris nodes and Bwitter nodes, however we use maximum 19 for his-
torical reasons. In order to maintain a high load during all our tests, we constantly
make 20 operations in parallel. If we use a higher number of connections per node we
increase this value so that it is always higher than the number of connections to Scalaris
nodes. Finally, we have configured the Scalaris Connections Manager so that each SR
is restarted 10 times before being aborted. We now present the Bwitter parameters we
grouped in Table 6.4 .
Dispatcher / Bwitter node type Small or Large instance
tweetchunksize 30
nbrOfFollowersPerChunk 20
Table 6.4: Bwitter application parameters.
We have two parameters of the Bwitter application to fix namely tweetchunksize
and nbrOfFollowersPerChunk. They are susceptible to have an impact on the re-
sults of our tests as they will influence the number of tweets read and the number of
operations involved in the write. We have chosen them so that the first tweet chunk
should contain a decent amount of tweets in order to have relevant tests. With a value
of 30 we estimate to 20 the number of tweets in the head chunk at the start of the test.
Indeed, each user should have got around 50 tweets in his line during the initialisation
phase.
Real system with stars and fans
In order to stick as much as possible to reality we have decided to populate our
system with two kind of users: stars and fans. Indeed, in Twitter, some users are a
lot more followed than they follow and the others follow more persons than they have
followers. We fixed to 10% the number of stars in the system, the rest of the users being
fans. Each user in the system has 75% of stars users amongst the users he follows. An
example of simulated network can be seen in Figure 6.12.
112
Figure 6.12: Simulated social network with social connections between users, each userfollows 3 users. Left) Random following pattern. Right) Nodes 2 and 4 are stars andeach user has a 2/3 probability per connection to follow a star.
Furthermore, users tend to do more reads than posts when visiting social networks,
we took this behaviour into account too by allowing to fix the read tweets operations
/ total number of operations ratio. We use the parameters we list in Table 6.5 for all
the tests.
Stars percentage 10% of users are stars
Pourcentage of stars in the followers 75% of the users followed
Read percentage 80% of the operations are read
Table 6.5: Social network parameters, part 2.
Cache influence
With this test we will prove that a cache mechanism is not optional and in fact
crucial for the performance of the system. We made two runs of our Bwitter simulation,
one with the cache and one without the cache. The parameters we used for the two
runs are the ones we just fixed with as exception the ones we list in Table 6.6.
Type of social network Heavy network
Dispatcher / Bwitter node Large istance
Number of Scalaris nodes 18
Connections per node 1
Table 6.6: Parameters changed for the cache test.
We have put the test results in Table 6.7. Remember that we have set a time to live
equal to 1 minute for the elements in the cache. Those elements thus stay a maximum
of 1 minute in the cache before being ejected, meaning that a deleted tweet can remain
113
visible for a maximum of 1 minute. The cache has a size big enough to keep all the
elements of the test that can be cached. This would probably not be the case in a real
situation, if we must remove an element from the cache because it is full we use a least
recently used strategy as explained in section 3.2.3.
Without cache With cache
Time taken for all the operation in seconds 863s 492s
Troughput (ops/s) 23,15 ops/s 40,59 ops/s
Failure percentage 1,32% 3,18%
Cache hits / 250704
Cache misses / 4431
Table 6.7: Performance comparison with and without application cache.
Obviously the cache is the quickest option, it allows to nearly double the number
of operations performed per second. This noticeable performance improvement is ex-
plained entirely by the frequent access to the cache. The cache is mainly used to access
tweet and passwords. Assuming tweets are in the cache, we avoid X transactions to
Scalaris, where X is the number of tweets read in one read operation. We saw during
the previous tests that reading a value from Scalaris approximately takes 1,5 ms with
18 nodes, the cache statistics indicates that the mean time taken for an access to the
cache is 0,006 ms. It is thus theoretically 250 times faster! According that we have
250704 hits we win (1, 5− 0, 006)× 250704ms = 374551ms = 375s for the whole test.
The difference between the two test times is 371s, the cache is thus indeed the main
factor improving the performance.
A side effect that can be observed when using the cache is that we have a higher
failure percentage. The failure percentage goes from 1,32% to 3,18%, which is still
a very good result, meaning that almost all of the Scalaris operations were correctly
performed. This is probably due to the higher number of concurrent posting due to the
cache usage, indeed, the reading tweets operations are a lot quicker and thus we have
more concurrent posting tweets operations than when we did not use a cache. This
implies that we have more conflicts. Indeed, we did a quick test and observed that
while only reading tweets we end up with failure percentage of 0, and at the contrary
when only posting tweets we had a failure percentage of 30%. Our assumption that
more concurrent tweet posting are responsible of this increase in failure percentage is
thus reasonnable.
In conclusion, the cache improves the global performance. The reading tweets
algorithms mainly benefits from the cache making reads even faster, which was our
goal. We could probably still optimize the cache usage but decided not to focus on this
part. The following tests will thus all use the cache described here.
Number of followers in a chunk of the topost set.
Before starting the scalability test we were curious to know in practice the influ-
ence of nbrOfFollowersPerChunk on the performance of our system. We first list
114
some theoretical elements that should help us understand the results. Then, we do a
simulation to see if verifies in a real test.
You must recall that the higher the nbrOfFollowersPerChunk, the higher the
number of keys involved in a write transaction, but the lower the number of necessary
transactions. Moreover, transactions involving more keys are in general more likely to
fail. We make use of our theoretical analysis, and we compute, using the Equation 5.2,
that we need respectively 174, 110, 102 and 98 Scalaris operations to do a single write
for values of nbrOfFollowersPerChunk of 1, 5, 10 and 20. Those results are displayed
in Figure 6.13.
nbOp = 8 + nbrFollowers×(
2 +3
nbrTweetsPerChunk+
2
nbrOfFollowersPerChunk
)= 8 + 40×
(2 +
3
20+
2
nbrOfFollowersPerChunk
)= 8 + 80 + 6 +
80
nbrOfFollowersPerChunk
= 94 +80
nbrOfFollowersPerChunk
(6.3)
Figure 6.13: Number of Scalaris operations needed to perform a Bwitter “post tweet”operation with respect to the number of followers per chunk.
We now want to evaluate in practice the impact of this parameter. We simulate
a social network with a higher level of conflict than the two we already presented in
115
order to have a clearer view of this impact. The levels of conflict of the heavy and light
networks we presented are 0,025 and 0,00625 respectively, in this test it is equal to
0,06. We measured the time needed to perform 10000 operations with different sizes of
nbrOfFollowersPerChunk. We summarize the simulation parameters in Table 6.8
and present our results in Figures 6.14 and 6.15.
Bwitter node / Dispatcher Small
Number of Scalaris nodes 10
Number of Bwitter operations 10000
Number of users 700
Users followed 40Number of users
Users followed0,06
Table 6.8: Parameters changed for the Topost set influence test.
Figure 6.14: Time measured to perform 10000 Bwitter operations with respect to thenumber of followers per chunk, results for small instances and conflict level of 0,06.
We can see that the time lowers a lot between one and five followers per chunk. This
has nothing surprising as the number of operations to do per tweet posted decreases a lot
between one and five as shown in Figure 6.14. The time difference is entirely explained
by the lower number of operations done for one tweet posting. Indeed, the cost of
the read operation stays the same whatever nbrOfFollowersPerChunk. However, if
we follow this reasoning the time should continue to lower between 5 and 20. But it
seems to stagnate and even to increase slightly at 20. We can explain it taking a look
at Figure 6.15 which plots the failure percentage. This one shows a big increase of
116
Figure 6.15: Failure percentage for 10000 Bwitter operations with respect to the numberof followers per chunk, results for small instances and conflict level of 0,06.
the failure percentage between 5 and 20. Indeed, as we mentioned in the introduction
of this section, the bigger nbrOfFollowersPerChunk the bigger the number of keys
involved per transaction during a tweet posting. And, larger transactions induce more
conflicts and thus more failures. The advantage of having less structures to manage
at higher value of nbrOfFollowersPerChunk seems thus to compensate with the
number of failures that increase also with nbrOfFollowersPerChunk, this is why we
observe this stagnation at the end of the graph.
In conclusion, we should not use a value too small for the nbrOfFollowersPer-
Chunk as in this case the number of operations increases a lot and the time thus
explodes. On the other hand, we should not use a too high value either as it quickly
increases the number of failures and thus the time. This is why we decided to use in
all our following tests a value of 20 for nbrOfFollowersPerChunk because it seems
a good compromise. It is maybe not the best choice but at least it seems to be a wise
one.
Scalability tests
With this test we evaluate the scalability of our application. We run our simulation
with the parameters described at the beginning of this section for different numbers of
nodes. We use the heavy network we have presented. We do not know for sure what
would be the best connection strategy as the degree of conflict of our simulation is hard
to evaluate. We thus test with a small dispatcher with one connection per node (1)
117
and with a small dispatcher with two connections per node (2). We also test with a
large dispatcher and one connection per node (3) as a small dispatcher is maybe not
powerful enough to handle the Bwitter tasks and lots of Scalaris connections.
Our results are grouped in Figures 6.16 and 6.17. The first shows the throughput
with respect to the nodes and the second plots the failure percentage.
Figure 6.16: Throughput for 20000 Bwitter operations on a heavy network with respectto the number of Scalaris nodes, results for one small dispatcher with one connectionper node, one small dispatcher with two connections per node and one large dispatcherwith one connection per node.
From Figures 6.16 we can see that (2) does not scale well. Indeed, the throughput
first increases until 12 nodes and then decreases at a level below the throughput reached
at 4 nodes. The failure percentage is more than twice the one for (1) and (3), which
seems to indicate than there are too many connections toward Scalaris. A simulation
with a smaller conflict level could have benefited from a higher number of connections
but we did not test it.
The throughput of (1) grows at regular pace until 14 nodes and then seems to slow
down. We have observed the same behavior during the Scalaris scalability tests but it
is more obvious here. The failure percentage grows linearly with the number of nodes,
which is normal.
Configuration (3) gives far better results in terms of throughput than (1) and (2). It
grows very well until 16 nodes and suddenly falls at 18 nodes. However the gap between
14 and 16 seems higher than usual, we thus believe that this situation was created by
exceptional conditions. We deduce from the observation of the throughput of (1), (2)
and (3) that a small dispatcher can not handle both Bwitter tasks and Scalaris related
118
Figure 6.17: Failure percentage for 20000 Bwitter operations on a heavy network withrespect to the number of Scalaris nodes, results for one small dispatcher with oneconnection per node, one small dispatcher with two connections per node and one largedispatcher with one connection per node.
tasks. We have indeed observed with Amazon’s basic monitoring tools that the CPU
as well as the network were used a lot more during these Bwitter tests than during
the Scalaris tests. It is not surprising as the values and keys used are bigger than
during the Scalaris tests and Bwitter performs various additional tasks. It therefore
indicates that it is necessary to use more powerfull machines than the small instances
from Amazon for the Bwitter nodes. The failure percentage grows slowly and is nearly
the same as for (1) until we reach 12 nodes. From 12 nodes and until 18 nodes, (3)
sees its failure percentage growing faster. This is probably because it has more CPU
and network capabilities and thus can run more transactions in parallel which creates
more conflicts. However, this seems to indicate that the gain of adding one node will
decrease slowly as the number of nodes becomes bigger. This has nothing surprising
and does not indicate a scalability problem. Indeed, during this test we increased the
number of parallel operations while keeping stable the number of users. Normally, the
number of machines grows with the size of the social network and thus the number of
users. But a user should not follow more users simply because there are more users in
the network.
In conclusion, Bwitter is scalable, but we need to have the Bwitter nodes powerful
enough to handle the necessary number of connections toward Scalaris while performing
the Bwitter tasks. We now make a final scalability test with a simulated social network
with a smaller conflict level which, we believe, is closer to reality. We only run the tests
with one large dispatcher and one connection per node. The parameters changed for
119
this test are in Table 6.9.
Bwitter node / Dispatcher Large
Number of Scalaris nodes 4→ 18
Connections per node 1
Network type Heavy and Light network
Table 6.9: Parameters changed for the push scalability test
This means that we now have a conflict level of 25/4000 = 0,00625. We show in
Figures 6.18 and 6.19 the results of the test as well as the results for the more dense
social network so that we can more easily compare the two.
Figure 6.18: Throughput for 20000 Bwitter operations with respect to the number ofScalaris nodes for the heavy and the light network, for one large dispatcher with oneconnection per node.
We observe as expected better performances with a smaller conflict level. The
failure percentage increases much slower than before which explains the tremendous
gain in performance. Looking at the two Bwitter scalability tests we can see that there
exists a pretty clear correlation between the failure percentage and the conflict level.
With 18 nodes and this conflict level we reach finally 66 ops/s which means around 13
tweets posted/s and 53 reads/s. If we make a small computation and assume a user
posts 3 tweets a day and reads their tweets 12 times a day, we estimate that we can
handle 380162 users with only 19 machines. This is obviously overestimated and not
precise but even a quarter of this number would be a good result.
During those tests we observed good scalability properties for the large dispatchers,
120
Figure 6.19: Failure percentage for 20000 Bwitter operations with respect to the numberof Scalaris nodes for the heavy and the light network, for one large dispatcher with oneconnection per node.
the small dispatchers were too short in resources. As for the Scalaris scalability test
we saw that a high conflict level reduces the throughput and lowers the gain obtained
from adding a machine. We now test Bwitter’s elasticity.
Elasticity tests
The scalability tests of Bwitter have shown good scalability results from 4 to 18
nodes for the heavy and the light network. However, we have decided to use the
light one, indeed the throughput increases faster with the number of nodes with this
network, we thus believe it is easier to observe elasticity with this one. Concerning
the nbrInitialData, defined during the elasticity tests on Scalaris, we have decided to
increase its value up to 20000. Indeed, the Scalaris elasticity test did not seem to show
any instability after nodes adding, we thus decided to try to increase the impact of
the churn. The time of the initialisation phase is very long, it takes approximately 45
minutes to post the initial data and additional 40 minutes to initialize Bwitter related
data such as followers, tweets and so on. It was thus not possible to push the number of
initial data much higher though we would have liked to. We keep the seven strategies we
defined during the elasticity tests on Scalaris and start with 6 initial nodes. However,
the results should be quite different because we used much more initial data and Bwitter
adds an important CPU and network overhead compared to the Scalaris operations we
did before. We present the results in Figures 6.20 and 6.21. As for the last elasticity
test, we present the evolution of the throughput as well as the failure percentage, we
121
Figure 6.20: Throughput with respect to the time, Bwitter results for the seven pre-sented strategies on Scalaris small instances with large dispatcher and and light network.
122
Figure 6.21: Failure percentage with respect to the time, Bwitter results for the sevenpresented strategies on Scalaris small instances with large dispatcher and and lightnetwork.
123
also indicate by blue dots the moment we start the machine on Amazon and by red dots
the moment at which Scalaris is started on nodes. We also indicate the final number
of nodes reached by each strategy in Table 6.10.
Strategy 1 2 3 4 5 6 7
Nodes added 0 1 5 8 8 12 12
Table 6.10: Number of nodes inserted in the ring at the end of the test.
First, you can observe that the throughput is much more unstable than during the
Scalaris elasticity test. The first reason is that the measure we take is much more
volatile. Secondly, we have put a lot more initial data in the system. This might thus
slow down Scalaris at some times, slowing down some read or post tweets operations
that are finished at the next sample when we take the measures and thus giving a big
gap between two measures. Thirdly, Bwitter operations are much more heavy than the
operations we did during Scalaris scalability test, it may thus have an impact on the
results.
We will not discuss each strategy in detail as we did for Scalaris. Instead we do
some general comments. We can observe that the first strategy’s throughput varies
a lot (between 20 and 30) all along the test, which means that even without adding
any node the throughput is quite variable. We also can see that as for the Scalaris
elasticity test, between the moment we start instances on Amazon and the moment
Scalaris is started on the nodes we have a slow down in the throughput. The adding of
nodes is once again directly effective, and the throughput in general increases. We also
observe that most of the strategies were not stabilized at the end of the test and that
their throughputs still vary a lot. But, as expected, the strategies that added the more
nodes during the test reached the highest throughput. The throughput is varying a lot,
it is thus not representative to chose a strategy according to the final throughput, we
thus turn toward the average throughput, represented in Figure 6.22, which is much
more easy to analyze.
As we can see, the strategies 6 and 7 that reach the highest number of nodes at the
end have also the highest average throughput. The strategies 4 and 5 have a similar
average throughput but 4 has a higher one because it adds its nodes before 5 and can
thus benefit sooner from the new nodes.
As we can see in 6.21 the failure percentage varies also a lot, we can also see that
when it reaches a peak the throughput naturally drops down. We see that when the
number of nodes grows the failure percentage also varies much more. We suppose the
peaks are an effect of Scalaris stabilisation algorithm run periodically.
So, once again, our conclusion is that the quicker you add nodes, the quicker you
increase the throughput and the higher average throughput you obtain during the tests.
However we can observe that strategy (7) was less stable than it was during the Scalaris
elasticity test. So maybe if we could have performed elasticity tests with more nodes,
we could have observed that adding all the nodes at the same time was not a good
idea. In order to conclude, we can say that, with our current resources, adding all the
124
nodes at the same time seems to be the best strategy.
Figure 6.22: Average throughput, Bwitter results for the seven presented strategies onScalaris small instances with large dispatcher and and light network.
6.3.3 Pull scalability test
In this final section we test the scalability of the push approach. We use exactly the
same parameters as those described at the beginning of this section. As for the other
scalability tests we make the number of nodes vary from 4 to 18, use one connection
per node and make 20000 Bwitter operations. We simulate the heavy and the light
networks. Those parameters are summarized in Table 6.11.
Bwitter node / Dispatcher Large
Number of Scalaris nodes 4→ 18
Connections per node 1
Number of Bwitter operations 20000
Network type Heavy and Light network
Users followed 40
Table 6.11: Parameters changed for the pull scalability test
The heavy network should give much worse results than the other one. Indeed,
from the theoretical analysis, we know that the complexity of the read operation grows
linearly with the number of followers for the pull approach. We read one chunk (here
one time frame) as we did for the push. We have set up the time frame at one day,
125
which is a reasonable choice for a real case application, so all the tweets posted are
posted in the same chunk. It may thus seem unfair compared to the push approach,
which flushes the head when it is full, while in the pull we are forced to read all the
tweets that were posted during the day. However, because we use a cache, this side
effect is strongly mitigated. Indeed, most of the tweets are in the cache and its read
access is really quick. The pull and push simulation are thus comparable. As the large
dispatcher has given better results for the push approach we decided to make the test
with a large dispatcher also. Concerning the Scalaris nodes we use, as usual, the small
instances. We put the throughput and the failure percentage for the heavy and the
light network in Figures 6.23 and 6.24.
Figure 6.23: Throughput for 20000 Bwitter operations with the pull approach withrespect to the number of Scalaris nodes for the heavy and the light network, for onelarge dispatcher with one connection per node.
The pull approach presents an excellent scalability for the two networks. Indeed, the
throughput increases perfectly linearly with the number of nodes. This good behavior
is due to the failure percentage that grows extremely slowly with the number of nodes.
This seems to indicate that it can handle a really high number of nodes. The low failure
percentage is the consequence of the low number of writes involved in the pull version
of the post tweet. Remember that the pull only writes the tweet reference at one place
and that when followers read their tweets they do not make any write. Operations
in the pull are thus nearly not conflictual at all. The failure percentage for the light
network seems to increase a lot at 16 nodes but it is only a visual effect. Indeed, it
only increases approximately of 0,05
For the same parameters, namely those described at the beginning of this section,
we go from around 18000 to 250000 Scalairs operations. This is due to high number of
126
reads and low number of writes in our test. As predicted the reads require much more
operations to be done using the pull design.
Figure 6.24: Failure percentage for 20000 Bwitter operations with the pull approachwith respect to the number of Scalaris nodes for the heavy and the light network, forone large dispatcher with one connection per node.
6.3.4 Conclusion: Pull versus Push
We want to say some final words about the pull and the push approach. We can
only compare the scalability tests because we did not performed elasticity test for the
pull. The results can be directly compared because we used the same parameters for
the two approaches. We put as usual the throughput and the failure percentage for the
push and the pull for the two network we tested. They are shown in Figures 6.25 and
6.26.
We can observe that the push approach for the two network types outperforms the
pull in term of throughput. The throughput also increases faster with the number of
nodes in the push approach. However, we can see that the throughput increase for
the push, as already observed, seems to reduce when we reach high number of nodes.
This is not the case with the pull approach which grows more constantly. The push
approach will thus probably reach a limit in scalability quicker than the pull.
127
Figure 6.25: Throughput for 20000 Bwitter operations with the push and pull approachwith respect to the number of Scalaris nodes for the heavy and the light network, forone large dispatcher with one connection per node.
Concerning the failure percentage, it is much more important in the push approach
and increases much quicker with the number of nodes. This explains why the increase
of the throughput in the push approach slows down with the number of nodes. The
pull approach does not present this problem as the failure percentage grows very slowly
and is ridiculously low.
We thus conclude that the two approaches have their pro and cons. The push
approach present a much better scalability but at the cost of a higher failure percentage.
The two approaches scale well but the pull does not seem to slow down. This seems to
indicate that the pull would be the most appropriate for a very high number of nodes.
However this last conclusion is purely hypothetical. We would need much larger scale
tests in order to confirm this intuition.
128
Figure 6.26: Failure percentage for 20000 Bwitter operations with the push and pullapproach with respect to the number of Scalaris nodes for the heavy and the lightnetwork, for one large dispatcher with one connection per node.
6.4 Conclusion
During this section we have shown how to configure Amazon and Scalaris. We
performed a series of tests concluding that Scalaris running on Amazon was indeed
scalable and elastic. We then performed a series of test on Bwitter, both push and
pull approaches. We have demonstrated for both that they were scaling very well.
The push approach presenting a quicker increase of performance but with a failure
percentage growing much faster. Finally, we showed that our system, based on the
push approach, was able to significantly improve its performances while facing a high
load in 15 minutes.
129
130
Chapter 7
Conclusion
Our goal was to design and develop a scalable and elastic implementation of a
social network application on the top of key/value datastore. Looking at the results
detailed in the previous chapter, we are confident we have reached our goal. Indeed,
we developed an implementation of our pull and push design, and they both showed
good scalability results. The elasticity was only tested for the push approach, and we
showed it was possible to quickly improve performance while assuring good level of
service. All those tests were achieved under real world assumptions using Amazon’s
Elastic Compute Cloud infrastructure. The implementation was realized with the goal
of being as close as possible to a real social network application, we thus took care at
protecting user data and at avoiding security flaws.
During our work with Beernet and its main developer Boris Mejıas, we identified
the basic requirements to allow different services to run on the same DHT without
interfering with each other. Those lead to the discovery of some potential improvements
for Beernet’s API, which are now implemented in version 0.9. This new API allows
users to protect and grant limited rights to their data by using a system of secrets.
Before testing Bwitter, we have also heavily tested Scalaris in order to understand
the future Bwitter tests results. We first showed the importance of choosing the right
number of connections. Afterwards, we studied deeply his scalability and tried different
strategies in order to evaluate the elasticity of Scalaris on Amazon’s EC2. It was shown
to be highly scalable and elastic.
Besides this work, we have also co-written an article, along with Peter Van Roy and
Boris Mejıas, entitled “Designing an Elastic and Scalable Social Network Application”.
In this article we detail some of the observations and design decisions which developped
in this master thesis. This article, that can be found in Chapter 10 of our annexes, has
been accepted for The Second International Conference on Cloud Computing, GRIDs,
and Virtualization1 organized by the IARIA and held the 25th to the 30th of September
2011 in Rome, Italy.
1CLOUD COMPUTING 2011, http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html,last accessed 13/08/2011
131
7.1 Further work
At multiple occasions during the tests, we concluded that it would have been in-
teresting to perform the tests with more nodes in order to have a better idea of the
scalability and the elasticity. Indeed, during this work we were limited in our tests to 20
machines. And hence, while Bwitter displayed good performance in this environment,
it would have been interesting to increase the number of machines in order to approach
a more realistic number of machines.
We also believe the flash crowd detection mechanism is an interesting subject to
study. Indeed, during our researches, we have noticed that there are sometimes telltale
behaviours in the network before a high peak of activity. It would thus be interesting
to try to design a mechanism based on those social behaviours in order to predict the
heavy loads and already allocate machines before the peak.
We did not study the downscale elasticity in our work because, according to the
Scalaris developers, their system does not handle graceful shutdowns yet in version
0.3.0. It would thus be interesting to observe and test Bwitter on Scalaris once this
feature is implemented in order to study its behaviour.
We did not address the load balancing between Bwitter nodes but it could be
interesting to develop an algorithm to detect which requests should be forwarded to
which Bwitter node in order to share the load between them. Following the same idea,
some requests, like tweets posted from stars, are quite heavy, it might also be a good
idea to split this work between the Bwitter nodes and not only between the Scalaris
nodes.
Finally, the load balancer of the Scalaris Connection Manager could be improved in
order to decide which SR should be executed in order to decrease the conflict between
SRs executed concurrently.
132
Bibliography
[1] Apache. Apache hbase, frontpage. http://hbase.apache.org, 2011. [Online; accessed
28-June-2011].
[2] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H.
Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Sto-
ica, and Matei Zaharia. Above the clouds: A berkeley view of cloud computing.
Technical Report UCB/EECS-2009-28, EECS Department, University of Califor-
nia, Berkeley, Feb 2009. URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/
EECS-2009-28.html.
[3] Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robert Morris, and Ion
Stoica. Looking up data in p2p systems. Commun. ACM, 46:43–48, February
2003. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/606272.606299. URL
http://doi.acm.org/10.1145/606272.606299.
[4] Shea Bennett. Twitter passes 300 million users, seeing 9.2 new registrations per sec-
ond. (allegedly.). http://www.mediabistro.com/alltwitter/twitter-300-million-users
b9026, 2011. [Online; accessed 28-June-2011].
[5] John Buford, Heather Yu, and Eng Keong Lua. P2P Networking and Applica-
tions. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. ISBN
0123742145, 9780123742148.
[6] Nicholas Carlson. Facebook has more than 600 million
users, goldman tells clients. http://www.businessinsider.com/
facebook-has-more-than-600-million-users-goldman-tells-clients-2011-1, 2011.
[Online; accessed 28-June-2011].
[7] Rick Cattell. Scalable sql and nosql data stores. ACM SIGMOD Record, 39(4),
dec 2010.
[8] Chris Clayton1. Standard cloud taxonomies and windows
azure. http://blogs.msdn.com/b/cclayton/archive/2011/06/07/
standard-cloud-taxonomies-and-windows-azure.aspx, 2011. [Online; accessed
26-July-2011].
[9] Technology Expert. Twitter proves itself again, in chilean earthquake. http://
www.tech-ex.net/2010/02/twitter-proves-itself-again-in-chilean.html, 2010. [Online;
accessed 28-June-2011].
133
[10] Code Futures. Database sharding. http://www.codefutures.com/
database-sharding/, 2011. [Online; accessed 28-June-2011].
[11] Ali Ghodsi. Distributed k-ary System: Algorithms for Distributed Hash Tables.
PhD thesis, KTH –- Royal Institute of Technology, Stockholm, Sweden, dec 2006.
[12] Ali Ghodsi, Luc Alima, and Seif Haridi. Symmetric replication for structured
peer-to-peer systems. In Gianluca Moro, Sonia Bergamaschi, Sam Joseph, Jean-
Henry Morin, and Aris Ouksel, editors, Databases, Information Systems, and
Peer-to-Peer Computing, volume 4125 of Lecture Notes in Computer Science,
pages 74–85. Springer Berlin / Heidelberg, 2007. URL http://dx.doi.org/10.1007/
978-3-540-71661-7 7. 10.1007/978-3-540-71661-7 7.
[13] Ali Ghodsi, Luc Onana Alima, and Seif Haridi. Symmetric replication for
structured peer-to-peer systems. In Proceedings of the 2005/2006 interna-
tional conference on Databases, information systems, and peer-to-peer computing,
DBISP2P’05/06, pages 74–85, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN
978-3-540-71660-0. URL http://portal.acm.org/citation.cfm?id=1783738.1783748.
[14] Jim Gray and Leslie Lamport. Consensus on transaction commit. ACM Trans.
Database Syst., 31:133–160, March 2006. ISSN 0362-5915. doi: http://doi.acm.
org/10.1145/1132863.1132867. URL http://doi.acm.org/10.1145/1132863.1132867.
[15] El-Ansary Sameh Haridi Seif. An overview of structured overlay networks. Hand-
book on Theoretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless, and
Peer-to-Peer Networks, 2005.
[16] Abigail Hauslohner. Is egypt about to have a facebook revolution? http://www.
time.com/time/world/article/0,8599,2044142,00.html, 2011. [Online; accessed 28-
June-2011].
[17] Bill Heil and Mikolaj Piskorski. New twitter research: Men follow men and nobody
tweets. http://blogs.hbr.org/cs/2009/06/new twitter research men follo.html, 2009.
[Online; accessed 28-June-2011].
[18] Rachelle Matherne. Social media coverage of the haiti earthquake. http://sixestate.
com/social-media-coverage-of-the-haiti-earthquake/, 2010. [Online; accessed 28-
June-2011].
[19] Boris Mejıas and Peter Van Roy. Beernet: Building self-managing decentralized
systems with replicated transactional storage. IJARAS: International Journal of
Adaptive, Resilient, and Autonomic Systems, 1(3):1–24, July-Sept 2010. ISSN
1947-9220. doi: 10.4018/jaras.2010070101.
[20] MySQL. Mysql cluster. http://www.mysql.com/products/cluster/, 2011. [Online;
accessed 28-June-2011].
[21] John Naughton. Yet another facebook revolution: why are we so surprised? http://
www.guardian.co.uk/technology/2011/jan/23/social-networking-rules-ok, 2011. [On-
line; accessed 28-June-2011].
134
[22] Timothy Grance Peter Mell. The nist definition of cloud computing (draft). Rec-
ommendations of the National Institute of Standards and Technology, 2011.
[23] Programming Languages and Distributed Computing Research Group, UCLou-
vain. Beernet: pbeer-to-pbeer network. http://beernet.info.ucl.ac.be, 2009. URL
http://beernet.info.ucl.ac.be.
[24] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker.
A scalable content-addressable network. In Proceedings of the 2001 conference
on Applications, technologies, architectures, and protocols for computer commu-
nications, SIGCOMM ’01, pages 161–172, New York, NY, USA, 2001. ACM.
ISBN 1-58113-411-8. doi: http://doi.acm.org/10.1145/383059.383072. URL http:
//doi.acm.org/10.1145/383059.383072.
[25] Redis. Redis. http://redis.io/, 2011. [Online; accessed 28-June-2011].
[26] Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy,
Scott Shenker, Ion Stoica, and Harlan Yu. Opendht: a public dht service and its
uses. SIGCOMM Comput. Commun. Rev., 35:73–84, August 2005. ISSN 0146-
4833. doi: http://doi.acm.org/10.1145/1090191.1080102. URL http://doi.acm.org/
10.1145/1090191.1080102.
[27] Alex Rodriguez. Restful web services: The basics. https://www.ibm.com/
developerworks/webservices/library/ws-restful/, 2008. [Online; accessed 13-August-
2011].
[28] Antony Rowstron and Peter Druschel. Storage management and caching in past,
a large-scale, persistent peer-to-peer storage utility. SIGOPS Oper. Syst. Rev., 35:
188–201, October 2001. ISSN 0163-5980. doi: http://doi.acm.org/10.1145/502059.
502053. URL http://doi.acm.org/10.1145/502059.502053.
[29] Thorsten Schutt, Florian Schintke, and Alexander Reinefeld. Scalaris: reliable
transactional p2p key/value store. In Proceedings of the 7th ACM SIGPLAN work-
shop on ERLANG, ERLANG ’08, pages 41–48, New York, NY, USA, 2008. ACM.
ISBN 978-1-60558-065-4. doi: http://doi.acm.org/10.1145/1411273.1411280. URL
http://doi.acm.org/10.1145/1411273.1411280.
[30] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Bal-
akrishnan. Chord: A scalable peer-to-peer lookup service for internet applica-
tions. SIGCOMM Comput. Commun. Rev., 31:149–160, August 2001. ISSN 0146-
4833. doi: http://doi.acm.org/10.1145/964723.383071. URL http://doi.acm.org/
10.1145/964723.383071.
[31] Chunqiang Tang, Zhichen Xu, and Mallik Mahalingam. psearch: information
retrieval in structured overlays. SIGCOMM Comput. Commun. Rev., 33:89–94,
January 2003. ISSN 0146-4833. doi: http://doi.acm.org/10.1145/774763.774777.
URL http://doi.acm.org/10.1145/774763.774777.
135
[32] G. Tselentis, J. Domingue, A. Galis, A. Gavras, and D. Hausheer. Towards the
Future Internet: A European Research Perspective. IOS Press, Amsterdam, The
Netherlands, The Netherlands, 2009. ISBN 1607500078, 9781607500070.
[33] Twitter. #numbers. http://blog.twitter.com/2011/03/numbers.html, 2011. [Online;
accessed 28-June-2011].
[34] Guido Urdaneta, Guillaume Pierre, and Maarten Van Steen. A survey of dht
security techniques. ACM Comput. Surv., 43:8:1–8:49, February 2011. ISSN 0360-
0300. doi: http://doi.acm.org/10.1145/1883612.1883615. URL http://doi.acm.org/
10.1145/1883612.1883615.
[35] Harry Wallop. Japan earthquake: how twitter and facebook
helped. http://www.telegraph.co.uk/technology/twitter/8379101/
Japan-earthquake-how-Twitter-and-Facebook-helped.html, 2011. [Online; ac-
cessed 28-June-2011].
[36] Evan Weaver. Improving running components. http://www.slideshare.net/Eweaver/
improving-running-components-at-twitter, 2009. [Online; accessed 28-June-2011].
[37] Wikipedia. Trusted platform module. http://en.wikipedia.org/wiki/Trusted
Platform Module, 2011. [Online; accessed 28-June-2011].
[38] Wikipedia. Trusted computing. http://en.wikipedia.org/wiki/Trusted computing#
Remote attestation, 2011. [Online; accessed 28-June-2011].
[39] Wikipedia. Partition (database). http://en.wikipedia.org/wiki/Partition (database),
2011. [Online; accessed 28-June-2011].
[40] Ethan Zuckerman. The first twitter revolution? http://www.foreignpolicy.com/
articles/2011/01/14/the first twitter revolution, 2011. [Online; accessed 28-June-
2011].
136
Part II
The Annexes
137
Chapter 8
Beernet Secret API
8.1 Without replication
8.1.1 Put
put (S:Secret K:Key V:Val)
Stores the triplet (Hash(Secret) Key Val) at the responsible of the Hash of Key.
This operation can have two results, “commit” or “abort”.
The operation returns “commit” if:
• there is nothing stored associated with the key Key or there is a triplet stored
previously by a put operation.
• there is no triplet (Secret1 Key Val1) stored at the responsible of the Hash of Key
so that Hash(Secret) 6= Hash(Secret1).
• the value has successfully been stored.
Otherwise the operation returns “abort” and nothing changed.
If no value is specified for Secret Beernet will assume it is the equivalent to
put(S:NO SECRET K:Key V:Val).
8.1.2 Delete
delete(S:Secret K:Key)
Deletes the triplet (Hash(Secret1) Key Val) stored at the responsible of the Hash
of Key. This operation can have two results, “commit” or “abort”.
The operation returns “commit” if:
• there is a triplet (Hash(Secret1) Key Val) stored with a put operation at the
responsible of the Hash of Key.
139
• Hash(Secret) = Hash(Secret1)
• the triplet has successfully been deleted
Otherwise the operation returns “abort” and nothing changed.
If no value is specified for Secret Beernet will assume it is the equivalent to
delete(S:NO SECRET K:Key).
8.2 With replication
8.2.1 Write
write(S:Secret K:Key V:Val)
Stores the triplet (Hash(Secret) Key Val) at the majority of replicas, updating the
value gives a new version number to the triplet. This operation can have two results,
“commit” or “abort”.
The operation returns “commit” if
• there is nothing stored associated with the key Key or there is a triplet stored
previously by a write operation at the majority of the replicas.
• there is no triplet (Secret1 Key Val) where Hash(Secret) 6= Hash(Secret1) stored
in the majority of the replicas
• the triplet has been correctly stored in the majority of the replicas
Otherwise the operation returns “abort” and nothing changed.
If no value is specified for Secret Beernet will assume it is the equivalent to
write(S:NO SECRET K:Key V:Val).
8.2.2 CreateSet
createSet(SS:SSecret K:Key S:Secret)
Stores the triplet (Hash(SSecret) Key Hash(Secret)) at the majority of replicas.
This operation can have two results, “commit” or “abort”.
The operation returns “commit” if:
• there is nothing stored associated with the key Key in the majority of the replicas
• there is no triplet (Hash(SSecret1) Key Hash(Secret1)) stored in the majority of
the replicas yet
• the triplet has been correctly stored in the majority of the replicas
140
Otherwise the operation returns “abort” and nothing changed.
If no value is specified for SSecret or Secret Beernet will set those values to
NO SECRET.
8.2.3 Add
add(S:Secret K:Key SV:SValue V:Val)
Adds the quadruplet (Secret Key SValue Val) to the set referenced by the key Key
in the majority of the replicas. This operation can have two results, “commit” or
“abort”.
The operation returns “commit” if:
• there is no triplet (Hash(SSecret1) Key Hash(Secret1)) stored at the majority of
the replicas with Hash(Secret1) 6= Hash(Secret)
• there is no quadruplet (Hash(Secret2) Key Hash(SValue2) Val) with
Hash(SValue2) 6= Hash(SValue) stored in the majority of the replicas
• the quadruplet has successfully been stored in the majority of the replicas
Otherwise the operation returns “abort” and nothing changed.
Note that if no (Hash(SSecret1) Key Hash(Secret1)) was stored previously at this
key by createSet Beernet will assume it is the equivalent to createSet(SS:NO SECRET
K:Key S:Secret) followed by add(S:Secret K:Key SV:SValue V:Val) where NO SECRET
is a reserved value of Beernet.
If no value is specified for Secret or SValue Beernet will set those values to
NO SECRET.
8.2.4 Remove
remove(S:Secret K:Key SV:SValue V:Val)
If no value is provided for VaI, this means we are dealing with a key/value pair and
not a key/value set, and so SValue is not evaluated. It deletes the triplet (Hash(Secret1)
Key Val1) stored at the majority of the replicas. This operation can have two results,
“commit” or “abort”.
The operation returns “commit” if:
• there is a triplet (Hash(Secret) Key Val1) stored with a write operation at the
majority of the replicas.
• Hash(Secret) = Hash(Secret1)
• the triplet has successfully been deleted from the majority of the replicas
141
Otherwise the operation returns “abort” and nothing changed.
If a value is provided for Val, this means we are dealing with a value in a set and
SValue will be checked. It deletes the quadruplet (Hash(Secret1) Key Hash(SValue1)
Val1) stored at the majority of the replicas. This operation can have two results,
“commit” or “abort”.
The operation returns “commit” if:
• there is a quadruplet (Hash(Secret1) Key Hash(SValue1) Val1) stored with an
add operation and there is a triplet (Hash(SSecret1) Key Hash(Secret1)) stored
with a createSet operation at the majority of the replicas.
• Val = Val1
• Hash(Secret) = Hash(Secret1)
• Hash(SValue) = Hash(SValue1) or Hash(SValue) = Hash(SSecret1)
• the quadruplet has successfully been deleted from the majority of the replicas
Otherwise the operation returns “abort” and nothing changed.
If no value is specified for Secret or SValue Beernet will set those values to
NO SECRET.
8.2.5 DestroySet
destroySet(SS:SSecret K:Key)
Deletes the triplet (Hash(SSecret1) Key Hash(Secret1)) and all the quadruplets
(Hash(Secret1) Key Hash(SValue1) Val) at the majority of replicas. This operation
can have two results, “commit” or “abort”.
The operation returns “commit” if:
• there is a triplet (Hash(SSecret1) Key Hash(Secret1)) stored at the majority of
the replicas
• Hash(SSecret) = Hash(SSecret1)
• the triplet and quadruplets have successfully been deleted at the majority of the
replicas
Otherwise the operation returns “abort” and nothing changed.
If no value is specified for SSecret Beernet will assume it is the equivalent to de-
stroySet(SS:NO SECRET K:Key).
142
Chapter 9
Bwitter API
9.1 User management
9.1.1 createUser
public void createUser(String userName, String password, String realName)
Creates a user with his personal information.
Parameters:
• userName - the userName of the user, may not contain spaces.
• password - the password of the user, has to be at least 8 characters long and must
contain at least one number and one special character (not from the 26 letters of
the alaphabet).
• realName - the full name of the user, must contain a first and last name.
Throws:
• UserAlreadyUsed - if there already exists a user with this userName.
• PassWordTooWeak - if the password does not meet the requirements.
• UserNameInvalid - if either the userName ot realName does not meet the require-
ments.
• ActionNotDoneException - if there was another problem during the operation.
143
9.1.2 deleteAccount
public boolean deleteAccount(String userName, String password)
Deletes the account of the user along with his lists and lines. Also deletes all the tweets
this user posted.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
Throws:
• BadCredentials - if the provided userName does not exist or if the password does
not match the userName.
• ActionNotDoneException - if there was another problem during the operation.
9.2 Tweets
9.2.1 postTweet
public void postTweet(String userName, String password, String msg)
Posts the message in order to be displayed in all the lines following the user
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• msg - a String containing the message
Throws:
• BadCredentials - if the provided userName does not exist or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
144
9.2.2 reTweet
public void reTweet(String userName, String password, String tweetID)
Posts the referenced tweet as a retweet in order to be displayed in all the lines following
the user.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• tweetID - reference of the tweet to retweet
Throws:
• ActionAlreadyPerformed - if this action has already been performed previously.
• BadCredentials - if the provided userName does not exist or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.2.3 reply
public void reply(String userName, String password, String msg, String tweetID)
Posts a new tweet with msg as message in order to be displayed in all the lines following
the user. The new tweet contains a reference to his parent tweet referenced by tweetID
and is added to its children.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• msg - a String containing the message
• tweetID - reference of the tweet to which to reply
145
Throws:
• BadCredentials - if the provided userName does not exist or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.2.4 deleteTweet
public void deleteTweet(String userName, String password, int tweetnbr)
Deletes the tweet of the user with the specified number.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• tweetnbr - number of the tweet to delete.
Throws:
• BadCredentials - if the provided userName does not exist or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
146
9.3 Lines
9.3.1 addUser
public void addUser(String userName, String password, String lineName,
String newFollowinguserName)
Adds the specified user to the specified line. From now all the tweets posted by the
specified user will be displayed in the specified line.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• lineName - name of the line to which the user should be added.
• newFollowingUserName - name of the user that should be added.
Throws:
• BadCredentials - if the provided userName does not exist or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.3.2 removeUser
public void removeUser(String userName, String password, String lineName,
String followinguserName)
Removes the specified user from the specified line. From now all the tweets posted by
the specified user will no longer be displayed in the specified line anymore.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• lineName - name of the line from which the user should be removed
• newFollowingUserName - name of the user that should be removed
147
Throws:
• BadCredentials - if the provided userName does not exist or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.3.3 allUsersFromLine
public Collection<String> allUsersFromLine(String userName, String lineName)
Retrieves all the users followed in the specified line owned by the specified user.
Parameters:
• lineName - name of the line
• userName - name of the user owning the line
Returns:
A Collection of Strings containing all the userNames of the users followed in the
specified line.
Throws:
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.3.4 allTweet
public Collection<Tweet> allTweet(String userName)
Retrieves all the tweets from the specified user. Should only be used for testing the
application.
Parameters:
• userName - name of the user.
148
Returns:
A LinkedList of all the Tweets of the user ordered chronologically.
Throws:
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.3.5 getTweetsFromLine
public TweetChunk getTweetsFromLine(String userName, String lineName, int cNbr,
String date)
Retrieves the tweets from the chunk with the number equal to cNbr from the line
lineName of the user userName that were posted after date. If date is null all the
tweets from the chunk are returned. If cNbr is negative the last chunk from the line is
returned.
Parameters:
• lineName - name of the line.
• userName - name of the user owning the line.
• cNbr - number of the chunk of the line you want to read. The chunks are ordered
from oldest to most recent, with the most recent chunk having the highest number.
• date - String representing the limit date with the format “05/06/11 15 h 26 min
03 s GMT”
Returns:
A TweetChunk containing a LinkedList of Tweets ordered chronologically and the
number of the chunk in which they are stored.
Throws:
• ParseException - if the date has not the correct format and could not be parsed.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
149
9.3.6 createLine
public void createLine(String userName, String password, String lineName)
Creates a new line with the specified name for specified user as an owner.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• lineName - name of the new line to create.
Throws:
• LineAlreadyExists - if the user already has a line with the same name.
• BadCredentials - if the provided userName does not exists or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.3.7 deleteLine
public void deleteLine(String userName, String password, String lineName)
Deletes the specified line owned by the user
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• lineName - name of the line to be deleted, note that the userline and timeline can
not be deleted.
Throws:
• BadCredentials - if the provided userName does not exists or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
150
9.3.8 getLineNames
public Collection<String> getlineNames(String userName)
Retrieves the names of all the lines of the user.
Parameters:
• userName - the userName of the owner of the lines.
Returns:
A LinkedList of Strings containing the names of all the lines.
Throws:
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.4 Lists
9.4.1 addTweetToList
public void addTweetToList(String userName, String password, String listname,
String tweetID)
Adds the referenced tweet to the specified list.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• listName - name of the user’s list.
• tweetID - reference to the tweet to add to the list.
Throws:
• ActionAlreadyPerformed - if the tweet has already been added to the list previ-
ously.
151
• BadCredentials - if the provided userName does not exists or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.4.2 removeTweetFromList
public void removeTweetFromList(String userName, String password, String listname,
String tweetID)
Remove the referenced tweet from the specified list.
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• listName - name of the user’s list.
• tweetID - reference of the tweet to remove from the list.
Throws:
• BadCredentials - if the provided userName does not exists or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.4.3 getTweetsFromList
public TweetChunk getTweetsFromList(String userName, String listname,
int cNbr, String date)
Retrieves the tweets from the chunk with the number equal to cNbr from the list
listname of the user userName that were posted after date. If date is null all the tweets
from the chunk are returned. If cNbr is negative the last chunk from the line is returned.
152
Parameters:
• listName - name of the list
• userName - name of the user owning the list.
• cNbr - number of the chunk of the line you want to read. The chunks are ordered
from oldest to most recent, with the most recent chunk having the highest number.
• date - String representing the limit date with the format “05/06/11 15 h 26 min
03 s GMT”
Returns:
A TweetChunk containing a LinkedList of Tweets ordered chronologically and the
number of the chunk in which they are stored.
Throws:
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.4.4 createList
public void createList(String userName, String password, String listname)
Creates a new list with the specified name and the user as an owner
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• listName - name of the new list to create
Throws:
• ListAlreadyExists - if the user already has a list with the same name.
• BadCredentials - if the provided userName does not exists or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
153
9.4.5 deleteList
public void deleteList(String userName, String password, String listname)
Deletes the specified list
Parameters:
• userName - the userName of the user performing the operation.
• password - the password of the user performing the operation.
• listName - name of the line to be deleted, note that the favoritelist cannot be
delete
Throws:
• BadCredentials - if the provided userName does not exists or if the password does
not match the userName.
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
9.4.6 getListNames
public Collection<String> getlistnames(String userName)
Retrieves the names of all the lines of the user.
Parameters:
• userName - name of the user owning the lists.
Returns:
A LinkedList of Strings containing the names of all the lists.
Throws:
• ValueNotFound - if a critical value needed to perform the operation could not be
retrieved.
• ActionNotDoneException - if there was another problem during the operation.
154
Chapter 10
The paper
During the course of our project we have co-written an article, along with Peter
Van Roy and Boris Mejıas, entitled “Designing an Elastic and Scalable Social Network
Application”.
The contents of this paper were based on our second implementation of Bwitter,
which we detail in section 5.1.2. It is thus not fully representative of our final imple-
mentations and design choices.
This article has been accepted for The Second International Conference on Cloud
Computing, GRIDs, and Virtualization1 organized by the IARIA and held the 25th to
the 30th of September 2011 in Rome, Italy.
The submitted version of this paper can be found next page.
1CLOUD COMPUTING 2011, http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html,last accessed 13/08/2011
155
Designing an Elastic and Scalable Social Network Application
Xavier De Coster, Matthieu Ghilain, Boris Mejıas, Peter Van RoyICTEAM institute
Universite catholique de LouvainLouvain-la-Neuve, Belgium
{decoster.xavier,ghilainm}@gmail.com {boris.mejias,peter.vanroy}@uclouvain.be
Abstract—Central server-based social networks can sufferfrom overloading caused by social trends and make the servicemomentarily unavailable preventing users to access it whenthey most want it. Central server-based social networks arenot adapted to face rapid growth of data or flash crowds.In this work we present a way to design a scalable, elasticand secure Twitter-like social network application build on thetop of Beernet, a transactional key/value datastore. By beingscalable and elastic the application avoids both overloading andwasting resources by scalung up and down quickly.
Keywords-Scalable; elastic; social network; design.
I. INTRODUCTION
Social networks are an increasing popular way for peopleto interact and express themselves. People can now createcontent and easily share it with other people. The servers ofthose services can only handle a given number of requestsat the same time, so if there are too many requests theserver can become overloaded. Social networks thus have topredict the amount of load they will have to face in order tohave enough resources at their disposal. Statically allocatingresources based on the mean utilisation of the service wouldlead to a waste during slack periods and overloading duringpeak periods. Twitter (http://www.twitter.com) shows the“Fail Whale” graphic whenever overloading occurs. This is atricky situation as this load is related to many social factors,some of which are impossible to predict. For instance wewant to be able to handle the high amount of people sendingChristmas or New Year wishes but also reacting to naturaldisasters. This is why we want to turn towards scalable andelastic solutions, allowing the system to add and removeresources on the fly in order to fit the required load. Inthis work we are going to focus on the design of a socialnetwork with elastic and scalable infrastructure: Bwitter, asecure Twitter-like social network built on Beernet [1], ascalable key/value store. In the next section we will overviewthe basic required operations for a social network. We willthen explain why we chose Beernet for this project inSection III and how to run multiple services on top of it inSection IV, in this section we will also discuss some possibleimprovements for DHTs in order to increase their securityand offer a richer application programming interface. Wethen take a closer look at the design of our application inSection V. In Section VI we will compare two types of
architectures on which this social network can run, one fullydistributed based on peer-to-peer and one centralised basedon the cloud. We will then finish with the implementationof our prototype in Section VII and a small conclusion atSection VIII.
II. A QUICK OVERVIEW OF REQUIRED OPERATIONS
Bwitter is designed to be a secure social network basedon Twitter. Twitter is a microblogging system, and while itlooks relatively simple at first sight it hides some complexfunctionalities. We included almost all of those in Bwitterand added some others. We will only depict the relevantfunctionalities here that will help us to analyse the designof the system and the differences between a centralised anddecentralised architecture.
A. NomenclatureThere are only a few core concepts on which our appli-
cation is based. A tweet is basically a short message withadditional meta information. It contains a message up to 140characters, the author’s username and a timestamp of whenit was posted. If the tweet is part of a discussion, it keepsa reference to the tweet it is an answer to and also keepsthe references towards tweets that are replies to it. A useris anybody who has registered in the system. A few piecesof information about the user are kept in memory by theapplication, such as her complete name and her password,used for authentication. A line is a collection of tweets andusers. The owner of the line can define which users he wantsto associate with the line. The tweets posted by those userswill be displayed in this line. This allows a user to haveseveral lines with different topics and users associated.
B. Basic operations1) Post a tweet: A user can publish a message by posting
a tweet. The application will post the tweet in the lines towhich the user is associated. This way all the users followingher have the tweet displayed in their line.
2) Retweet a tweet: When a user likes a tweet from another user she can decide to share it by retweeting it. Thiswill have the effect of “sending” the retweet to all the linesto which the user is associated. The retweet will be displayedin the lines as if the original author posted it but with theretweeter’s name indicated.
3) Reply to a tweet: A user can decide to reply to a tweet.This will include a reference to the reply tweet inside theinitial tweet. Additionally a reply keeps a reference to thetweet to which it responds. This allows to build the wholeconversation tree.
4) Create a line: A user can create additional lines withcustom names to regroup specific users.
5) Add and remove users from a line: A user can asso-ciate a new user to a line, from then on all the tweets thisnewly added user posts will be included in the line. A usercan also remove a user from a line, she will then not see thetweets of this user in her line anymore and will not receiveher new tweets either.
6) Read tweets: A user can read the tweets from a lineby packs of 20 tweets. She can also refresh the tweets of aline to retrieve the tweets that have been posted since herlast refresh.
III. WHY BEERNET?
Beernet [2] is a transactional, scalable and elastic peer-to-peer key/value data store build on the top of a DHT. Peersin Beernet are organized in a relaxed Chord-like ring [3]and keep O(log(N)) fingers for routing. This relaxed ring ismore fault tolerant than a traditional ring and its robust joinand leave algorithm to handle churn make Beernet a goodcandidate to build an elastic system. Any peer can performlookup and store operations for any key in O(log(N)), whereN is the number of peers in the network. The key distributionis done using a consistent hash function, roughly distributingthe load among the peers. These two properties are a strongadvantage for scalability of the system compared to solutionslike client/server.
Beernet provides transactional storage with strong con-sistency, using different data abstractions. Fault-tolerance isachieved through symmetric replication, which has severaladvantages that we will not detail here compared to leaf-set and successor list replication strategy [4]. In everytransaction, a dynamically chosen transaction manager (TM)guarantees that if the transaction is committed, at least themajority of the replicas of an item stores the latest valueof the item. A set of replicated TMs guarantees that thetransaction does not rely on the survival of the TM leader.Transactions can involve several items. If the transaction iscommitted, all items are modified. Updates are performedusing optimistic locking.
With respect to data abstractions, Beernet provides notonly key/value-pairs as in Chord-alike networks, but alsokey/value sets, as in OpenDHT-alike networks [5]. The com-bination of these two abstractions provides more possibilitiesin order to design and build the database, as we will explainin Section V. Moreover, key/value sets are lock-free inBeernet, providing better performance.
We opted for Beernet because of those native data ab-stractions and its elastic and scalability properties. But any
scalable and elastic key/value store providing transactionalstorage with strong consistency providing those data abstrac-tions could be used too.
IV. RUNNING MULTIPLE SERVICES ON BEERNET
Multiple services running on the same DHT can conflictwith each other. We will now discuss two mechanismsdesigned to avoid those conflicts.
A. Protecting data with Secrets
Early in the process, we elicited a crucial requirement.The integrity of the data posted by the users on Bwittermust be preserved. A classical mechanism, but not withoutflaws, is to use a capability-based approach. Data is stored atrandom generated keys so that other applications and usersusing Beernet cannot erase others values because they do notknow at which keys these values are stored. But in Bwitter,some information must be available for everybody and thuskeys must be known by all users, meaning that we cannot userandom keys. For example, any user must be able to retrievethe user profile of another user, it must thus know the keyat which it is stored. The problem is that Beernet does notallow any form of authentication so key/value pairs are leftunprotected, meaning that anybody able to make requests toBeernet can modify or delete any previously stored data.
We make a first and naive assumption that servicesrunning on Beernet are bug free and respectful of each other.They thus check at each write operation that nothing else isstored at a given key otherwise they cancel the operation.Thanks to the transactional support of Beernet the check andthe write can be done atomically. This way we can avoidrace conditions where process A reads, the process B reads,both concluding that there is nothing at a given key and bothwriting a value leading to the lost of one of the two writes.
This assumption is not realistic and adds complexity to thecode of each application running on Beernet. We thus relaxit and assume that Beernet is running in a safe environmentlike the cloud, which implies that no malicious node canbe added to Beernet. We allow any application to makerequests directly to any Beernet node from the Internet. Wedesigned a mechanism called “secrets” to protect key/valuepairs and key/value sets stored on Beernet enriching theexisting Beernet API.
Applications can now associate secrets to key/value pairsand key/value sets they store. This secret is not mandatory,if no secret is provided a “public” secret is automaticallyadded. This secret is needed to modify or delete what isstored at the key protected. For instance we could have thefollowing situation. A first request stores at the key bar thevalue foo using the secret ASecret, then another request triesto store at key bar another value using a secret differentfrom ASecret. Because secrets are different Beernet rejectsthe last request, which will thus have no effect on the datastore. A similar mechanism has been implemented for sets,
allowing to dissociate the protection of the set as a wholeand the values it contains.
Secrets are implemented in Beernet and have been testedthrough our Bwitter application. A similar but weaker mech-anism is proposed by OpenDHT [5]. Complete informationconcerning the new secret API can be found at Bwitter’sweb site (http://bwitter.dyndns.org/).
B. Dictionaries
At the moment in Beernet, as in all key/value stores weknow, there is only one key space. This can cause problemsif multiple services use the same key. For instance twoservices might design their database storing the user profilesat a key equal to the username of a user. This means they cannot both have a user with the same username. This problemcannot be solved with the secrets mechanism we proposed.We thus propose to enhance the current Beernet API withmultiple dictionaries. A dictionary has a unique name andrefers to a key-space in Beernet. A new application cancreate a dictionary as it starts using Beernet. It can latercreate new dictionaries at run-time as needed, which allowsthe developpers to build more efficient and robust imple-mentation. Dictionaries can be efficiently created on the flyin O(log(N)) where N is the number of peers in the Beernetnetwork. Moreover dictionaries do not degrade storing andreading performance of Beernet. If two applications need toshare data they just have to use the same dictionary. Thishas not yet been implemented, but API and algorithms arecurrently being designed. An open problem is how to avoidmalicious applications to access the dictionary of anotherapplication.
V. DESIGN PROCESS
We will now present our design choices and explain howwe relieve machines hosting popular values.
A. Main directions
We will start by discussing the main design choices wemade for our implementation.
1) Make reads cheap: While designing the constructionmechanism of the lines we were faced with the followingchoice: Either push the information and put the burden onthe write, making the “post tweet” operation add a referenceto the tweet in the lines of each follower. Or pulling theinformation and build the lines when a user wants to readthem, by fetching all the tweets posted by the users hefollows and reordering them. As people do more reads thanwrites on social networks, based on the assumption that eachposted tweet is at least read one time, we opted to makereads cheaper than writes.
2) Do not store full tweets in the lines but references:There is no need to replicate the whole tweet inside eachline, as a tweet could be potentially contain a lot of in-formation and should be easy to delete. To delete a tweet
the application only has to edit the stored tweet and doesnot need go through every line that could contain the tweet.When loading the tweet the application can see if it has beendeleted or not.
3) Minimise the changes to an object: We want theobjects to be as static as possible to enable cache systems.This is why we do not store potentially dynamic informationinside the objects but rather have a pointer in them, pointingto a place where we could find the information. For instance,Tweets are only modified when we delete them, if there is areply to them, the ID of the new child is stored in a separatedset.
4) Do not make users load unnecessary things: Loadingthe whole line each time we want to see the new tweetswould result in an unnecessarily high number of messagesexchanged and would be highly bandwidth consuming. Thisis why we decided to cut lines, which in fact are just bigsorted set, into subsets, which are sets of x tweets, that canbe organised in a linked list fashion, where x is a tunableparameter. This way the user can load tweets in chunks ofx tweets. The first subset contains all the references to thetweets posted since the last time the user retrieved the line,it can thus be much larger than x tweets, it is not a problemas users generally want to check all the new tweets whenthey consult a line. The cutting is then done as follows: theapplication removes the x oldest references from the firstset, posts them in an new subset and repeats the operationuntil the loaded first set is smaller than x.
5) Retrieving Tweets in order: Due to the cutting mech-anism and delays in the network we can not be sure thateach reference contained in a subset is strictly newer thanthe references stored in the next subset. So we also retrievethe tweet references from this one and only select the first20 newest references before fetching the tweets.
6) Filtering the references: When a user is dissociatedfrom a line we do not want our application to still displaythe tweets he posted previously. We decided not to scanthe whole line to remove all the references added by thisuser, but rather remove the user from the list of the usersassociated with the line and filter the references-based onthis list before fetching the corresponding tweets.
7) Only encrypt sensitive data: Most of the data in Twit-ter is not private so there would be no point in encryptingit. Only the sensitive data such as the password of the usersshould be protected by encryption when stored.
8) Modularity: Even if our whole design and architecturerelies on the features and API offered by Beernet it is alwaysbetter to be modular and to define clear interfaces so we canreplace a whole layer by an other easily. For instance anyother DHT could easily be used, provided it supports thesame data abstractions or they can be simulated.
B. Improving overall performance adding a cache
1) The popular value problem: Given the properties ofthe DHT, a key/value pair is mapped to a node or fnodes, where f is the replication factor, depending of theredundancy level desired. This implies that if a key isfrequently requested, the nodes responsible for it can beoverloaded while the rest of the network is mostly idleand adding additional machines is not going to improve thesituation. It is not uncommon on Twitter to have wildlypopular tweets that are retweeted by thousands of users.In the worst case the retweets can be seen as exponentialphenomenon as all the users following the retweeter aresusceptible to retweet it too.
2) Use an application cache as solution: Adding nodeswill not solve the problem, because the number of nodesresponsible for a key/value pair will not change. In order toreduce this number of requests we have decided to add acache with a LRU replacement strategy at the applicationlevel. This solves the retweet problem because now theapplication, which is in charge of several users, will havein its cache the tweet as soon as one of its user reads thepopular tweet. This tweet will stay in the cache because theusers frequently make requests to read it. This way we willreduce the load put on the nodes responsible for the tweet.
We now have to take into account that values are notimmutable, they can be deleted and modified. A naivesolution would be to do active pulling to Beernet to detectchanges to the key/value pair stored in the cache. This wouldbe quite inefficient as there are several values, like tweets,that almost never change. In order to avoid pulling we needa mechanism that warns us when a change is done to akey/value pair stored in the cache. Beernet, as describedin [1], allows an application to register to a key/value pairand to receive a notification when this value is updated. Ourapplication cache will thus register to each key/value pairthat it actually holds and when it receives a notification fromBeernet indicating that a pair has been updated it will updateits corresponding replicas. This mechanism has the bigadvantage of removing unnecessary requests. Notificationsare asynchronous, so the replicas in the cache can havedifferent values at a given moment, leading to an eventualconsistency model for the reads. On the other hand writes donot go through the cache but directly to Beernet, this allowsto keep strong consistency for the writes inside Beernet.This is an acceptable trade off as we do not need strongconsistency for reads inside a social network.
VI. ARCHITECTURE
We will present two different scalable architectures forour application. In both architectures our application isdecomposed in three loosely coupled layers. From top tobottom, the Graphic User Interface (GUI), Bwitter usedto handle the operations described in Section II and thekey/value data store. For this last layer we use Beernet,
but it could be replaced by any key/value store with similarproperties. As a remainder the data store must provide read-/write operations on values and sets as well as implementingthe secrets we described before. This architecture is verymodular, each layer can be changed assuming it respectsthe API of the layer above. We now have to decide whereBeernet will run. We have two options, either let the Beernetnodes run on the users’ machines or run them on thecloud, leading to two radically different architectures: thecompletely decentralised architecture and the cloud-basedarchitecture.
A. Completely decentralised architecture
In a fully decentralised architecture the user runs a Beer-net node and the Bwitter application on her machine. TheBwitter application will do requests directly to this localBeernet node. Ideally this local Beernet node should not berestricted to the Bwitter application but should also be acces-sible for other applications. The problem with this approachis that the user can bypass protection mechanism enforcedat higher level by accessing DHT low level functions ofBeernet. Usually this is not a problem as untrusted userswould not know at which key the data is stored and thuscan not compromise it. But in our case the data has tobe at known keys so that the application can dynamicallyretrieve them. This means that any user understanding howour application works would be able to delete, edit or forgelines, users, tweets and references. This would be a securitynightmare.
We tried to tackle this problem with the secret mecha-nism we designed to enrich Beernet’s interface. While thisprevented the users to edit or delete data they did not createthemselves we could not prevent them to forge elements. Toavoid this we needed a way to authenticate every data postedby a user. There are cryptographic mechanisms to enforcethis and ways to efficiently manage the keys but they areoutside the scope of this paper.
Even with those mechanisms in place we have to en-force security at the DHT level. Beernet uses encryption tocommunicate between different nodes to avoid confidentialinformation leak. But anyone could add modified Beernetnodes behaving maliciously. Aside usual attacks [6], acorrupted node could be modified to reveal all the secretsinside the requests going through it. We thus have to makesure that the code running the Beernet node is not modified,so we need a mechanism that enforce remote attestation asdescribed in [7]. This can be done by using a TPM, whichprovides cryptographic code signature in hardware, on theusers’ machine in order to be able to prove to other Beernetnodes that the client’s node is a trustworthy node. Until aBeernet node has a way to tell for sure it can trust anotherBeernet node we are in a dead end. Indeed anyone stealingthe secret of another user can erase any data posted by theuser.
Assuming that a Twitter session time is short, this canbe a problem if our application is the only one running onthe top of Beernet. Indeed it will result in nodes frequentlyjoining and leaving the network with a short connectiontime. Each of those changes in the topology of Beernetwill modify the keys for which the nodes are responsibletriggering key/value pairs reallocation itself leading to animportant and undesirable churn. This would not be an idealenvironment for a DHT.
B. Cloud-based architecture
With this architecture the Bwitter and the Beernet nodeswill run on the cloud, which is an adequate environmentfor scalable and elastic applications. We can thus easily addor remove Bwitter and Beernet nodes to meet the demand,increasing the efficiency of the network. A Bwitter node is amachine running Bwitter but generally also a Beernet node.This solution also allows us to keep a stable DHT as nodesare not subject to high churn as it was the case in the firstarchitecture we presented.
Using this solution we do not have all the security issueswe had with the fully decentralised architecture. This isbecause the users do not have direct access to the Beernetnodes anymore but have to go through a Bwitter nodeand can only perform operations defined in Section II.Furthermore, the communication channel between the GUIand the Bwitter node can guarantee authenticity of the serverand encryption of data being transmitted, for instance usinghttps. Bwitter requires users to be authenticated to access ormodify their data. Doing so we provide data integrity andauthenticity because, for instance, Bwitter does not allowa user to delete a tweet that he did not post or to posta tweet using the username of someone else. The securityproblem concerning possible revelations of user secrets dueto a malicious node is not relevant anymore as our DHT isfully under our control.
The cloud-based architecture is thus more secure andstable, this is why we have finally chosen to implement thissolution, we now take a closer look at how the layer stackis build. Note that in spite of our researches we did not findany information about Twitters current architecture so weare not able to compare both architectures.
As said before the Beernet layer runs on the cloud, thislayer is monitored in order to detect flash crowds andBeernet nodes will be added and removed on the fly to meetthe demand.
The intermediate layer, also running on the cloud, isBwitter, it communicates with Beernet and the GUIs. Thislayer can be put on the same machine as a Beernet node oron another machine. Normally there should be less Bwitternodes than Beernet nodes. One Bwitter node is associated toa Beernet node but can be re-linked to another Beernet nodeif it goes down. Each Bwitter node should be connected to adifferent Beernet node in order to share the load. In practice
the Bwitter nodes will not be accessible directly, they willbe accessed through a fast and transparent reverse proxy thatwill be in charge of doing load balancing between Bwitternodes. At the moment Bwitter nodes use sessions to identifythe users, so the reverse proxy is forced to keep track of thesessions in order to be able to map the same client to thesame Bwitter node. We plan to change this behavior to offera completely REST Bwitter API.
The top layer is the GUI, it connects to a Bwitternode using a secure connection channel that guaranteethe authenticity of the Bwitter node and encrypts all thecommunications between the GUI and the Bwitter node.Multiple GUI modules can connect to the same Bwitternode. The GUI layer is the only one running on the clientmachine.
C. Elasticity
We previously explained that to prevent the Fail Whaleerror, the system needs to scale up to allocate more resourceto be able to answer an increase of user requests. Once theload of the system gets back to normal, the system needs toscale down to release unused resources. We briefly explainhow a ring-based key/value store needs to handle elasticityin terms of data management. We are currently working onmaking the elastic behaviour more efficient in Beernet.
1) Scale up: When a node j joins the ring in betweenpeers i and k, it takes over part of the responsibilityof its successor, more specifically all keys from i to j.Therefore, data migration is needed from peer k to peer j.The migration involves not only the data associated to keysin the range ]i, j], but also the replicated items symmetricallymatching the range. Other noSQL databases such as HBase(http://hbase.apache.org) do not trigger any data migrationupon adding new nodes to the system, showing betterperformance scaling up.
2) Scale down: There are two ways of removing nodesfrom the system: by gently leaving and by failing. It is veryreasonable to consider gently leaves in cloud environments,because the system explicitly decides to reduce the size ofthe system. In such case, it is assumed that the leaving peerj has time enough to migrate all its data to its successorwho becomes the new responsible for the key range ]i, j],being i the predecessor.
Scaling down due to the failure of peers is much morecomplicated because the new responsible of the missingkey range needs to recover the data from the remainingreplicas. The difficulty comes from the fact that the valueof application keys is unknown, since the hash function isnot bijective. Therefore, the peer needs to perform a rangequery, as in Scalaris [8], but based on the hash keys. Anothercomplication is that there are no replica sets based on keyranges, but on each single key.
VII. IMPLEMENTATION
We have implemented a prototype based on our cloud-based architecture. Sources are freely available at [bwit-ter.dyndns.org]. We will now detail how we actually imple-mented it. You can see a full schema of our implementationin Figure 1.
As explained, our architecture has three main layers. TheDHT layer is implemented using Beernet, build in Oz v1.3.2(http://www.mozart-oz.org/) enhanced with the secret mech-anism. Beernet is accessible through a socket API , we usedit to communicate with the Bwitter layer. An alternative testversion of the data store layer used for testing the applicationis also made available at http://bwitter.dyndns.org.
Figure 1. Implementation structure scheme
At the top of the Bwitter layer is a Tomcat 7.0 applicationserver (http://tomcat.apache.org) using java servlets fromjava EE. Bwitter is accessible from the internet through anAPI that The Bwitter layer is connected to the bottom layerusing sockets to communicate with an Oz agent controllingBeernet. The Bwitter nodes are accessible remotely viaan http API, finally we would like to make it completelyconform to REST API. The Tomcat servers are not directlyaccessed, they are accessed through a reverse proxy server,in this case nginx (http://wiki.nginx.org), which is told tosupport 10k concurrent connections. This nginx server isin charge of serving static content as well as doing loadbalancing for the Tomcat servers. This load balancing isperformed so that messages of a same session are alwaysmapped to the same Tomcat server, this is necessary asauthentication is needed to perform some of the Bwitteroperations and we did not want to share the state of theusers sessions between the Bwitter nodes for performancereasons. The connection to the web-based API is performedusing https to meet the secure channel requirement of ourarchitecture.
The last layer is the GUI, we decided to implement itas a Rich Internet Application (RIA), using the Adobe Flex
technology (http://www.adobe.com/products/flex). This GUIuses the web API we developed to access Bwitter.
VIII. CONCLUSION
Our goal was to build a new system able to withstand flashcrowd by relying on an elastic and scalable architecture. Thisallows us to add resources to face heavier traffic and avoidwaste of resources.
While the prototype is not yet totally finished our wholedesign is totally scalable, meaning we do not have singleabsurdly huge operations due to the high number of usersone might follow or be followed by. We avoid overloadingspecific machines because we do not rely on any globalkeys and use our cache mechanism to prevent the retweetproblem. Some preliminary scalability tests have been doneon Amazon and are encouraging.
During the implementation we also came across twopotentially important improvements for key/value stores,namely duplicating the key space using multiple dictionariesand the protection of data via secrets, with the last one nowimplemented in Beernets latest release.
REFERENCES
[1] B. Mejıas and P. Van Roy, “Beernet: Building self-managingdecentralized systems with replicated transactional storage,”IJARAS: International Journal of Adaptive, Resilient, andAutonomic Systems, vol. 1, no. 3, pp. 1–24, July-Sept 2010.
[2] Programming Languages and Distributed Computing ResearchGroup, UCLouvain, “Beernet: pbeer-to-pbeer network,” http://beernet.info.ucl.ac.be, 2009. [Online]. Available: http://beernet.info.ucl.ac.be
[3] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Bal-akrishnan, “Chord: A scalable peer-to-peer lookup servicefor internet applications,” SIGCOMM Comput. Commun. Rev.,vol. 31, pp. 149–160, August 2001.
[4] A. Ghodsi, L. O. Alima, and S. Haridi, “Symmetric replicationfor structured peer-to-peer systems,” in Proceedings of the2005/2006 international conference on Databases, informationsystems, and peer-to-peer computing, ser. DBISP2P’05/06.Berlin, Heidelberg: Springer-Verlag, 2007, pp. 74–85.
[5] S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy,S. Shenker, I. Stoica, and H. Yu, “Opendht: a public dht serviceand its uses,” SIGCOMM Comput. Commun. Rev., vol. 35, pp.73–84, August 2005.
[6] G. Urdaneta, G. Pierre, and M. v. Steen, “A survey of dhtsecurity techniques,” ACM Computing Surveys, vol. 43, no. 2,jan 2011.
[7] Wikipedia, “Trusted computing,” http://en.wikipedia.org/wiki/Trusted\ computing\#Remote\ attestation, 2011, [Online; ac-cessed 28-June-2011].
[8] T. Schutt, F. Schintke, and A. Reinefeld, “Scalaris: reliabletransactional p2p key/value store,” in Proceedings of the 7thACM SIGPLAN workshop on ERLANG, ser. ERLANG ’08.New York, NY, USA: ACM, 2008, pp. 41–48.