Investigate potential performance improvements in a Real Time...

Investigate potential performanceimprovements in a Real TimeClearing system using ContentDelivery Network-based technology

Henrik Erskers

Henrik ErskersDegree Project in Computing Science Engineering, 30 ECTS CreditsSpring 2019Supervisor: Jan Erik MostromExternal supervisor: Pal ForsbergExaminer: Henrik BjorklundMaster of Science Programme in Computing Science and Engineering, 300 ECTS Credits

Abstract

The increasing demand from customers and users on systems areconstantly increasing, demanding faster more robust solutions.There are no exception especially when it comes to clearing tech-nology in the financial industry, where the system needs to handleinformation fast and be responsive. The thesis have investigatedthe potential performance improvements by using Content De-livery Network(CDN) based technology on Cinnobers Real TimeClearing system RTC. The thesis have based on constructing afunctional CDN node, and look at the impact that it has on thesystem in regards of handling reference data. Could this approachhelp improve the system and improve the performance of the sys-tem and scalability in regards to reference data handling. Basedon the results that was gathered by comparing the node imple-mentation with the original system, there are clear indicationsperformance improvements on the system. Where the results con-cluded that with the CDN node implementation a performanceimprovement when requesting reference data, had an increase infetch time by a factor of 6. By using the results gathered in thethesis, a simulation was created to simulate the effect of a fullyscaled CDN. The simulation concluded that the implementationcould reduce latency by 44 minutes during a day of use.

Acknowledgements

First of all I would like to thank Cinnober for giving me the opportunity to writethe thesis at their office. And the help they provided me to develop the thesis. Aspecial thanks to my supervisor Pal Forsberg at Cinnober for helping me with alluncertainties and problems that I encountered during my thesis. I would also wantto thank my supervisor Jan Erik Mostrom at the university for helping me withquestions regarding the thesis and giving feedback on the report. Lastly I would liketo thank Linn, family and friends for proofreading the thesis and discussing differentideas and questions.

Abbreviations

CDN - Content Delivery NetworkRTC - Real Time Clearing systemSharedData - A collection of reference data in the systemDDoS - Distributed Denial of ServiceCSP - Cloud Service ProviderPoP - Point of precenseLegacy code - Code that is not written from scratch, adding functionality on existingcode base.

iii

Contents

1 Introduction 11.1 Background 1

1.1.1 Real Time Clearing System 11.2 Purpose 21.3 Goal 21.4 Limitations 2

1.4.1 Testing 2

2 Methods 32.1 Research 32.2 Simulations 32.3 Testing 3

2.3.1 Scripts 42.3.2 Scenarios 42.3.3 Performance testing 52.3.4 Monitoring tests 5

2.4 Environment setups 52.5 Analysis and discussion 62.6 Evaluation methods 6

3 Theory 73.1 Cloud 73.2 Content Delivery Network 7

3.2.1 CDN Node 73.2.2 Scaling 83.2.3 Benefits 93.2.4 Disadvantage 9

3.3 Data caching 103.3.1 Caching Strategies 103.3.2 Eviction policies 12

3.4 Reference Data 123.4.1 Account 123.4.2 Tradable Instruments 12

4 Implementations 144.1 Node implementation 144.2 Monitoring implementations 14

4.2.1 Monitoring by logging 144.2.2 Monitoring by process observations 14

4.3 Simulations and tests 154.3.1 CDN node active in Stockholm 15

iv

Contents

4.3.2 CDN node active in Stockholm and Washington D.C 16

5 Result 175.1 Scenarios 17

5.1.1 Time to retrieve data from SharedData with 100% miss (Nocaching) 17

5.1.2 Time to retrieve data from SharedData with CDN node 185.1.3 Test the miss ratio on the caching solution 19

5.2 Measure the load on the origin server(SharedData) 20

6 Discussion 256.1 Analyze results 256.2 Fully scaled CDN implementation 25

6.2.1 Content Delivery Network Scenario 25

7 Conclusion 29

8 Future work 30

v

1 Introduction

The traffic load on the global internet is continuously increasing each year, withhigher amount of data circulating and users demanding faster response times fromthe services. This thesis will look at one approach at reducing the server load andimproving the throughput on users requests.

1.1 Background

The connection to the internet is something that many think of as a human right.We adapt more and more things to the internet, and with that the amount of traf-fic over the internet is constantly increasing. The total internet traffic has seen adramatic growth during the past two decades. In 1992 the global internet carriedapproximately 100 GB of traffic a day, ten years later in 2002 those numbers wereat 100 GB per second. If we look at the year 2017 the traffic amounted to 46,600GB per second. [4] And the trend indicates that it will continue to grow, accordingto The Cisco Visual Networking Index: Forecast and Trends. Which estimates thatthe traffic will reach 150,700 GB per second by year 2022[4]. When most of thetransferred data is of static form[4], different technologies has emerged to handlethe data to reduce the load on the networks, one is Content Delivery Network orCDN:s for short. The technology aims to handle the increasing traffic by reducingunnecessary paths on the network to retrieve the data.

1.1.1 Real Time Clearing SystemCinnober has a system for clearing of financial transactions, TRADExpressTM

Real Time Clearing (RTC) which is used by a clearing house, a type of financialinstitution. Given a trade on an exchange, the clearing house inserts itself as thecounterparty to both the buyer and seller. In other words, the clearing house acts asthe buyer for the seller and the seller for the buyer. In doing so each trading party isexposed to only the clearing house, resulting in the traders not needing to care aboutwho they are trading with. RTC is a distributed system that spans over anywherefrom 5-50 Java processes depending on implementation. Its software design aims toexcel in robustness, low latency and high throughput for clearing house real timeoperations and calculations.

1

Chapter 1. Introduction

1.2 Purpose

RTC contains a master of reference-data called Shared Data where clients of thesystem currently access the data directly. To relieve stress on RTC and increasethroughput a content delivery network solution is to be investigated. The purposeof the project will be to apply the bases of CDN to a local point in their system,and in a local sandbox environment try to apply it and see what the effects arewhen it comes to performance and robustness. How can CDN technology be appliedto a clearing system to improve throughput and reduce stress to the system. Thethesis will investigate different approaches of CDN solutions and investigate whichsolution will be best suited for the problem. And use that solution to tailor it tothe RTC system, by modifying the caching to meet the requirement of the systemand get the optimal performance for the system.

1.3 Goal

The goal of this thesis is to implement a solution based on CDN technology andinvestigate the difference between the solutions in regards to the systems throughputand load. The thesis aims to address the following questions:

• Can a CDN implementation on the RTC system improve throughput of reference-data? By how much?

• Can a CDN implementation to the RTC system relieve stress?

• Can the implementation improve robustness and scalability of the system?

• Is the CDN implementation beneficial for the RTC system?

1.4 Limitations

The thesis will only cover the implementation of one edge node (CDN node). Thereason is because the aspect that Cinnober do not want to execute their code onvarious cloud platforms without further investigation around what and how the codeis deployed in the cloud. Based on that, Cinnober agreed that for this thesis it wouldbe best to have it cover the impact of one node and not a complete network. Thislimitation is based on available resources, such as time and relevance. Thus with onecomplete node a network can be created. But because of the limited resources andthe added complications of that implementation, the focus will lie at implementingone node and test if the solution fulfills the goals.

1.4.1 TestingBecause of the limitations, the geographical significance of the edge nodes cannotbe tested in a real environment. Thus the node created will be tested in differentscenarios to simulate the geographical impact.

2

2 Methods

This section will cover the different methods used to implement the solution.

2.1 Research

To answer the questions that are formulated in section 1.4 about the thesis goal,literature studies will be used to validate claims/disclaims about the more abstractquestions such as robustness, scalability and general questions about what the dif-ferent benefits are to the system using this technology.

2.2 Simulations

As mentioned in section 1.4, there are limitations to the thesis. Because of thelimitations it was not possible to try to set up the system in a complete cloud basedenvironment. Because of the limitation, the implementation was created and testedin a local environment. Simulations will be applied on the results to demonstratethe effects if it was in a cloud environment. Where the simulations will be based onmodifying users location, and the geographical impact of retrieval times towards thesystem. Latency statistics will be used to calculate the result[8]. Based from theresults measured from the different test, the values measured will be recalculatedwith the added latency based from where the user is located. This will be done toillustrate the different times for users that interacts with the system. Where thelatency statistics from [8] will be used, that provide geographical ping statistics ofthe world. Because of the small size of the requests that are sent to the systemwhen requesting reference data, the ping requests provides a good measurement ofthe latency. Ping request do not differ much in size from the requests that aresent to the system. Thus using ping statistics were an ideal option when simulatesome parts of the measured values. The purpose of the simulations are to providea better understanding of how the results would look like if the system was usingCDN technology.

2.3 Testing

To answer the questions about throughput, different tests were constructed to in-vestigate the results of using the implementation compared to not using it. Whentesting the relief of stress to the system, different types of stress tests will be madeto measure the effects of the implementation. During the stress tests both the loadon the master as well as the load on the CDN node were measured. All of the dif-ferent tests and scenarios were done with the implementation and without to better

3

Chapter 2. Methods

compare the solutions and draw conclusions from it. For more detailed descriptionabout the testing methods see sections below.

2.3.1 ScriptsScripts were used to create the simulation of many users interacting with the system,and simulate the different scenarios. Monitoring scripts was created to measure theload of different services during the scenarios to clarify the affects of the implemen-tation.

2.3.2 ScenariosTo get a better understanding of the performance of the implementation, a numberof scenarios were created. The scenarios purpose was to find and simulate commoninteractions and user cases on the system.

Time to retrieve data from SharedData

The speed of retrieving data from SharedData is key when trying to have the systemas fast and responsive as possible. Therefor this test was created to measure thetime to retrieve data with and without the CDN node implementation, e.g. 10 000requests to retrieve accounts where each request will be timed. The scenario willgive an indicator if the implementation is going to improve the retrieval speed of thesystem.

Simulate the effects of a fully scaled CDN

Because this thesis do not cover a fully scaled CDN implementation, this scenariopurpose is to simulate how it would look and perform. Based on the data providedfrom the other scenarios, this scenario will simulate the potential effect if the im-plementation was fully scaled. The scenario will provide a better understanding ofhow the CDN would operate, in terms of performance and cost.

Test the miss ratio on the caching solution

For the CDN node that is created, this scenario will investigate the impact of themiss ratio of the cache. How much percentage of the content can be stored in thenode, and how the performance reflect based on the amount cached. To test this foreach request that the CDN node receives it will log the current caching percentage,and the user process will time each request. With this information the time foreach request can be compared to the information about the caching percentage.The scenario will provide indicators of how to optimize the caching percentage toreach optimal performance. But also provide an foundation where comparisons canbe made on cost of having more data stored in the cache in correlation to theperformance improvements.

Measure the load on the origin server (SharedData)

Because of SharedDatas central part of the system, it is important to keep the loadas low as possible. The scenario will investigate the impact of the implementationwhen it comes to the load of SharedData. This scenario will monitor the system

4

2.4. Environment setups

and investigate the logs of the origin server to figure out what kind of requests it isreceiving and how long does it take to handle the different requests. The scenariowas created to help answering one goal of the thesis, ”Can the implementation relievestress to the system?”.

2.3.3 Performance testingOne of the questions that the thesis set out to find was the performance aspect of theimplementation. Financial system have heavy emphasis on the performance of thesystem. If the information do not get instantly updated and distributed, it couldresult in that people take risky positions unknowingly, based on old information.Therefor a lot of emphasis as been put to create test that illustrates and measurethe performance of the implementation and without.

2.3.4 Monitoring testsOne important aspect of running the test is to get the relevant data from the test.To secure the test data provided from the test scenarios, two different gatheringmethods has been performed.

Monitoring by logging

The first way of getting the information provided from the test were to add loggingat various locations in the the system. From logging the current caching percentageto the time to retrieve one request. Most of the logging was done in node implemen-tation or modified Java classes in the RTC system.

Monitoring by process observation

The second solution of collecting the information from the different test scenarios wasby process observation. When running the complete system many of the differentprocesses that makes up the system are affected by each other. So to get a goodunderstanding on how the implementation affected the different processes in thesystem, they were monitored during the scenarios.

2.4 Environment setups

The tests described below were run in a local environment with the following hard-ware:

• Memory: 32 GB

• Processor: Intel R© Xeon(R) CPU E5-2630 0 @ 2.30GHz x 12

• Disk: 235,2 GB

• Operating system: Ubuntu 18.10 (64-bit)

The different environments that is described in the different scenarios are simulatedusing, e.g simulated delay, both in form of latency in respect to the geographicaldistance as well as different scenarios with different bandwidth.

5

Chapter 2. Methods

2.5 Analysis and discussion

The results from the different test scenarios will be analyzed, to better understandthe affects of the implementation. And how Cinnober can use that informationto make a decision whether to move forward with the implementation or not. Theresults from the different scenarios will be discussed how to interpret them and whatinformation can we use moving forward as well as aspects that the scenarios havenot taken into account.

2.6 Evaluation methods

The results will be evaluated partly based on the performance of the solution indifferent aspects compared to the original implementation. Aside from the moremeasurable results, for the more abstract results and conclusions will be evaluatedbased on correlation with literature studies and papers.

6

3 Theory

3.1 Cloud

When talking about cloud it can mean different things, cloud is a broad categoriza-tion of many different functionalities. People without a computing science back-ground might think of cloud as something like image storage, or backups for yourcomputer or mobile devices. And that is not wrong, one big aspect of cloud isstorage, but that is just scratching the surface of what cloud really is. Instead ofthinking that the cloud is storage, think of it as data centers that are located aroundthe globe, and the data centers contain a large quantity of servers where the serverscan be used as you like. Where companies like Google, Amazon and Microsoft offeraccess to their data centers often by a ”pay-as-you-go” model, where you only payfor the resources that you use. [2] This provides a more robust and scalable optionto companies instead of buying and maintaining the servers them self.

3.2 Content Delivery Network

A so called content delivery network (CDN), is a highly distributed cloud basedplatform of servers that are optimized to deliver content to users as fast as possible.These networks are commonly used when hosting websites as well as static contentsuch as images and videos. The popularity of these networks continues to grow, andthe majority of internet traffic is served through CDNs, such as sites like Facebook,Netflix and Amazon[5]. The basic idea behind the technology is very trivial. Bylooking at the system illustration in Figure 3.1, we can see how one company couldset up their network. As depicted in Figure 3.1 there are numerous CDN nodesor PoP:s (point of presence) in different geographical location, these nodes are thekey component in the technology. When a user interacts with the system, the userconnects to a CDN node that is located to the user geographically. The CDN nodeworks like an extension of the origin server, where the most commonly requestedinformation is stored. And changes dynamically based on the patterns of the usersrequests towards that node. By using this structure a company like Netflix canreach users globally and scale to new regions only by adding additional CDN nodesin that region. With this structure the CDN nodes handle most of the incomingtraffic and is supposed to provide faster response times for the users when requestinginformation. With this structure the network reduces load on the origin servers,because the network eliminates an overflow of connections to the origin server.

3.2.1 CDN NodeThe key component in a content delivery network is the edge nodes or PoP:s. Thesenodes are essentially lightweight images of the origin server. The nodes purpose areto store content that is frequently requested by the users, and reduce the connections

7

Chapter 3. Theory

Figure 3.1: Illustration of a CDN structureSource: http://www.liberaldictionary.com/wp-content/uploads/2019/02/cdn-3983.jpg

and traffic load towards the origin server. If the edge node does not contain theinformation that the user is requesting then it uses the private network to retrievethe information from the origin server (see Figure 3.1).

3.2.2 ScalingOne of the key aspects of a CDN or having your service running in the cloud isthe scaling possibility of the service. Depending on the incoming traffic the servicecan scale up or down to match the user request and reduce the cost of the hostedservice. [12] Let us say that Cinnober has their system live in Europe and in Asia,then depending on the time of the day the traffic in the regions will differ. Thisenables Cinnober to optimize the resources spent on the system, because when it isnight time in Europe the traffic is low and they can scale down the service and putthose resources in Asia where it is morning and the traffic is much higher.

8

3.2. Content Delivery Network

3.2.3 BenefitsThe benefits of having a service run on a content delivery network are in most casesvery beneficial, different things that a CDN can contribute with are the following [1]:

• Protect the website/service from DDoS attacks

• Reduce bandwidth consumption

• Handle high traffic load

• Improve load speed

These benefits could be very useful for the system, as one study found that when littleas one second delay of waiting on the service/website decreases customer satisfactionby 16% and 40% of users will abandon a website if it takes more that three secondsto load[9]. The network can also help improve the fail-tolerance of the system, whereusers can continue to use the system even if the node that they are connected toshuts down. The users would get redirected to another node or the origin and wouldnot notice any difference besides the added latency when connecting to the closestnode available.

3.2.4 DisadvantageAs with all technologies, there are downsides of using them. When it comes toCDN:s there are a few as well. [3]

• Additional complexity of the system

• Additional cost

• Geo-location limitations

Of course adding new things will most of the time add complexity to the system,but it does not have to be a bad thing. Depending on what you get in return fromadding the functionality, one needs to make a decision if this solution is beneficialto the system or not. When the new technology is a CDN, the cost needs to betaken into account as well. Having your system executing, or parts of your systemexecuting on a CDN is not free. One option to run a CDN is to use one of theCDN providers, such as Akamai or CloudFlare, which handles the scaling and geo-location distribution of your program. Another option is to run the service througha Cloud Service Provider (CSP) such as Google cloud or Amazon AWS, where youcan decide how the scaling should work and at what geo-location should instances ofthe service be located. But by using a CSP or companies that offer CDN:s there isthe limitation on the geo-location of your deployment, where you are limited by theirrange of deployments areas. For example by creating your own CDN implementationof some aspect of the system and deploying it on Google cloud you have a limitedreach in a country like Russia for example [7], and that is something that needs tobe taken into account when deciding on how setup a CDN or if it is even beneficialat all.

9

Chapter 3. Theory

3.3 Data caching

The data in a cache is generally stored in fast access memory such as RAM (Random-access memory). The primary purpose of caching is to increase data retrieval per-formance by reducing the need to access the underlying slower storage layer. Thebenefits of caching is that it increases the read throughput, reducing the load onthe backend and the elimination of database hotspots [13]. The average memoryreference time is [10]:

T =m×Tm +Th +E (3.1)

Hit ratio=H/(H+M) (3.2)

wherem = miss ratio = 1 - (hit ratio)Tm = time to make a main memory access when there is a miss.Th = the latency: the time to reference the cache.E = Various secondary effects, such as queuing effects in multiprocessor systemsH = Cache hitsM = Cache misses. Where we want to achieve a small m as possible to minimizethe time to access main memory.

3.3.1 Caching StrategiesThere are many different types of caching strategies, below are the two most inter-esting strategies for this project.

Cache-Aside

This strategy is also referred to as lazy loading, the reason for this is because itloads data lazily on the first read. In this strategy the application talks with boththe cache and the database (see Figure 3.2). The design works as following: Whenthe service requests some data it checks with the cache if it have it, depending onthe answer there are two possible scenarios (depicted in Figures 3.2, 3.3). Either itreturns the data that is cached and the service can return the data instantly or thecache doesn’t have the data. In that case the service retrieves the data from thedatabase or wherever the data is stored, and then saves the data to the cache.

Read Through

This strategy is similar to the cache-aside strategy, both are of the type lazy load-ing, meaning that the data is only loaded once it is requested. The main differencebetween this strategy and cache-aside is that the service always goes through thecache, depicted in Figure 3.4. The service requests the cache for some data, depend-ing if the cache has the data or not, it will either return it directly or request it fromthe database. If the data is not present in the cache the request would just have aprolonged response time. The main differences between the two strategies is thatin cache-aside the service is responsible for communicating with the cache and thedatabase, while in read-through the cache is the one responsible for communicationwith the database. A benefit of this strategy is that the cached data cannot bedifferent from that of the database.

10

3.3. Data caching

Figure 3.2: Communication path in a hit scenario using cache-aside

Figure 3.3: Communication path in a miss scenario using cache-aside

Figure 3.4: Communication path in when cache hit and miss in read-throughstrategy

11

Chapter 3. Theory

3.3.2 Eviction policiesEvery cache have a finite memory at their disposal, and if cache uses all of thegiven memory pool repercussions needs to be taken. These repercussions are calledeviction policies and two policies are described below.

Least Recently Used (LRU)

This caching algorithm keeps recently used items near the top of cache. Whenevera new item is accessed, the LRU places it at the top of the cache. When the cachelimit has been reached, items that have been accessed less recently will be removedstarting from the bottom of the cache. This can be an expensive algorithm to use,as it needs to keep ”age bits” that show exactly when the item was accessed.

Least Frequently Used (LFU)

The LFU algorithm uses a counter to keep track of how many times an entry hasbeen accessed. This is used so when removing entries from the cache, the entry withthe lowest count is removed first.

3.4 Reference Data

Reference data is information that is used to structure and constrain other informa-tion. Examples of this is in mathematics is the π, or a structure for calendars suchas a list of valid months and days of the week. This form of data rarely changes andthat can because of that be cached to improve retrieval of information. When wespeak of reference data in computer science we often define it as a special subset ofmaster data, where master data is where all the information is stored and changes.This subset is used for classification, like postal codes, financial hierarchies or coun-tries. The two primary types of reference data that will be looked at during theproject are described below.

3.4.1 AccountThe accounts that exists within a clearing system doesn’t change that often. Eachaccount follow the same structure, only the content within them are different. Be-cause of this the account structure is going to be cached.

3.4.2 Tradable InstrumentsThe different types of tradable instruments of a clearing house rarely changes, andthey contain the structure that each instrument has, so instead of retrieving thatinformation every time, it will be cached.

12

3.4. Reference Data

Table 3.1 The structure of an account

Account

State

Clearing Member Code

Trading Member Code

Name

ID

Classification

Omnibus/ISA

Gross/Net

Automatic Give Up Trading Member Code

Automatic Give Up Trading Member

Automatic Give Up Customer Confirmation Code

Automatic Give Up Supplementary Code

Table 3.2 The structure of an tradable instrument

Tradable Instruments

State

Instrument ID

Name

Symbol

Underlying ID

Instrument Type

Currency

13

4 Implementations

The implementations described below are based on the information provided insection 2.3, 3.2.1 and 3.3 describing the different aspects that the solution needs tobe made of and handle. We can break the implementation up in two parts, whereone part is the implementation of the CDN node, and the other the code to monitorthe node and the overall system.

4.1 Node implementation

Because the implementation were done on legacy code that was provided by Cin-nober, the complete node was not implemented during the thesis. But modificationson the collection of classes that makes up their API instance, and creating the classesthat was missing for the implementation to work. When working with legacy codethere can be limitations constraining the possibilities of adding new functionality tothe system. Where some approaches was not optimal because of the structure ofthe existing code base. Modifications was done on the API instance to be able touse it as a CDN node. Primarily by creating an caching layer, depicted in Figure4.1. As seen in Figure 4.1 the caching layer intercept the data flow of the API in-stance, and because of the added complexity of adding functionality on legacy code,read-through strategy seen in Figure 3.4, was the best option when implementingthe caching layer.

4.2 Monitoring implementations

One important aspect of the implementation that was done was the monitoringfunctionality. Even if it did not provide anything for the implementation of thesystem, it enabled the possibility to observe the effects of the implementation.

4.2.1 Monitoring by loggingLogging run time information was the biggest source of information gathering, thiswas done by writing down relevant information in different Java classes to a file.The different logged files could later be used when running the simulations, bymodifying the values based on the scenario. The logged files will also be used tocompare different measured values in comparison to each other.

4.2.2 Monitoring by process observationsThe other source of information gathering was done by process monitoring scripts.These scripts monitored different key processes in the system to observe how theybehaved during the different scenarios were run.

14

4.3. Simulations and tests

Figure 4.1: Data flow of the implemented CDN node

4.3 Simulations and tests

To understand how the system behaves in the original state, the scenarios mentionedin section 2.3 where run on the system. The results gathered from the tests providesa solid reference point, that will be compared with the results provided from theCDN node implementation. With the two different test cases described below, thegoal is to provide a solid foundation to illustrate the effects of adding one CDN node.By showing the retrieval times as well as the load on the origin server. And whatthe potential benefits can be by having the possibility of adding new CDN nodes indifferent geographical locations.

4.3.1 CDN node active in StockholmTo illustrate the effect of an active CDN node, one node was started in Stockholm.The scenarios was run to investigate the differences in system response time. Thetest illustrates the performance from users based in Stockholm and Washington D.C.The performance from Washington D.C is calculated from the results measured fromStockholm with an latency delay added, as mentioned in section 2.2. The test showsthe impact of running one CDN node in the system.

15

Chapter 4. Implementations

4.3.2 CDN node active in Stockholm and Washington D.CThis test is an simulated extension of the previously described test. In this testanother CDN node is active in Washington D.C, that becomes the access point tothe system for users located in that area. Where the goal of the test is to illustratethe effect when it comes to retrieval time, for the users located in the WashingtonD.C area.

16

5 Result

The main goal from the different scenarios is to get results from the different teststhat can be used to answer the set goals for the master thesis or provide data toreach a conclusion from the results. This section will cover the measured values thatwere provided from testing the different scenarios described in section 2.3.

5.1 Scenarios

The result from the different scenarios tested are described and analysed below.

5.1.1 Time to retrieve data from SharedData with 100% miss (No caching)When running tests without any modifications to the system we can see in Figure5.1 that retrieving 10 000 Accounts from SharedData takes around 15-17 seconds,which means that the time to retrieve one Account takes around 0,0016 seconds or1,6 milliseconds(ms). Based on that the user interacting with the system is locatedin Stockholm, where the node is active. When comparing the impact of the userlocation seen in Figure 5.2, we can clearly see in Figure 5.2a that the added latencywhen sending from Washington D.C. The average fetching time increases from 1,6ms to 103,9 ms per request.

Figure 5.1: The retrieval of reference data without caching

17

Chapter 5. Result

(a) from Washington D.C (simulated) (b) from Stockholm.

Figure 5.2: The retrieval of reference data without caching

5.1.2 Time to retrieve data from SharedData with CDN nodeThis scenario runs the same test suite as the section without caching, but insteadthis scenario has the CDN node implementation active in different locations.

Active CDN node in Stockholm

When running a CDN node in Stockholm we can see that the difference is immensely,compared with not running the implementation. As we can see in Figure 5.4, whereFigure 5.2b and Figure 5.3b are plotted against each other to clearly illustrate theeffect. The fetching time for the reference data drops from an average of 15-17seconds for fetching 10 000 Accounts to an average just under 3 seconds. Thismeans that per account the fetching time gets roughly reduced from 1,6 millisecondsto 0,2638 milliseconds, that’s a fetch time increase of a factor of 6. By comparing thesimulated times from Washington D.C in Figure 5.2a, 5.3a. We can see that withthe active node in Stockholm, that there are improvements but not significantly.

(a) Washington D.C (simulated) (b) Stockholm.

Figure 5.3: With CDN node active in Stockholm, the retrieval of reference data

18

5.1. Scenarios

Figure 5.4: The retrieval of reference data with and without caching.

Active CDN node in Stockholm and Washington D.C

When adding another CDN node, this time in Washington D.C. By adding the nodethe retrieval time is significantly reduced, as shown in Figure 5.5. We can see thatthe retrieval time is higher at the start of the simulation, and gets lower and lower.The reason for the higher retrieval time in the start is because of the cache misses.And the node needs to retrieve the data from the origin server located in Stockholm.By adding the additional node in Washington D.C, the average retrieval time goesfrom 1025 seconds to around 5 seconds. That means that the average retrieval timegets reduced by a factor of 205.

5.1.3 Test the miss ratio on the caching solutionWhen testing this scenario the retrieval time for the reference data and the cachehit ratios correlation is investigated. From the tests we can see in Figure 5.6 howthe retrieval time for the reference data is correlated to the cache hit ratio. Fromthe figure we can see that the retrieval time drastically decreases when the cache hitratio increases, where the time of a hit rate at 86% is approximately around 1,2 msand the when the hit rate is at 98-100% is around 0,2 - 0,25 ms. Which is a timereduction by a factor of 6. The results are gathered by measuring retrieval timefrom Stockholm towards the CDN node with the same location.

19

Chapter 5. Result

Figure 5.5: The retrieval of reference data with CDN node in Stockholm andWashington D.C.

Figure 5.6: The retrieval of reference data in comparison to the hit ratio of thecache

5.2 Measure the load on the origin server(SharedData)

When creating the scenario of measure the impacts on the origin server with theCDN implementation and without it, there are couple of different aspects that wasinvestigated. When monitoring the origin server, which in this case is the Shared-

20

5.2. Measure the load on the origin server(SharedData)

Data node. We can see that the load on the server is exponentially higher comparedto when the CDN node is active. In Figure 5.7 we can see that the CPU usagespikes when the different requests arrive to the server, and reaches a fairly stablepercentage around 150-200%. This puts high emphasis on that the server is optimalto handle the load and increases the risk of slower handling of external request thatoccur at the same time as handling the original requests. We can see in the figurethat the CPU load when the CDN node is active is significantly lower and averagearound 2-4% as is highlighted from Figure 5.7 in Figure 5.8. The origin server whichin this case is called SharedData, contains all the reference data for the clearingsystem. So reserving the CPU usage is essential to ensure that the resources areavailable when needed. It is trivial that the memory consumption for the Shared-

Figure 5.7: CPU usage of SharedData with and without CDN Node.

Data node should be fairly constant when interacting with it using a active CDNnode. Because the information is stored at the edge node so the requests are reducedto the origin server, and depending on the available storage capacity on the edgenode and the size of the information that is stored at the origin server.

21

Chapter 5. Result

Figure 5.8: CPU usage of SharedData with and without CDN Node.

Figure 5.9: Memory usage of SharedData with and without CDN Node.

22

5.2. Measure the load on the origin server(SharedData)

When a user want to retrieve information about an account, the user sendsthe a getAccount request to the access point of the system. The access point thenpropagates that request down to the SharedData node that contains all the referencedata in the clearing system. The getAccount in turn can trigger two additionalrequests called getTradingMembers and getClearingMembers depending on the typeof account that is requested. All these requests are sent to the SharedData nodeand are processed, we can see in Figure 5.10, the logs for the incoming requests thatwere sent during a scenario. If we look at Figure 5.11 which is the moving averageof the first 100 requests from Figure 5.10 we can see that the different requestsvaries between 0.15 to 0.4 milliseconds with a few deviations. With the duration tohandle the requests we can clearly see the performance improvements can clearly bemade to reduce the time spent handling requests. And when taken into account thecorrelation between the request, the time begins to add up when one request canadd around one extra millisecond. Though one millisecond might not sound thatmuch, but when keeping in mind that one request can trigger two or more additionalrequest. The extra delay in time starts to stack up.

Figure 5.10: The times for different request handled by the SharedData node.

23

Chapter 5. Result

Figure 5.11: The moving average of the first 100 requests on the SharedData node.

24

6 Discussion

This chapter will analyze the results and discuss how the fully scaled implementationwould look in a cloud environment.

6.1 Analyze results

From the results in chapter 5 we can see the impact of the implementation on thesystem. The results reflect the impact of one node at one access point for usersinteracting with the system. This means that the results does not depict a completeCDN implementation but illustrates the effects on the system with one CDN nodeactive. Where the CDN node cover a geographical area as seen in Figure 3.1, inthe bottom right corner where the node covers the area of Australia. When talkingabout one CDN node, the node can vary in composition depending on the geograph-ical area and the traffic load in that area. Where at one location the scale of thenode can be small to be able to handle the traffic and caching for traffic in thatarea, where in areas of high load of the system the scale of the CDN node would bemuch larger to handle all the user requests. So depending on the load on the node,it scales up and down the active instances of the virtual machines handling the userrequests to provide a better experience for the users.

6.2 Fully scaled CDN implementation

As mentioned in section 6.1 above, when talking of the structure of one PoP or CDNnode, they can be constructed in different ways in terms of the scale of the node.Where the scale of the node can be determined based on budget restrictions, or loadon the system. If the node is structured to keep cost down, then the node wouldnot scale up if experience high traffic load. The consequences would be increasedresponse time on requests. With that in mind the implementation would probablyhave better performance than previous implementation without the scalable optionof the edge nodes. Another approach is to see the each node as a scalable cluster ofvirtual machines, this way the implementation of the network will always meet therequirements from the users, but the solution would cost significantly more than theprevious mentioned solution.

6.2.1 Content Delivery Network ScenarioFrom the results in section 5 and 6.1, we can construct a fully scaled implementationof a CDN where we use the results gathered as reference points when constructing

25

Chapter 6. Discussion

the potential implementation. We need to start by looking at what kind of resourcesdoes the implementation need to function properly.

Background

The goal of this scenario is to get a better understanding of how a full scaled CDNimplementation on Cinnober’s RTC system with regards to handle reference datawould look like. Cinnober have created the clearing system for many well knownmarkets, the structure of the clearing house can of course vary and the amount ofclearing members, trading members and tradable instruments. To give some graspof the volumes on the clearing houses, for one clearing house there are roughly 45000tradable instruments, 5000 accounts, 100 clearing members, and 400 trading mem-bers. All this information is reference data, and information that could potentiallybe handled in a CDN.

Cloud deployment

The focus during the thesis has been to ignore the existing CDN package solutionand focus more on how Cinnober could construct their own CDN based on theirneeds. When looking at two big leaders within cloud services providers, Google andAmazon. They offer the geographical coverage that would be interesting in thisscenario, where the CDN nodes could be located in a wide range of location suitablefor the clearing system. [6, 7] Where spot prices(2019-04-16) of a standard virtualmachine(VM) running with one virtual CPU(vCPU) and 3.75GB in memory costsaround $0.0523/Hour at Google [11] and a virtual machine running with two vCPU:sand 3.75GB in memory costs around $0.0554/Hour at Amazon [2].

System interactions

To get an understanding on how this information that is stored in the system isused and how it’s used by the users we need to layout some basic behaviors of theusers interacting with the system and how that correlates with fetching referencedata. In the system there are between 2000-5000 users interacting with the systemdaily. This might not seem to be a lot of users interacting with the system, butwhen looking at how one users presence in the system correlates with how muchreference data that needs to be requested. We begin to understand the possibility ofimprovements. When a user interacts with the clearing house user interface referencedata is requested constantly, whenever a user loads the interface all the accounts andtradable instruments needs to be fetched for the user. That is just when the uservisits the user interface, then the user might want to look at a various range ofinstruments or accounts that operate within the clearing house.

Setup

By assuming a users interaction patterns during a day we can get a better under-standing on the effect the solution could have on the system. Users can of courseinteract differently with the system, and a users interaction patterns can vary on adaily basis. But for this scenario we assume that all users interact in the same way.The calculations below are going to be done under the cost during one day of runningthis implementation. Assume that the clearing house is located in central Europe,

26

6.2. Fully scaled CDN implementation

then we set up the CDN nodes in Europe(London), Asia (Tokyo, Singapore), USA(Virginia) [6, 7]. With the four nodes we will cover a large geographical area. Thebase cost for running these instances would be:

TC = Instances×24×HP (6.1)

where TC = Total cost and HP = the hourly spot price of the VM instance. Fromequation 6.1 we then get the total cost for one day to $ 5,3184, this value doesn’ttake into account potential scaling for the different edge nodes.

Calculations

Given the information from section 6.2.1 we assume that 1000 of the users interactingwith the system are located in USA. Where the users interacts with the system asfollowing during one hour of the day:

• Website refresh, resulting in fetching all reference data.

• Fetches 10 000 instruments.

• Checking the latest trades = 10 000 rows, where one row contain two additionalrequests

• Checking trade history → fetches all accounts, instruments. And for each rowin the 10 000 results → six additional requests.

Based on that information we construct the following equation for users behaviorduring an hour of the day:

TR= 1×FA+ 10000× I+ 10000×T + 10000×TH+ 1×FA+ 10000×P (6.2)

Row =Account×3 + Instrument×3 (6.3)

Row =Account+ Instrument (6.4)

Row =Account+Tradingmember+Clearingmember (6.5)

whereTR = Total number of requestsFA = Fetching all reference data.I = Instrument.TH = Trade history, where each row is defined by equation 6.3T = Trade, where each row is defined by equation 6.4P = Positions, where each row is defined by equation 6.5

The users total requests sent during one hour would mount up around 220000 re-quests/hour → 1980000 requests/day. When using the data provided in section 5,and inserting the different measured times for the requests with and without we getthe following values: No CDN implementation = 3168 seconds during one day or52,8 minutes. With the CDN node implementation = 522,3 seconds or 8,7 minutes.So with the implementation the system would roughly reduce latency and waitingtime for the user by 44 minutes during one day of work.

27

Chapter 6. Discussion

Scenario limitations

It is important to note that users interactions with the system is approximated basedon limited knowledge, so the scenario could be an underestimation just as much itcould be an overestimation.

28

7 Conclusion

The thesis purpose was to find answers to the questions/goals as described in sec-tion 1.3. What conclusions can be drawn based of section 6 and 5, and can all thequestions be answered?

The first question that the thesis wanted to answer was can a CDN implementationon the RTC system improve throughput of reference data, which is clearly provenin the result section 5, where we can see that the implementation out performs theimplementation without it by a factor of 6. And based on the fully scaled scenariodescribed in discussions section 6.2, we can see how that would result during a com-plete day of operation.

When it comes to can the implementation relieve stress on the system, we cansee clearly in Figure 5.7 that the load on the SharedData node drastically decreasedwith the implementation which is a very important aspect of the solution. Becausethe SharedData node communicates with other parts of the system and it’s impor-tant that the node is slowed down by reference data requests when business crucialcommunication needs to be propagated to the node.

The question of if the solution is beneficial to the RTC system might be the mostcomplex question to reach one concrete conclusion, based on the results and in-formation provided in the discussion section, the solution would provide increasedthroughput and relieve stress to the system which would be very beneficial to thesystem. So in theory this implementation would be very beneficial based on the costof the cloud resources needed. The whole aspect whether or not the implementationis beneficial to the RTC system is dependent on how you look at it, from a technicalstandpoint it would be beneficial and improve the system in regards to scalability,throughput and stress relieving. But there will be added complexity to the RTCsystem as a whole and added costs to setup and maintain. This is something thatCinnober needs to investigate more to give a better understanding if the solution iscost effective and if it provides an business advantage for the company.

The thesis answered the questions that it was supposed to, and when it comesto the beneficial aspect provided a basis to make a decision if the company wants tofurther investigate this form of solution.

29

8 Future work

One question we might be asking in this stage of the thesis is, are we done here?Is the work accomplish? Well yes and no, in respect on what the master thesisset out to accomplish then the answer is yes, I have managed to find the answersthat I was looking for when I started this thesis and set the goals for the project.But on the other hand this paper is only a pilot study to find out if the approachand the potential solution to the problem is feasible and possible to implement,and if so, what would be the benefit of implementing such a solution. So withthat answer I have also motivated that the work is still not completely done. Forfuture work, or in form of another master thesis that would work as an extensionsto this one. One could create a more technical implementation of the basis of thisthesis. Where the person would look at what would be the optimal solution for a fullscaled implementation of a content delivery network to handle the reference data onCinnober’s RTC system.

30

Bibliography

[1] Akamai. What are the benefits of a CDN. url: https://www.akamai.com/us/en/cdn/what-are-the-benefits-of-a-cdn.jsp.

[2] Amazon EC2 Pricing - Amazon Web Services. url: https://aws.amazon.com/ec2/pricing/ (visited on 04/16/2019).

[3] Josh Carlyle. What Are the Advantages and Disadvantages of Using a CDN?Nov. 2018. url: https://www.colocationamerica.com/blog/cdn-advantages-and-disadvantages.

[4] Cisco Visual Networking Index: Forecast and Trends, 2017–2022. Jan. 2019.url: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual- networking- index- vni/white- paper- c11- 741490.html.

[5] Cloudflare. What is a CDN? url: https://www.cloudflare.com/learning/cdn/what-is-a-cdn/.

[6] Global Cloud Infrastructure — Regions Availability Zones — AWS. url:https://aws.amazon.com/about-aws/global-infrastructure/.

[7] Global Locations - Regions Zones — Google Cloud. url: https://cloud.google.com/about/locations/.

[8] Global Ping Statistics. url: https://wondernetwork.com/pings.[9] Viki Green. “Impact of slow page load time on website performance”. In:

Medium (Jan. 24, 2016). url: https://medium.com/@vikigreen/impact-of-slow-page-load-time-on-website-performance-40d5c9ce568a (vis-ited on 02/04/2019).

[10] Orion Sky Lawlor. Performance Modeling: Amdahl, AMAT, and Alpha-Beta.url: https://www.cs.uaf.edu/2011/spring/cs641/lecture/04_05_modeling.html.

[11] Pricing — Compute Engine Documentation — Google Cloud. url: https://cloud.google.com/compute/pricing (visited on 04/16/2019).

[12] Scaling Based on CPU or Load Balancing Serving Capacity — Compute En-gine Documentation — Google Cloud. url: https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing.

[13] What is Caching and How it Works — AWS. url: https://aws.amazon.com/caching/.

31

https://www.akamai.com/us/en/cdn/what-are-the-benefits-of-a-cdn.jsp

https://www.akamai.com/us/en/cdn/what-are-the-benefits-of-a-cdn.jsp

https://aws.amazon.com/ec2/pricing/

https://aws.amazon.com/ec2/pricing/

https://www.colocationamerica.com/blog/cdn-advantages-and-disadvantages

https://www.colocationamerica.com/blog/cdn-advantages-and-disadvantages

https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-741490.html



https://www.cloudflare.com/learning/cdn/what-is-a-cdn/

https://www.cloudflare.com/learning/cdn/what-is-a-cdn/

https://aws.amazon.com/about-aws/global-infrastructure/

https://cloud.google.com/about/locations/

https://cloud.google.com/about/locations/

https://wondernetwork.com/pings

https://medium.com/@vikigreen/impact-of-slow-page-load-time-on-website-performance-40d5c9ce568a

https://medium.com/@vikigreen/impact-of-slow-page-load-time-on-website-performance-40d5c9ce568a

https://www.cs.uaf.edu/2011/spring/cs641/lecture/04_05_modeling.html

https://www.cs.uaf.edu/2011/spring/cs641/lecture/04_05_modeling.html

https://cloud.google.com/compute/pricing

https://cloud.google.com/compute/pricing

https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing

https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing

https://aws.amazon.com/caching/

https://aws.amazon.com/caching/

Investigate potential performance improvements in a Real Time...

Documents

Transcript of Investigate potential performance improvements in a Real Time...