Reinforcement Learning for Uplink Power Control1295396/FULLTEXT01.pdf · reinforcement learning...

IN DEGREE PROJECT ELECTRICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Reinforcement Learning for Uplink Power Control

ALAN GORAN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Reinforcement Learning for Uplink Power Control

ALAN GORAN

Examiner: Joakim JaldénAcademic Supervisor: Marie Maros

Industrial Supervisor at Ericsson: Roy Timo

Master’s ThesisSchool of Electrical Engineering and Computer Science

Royal Institute of Technology, SE-100 44 Stockholm, SwedenStockholm, Sweden 2018

i

Abstract

Uplink power control is a resource management func-tion that controls the signal’s transmit power from a userdevice, i.e. mobile phone, to a base-station tower. It isused to maximize the data-rates while reducing the gener-ated interference.

Reinforcement learning is a powerful learning techniquethat has the capability not only to teach an artificial agenthow to act, but also to create the possibility for the agentto learn through its own experiences by interacting with anenvironment.

In this thesis we have applied reinforcement learningon uplink power control, enabling an intelligent softwareagent to dynamically adjust the user devices’ transmit pow-ers. The agent learns to find suitable transmit power levelsfor the user devices by choosing a value for the closed-loopcorrection signal in uplink. The purpose is to investigatewhether or not reinforcement learning can improve the up-link power control in the new 5G communication system.

The problem was formulated as a multi-armed banditat first, and then extended to a contextual bandit. We im-plemented three different reinforcement learning algorithmsfor the agent to solve the problem. The performance ofthe agent using each of the three algorithms was evaluatedby comparing the performance of the uplink power controlwith and without the agent. With this approach we coulddiscover whether the agent is improving the performanceor not. From simulations, it was found out that the agentis in fact able to find a value for the correction signal thatimproves the data-rate, or throughput measured in Mbps,of the user devices average connections. However, it wasalso found that the agent does not have a significant con-tribution regarding the interference.

ii

Referat

Förstärkningsinlärning för UplinkEffektstyrning

Uplink effektstyrning är en funktion för resurshanteringsom styr signalens sändningseffekt från en användaranord-ning, det vill säga från en mobiltelefon, till ett bas-stationstorn.Det används för att maximera datahastigheterna samtidigtsom den genererade interferensen reduceras.

Förstärkningsinlärning är en kraftfull inlärningsteknik.Den har förmågan att inte bara lära en artificiell agent hurman agerar utan också skapa möjligheten för agenten attlära sig själv, genom att samspela med en miljö och samlaegna erfarenheter.

I detta examensarbete har vi tillämpat förstärknings-inlärning på uplink effektstyrning, vilket möjliggör för enintelligent mjukvaruagent att dynamiskt anpassa effektenpå sändningssignalerna hos användarens enheter. Agentenlär sig att hitta lämpliga sändningseffekter för enheternagenom att välja ett värde för korrigeringssignalen med slu-ten slinga i uplinken. Syftet är att undersöka huruvida för-stärkningsinlärning kan förbättra uplink effektstyrningen idet nya 5G-kommunikationssystemet.

Problemet formulerades som en Multi-Armed banditförst och utvidgades sedan till en Contextual bandit. Viimplementerade tre olika förstärkningsinlärning algoritmerför agenten att lösa problemet. Utförandet av agenten medanvändning av var och en av de tre algoritmerna utvärde-rades genom att jämföra prestanda för uplink effektstyr-ningen med och utan agenten. Med detta tillvägagångssättkunde vi upptäcka om agenten förbättrar prestanda eller in-te. Från våra simuleringar upptäcktes att agenten faktisktkan hitta ett värde för korrigeringssignalen som förbättrardatahastigheten, genomflödet mätt i Mbps, av genomsnitt-liga anslutningar för användarenheterna. Det har däremotockså visat sig att agenten inte har ett betydande bidragavseende interferensstörningarna.

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Outline/Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Uplink Power Control 32.1 Power Control in Telecommunications . . . . . . . . . . . . . . . . . 32.2 Open Loop And Closed Loop Power Control . . . . . . . . . . . . . . 42.3 Deployment of UEs in Different Environments . . . . . . . . . . . . . 62.4 Thesis Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Reinforcement Learning 93.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Exploration And Exploitation Dilemma . . . . . . . . . . . . . . . . 113.5 Finite Markov Decision Process . . . . . . . . . . . . . . . . . . . . . 12

3.5.1 Reward Function And Goals . . . . . . . . . . . . . . . . . . 14

4 Method 174.1 Non-Associative Search . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 A k-Armed Bandit Problem in Uplink Power Control . . . . . 184.2 Implemented Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Epsilon Greedy Algorithm . . . . . . . . . . . . . . . . . . . . 194.2.2 Upper Confidence Bound Action Selection Algorithm . . . . . 234.2.3 Gradient Bandit Algorithms . . . . . . . . . . . . . . . . . . . 24

4.3 Associative Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Reward Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.1 Local Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4.2 Global Reward . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5.1 Evaluating The Performances . . . . . . . . . . . . . . . . . . 29

5 Results 315.1 Experiment 1: RL Using Local Reward . . . . . . . . . . . . . . . . . 31

iii

iv CONTENTS

5.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1.2 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . 33

5.2 Experiment 2: RL Using Global Reward . . . . . . . . . . . . . . . . 335.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.2 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . 35

5.3 Comparison between Local and Global Results . . . . . . . . . . . . 355.4 Experiment 3: Interference . . . . . . . . . . . . . . . . . . . . . . . 36

5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.4.2 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . 37

6 Discussion and Conclusion 396.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 396.1.2 Global Reward Function Failure . . . . . . . . . . . . . . . . 406.1.3 Other Reward Functions . . . . . . . . . . . . . . . . . . . . . 406.1.4 Interference Not Found . . . . . . . . . . . . . . . . . . . . . 41

6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.4 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography 45

Abbreviations

AI Artificial Intelligence

ML Machine Learning

MDP Markov Decision Processes

RL Reinforcement Learning

UCB Upper Confidence Bound

UPC Uplink Power Control

UE User Equipment

SINR Signal-to-Interference and Noise Ratio

CDF Cumulative Distribution Function

v

Chapter 1

Introduction

The world is on the verge of a new age of interconnection as the fifth generationof wireless system, known as 5G, is expected to launch in 2020. With the cuttingedge network technology, the 5G offers connections that are tens of times faster andmore reliable than the current connections. By providing the needed infrastructureto carry big amounts of data, the technology will bring us a big step closer to powerthe Internet of Things, [1].

Ericsson is one of the leading companies in telecommunications. It is at Ericssonthis thesis is being carried out. With this thesis we wish to improve one particulararea, that is the scheduling decisions in future 5G base-stations. The study willmore specifically be about Uplink Power Control (UPC).

Uplink power control is the controlling of a signal’s transmission power, a signalthat is transmitted from a User Equipment (UE), i.e. a mobile phone, to a base-station tower. UPC is generally used to maximize the quality of service whilelimiting the generated interference [2].

In this thesis we are studying the potential of Machine Learning (ML) in UplinkPower Control. The type of ML in which we study and implement is called Rein-forcement Learning (RL). With RL algorithms we will attempt to train a system tofind the optimal transmission power for each UE. The transmission power control isan important part of communication systems and improving it allows us to reducethe interference between different UE signals, as well as reducing the UE’s energyconsumption.

The aim is to provide better data-rates to the users by dynamically adjustingthe users’ transmit powers. The Signal-to-Interference and Noise Ratio (SINR) is areasonable first-order approximation to the user’s data rate. This approximation isused throughout the field. The idea is to use the SINR to train an RL-agent thatsets the transmit power.

Reinforcement learning allows us to create intelligent learners and decision mak-ers called agents. The agents can interact with an environment, which consists of

1

2 CHAPTER 1. INTRODUCTION

everything outside of the agent’s control [3]. Through this agent-environment in-teraction, the agent can learn to choose better actions in order to maximize somenotion of cumulative reward. This maximization of the reward through the agent-environment interaction is the underlying theory of reinforcement learning. Thistype of machine learning is inspired by behaviourist psychology. For example, whena child is taking his first steps in an environment, the most effective way for himto learn how to walk is through his direct sensing connection to the environment.By training this connection, he gathers a complex set of information about the con-sequences of his actions and, through that, learns to achieve his goals. However,human’s biological perception and connection to their surrounding is profoundlymore advanced than what we are capable of providing a software agent with. Find-ing an optimal solution to a problem with RL is therefore not as simple as one mightassume. Nevertheless, RL have been showing very good results in the latest years.It got a lot of attention when Google’s DeepMind developed an RL system, calledAlphaGo, that for the first time ever won over the human champion in the game ofGo [4]. An AI system that can beat a human champion in Go was previously notthought to be possible due to its big search space and the difficulty in predictingthe board moves.

1.1 Research QuestionThe thesis will study the use of reinforcement learning in uplink power control in5G. The purpose is to investigate usage of RL’s promising technology in the contextof power control. We want to discover whether or not it is favourable to set thecontrol parameters in uplink using RL algorithms.

Reducing the interference between different the UEs while achieving better data-rates in the uplink transmission is the main goal. Lower interference, or higherSINR, is the measurement that defines how good the system performs.

1.2 Outline/OverviewThe following two chapters contain the theory for both uplink power control and re-inforcement learning. In chapter 2, we go through the background behind the uplinkpower control and identify the problem we aim to solve. In chapter 3, the necessaryreinforcement learning background is explained and how it can be a solution for theuplink power control problem.

In chapter 4, we put those theories in use and explain all the methods thatwere implemented in this thesis. In addition to that, we will also go through thedifficulties that were encountered when implementing the algorithms. Then theresults of the implemented methods are summarized in chapter 5. Finally we endthis thesis with chapter 6, it consists of discussions, conclusions, future work andethical aspects about the researched work.

Chapter 2

Uplink Power Control

The problem we are studying in this thesis is about Uplink Power Control (UPC).We aim to improve the transmission power control between two nodes. In thischapter, we will explain the theory behind UPC as well as the thesis problem. Itis necessary to understand the background of UPC before attempting to improveit. Specifically, we want to know about the functionality of UPC in 5G and thetrade-offs between transmitting with more power and increased interference.

This chapter will also cover how user devices, or User Equipment (UE), differin their behaviour when deployed in different environments, i.e., dense city versussparse rural area.

2.1 Power Control in Telecommunications

Power control is the process of controlling the transmission power of a channel linkbetween two nodes. The controlling of transmission power is an attempt to improvea communication signal and achieve better quality of service. Power controls areimplemented on communication devices such as cell phones or any other kind of UE.There are two kinds of channel links in telecommunication, downlink and uplink.A communication signal is called downlink when it goes from base-station to UEs,and it is called uplink when it goes from UEs to base-station, see Figure 2.1. Inthis thesis, we are focusing on power control for uplink only.

Power control is needed to prevent too much unwanted interference betweendifferent UEs’ communication signals, see Figure 2.2. It is undeniable that if thereis only one mobile device on a site and there are no other devices to interfere with,then the higher the transmission power of the mobile device the better the qualityof the signal. However, this is not a good strategy when there are other UE signalswithin range. The idea is to decrease the output transmission power to an optimalvalue when other mobile devices are around. Decreasing the power level will reducethe interference between the devices. A good analogy is imagining a room with onlytwo persons in it, the louder they speak (higher transmission power) the better they

3

4 CHAPTER 2. UPLINK POWER CONTROL

Figure 2.1: An illustration separating the downlink and uplink communication signalsbetween a base-station and a UE. The arrows represent the wireless communication signals.

will hear each other, but if there are multiple people in the room and everyone isspeaking loudly then hearing each other will be hard (more interference). Anotheradvantage of decreasing the transmission power is that the mobile devices’ energyconsumption will be reduces.

2.2 Open Loop And Closed Loop Power Control

In uplink power control, we combine two mechanisms, open loop and closed loop.The open loop’s objective is to compensate the path loss. Path loss is the reductionin power density of an electromagnetic wave when the wave is traveling throughspace. The open loop generates a transmit power depending on the downlink pathloss estimation, see Equation 2.1. The disadvantage of open loop control is that itsperformance can be subjected to errors in estimating the path loss [5]. Therefore aclosed loop control is introduced. The closed loop consists of a correction functionδ, which is a feedback (i.e., measurements and exchange of control data) from thebase-station to the UE in order to fine-tune the UE’s transmission power [5]. TheUE’s transmitting power is set according to

P = min{PCMAX , P0 + αPLDL + 10log10M + ∆MCS + δ}, (2.1)

where the open loop power control include the first part of the formula P0 +αPLDL + 10log10M + ∆MCS and the δ alone makes it a closed loop control. Theterms stand for;

2.2. OPEN LOOP AND CLOSED LOOP POWER CONTROL 5

Figure 2.2: An illustration of the interference between two UEs. Transmitter 1 andtransmitter 2 are the respective transmitters on UE 1 and UE 2 while receiver 1 and receiver2 are the receivers of the base-station. The receivers are physically located on the sametower but they are separated here in order to make the illustration clearer. The straightarrows between the transmitters and receivers represent the desired uplink communicationsignal from the UEs to the base-station. The dotted arrows represent the same transmittedsignal picked up by the unintended receiver, transmitter 1 to receiver 2 and transmitter 2to receiver 1, which causes an interference.

• PCMAX is the maximum power allowed for UE transmission, which is 23 dBm(200 mW) for UE power class 3 according to 3GPP standard, [6].

• P0 represents the power offset that is the minimum required power receivedat base-station in uplink,

• PLDL is the estimated downlink path loss calculated by UE,

• α ∈ [0, 1] is a parameter used to configure the path loss compensation,

• M is the instantaneous Physical Uplink Shared Channel (PUSCH) bandwidthduring a subframe expressed in resource blocks,

• the 10log10M term is used to increase the UE transmission power for largerresource block allocations,

• ∆MCS increases the UE transmit power when transferring a large number ofbits per Resource Element. This links the UE transmit power to the Modu-lation and Coding Scheme (MCS).

6 CHAPTER 2. UPLINK POWER CONTROL

• δ is the correction parameter that is used to manually increase and decreasethe transmission power.

2.3 Deployment of UEs in Different EnvironmentsUser equipment are used in various places our environment. We want to stay con-nected to fast internet in sparse rural areas as much as in dense cities. The deploy-ment of the UEs plays a crucial roll in the power control problem. The methodsthat decide the values of the parameters in Equation 2.1 are dependent on the UEdeployment. Which makes it very hard to find one method that fits all deployments.This is where reinforcement learning becomes a topic of interest. Designing an agentthat can quickly adapt to different environments and learn to maximize the qualityof service, may be a good solution.

2.4 Thesis ProblemWe want to investigate whether or not using reinforcement learning algorithmswill improve the power control, which in this context means reducing interferencewhile achieving higher uplink data rates. The approach is to design a system usingreinforcement learning that can control the outputted transmission power P bychoosing different values for δ parameter in Equation 2.1. Instead of having δ set toa single predetermined value, the system will train an agent to select different “best”values of δ for every UE. The agent will learn which δ values, also called actions,provide better results and which do not. When it finds a more desirable action itwill repeatedly exploit that action, more on this is explained in chapter 3. Theagent will be running at the base-station, and feed the results of the RL algorithm,which is the value of δ, as control information to the UEs.

The parameter δ can optionally be either absolute or cumulative according tothe 3GPP standard [6]. Only the first option will be investigated in this thesis. Theabsolute values the δ parameter can take, according to [6], consist of four values,see Table 2.1.

Actions Absolute δ [dB]a0 -4a1 -1a2 1a3 4

Table 2.1: A mapping between the agent’s actions and the corresponding δ values [6]

An RL-agent is trained through evaluating the actions that were previouslyexecuted. We will therefore need a representation, or a measurement, of the en-vironment that tells us how good the given action performed. This measurement

2.4. THESIS PROBLEM 7

is needed for the reward function, which is more thoroughly explained in subsec-tion 3.5.1. What we choose to use for evaluating the performance is the Signal-to-Interference and Noise Ratio (SINR), which is a first-order approximation tothe UE’s data rate. Specifically, we can estimate the maximum transmitted datarate, also called throughput, by using the Additive White Gaussian Noise (AWGN)channel capacity formula from information theory,

throughput = log(1 + SINR). (2.2)

Chapter 3

Reinforcement Learning

This chapter is dedicated to the theory behind reinforcement learning (RL). In thisthesis, we have the problem explained in chapter 2 that we want to solve by usingseveral reinforcement learning algorithms. The current existing solution to thatproblem is using different control methods to adjusts the strength of uplink signals.Because of the rapid developments in the field of reinforcement learning, it is agood idea to try to improve the current solution to a more advanced or intelligentsolution.

In order to implement RL algorithms, it is necessary to first acquire a generalunderstanding about reinforcement learning, its advantages, disadvantages, as wellas broadening our knowledge about the processes of the reinforcement learning. Wealso describe what category of “learning” that reinforcement learning falls under andthe differences between it and other types of learning methods. The backgroundknowledge needed to comprehend how reinforcement learning can be useful to solvea problem in uplink power control is provided in this chapter.

3.1 Machine Learning

Machine learning is a type of Artificial Intelligence (AI) that allows software pro-grams to predict outcomes of an action without being specifically programmed.Machine learning is mainly broken down into supervised learning, unsupervisedlearning and reinforcement learning.

Supervised learning is learning from a data set that is equipped with labeledexamples supplied by an experienced supervisor. Each provided data point describesa situation with a specification or label of the optimal action to take [3]. This type ofmachine learning is often used in categorizing objects. For example, we can train aprogram to identify cars in images by using a data set of images with labels that tellwhether there is a car in each image or not. In order for the program to understandif there is a car in an unfamiliar image we need to have the program go through thelabeled data set and get some sort of understanding of how a car looks like in an

9

10 CHAPTER 3. REINFORCEMENT LEARNING

image.Unsupervised learning is learning by training on data that is neither classified nor

labeled. Unsupervised learning algorithms use iterative approaches to go throughthe data and arrive at a conclusion. It is about obtaining structural aspects of thedata during the time that the system learns. Since the data set is unlabeled, thetrained system would have no evaluation of the accuracy of its output, a featurethat separates unsupervised learning from supervised learning and reinforcementlearning. This type of machine learning is useful when labeled data is scarce orunavailable, or when the tasks are too complex for supervised learning.

The other type of machine learning is Reinforcement Learning. Reinforcementlearning shares some similarities with unsupervised learning. In particular, it doesnot train/depend on examples of correct behaviour [3]. However, the difference isthat reinforcement learning’s goal is to maximize a reward signal instead of findingstructures in unfamiliar data.

Reinforcement learning is the main subject of this thesis and it is explained indetail throughout the rest of the chapter. However, since the other kinds of learningare not within the scope of this thesis they are not further explained.

3.2 Reinforcement Learning

Reinforcement learning is the study of self-learning and its most essential featureis that, during training, the actions are evaluated after they are already taken. Infact, in the initial steps the system will not know for sure whether a given action isgood or bad before executing that action. It does however estimate the outcome ofall possible actions before choosing one of them. This makes reinforcement learningefficient when it comes to taking real-time decisions in an unknown environment[3]. For that reason, using RL may be a good method to solve our power controlproblem explained in section 2.4.

In reinforcement learning, there are learners and decision makers called agents.The agent’s objective is to pick actions and reward every good action and punish,or negatively reward, every bad action that it takes. We can therefore argue thatthis type of machine learning is learning by doing, meaning an RL-agent has to takeuncertain actions in the beginning that might lead to bad results, but it will eventu-ally learn and improve its performance after some exploration of the environment.An environment in reinforcement learning is defined as a set of states that the agentinteracts with and tries to influence through its action choices. The environmentinvolves everything other than what the agent can control [3].

3.3. MULTI-ARMED BANDITS 11

3.3 Multi-Armed Bandits

Multi-armed bandit is one of many different reinforcement learning problems. Aneasy way to explain multi-armed bandit is through the classic example of the gam-bler at a row of slot machines in a casino. Imagine a gambler playing the slotmachines. Since each machine have a fixed payout probability, the gambler has todecide which machines to play and how many times to play each machine, as wellas making decisions about whether or not to continue playing the current machineor try playing a different one [7]. The gambler’s goal is to discover a policy thatmaximizes the total payout, or the reward in RL terms. The actions in this problemare the pulling of the levers attached to the different slot machines. By repeatedlypulling the levers the gambler can observe the associated rewards and learn fromthese observations. When a sequence of trials is done the gambler will be able toestimate the produced reward for each action, hence ensuring that optimal actionchoices are taken to fulfill the goal.

The problem that the gambler faces at each lever pull is the trade-off between theexploitation of a machine that delivers the highest payout based on the informationthe gambler already has, and the exploration of other machines to gather moreinformation about the payout probability distribution of the machines.

3.4 Exploration And Exploitation Dilemma

A great challenge in reinforcement learning is the trade-off between exploration andexploitation. For an agent to obtain higher rewards, or take more correct actions,it must exploit actions that are already tried out in the past and proved to delivergood rewards. However, it is also necessary for an agent to explore new actions andtake the risk of possibly choosing bad actions in order to learn and make betterdecisions in the future.

The dilemma can be very complicated depending on the environment withinwhich the agent is operating. In a simple static environment, we may want the agentto eventually stop exploring. Because after a reasonable amount of exploration,there may not be anything else left to explore and taking actions for explorationpurposes will only lead to bad performance. But in a more dynamic environmentthat is constantly subjected to changes, we do not want to stop exploring. Whichmeans that there will be a balancing problem, how often should we exploit andhow often explore? This trade-off dilemma between exploration and exploitation isfound in all of the reinforcement learning algorithms, but the method for balancingthe trade-off varies between one and the other.

The explore-exploit trade-off methods we look into in this thesis are epsilongreedy, Upper-Confidence-Bound (UCB) and gradient bandit. We will explain thesemethods in chapter 4.


3.5 Finite Markov Decision ProcessMarkov Decision Processes (MDP) are the fundamental formalism for reinforcementlearning problems as well as other learning problems in stochastic domains. MDPsare frequently used to model stochastic planning problems, game playing problemsand autonomous robot control problems etc. The MDPs have in fact become thestandard formalism for learning sequential decision making [8].

MDPs do not differ a lot from markov processes. A markov process is a stochasticprocess in which the past of the process is not important if you know the currentstate, because all the past information that could be useful to predict the futurestates are included in the present state. What separates the MDPs from markovprocesses is the fact that MDPs have an agent that makes decisions which influencesthe system over time. The agent’s actions, given the current state, will affect thefuture states. Since the other types of markov chains, discrete-time and continuous-time markov chain, do not include such a decision making agent, the MDPs can beseen as an extension of the markov chain.

In RL problems, an environment is modelled through a set of states and thesestates are controlled through actions. The objective is to maximize the action’sperformance representation called reward. The decision maker, called the agent, issupposed to choose the best action to execute at a given state of the environment.When this process is repeated, it is known as markov decision process, see Figure 3.1.

A finite MDP is defined as a tuple of (S,A,R) where S is a finite set of states,A a finite set of actions and R a reward function depending only on the previousstate-action pair. The MDP provides a mathematical framework for solving decisionmaking problems. The mathematical definition of the MDP is given by

p(r|s, a) := Pr{Rt = r|St−1 = s,At−1 = a}, (3.1)

which is translated to the probability of producing the reward r given the agentis handling state s by executing action a. It is through this definition the agentmakes decisions about what actions to take.

The states are some representation of the environment’s situation. What thestates specifically represent depends on what the designer designs it to be, in thisthesis the states represent the different user equipment on the site. The gambler’smulti-armed bandit problem explained in section 3.3 is simply a one-state markovdecision process, since the gambler is sitting at only one row of slot machines.

The actions on the other hand, are simply a set of abilities that the agent canchoose from in order to reach its goal, which is defined in the designed policy π.A set of actions can vary between low-level controls, such as the power applied toa hammer machine that strikes down a spike into a piece of wood, and high-leveldecisions, such as whether or not a self-driving car should crash into a wall or drivedown a cliff. The actions can basically be any decision we want to learn how to

3.5. FINITE MARKOV DECISION PROCESS 13

make, whereas the states are any information that can be useful to know in makingthose decisions [3].

The MDP framework is useful in RL problems where the actions taken by theagent do not only affect the immediate reward, but also the following future sit-uations, or states. Hence the chosen action will also influence the future rewards[3]. The agent will in theory learn which specific action works best or producesmaximum reward in each state. However, in order to maximize the reward in thelong run, it will also have to choose specific actions that may not maximize thereward at a given time step or a given state but will result in a greater overallreward in the future time steps or future states. So there is a trade-off betweenfavoring immediate rewards and future rewards that is resolved through various RLmethods.

Figure 3.1: A diagram of state, reward and action in reinforcement learning that showsthe interaction between a learning agent and an environment. At denotes the chosen actionby the agent based on the policy π, Rt+1 denotes the reward (or the consequence) of thetaken action, and St+1 denotes what state the agent is found after the action was taken [3].

The process diagram for MDP is shown in Figure 3.1. It starts with the agentdeciding to take an action At chosen from a set of possible actions A based on apredesigned policy π. This occurs every discrete time step t = 1, 2, 3, ..., T , where Tis the final time step. The action At is normally chosen conditioned to knowing thestate St, however, the state of the environment is not usually known in the initialstarting point. The first action A1 is randomly chosen in most cases. After thechosen action At is executed, we will be in the next time step and the consequenceof the taken action is fed back to the agent as a form of reward Rt+1. The rewardsystem is used to evaluate the actions and it tells the agent how good the perfor-mance of a chosen action was. With the reward, the agent will update the likelihoodof action At being picked in the future. Additionally, when action At is executed,the agent will find itself in a different state St+1, this information is read from theenvironment and sent to the agent in order for the agent to choose the next actionAt+1 on that basis. The process will then repeat itself again. With this closed loop


agent-environment interface the system becomes autonomous, see Figure 3.1.The MDP framework is probably not sufficient to solve all the decision-learning

problems but it is very useful and applicable in many cases [3]. It is capable toreduce any goal-directed learning problem into three signals exchanged between theagent and the environment: a signal that represents the choices made by the agent,called actions, a signal that represents the basis on which the choices are made,called states, and a signal that represents the agents goal, called reward [3].

3.5.1 Reward Function And Goals

The goal in a reinforcement learning problem is maximizing the total amount ofrewards that is generated from an environment and passed along to an agent. Thegoal is not to maximize the immediate reward, but maximizing the cumulativereward in the future. Therefore the reward signal is one of the vital parts in anyreinforcement learning problem, and using the reward to formulate a goal is a veryspecific feature of reinforcement learning [3].

The best way to explain how reward functions work by using reward signals isthrough examples. Imagine a robot that we want to use during autumn for pickingup fallen leaves in a garden. A reasonable reward function will give it a rewardof zero most of the time, a reward of +1 for every leaf it picks up and stores inits basket, and a reward of -1 every time it makes a mistake, like crashing intosomething or braking a vase. Since the agent is designed to maximize the reward,the goal will therefore be to pick up as many leaves as possible while avoidingcasualties like crashing or braking vases.

Formulating the goals using only the reward signal might sound a little confinedfor more complicated situations, but it has in fact proved to be a very suitable wayto do that [3]. Using the reward signal, one can create a complex reward functionthat shapes up the main goal as well as subgoals for the agent. The reward functionmust also make sure that the agent is rewarded properly to achieve the end goal inthe right way. Achieving the end goal the right way may be trickier than one thinks.The agent will always learn to maximize its rewards but this does not always meanthat it is doing what the designer wants it to do. There are many examples wherereward functions fail and cause the agent to learn to make matters worse. This failin the reward function design is common and have been seen in many RL problems.

There is something called the cobra effect, it happens when an attempt to solvea problem actually make matters much worse, as a way of unexpected consequence[9]. An example of the cobra effect was found in the work done in [10], where theymention how a flaw in the reward function made the agent learn to do somethingelse than the intended goal. In their work, reinforcement learning is used on a robotarm that is meant to learn how to pick up a block and move it as far away aspossible, in other words, fully stretch the arm before putting down the block. Sothe reward function was designed to reward the agent based on how far the blocks

3.5. FINITE MARKOV DECISION PROCESS 15

got from the robot arm’s base. The designer thought with this reward function theagent will learn to pick the block up and move it as far away as possible beforesetting it down, which makes sense. But instead of doing that, the agent learned tomaximize the reward by hitting the block as hard as it could and shooting it as faraway as possible. This happened because the block got further away and the agentwas rewarded better than picking it up and setting it down. After further trainingtime, it even learned to pick the block up and throw it. The reinforcement learningalgorithm worked, in the sense that it did in fact maximize the reward. The goalwas, however, not achieved at all because of the poorly designed reward function.This proves how important it is to formulate the reward function correctly in an RLproblem. The reward function is a way to tell the agent precisely what you want toachieve and not how you want to achieve it [3].

Chapter 4

Method

There are many different reinforcement learning methods to tackle a reinforcementlearning problem with. We will however only explain the methods that were imple-mented in this thesis, which are epsilon greedy (ε-greedy), Upper-Confidence-Bound(UCB) and gradient bandit algorithms. We will go through the algorithms in detail,how they were implemented and what difficulties that were found when the theorywas invoked in practice.

The reason behind choosing the mentioned RL algorithms to solve our RL prob-lem are due to the fact that our problem is formulated as a multi-armed bandit.And according to Sutton’s book on reinforcement learning [3], a multi-armed banditproblem can be solved through ε-greedy, UCB and gradient bandit algorithms. Allthree of them were implemented in this project.

This chapter is started with a simplified setting of the uplink power controlproblem, one that does not involve learning to handle more than one situation, orone UE. We formulate the problem as a non-associative search called multi-armedbandit. With this setting we can avoid the complexity of a full reinforcement learn-ing problem, yet still be able to evaluate how the implemented methods work. Theproblem was later reformulated to a more general reinforcement learning problem,an associative search, that is, when the solution methods involve learning to handlemore than one situation.

The designed reward functions for the RL agent are stated and explained undera section of its own in this chapter. As mentioned in the previous chapter, thereward function plays a big roll in reinforcement learning applications. It is thereforenecessary to clarify what the different reward functions represent and why they aredesigned that way.

We will end this chapter by briefly explaining how the simulation works andwhich of its parameters need to be set before running a simulation. The detailedexplanation of Ericsson’s simulation is left out because it is not in the scope of thisthesis. The simulation is only used to evaluate our implemented work.

17

18 CHAPTER 4. METHOD

4.1 Non-Associative Search

A non-associative search in our problem means handling the power transmissionfor only one UE. Since the power control problem was buried inside Ericsson’sexisting uplink power control implementation, it was reasonable to formulate theproblem as a non-associative multi-armed bandit problem. That way, it was easierto manipulate the already implemented power control code, and then try out thechosen methods, evaluate the results and identify further problems.

4.1.1 A k-Armed Bandit Problem in Uplink Power Control

In reinforcement learning, there are three important terms; states, actions andrewards. In order to draw up the power control problem as a k-armed bandit, weneed to first identify what these terms correspond to.

We define the states as the different UEs, each UE is a state of its own. Sincewe are only considering one UE for now, the use of states are insignificant. Thestates plays a bigger roll in section 4.3, where the problem is reformulated to anassociative search. We will therefore explain the roll of states in more detail insection 4.3.

Considering the power control problem explained in section 2.4, we are repeat-edly faced with a choice among k different values for δ parameter, we call this choicethe action selection. Each possible choice is an action the agent can choose from.When an action is selected and executed, a reward that is dependent on that actionis returned. In our case, the reward is equal to the throughput. If we for exampleselect the action δ = 4 dB from Table 2.1 and transmit the power signal, then theSINR is measured and from that we get the throughput according to Equation 2.2in chapter 2.

The throughput is directly considered to be the reward for that action. It is alinear value measured in mega bits per second (Mbps), so the higher the throughputthe better the connection between a UE and a base station. The objective here isto maximize the reward and eventually converge the transmitted uplink power.

When the actions, states and rewards are clarified and we know what they standfor in our power control problem in section 2.4, we propose three different algorithmsto solve it. The algorithms are explained below.

4.2 Implemented Methods

The first step for all of the different implemented algorithms are the same, that is,choose and execute a random action. The agent’s first decision on what action totake is simply a blind guess, because at this stage the agent has no knowledge aboutthe environment. Every action is equally weighed and the agent is purely exploring.

4.2. IMPLEMENTED METHODS 19

After the first step the agent will get a reward for the executed action and alongwith that a reference point to compare the rewards of other actions with it.

By repeating the action selection process and store/accumulate the reward foreach selected action we can get the system to “learn” to become somewhat intel-ligent. The variable that stores the reward information is called action-value, it isused to maximize the reward by concentrating the action selection on the choicesthat have proven to provide the system with a higher reward, which means higherSINR and higher throughput.

The selected action on time step t can be denoted as At, and the reward for thataction as Rt+1 because the reward is returned in the next time step. The action-value for an action, a, is then the expected reward given that a is the selected actionAt [3],

Qt(a) = E[Rt+1|At = a]. (4.1)

The action-values are a future estimation made based on previous rewards, we donot know their values with certainty until they are executed. If we knew the valuesbefore the execution, the problem would be irrelevant to solve because we wouldalways choose the action that we know would produce the highest reward value.The estimated value of action a at time step t is denoted in the implementation asQt(a).

After every action has been chosen and executed, there should be at least one ac-tion that performed better than the rest, or produced the highest reward. Althoughjust because that action produced the highest reward once does not mean it is theoptimal choice in the future. We will therefore need to explore other actions everynow and then, which leads to the exploit-explore dilemma explained in section 3.4.Every implemented algorithm handles this dilemma differently.

4.2.1 Epsilon Greedy AlgorithmIn epsilon greedy the action selection comes down to two choices, greedy and non-greedy choices. That choice is dependent on a predetermined constant called epsilonε, which is essentially a probability that resolves the exploit-explore dilemma thatwas mentioned in section 3.4. Since the constant ε is a probability it can take valuesbetween 0 ≤ ε ≤ 1 and it represents the exploration probability, which is explainedin more detail below.

Action selection

Considering the estimates of all the action-values Q, there should always be atleast one action whose estimate is higher than the other ones. This action is thegreedy action and when it is selected, it means that the algorithm is exploiting thecurrent knowledge and not exploring to gain new knowledge. If there are two or


more action-values that are equal and are together the highest in the list, then oneof them is randomly chosen. All the other actions that are not greedy actions arecalled non-greedy actions. And if the agent chooses a non-greedy action, it meansthat the system is exploring and gaining new knowledge about the environment.The remaining question is; how does the agent actually decide which to choosebetween greedy and non-greedy actions?

Suppose we set the ε value to some value between 0 ≤ ε ≤ 1, the algorithm wouldthen decide on choosing greedy or non-greedy actions according to the probabilitydistribution

A←{greedy with probability 1− ε (breaking ties randomly)nongreedy with probability ε (4.2)

where setting the epsilon to a higher value, will lead the algorithm to exploremore and exploit less, and vice versa.

Exploitation is the way to go in order to maximize the reward/SINR at a giventime step, however, exploration can lead to better total reward in future steps.Assume at a given time step t, there is a greedy action (has the highest estimatedaction-value) and that same action has been tried out multiple times before timestep t which makes us certain that the estimation is reliable, while on the other handthere are several other actions that are non-greedy (have lower estimated action-values) but their estimation values are with a great uncertainty. That implies thatamong these non-greedy actions there may be an action that will produce a higherreward than the greedy action, but the agent does not know it yet or which one ofthem. It may therefore be wise to explore these lower valued actions and discoveractions that may produce better reward than the greedy action.

Once the agent has chosen whether to take a greedy or a non-greedy action, itwill then need to separate the greedy action from the non greedy ones. The agentdoes this through

greedy_action = argmaxa

[Qt(a)

], (4.3)

where greedy_action is an array that stores the highest estimated action-values,while the rest of the estimated action-values are stored in another array callednongreedy_actions. For the greedy actions, if two or more actions achieve themaximum Qt, meaning they have the same value, then it chooses randomly betweenthese actions.

Action-value estimation and update

To make sure that the reinforcement learning agent learns to select the optimalactions, it has to be updated with a reward after every execution of the previously


selected action. Updating the agent in this case is the same as estimating the valuesof actions after every action taken.

There are several ways to estimate the values of the actions. A natural way todo the estimation is by computing the average of every produced reward in everytime step that a given action a was chosen. For this the agent needs to store thenumber of times each action has previously been selected Nt(a), and the respectiverewards at each time step,

Qt(a) = sum of rewards when a was selectednumber of times a has been selected , (4.4)

which by the law of large numbers, Qt(a) will converge to a specific value whenthe denominator goes to infinity [3]. That is, if the expected reward of each actiondo not fluctuate over time. This method of estimating the action-value is calledsample-average in [3] because every estimate is an average of the sample of relevantrewards. This method is one of the simpler methods to estimate action values, andnot necessarily the best one. It does however serve it’s purpose.

In the implementation, this method requires a lot of memory usage and com-putation due to the increase of the number of saved rewards over time. So theimplementation of the estimation method was modified and improved to a methodcalled Incremental Action-Value.

Let the produced reward after ith selection of a given action be denoted as Ri

and the action-value estimation after being selected n times be denoted as Qn, thenEquation 4.4 would be equivalent to

Qn = R1 +R2 + ...+Rn−1n− 1 = 1

n− 1

n−1∑i=1

Ri. (4.5)

The incremental action-value method is a more reasonable method to imple-ment in coding. Instead of maintaining a record of all the rewards produced forevery action in every time step and carry out larger and larger computations asthe number of iterations increase, we can simply use an incremental formula thatupdates the averages with smaller computation every time a new reward needs to beprocessed. So the formula for computing the new updated average of all n rewardswas converted to


Qn+1 = 1n

n∑i=1

Ri

= 1n

(Rn +

n−1∑i=1

Ri

)

= 1n

(Rn + (n− 1) 1

n− 1

n−1∑i=1

Ri

)= 1n

(Rn + (n− 1)Qn

)= Qn + 1

n

[Rn −Qn

].

(4.6)

This implementation requires memory only for the old estimate Qn and numberof previous selections n for each action, and only a small computation of Equa-tion 4.6 every time the agent is updated.

The pseudocode for the implemented bandit algorithm using ε-greedy actionselection is shown in Algorithm 1.

Algorithm 1 Epsilon greedy algorithm applied to a bandit problem [3]Initialize, for a = 1 to k;Q(a)← 0N(a)← 0

Select initial action;A← random selection

Repeat;R← execute action A and return the rewardN(A)← N(A) + 1Q(A)← Q(A) + 1

N(A) [R−Q(A)]

A←{

arg maxaQ(a) with probability 1− ε (breaking ties randomly)a random action with probability ε

To further improve Algorithm 1, the initial values of Q(a) can be changed from 0to an optimistic high initial value. The optimistic initialization method was used bySutton (1996) [11], it is an easy way to provide the agent with some prior knowledgeabout the expected reward level. This method makes the agent bias to the actionsthat have not been tried out yet in the beginning steps of the learning process.Since the non-selected actions are biased by their high initial values the ε-greedyagent will most likely pick those actions first. This way the agent tries out all theactions in its first iterations. The bias is immediately removed once all the actions


have been taken at least once because their action-value estimates will be replacedwith a lower value than the initial optimistic value. The advantage of this method isthat it does not miss out on trying out all of the actions at start and help convergeto the optimal action choice faster.

4.2.2 Upper Confidence Bound Action Selection AlgorithmA greedy action selection method selects only the “best” action at a given time step,but the best action according to the agent’s current knowledge at that time stepis not necessarily the actual best action. Therefore, exploring and taking the riskof selecting a bad action from time to time can be beneficial. The ε-greedy actionforces the agent to sometimes (depending on the probability value of ε) take non-greedy actions, which makes the agent gaining more knowledge. There is howeverroom for improvements in that. The non-greedy action in ε-greedy algorithm isselected randomly with no consideration to how nearly-greedy the action is nor howuncertain the action-value is. A great improvement in these exploration states wouldbe to select an action among the non-greedy actions by taking into account theirestimation uncertainties as well as how close they are to the maximum (the greedyaction). This method of picking non-greedy actions according to their potentialfor being the optimal one instead of blindly picking an action is called the UpperConfidence Bound (UCB) action selection. The UCB method was developed byAuer, Cesa-Bianchi and Fischer in 2002, [12].

In order to take both action-value Q(a) estimation and its uncertainty intoaccount we will select the actions according to

At = argmaxa

[Qt(a) + c

√ln tNt(a)

][3], (4.7)

where ln t is the natural logarithm of t, Nt(a) is the number of times that actiona has been selected before the time step t and c is a constant number larger than0 that controls the degree of exploration. The Nt(a) is updated (accumulated withthe value of 1, similarly to Algorithm 1) before the equation Equation 4.7 takesplace, which means that under no circumstances the value of Nt(a) is equal to zero.

The square-root term in the UCB action selection method is a way to estimatethe uncertainty or the variance in the estimation of action-value [3]. By takingthat into account and establishing a confidence level c, the maximized estimate isthe upper bound of the true action-value of a given action a. If an action a hasbeen selected much less than the other actions, then the square-root term of actiona would be much higher than the other actions. This will thus favor action a bymaking it more probable to get selected, since its total value might possibly becomehigher than the other action’s total values. The natural logarithm in the numeratorof the uncertainty measurement term makes the term less significant over a longertime, but still not limited. The UCB action selection equation guarantees that all


actions are eventually selected, but actions that have low action-value estimationor have been selected many times in comparison to the other actions will over timebe selected fewer times.

4.2.3 Gradient Bandit Algorithms

Another possible way to find the optimal action in this problem is the gradientbandit algorithm. The gradient bandit is, unlike the two previously discussed algo-rithms, not using action-value estimations to select its optimal actions. It insteaduses and learns through a numerical preference for every possible actionHt : A → R.The action preferences are updated and recorded after each iteration with the useof the throughput between two nodes (UE and base station) which is used as thereward, this is stated in Equation 4.9 further down.

In this algorithm, the action that has a higher preference over other actions aremore likely to be selected and executed. All the actions have a selection probability,which is determined through a soft-max distribution using the preferences of theactions.

In probability theory, a soft-max function (also called normalized exponentialfunction [13]) is used to represent a categorical distribution, which is a probabilitydistribution over K different possible outcomes. It is calculated using the exponen-tial values of the action preferences H, see Equation 4.8.

The probability of an action a being selected at time step t is calculated asfollows,

P (At = a) = eHt(a)

k∑b=1

eHt(b). (4.8)

This calculation is done for all the actions in every iteration. And the totalprobability of all actions should always be equal to one, with the exception of avery small deviation caused by rounding up or down due to the maximum floatdecimals a computer can handle. Furthermore, the actions’ initial preference areall set to zero so that all actions have an equal probability of being selected at thestart.

When an action, At = a, is selected and executed at time step t, the simulationreturns a reward, Rt, which we can then use to update the preferences of the actionsdepending on the reward. The agent is also meant to keep a record of the averagereward produced through all the previous time steps and including the current timestep t. The average reward, R̄t, is calculated just like in Equation 4.5 and is usedas a reference point to indicate whether the returned reward Rt was smaller orgreater than the average reward. If Rt is greater than the average reward R̄t, thenthat action’s preference is increased and the preferences of all the other actions are

4.3. ASSOCIATIVE SEARCH 25

equally decreased for the next iteration, and vice versa if the Rt is smaller than R̄t,see Equation 4.9. Furthermore, increasing or decreasing the action preferences willequivalently affect the probability of the respective actions due to the calculationdone in Equation 4.8. The action-preferences update step for this algorithm is asfollows,

Ht+1(At) = Ht(At) + α(Rt − R̄t)(1− P (At)), for selected action a = At

Ht+1(a) = Ht(a)− α(Rt − R̄t)P (a), for all a 6= At,(4.9)

where α > 0 is a constant predefined step-size parameter [3].The gradient bandit algorithm is expected to work better than the ε-greedy and

the UCB algorithms because of its computation of the likelihood of actions beingchosen. An advantage that the gradient algorithm has is that for every selectedand executed action, all of the action’s preferences are updated. When the agentis doing well, producing higher and higher rewards, the gradient algorithm willincrease the likelihood of choosing that action that is causing the better performancewhile decreasing the rest of action’s likelihood to be chosen. This feature will causethe agent to keep exploiting a given action as long as there is an improvement inproducing rewards, and turn its focus on exploring other actions when it is not. Onthe contrary, ε-greedy and the UCB algorithms are set to explore ε percentage ofthe time and that value does not adapt to the performance of the agent.

4.3 Associative SearchAssociative search, also called contextual bandits [3], introduce the concept of states.The states are representations of the environment, and in this thesis we designedthe states to represent each individual UE. In the non-associative search, the agentcompletely ignores the state of the environment because it was only dealing with oneUE. Therefore, in the case of non-associative search, the use of states is irrelevant,unless we design the states to represent something else.

We will reformulate the problem into an associative search, which means thereare now multiple UEs on the site. And we use the states to let the RL-agent be ableto distinguish the different UEs. It will let the agent know which UE it is currentlydealing with. This is a good feature since the agent needs to learn what valuesof δ work best for each UE. Each UE has different properties, and the optimaltransmission power will not be the same for every UE. The different propertiescan for example be near or far away from the base-station, located in a crowdedor uncrowded area etc. The agent has much bigger challenges to take on whenthere are multiple UEs on site, challenges like interference and quality of service.It has to find out a way to minimize the interference while maximizing the overallthroughput between the UEs and the base-station. It has to take into account that


the UEs’ transmission powers are correlated, meaning that, changes in the powertransmission of one UE will not change the throughput of that UE only, but otherUEs’ too.

Figure 4.1: An illustration of the associative search problem. The agent on the leftside of the figure commands the UEs to use the chosen action At. After the UE’s uplinktransmission is done, the throughput is measured and returned to the agent through thereward Rt+1. At the same time the agent is informed through the state St+1 which UE itneeds to select an action for next. The indexed letters n and m denote the number of UEand action respectively. On the right side of the figure there is the base-station tower thatthe UEs are connecting with.

The Markov Decision Processes (MDP) explained in section 3.5 is used for thisassociative search problem. With MDP the agent can use the states to keep track ofthe environment’s situation and try to find an optimal transmission power, or the δvalue, for all of the UEs, see Figure 4.1. The agent will still be using the implementedmethods explained in section 4.2 for selecting and rewarding the actions, only now,it will have different parallel threads for each UE. Additionally, the agent will now tocondition the action choices on the states. Thus the action-value, or the estimatedreward, given in Equation 4.1 will be updated to

Qt(s, a) = E[Rt+1|St = s,At = a], [3] (4.10)

where s and a stand for the given state of the environment and the selectedaction respectively.

4.4 Reward Functions

So far in this chapter, we have assumed that the reward value is always provided.However, as explained in subsection 3.5.1 the reward function is more complicated

4.4. REWARD FUNCTIONS 27

than it appears. For that reason, different reward functions were designed and triedout.

Reward functions are especially important in this thesis because the only wayto let the agent know that the different user equipment are correlated is throughthe reward function. The correlation is that actions chosen for a UE affect thethroughput of another UE through interference.

The reward function was designed and gradually improved as the results wereevaluated. The different reward functions are explained in the subsections below.

4.4.1 Local Reward

Local reward is basically another name for the throughput between base-stationand a given UE. It is the first reward function that was designed in this thesis. Itrewards the action-values with the throughput directly, see Equation 2.2.

Using only local reward, the agent will learn to take actions that only maximizesthe throughput of the currently handled UEs and not the overall throughput. Theagent will disregard how the actions may affect the other UEs. This reward functionis simply assuming there is one UE on site and that there is nothing else to interferewith. The local reward is useful to analyze how well the overall design of thesystem works, it may possibly be enough for actually reaching our goal too, whichis maximizing the total throughput. However, we want the agent to also take otherUEs into account when choosing the actions, which is why we introduce globalreward.

4.4.2 Global Reward

Global reward is the average throughput of all the UEs that the agent handlessimultaneously. The interference between the UEs only occur when they are beinghandled simultaneously. Thus, computing the average throughput, or the averagelocal reward, is a way to measure that interference. The global reward is designedto shape the agent’s goal. Through the global reward we can let the agent knowthat raising the overall throughput of all the UEs is what we want and not just eachUE for itself.

Assume that the agent sets the δ parameter to a value that is maximizing thethroughput of a given UE but also causing the throughput of other UEs to becomelower. Using the global reward we expect the agent to understand that this isnot a good choice because of the reduction of the reward, and will eventually trysomething else.


4.5 Simulation Setup

We run the RL algorithms on an Ericsson-developed simulation. The simulation isequipped with features needed to replicate a real-life scenario, where we can simulatea site, or sites, that consists of a base-station and a number of UEs connected tothat base-station. The simulation is set up so that the environment is dynamic,there are some changes that happen to the UEs’ properties during a simulation run,i.e. the distances of the UEs to the base-station.

In the simulation setting, we can manipulate the number of propagation frames,which is the number of times the power control program schedules each UE duringa simulation. Accordingly, the propagation frames are the number of times thatthe agent can choose actions for the UEs, which means setting the δ parameter inEquation 2.1 to either one of the values stated in Table 2.1. The propagation framenumber was usually set to 4000 for our experiments.

Another parameter that need to be set to a value prior to running the simula-tion is the path loss compensation factor α, this is mentioned in Equation 2.1 inchapter 2. The α value was set to 1 for our experiments, which indicates a full pathloss compensation for the transmitted uplink signal.

Additionally, we have to define the number of sites and the total number of UEswe want to use in the simulation. A site stands for a base-station and all the UEsthat are handled by that base-station. When we set the number of sites and UEs,the simulation program randomly drops the UEs in different locations on the sites.The UEs could therefore be located far away from each other as well as very close toeach other. This randomness plays a roll in the case of interference, making everysimulation different from one to another. For that reason, we set up the system sothat every time we want to simulate something, the same simulation is repeatedexactly 50 times and then we take the average of the results before evaluating it.We repeat the process 50 times because the results stabilizes after this number ofruns.

The the agent’s ability to handle multiple UEs simultaneously is dependent onthe number of sectors. Each site is divided into three sectors, see Figure 4.2, whichmeans that the site’s base-station is capable of scheduling three UEs simultaneously.Accordingly, the agent can handle three UEs at each time step too. The number ofsectors can be changed but we have no reason to change that in this project.

The RL algorithms’ constant parameters are the probability value ε, the degreeof exploration c and the step size α used respectively in the ε-greedy, UCB andthe gradient methods were set to values that produced the best performances. Theparameters’ values can always be manipulated and improved even more, however,for most of our simulations the values stated in Table 4.1 were used. These valueswere decided after multiple trials.

4.5. SIMULATION SETUP 29

Figure 4.2: An illustration of a site, where the physical area of the site is divided intothree sectors around the base-station. At every time step, the base-station schedules oneUE in each sector.

RL parameter Valueε 0.1c 50α 0.1

Table 4.1: The RL constants used in the ε-greedy, UCB and gradient methods.

4.5.1 Evaluating The Performances

We evaluate the algorithm’s performance by plotting and analyzing the CumulativeDistribution Function (CDF) of the average throughput of all UEs. The CDFdescribes the probability of the average user throughput that takes a value equal toor less than x. Figure 4.3 shows an example of the CDF of the user throughput whenthe simulation is done without including the RL algorithms. The results presentedin chapter 5 are analyzed by comparing them to a similar non-RL simulation likethe one shown in Figure 4.3.


0 2 4 6 8 10 12

User throughput [Mbps]

0

10

20

30

40

50

60

70

80

90

100

C.D

.F.

[%]

User Throughput

User throughput using power control without RL algorithm

Figure 4.3: A plot showing the CDF of the average throughput of all UEs. For thissimulation the number of sites and UEs are set to 1 and 15 respectively, and the δ parameteris set to zero for this simulation because the RL agent is deactivated.

In the plot shown above, we want the graph to be as vertical and as far to theright as possible, it is through these two attributes we evaluate the RL methods.Since the graph represents the CDF of the average throughput of the UEs, the morevertical the graph is the more “fair” the service is. This is because when a graph isvertical, more UEs are served with the same or nearly the same amounts of Mbps.Contrarily, if the graph is more horizontal then some UEs get served with higherthroughput and some with lower throughput, which is something we want to avoidas much as possible because it is an “unfair” service. Additionally, we want thegraph to generally be located as far to the right as possible since that means thatthe UEs are served with higher throughput.

Chapter 5

Results

The results of the implemented methods explained in chapter 4 are presented andevaluated in this chapter. It is based on the presented results we draw our thesisconclusions and whether or not using reinforcement learning is an improvement tothe uplink power control.

This chapter is divided into multiple experiments. Every experiment is consist-ing of three parts, an explanation of the experiment, result of the performance andan interpretation of the result. The first two experiments demonstrate the perfor-mances of two types of reward functions in different methods. The third experimentpresents the results about the interference when we are using the implemented RLalgorithms.

5.1 Experiment 1: RL Using Local Reward

The local reward is equal to the throughput between a given UE and base-station,which is computed by using the SINR, see Equation 2.2. Using the local rewardas the reward function in the three algorithms explained in chapter 4 produced theresults shown below. For this experiment, the number of propagation frames wasset to 4000 for all simulations.

Figure 5.1a, 5.1b and 5.1c show the performances of ε-greedy, UCB and gradientmethods respectively. These figures compare the RL methods with the performanceof the non-RL power control where the δ parameter is permanently set to zero.Additionally, Figure 5.2 is showing the comparison between the three methods usingthe local reward.

31

32 CHAPTER 5. RESULTS

5.1.1 Results

0 2 4 6 8 10 12 14


0

20

40

60

80

100

C.D

.F.

[%]

User Throughput

No RL algorithm (delta = 0)

Epsilon greedy

(a) Epsilon greedy algorithm

0 2 4 6 8 10 12 14


0

20

40

60

80

100

C.D

.F. [%

]

User Throughput


UCB

(b) UCB algorithm

0 2 4 6 8 10 12 14


0

20

40

60

80

100

C.D

.F. [%

]

User Throughput


Gradient

(c) Gradient algorithm

Figure 5.1: Each sub-figure demonstrates the performance of a different RL algorithmusing local reward. The plots show the CDF of the average throughput of all UEs in twosimulations, with and without RL algorithms. For every plotted simulation, the number ofsites and UEs are set to 1 and 15 respectively. The δ parameter is set to zero in the nonRL simulations and set to a value chosen by the RL agent in the RL simulations.

5.2. EXPERIMENT 2: RL USING GLOBAL REWARD 33

0 2 4 6 8 10 12 14


0

20

40

60

80

100

C.D

.F. [%

]

User Throughput


Epsilon greedy

UCB

Gradient

Figure 5.2: This plot shows the graphs in Figure 5.1 merged together to compare thethree algorithm’s performances to each other when using local reward. The performancesof the ε-greedy, UCB and gradient methods are so identical that their graphs are stackedupon each other on the right side of the of plot.

5.1.2 Interpretation of Results

Results in Figure 5.1 prove that all of the three algorithms increase the throughputof the UEs when compared to a simulation that is done without any RL algorithm.In fact, the three algorithms perform almost exactly the same, this is shown inFigure 5.2. The reason behind that similarity is that there is a value for the δparameter that performs better than the other values and all three of the algorithmsare able to find it. These results indicate that there is a method for the agent toachieve a higher throughput, and it is able to find that method and exploit it, whichleads to a better performance.

The results in Figure 5.1 are achieved when the reward function is set to localreward. In the next experiment, the performance of the algorithms are shown whenusing global reward.

5.2 Experiment 2: RL Using Global RewardThe global reward is the average throughput given by all the simultaneously handledUEs. Using the global reward as the reward function in the three algorithms ex-plained in chapter 4 produced the results shown below. The number of propagationframes was set to 4000 for this experiment as well.

Figure 5.3a, 5.3b and 5.3c show the performances of ε-greedy, UCB and gradientalgorithms respectively. The figures compare the RL algorithms with the perfor-mance of the non-RL power control where the δ parameter is permanently set tozero. Additionally, Figure 5.4 is showing the comparison between the three methodsusing the global reward.


5.2.1 Results

0 2 4 6 8 10 12 14


0

20

40

60

80

100

C.D

.F.

[%]

User Throughput


Epsilon greedy

(a) Epsilon greedy algorithm

0 2 4 6 8 10 12 14


0

20

40

60

80

100

C.D

.F.

[%]

User Throughput


UCB

(b) UCB algorithm

0 2 4 6 8 10 12 14


0

20

40

60

80

100

C.D

.F.

[%]

User Throughput


Gradient

(c) Gradient algorithm

Figure 5.3: Each sub-figure demonstrates the performance of a different RL algorithmusing global reward. The plots show the CDF of the average throughput of all UEs in twosimulations, with and without RL algorithms. For every plotted simulation, the number ofsites and UEs are set to 1 and 15 respectively. The δ parameter is set to zero in the nonRL simulations and set by the RL agents in the RL simulations.

5.3. COMPARISON BETWEEN LOCAL AND GLOBAL RESULTS 35

0 2 4 6 8 10 12 14


0

10

20

30

40

50

60

70

80

90

100

C.D

.F.

[%]

User Throughput


Epsilon greedy

UCB

Gradient

Figure 5.4: This plot shows the graphs in Figure 5.3 merged together to compare thethree method’s performances to each other when using global reward. The performancesof the ε-greedy, UCB and gradient methods are almost identical. The graphs represent theCDF of the average throughput of the UEs. For all four simulations, the number of sitesand UEs are set to 1 and 15 respectively. The δ parameter is constantly set to zero in thenon-RL simulations and set to a value chosen by the RL agent in the RL simulations.

5.2.2 Interpretation of Results

Figure 5.3 confirms that the algorithms work with global reward as well since thereare clear increases in the throughput between the RL and non-RL simulations. If wetake a close look at the comparison between the performances of ε-greedy, UCB andgradient algorithms in Figure 5.4, we can tell that the gradient algorithms performsslightly better than the ε-greedy and UCB. These results are achieved when the RLparameters are set according to Table 4.1, changing those parameters will of courselead to different results but not necessarily better results.

5.3 Comparison between Local and Global Results

Setting the reward function to either local reward or global reward seem to be max-imizing the throughput properly. However, the question we want to ask is whichreward function makes the RL agents perform better? Does the global rewardhelp the agent perform better than it does with local reward as expected in sub-section 4.4.2? To answer that, a plot is presented to compare the local and globalreward using the ε-greedy method, see Figure 5.5. This plot shows that the ε-greedyalgorithm using local reward performs almost the same, but slightly better, whencompared to the same algorithm using global reward, the same result is achievedwhen using other algorithms as well. This is not a result we were wishing to get.The global reward was created to let the agent know that we aim to maximize thetotal throughput of the UEs and not each UE for itself, but it fails to do that.


0 2 4 6 8 10 12 14


0

10

20

30

40

50

60

70

80

90

100

C.D

.F.

[%]

User Throughput

Epsilon greedy with local reward

Epsilon greedy with global reward

Figure 5.5: This plot shows the comparison between the local and global reward whenrunning the simulation with the same RL algorithm, ε-greedy. The graphs represent theCDF of the average throughput of the UEs. For both of the simulations, the number ofsites and UEs are set to 1 and 15 respectively.

5.4 Experiment 3: Interference

Up until this point we have only evaluated the results of different algorithms andreward functions. In this experiment, we turn our focus to the agent’s performanceregarding the interference between the UEs. We run different simulation settingswithout changing the RL algorithm or the reward function, they are set to gradientalgorithm and local reward respectively. The number of propagation frames wasagain set to 4000 for all of the simulation runs in this experiment.

In order to draw conclusions about the interference, the simulations shown inFigure 5.6 had to be done different than the previous experiments. As mentionedin section 4.5, a site with three sectors is only able to schedule three UEs at a time.This means that only the three simultaneously scheduled UEs can interfere witheach other and it is not enough to be able to draw a conclusion about whether theRL algorithm improves the interference or not. We need a simulation setup thatcauses more interference, so we use multiple sites and reduce the distances betweenthe sites, leaving the UEs’ connections more vulnerable to interference.

We ran the simulations using a single site in the previous two experiments butin this experiment we will increase that to 9 sites and then again to 19 sites. Sinceeach site consists of a base-station that has three sectors, the total number of sectorsbecomes 27 and then 57. The number of UEs used in the simulations was equal tothe number of sectors, giving each sector one UE to deal with. Additionally, thedistance between the sites are normally (and in previous experiments) set to 200meters, but we add another scenario where the distance is reduced to 70 meters inorder to cause even more interference.

5.4. EXPERIMENT 3: INTERFERENCE 37

In Figure 5.6, the simulations mentioned above are found in the right hand sideof the plot. On the left hand side of the plot we have the dashed graphs, they are thesame simulations as the ones on the right side except that these ones are operatingwithout the RL agents, their delta parameter is a constant set to δ = 0. The RL-deactivation is simply included so that we can compare how the same simulationsperform without having our RL agent making decision for the delta parameter.

5.4.1 Results

-5 0 5 10 15 20 25 30 35


0

10

20

30

40

50

60

70

80

90

100

C.D

.F.

[%]

User Throughput

No RL 19/57/57 (200m distance)

Gradient 19/57/57 (200m distance)





Figure 5.6: The plot shows the CDF of the average throughput of all UEs in the simula-tions. It is demonstrating the performance of our Gradient RL method using local rewardin three different scenarios. The simulations are separated from each other by the numberof sites (base-stations) and the inter-distance between the base-stations. Under the labelsbox, the type of RL method, the number of sites/sectors/UEs and the inter-distances arestated. The three dashed graphs in the left hand side of the plot are the RL-deactivatedsimulations that have the same setup as the corresponding graphs in the right hand side ofthe plot.

5.4.2 Interpretation of ResultsResults in Figure 5.6 show six different simulations. We have three different simu-lation setups where each of them are done twice, with and without the RL agent. Ifwe take a look at the three RL-activated simulations on the right hand side of theplot, we can observe that even though we increase the number of sites from 9 to 19,the performance of the RL agent is not getting much worse. And the simulation thatscores the highest throughput is the one with 19 sites and 70 meter inter-distancebetween the sites. This simulation was expected to perform worse than the othertwo since increasing the number of sites and decreasing the inter-distance betweenthe sites lead to more interference. The fact that the RL agent’s performance is notinfluenced by increasing the number of sites as well as reducing the inter-distancebetween the sites suggests that our RL agent does indeed improve the interference.


However, if we look at the corresponding RL-deactivated simulations on the left, wewill find that the mentioned assumption is not correct. All of the RL-deactivatedsimulations seem to perform almost the same despite increasing the number of sitesand reducing the inter-distance between the sites. This proves that the changesmade in the number of sites and the inter-distance between the sites do not reallycause more interference. The interference may be influenced by these changes inreality but it does not in our simulations, this issue is discussed under Discussions insubsection 6.1.4. There is really no interference problem for our RL agent to solvein this experiment. The agent does however find the delta value that produces thehighest throughput, just like in the previous experiments.

Furthermore, the reason why one of the RL-activated simulations performs bet-ter than the other two can be due to the fact that it has a shorter inter-distancebetween the sites. This means that, on average, the UEs are pushed closer tothe base-stations, which will lead to lower path loss in the transmitting signal andtherefore lead to higher throughput. This might not be the only explanation for it,but one thing that we can be sure of is that it is not the RL agent detecting theinterference and working itself around the interference issue by manipulating thetransmission power level of the UEs.

Chapter 6

Discussion and Conclusion

In order to draw conclusions about implementing reinforcement learning in uplinkpower control we need to discuss the advantages and disadvantages of our imple-mented system. In this chapter, we will go through the parts of our design thatworked and the parts that did not. We will discuss the reasons behind the failedand successful attempts in trying to solve the thesis problem as well as explainingthe reasons why reinforcement learning became the topic of this thesis.

It is important to end the thesis by presenting the ideas that were never imple-mented but could possibly lead to better results if they were to be implemented inthe future. These ideas are explained under the Future Work section. Furthermore,as a final section, the ethical aspects of our thesis is stated under the Ethics section.

6.1 DiscussionWe have four main discussion topics in this thesis. These discussions are found inthe subsections below.

6.1.1 Reinforcement LearningOur approach to the uplink power control problem was to solve it using machinelearning. As mentioned in chapter 3, there are multiple types of machine learning,so why did we choose RL? Supervised and unsupervised machine learning need datato train on and learn from, then the learned information are used on unseen data.Reinforcement learning on the other hand is the type of machine learning that learnsby interacting with an environment. For our project, there were no available datato train on, but an environment was available through Ericsson’s simulation, whichmade RL the obvious choice. In addition to that, a major advantage of RL is thatit does not rely on knowing what the correct actions are. This makes RL an evenbetter candidate for solving our uplink power control problem where there is noright or wrong in the transmitted power level, all we know is that we want betterresults

39

40 CHAPTER 6. DISCUSSION AND CONCLUSION

6.1.2 Global Reward Function FailureThe result in Figure 5.5 proves that the global reward function does not performbetter than the local reward function as expected. The global reward was designedto improve the performance of the agent, but it failed to do that. After furtherinvestigation, I found out that it is because every iteration with the global reward,which is the average of the simultaneously handled UEs, is not the same when thereare more than one UE per sector. For example, in a site of three sectors we haveUE1, UE2 and UE3 belonging to sectors 1, 2 and 3 respectively. The global rewardis then equal to the average throughput of all three mentioned UEs every time theagent rewards any one of the three UEs. If we introduce another user device UE4that also belongs to sector 1, then the global reward will in the first iteration beequal to the average throughput of UE1, UE2 and UE3 and in the second iterationit will be the average throughput of UE4, UE2 and UE3. Since the two iterationsdo not handle the same UEs they are, in RL terms, two different states. However,we have not designed the system to handle them as two different states, and it willtherefore cause a confusion in the agent’s learning process.

If we set up the simulation so that there is only one UE per sector, the resultsthen show that the performance of the agent using the global reward is the sameas using the local reward. Using the global reward as our reward function, theperformance will only get worse when we increase the number of deployed UEs.For that reason, the global reward function that we have designed does not work.To solve the problem, designing a different reward function or taking a differentapproach should be done, more suggestions are stated under section 6.3.

6.1.3 Other Reward FunctionsApart from the mentioned global and local rewards, there were multiple other rewardfunctions that were designed and tried out. Their results are not included in thisreport because their designs were not good enough to be compared with the globaland local rewards. Be that as it may, a reward function that came close to competingwith the global and local rewards was

reward = LocalReward− beta× punishmentpunishment = LastGlobalReward−GlobalReward

(6.1)

where beta is a constant that controls the degree of punishment and punishmentis the difference between the current global reward, or averaged throughput of all thesimultaneously handled UEs, and the previous global reward. The punishment termwas designed so that the agent can reward the executed action choices based not onlyon the local reward but also on the global reward. This way, if an action producesa high local reward, or throughput, but damages the connections of the other UEs,then the punishment term will punish that action and reduce its likelihood to be

6.2. CONCLUSION 41

picked in the future. The designed reward function is a good approach but becauseof the unreliability of the global reward mention in subsection 6.1.2 the expectedperformance was not achieved.

6.1.4 Interference Not FoundThere have been different speculations about why the interference did not increasewhen more sites were added and the inter-distances between them reduced in ex-periment 3. Although to be really sure, deeper study about Ericsson’s simulationprogram is required.

One of the reasons may be that the Interference Rejection Combining (IRC)receivers do a very good job in cancelling the interfering signals in the simulations.The added number of sites may therefore not have the expected impact on theinterference.

We also have to consider the possibility of existing flaws in the simulation’ssoftware program, because the program is not completely done yet and it is stillunder development. During the time this project was being carried out, a couple oferrors was found in the simulation program and there may be more undiscoverederrors in it that can cause issues like this.

6.2 ConclusionReinforcement learning is a good solution for multi-armed bandit problems. Allof the three implemented algorithms successfully achieve the goal of maximizingthe throughput in uplink. However, formulating the thesis problem as multi-armedbandit problem and then as a contextual bandit problem when the states wereintroduced, and trying to solve it through an RL agent that only has power overone parameter of the uplink power control is an inadequate approach.

Using our implemented reward functions, the interference problem does not fallunder the agent’s reach, hence not solvable by the agent. The reason behind thatis that the RL agents are designed under the assumption that if we increase thenumber of user devices, or UEs, in an environment then the interference betweenthem increases as well. The simulation we use in this thesis is nonetheless notsupporting that assumption. Therefore, a reward function that could correctlyoutline the interference problem could not be found in this thesis.

Be that as it may, the RL agents did in fact improve the uplink power control. Ifwe separate the project goal into two different goals, the first being maximizing thethroughput and the second being improving the interference, then we can concludethat our designed reinforcement learning application is well suited for achieving thefirst goal. It is, however, not suited to fulfill our second goal.

The main conclusion about reinforcement learning drawn from our study is thata strong contributing factor to the success of a reinforcement learning application

42 CHAPTER 6. DISCUSSION AND CONCLUSION

is how well the reward function outlines the problem and how well it evaluates theprogress in solving the problem.

6.3 Future Work

The research in this thesis can be extended in many ways. There are a few ideasthat could be implemented and further investigated, we will mention these ideas inthis section.

In our implementations, the states have always represented the different UEs.This means that the only information about the environment that the agent receivesprior to making decisions about the UE’s transmitted power is what UE it is makingdecisions for. While that is a good representation of the environment there areother, possibly better, ways represent the environment. One idea is to cluster theUEs together based on their distances to the base-station. The distance of a UE toits base-station can be measured with the help of the reduction in power density,or path loss, of the transmitted signal.

Instead of having the states representing each different UE, we can design thestates to represent the location of the UEs as either near or far away from the base-station, S ∈ {near, far}. This way of representing the environment will significantlyreduce the number of states. If we for example have 100 UEs in a site, we will onlyhave two states to represent that environment, as opposed to our current designwhich would have resulted 100 different states.

Using the mentioned design for the states, the agent will need to learn whichactions are good to take in only two different scenarios, UE being near or far awayfrom base-station. This may lead to better performance in the sense of convergencetime. The less the number of states, the faster the learning process.

Another idea that may possibly lead to better performance is accumulating theδ value, or the closed-loop correction signal, instead of only setting it to the absolutevalues of δ ∈ {−4,−1, 1, 4}. The accumulated δ value can be calculated with

δt = δt−1 + θ, (6.2)

where θ is the parameter that is selected by the agent and it can take the valuesof θ ∈ {−1, 0, 1}. The agent will not be limited to a few specific values for theclosed-loop correction signal. It will instead be able to set the transmission powerto any value under the top limit of 23 dBm.

Finally, the last future work suggestion we present is to not apply the RL algo-rithms only on the δ parameter, but also applying it on the path loss compensationfactor α, see Equation 2.1 in chapter 2. Reinforcement learning could potentiallybe more useful for finding a good value for the α parameter than the δ parameter.

6.4. ETHICS 43

6.4 EthicsThere are many ethical aspects within machine learning and AI that are frequentlydiscussed nowadays as newer and more advanced algorithms are being developed. Toquestion the ethical aspects of newly built products is purely about moral principles,and proving the goodness in any developer’s conscience is firstly done by examiningthe effects of the new product on the social conditions of our lives.

If we give our decision-making algorithm more authority and control, a validethical issue is whether or not someone should be held responsible if things gowrong, and if so, who? Our society is built with rules that make sure that membersof the society do not harm each other or the environment we live in. When amember commits an action that breaks those rules he or she is punished accordingly.This works to some extent when it comes to humans, but with AI products thispunishment method makes no sense. Therefore, we can argue that a person shouldtake responsibility for the actions taken by the machine, it could be the companymanufacturing the product, the person developing it, or maybe even the governmentdepartment that approves the product. If such a product is used for applicationsthat have a great impact on human lives, then this is an issue that needs to besettled before it is launched.

If the RL implementation explained in this thesis is ever to be launched, thereare a number of possible consequences that need to be considered. Manipulating thetransmission power can lead to battery damages in the user devices, or the agentmay learn to maximize its reward by doing something other than we intended, or itcould possibly even cause more interference instead of reducing it. These are justsome examples of the unwanted consequences that can be caused by such a product.

Bibliography

[1] D. Miorandi, S. Sicari, F. D. Pellegrini, and I. Chlamtac, “Internetof things: Vision, applications and research challenges,” Ad HocNetworks, vol. 10, no. 7, pp. 1497 – 1516, 2012. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S1570870512000674

[2] A. Simonsson and A. Furuskar, “Uplink power control in lte - overview andperformance, subtitle: Principles and benefits of utilizing rather than compen-sating for sinr variations,” in 2008 IEEE 68th Vehicular Technology Conference,Sept 2008, pp. 1–5.

[3] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2017.[Online]. Available: http://incompleteideas.net/book/bookdraft2017nov5.pdf

[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche,J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman,D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Gowith deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp.484–489, Jan. 2016.

[5] R. Mullner, C. F. Ball, K. Ivanov, J. Lienhart, and P. Hric, “Contrasting open-loop and closed-loop power control performance in utran lte uplink by ue traceanalysis,” in 2009 IEEE International Conference on Communications, June2009, pp. 1–6.

[6] 3GPP, “NR; Physical layer procedures for control,” 3rd Generation PartnershipProject (3GPP), Technical Specification (TS) 38.213.f00, December 2017,version 15.0.0. [Online]. Available: http://www.3gpp.org/DynaReport/38213.htm

[7] R. Weber, “On the gittins index for multiarmed bandits,” Annalsof Applied Probability, vol. 2, p. 1024–1033, 1992. [Online]. Available:https://projecteuclid.org/euclid.aoap/1177005588

[8] M. van Otterlo, “Reinforcement learning and markov decisionprocesses,” 2009. [Online]. Available: https://www.semanticscholar.

45

http://www.sciencedirect.com/science/article/pii/S1570870512000674

http://incompleteideas.net/book/bookdraft2017nov5.pdf

http://www.3gpp.org/DynaReport/38213.htm

http://www.3gpp.org/DynaReport/38213.htm

https://projecteuclid.org/euclid.aoap/1177005588

https://www.semanticscholar.org/paper/Reinforcement-Learning-and-Markov-Decision-Otterlo/a446aeee58179a425b839bd0bff9562159b317db



46 BIBLIOGRAPHY

org/paper/Reinforcement-Learning-and-Markov-Decision-Otterlo/a446aeee58179a425b839bd0bff9562159b317db

[9] L. Brickman, Preparing the 21st Century Church. Xulon Press, 2002. [Online].Available: https://books.google.se/books?id=R6ocCjZIrrUC

[10] (2017, Nov) Deep reinforcement learning models: Tipstricks for writing reward functions. [Online; accessed 05-June-2018]. [Online]. Available: https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0

[11] R. S. Sutton, “Generalization in reinforcement learning: Successful examplesusing sparse coarse coding,” in Advances in Neural Information ProcessingSystems: Proceedings of the 1995 Conference. MIT Press, Cambridge, MA,1996, pp. 1038–1044.

[12] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, vol. 47, no. 2, pp. 235–256,May 2002. [Online]. Available: https://doi.org/10.1023/A:1013689704352

[13] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.





https://books.google.se/books?id=R6ocCjZIrrUC

https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0

https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0

https://doi.org/10.1023/A:1013689704352

www.kth.seTRITA-EECS-EX-2018:787

Reinforcement Learning for Uplink Power Control1295396/FULLTEXT01.pdf · reinforcement learning...

Documents

Transcript of Reinforcement Learning for Uplink Power Control1295396/FULLTEXT01.pdf · reinforcement learning...