Reinforcement Learning for Mobile Computing -...

Networking Laboratory 1/20

Sungkyunkwan University

Copyright 2000-2019 Networking Laboratory

Reinforcement Learning for Mobile Computing

Prepared by D.T. Le and H. Choo

Mobile Computing Networking Laboratory 2/43

Contents

Reinforcement Learning (RL)

Q-Learning

► Case Study 1: RL in Broadcast Schedule

Deep Q-Learning

► Case Study 2: RL in Resource Management


Reinforcement Learning (RL)Example #1: A blind man’s buff game


Reinforcement Learning (RL)Example #2: An intelligent network


Agent-based learning: Agent learns by interacting with its environment

Learning by trial and error

Reinforcement Learning (RL)Definition


Difference from supervised learning:

No training data; wrong decisions are

not corrected explicitly

Learning happens online based on

implicit feedback in the form of reward/

cost values

► Your action influences the state of the

world which determines its reward

Main challenge:

Exploration vs. Exploitation

Reinforcement Learning (RL)Characteristic


Two reasons to take an action in RL

► Exploration: gather more information that might lead us to better decisions

in the future

How do we know if there is not a pot of gold around the corner?

We typically need to take actions that do not seem the best according to the

current model

► Exploitation: make the best decision given current information

Managing the trade-off between exploration and exploitation is a critical

issue in RL

Basic intuition behinds most approaches:

► Explore more when knowledge is weak

► Exploit more as we gain knowledge

Reinforcement Learning (RL)Exploration versus exploitation


Reinforcement Learning (RL)Success stories

Minsky’s Stochastic Neural Analogy Reinforcement Computer (SNARC)

- 1951

Matching world’s best players in backgammon (Tesauro, 1992-95)

Helicopter autopilot (Ng et al., 2006)

Human level performance in Atari games through deep Q-networks

(DeepMind, 2013-15)

AlphaGo beats top Go player (DeepMind, 2016)

Self thought AlphaGo Zero beats AlphaGo 100 to 0 (DeepMind, 2017)


Reinforcement Learning (RL)Markov Decision Process (MDP)

Mathematical formulation of the RL problem

Defined by: (𝒮,𝒜,ℝ, ℙ, 𝛾)► 𝒮: set of possible states

► 𝒜: set of possible actions

► ℝ: distribution of reward for given (state, action) pair

► ℙ: transition probability i.e. distribution over next state for given (state,

action) pair

► 𝛾: discount factor

MDP of a robot trying to walk

state

actionsreward

transition

probability

: slow action

: fast action


Reinforcement Learning (RL)Markov Decision Process (MDP)

How to handle the randomness (initial state, transition probability…)?

→ Answer: Expected value 𝔼► Expected value is the average value of a random variable over a large

number of experiments

► e.g. Value of reward when the robot executes a fast action at Moving state

𝔼 reward of fast action when moving = 0.2 × −1 + 0.8 × 2 = 1.4

MDP of a robot trying to walk

: slow action

: fast action


Reinforcement Learning (RL)The task (1/2)

To learn a policy that selects an action for a state

𝜋: 𝒮 → 𝒜 ~ 𝜋 𝑠𝑡 = 𝑎𝑡

Cumulative discounted reward

𝑟0 + 𝛾𝑟1 + 𝛾2𝑟2 +⋯ = σ𝑡≥0 𝛾𝑡𝑟𝑡

► 𝑟𝑡 = ℝ 𝑠𝑡 , 𝑎𝑡 : the reward when we perform action 𝑎𝑡 in state 𝑠𝑡► 𝑠𝑡+1 = ℙ 𝑠𝑡 , 𝑎𝑡 : the next state when we perform action 𝑎𝑡 in state 𝑠𝑡► 0 < 𝛾 ≤ 1: immediate reward is worth more than future reward

What would happen if 𝛾 = 0?

𝜋


Reinforcement Learning (RL)The task (2/2)

What would be the optimal policy 𝜋∗?

→ Answer: the policy that maximizes the expected cumulative reward

𝜋∗ = argmax𝜋

𝔼

𝑡≥0

𝛾𝑡𝑟𝑡 | 𝜋

E.g. reach one of terminal states (greyed out) in least number of actions


Reinforcement Learning (RL)Value function and Q-value function

How good is state 𝑠 under policy 𝜋?

→ Answer: the expected cumulative reward following policy 𝜋 from state 𝑠

𝑉𝜋 𝑠 = 𝔼

𝑡≥0

𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠, 𝜋

How good is a (state, action) pair under policy 𝜋?

→ Answer: the expected cumulative reward from taking action 𝑎 in state 𝑠and then following the policy

𝑄𝜋 𝑠, 𝑎 = 𝔼

𝑡≥0

𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋


Reinforcement Learning (RL)Optimal policy

The optimal value function and Q-value function are:

𝑉∗ 𝑠 = max𝜋

𝔼

𝑡≥0

𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠, 𝜋

𝑄∗ 𝑠, 𝑎 = max𝜋

𝔼

𝑡≥0

𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋

Where the environment is fully observable, i.e. you know 𝑟𝑡 = ℝ 𝑠𝑡 , 𝑎𝑡and 𝑠𝑡+1 = ℙ 𝑠𝑡 , 𝑎𝑡 , the optimality can be solved by model-based

learning through value iteration and policy iteration

► However, this is unrealistic in many real problems, e.g., what is the reward if

a robot is exploring Mars and decides to take a right turn?

→ We can circumvent this problem by exploring how the world reacts to our

actions through a state-action function → Q-learning (model-free learning)


Reinforcement Learning (RL)Bellman Equation

𝑄∗ satisfies the following Bellman equation

𝑄∗ 𝑠, 𝑎 = 𝔼 𝑟 + 𝛾max𝑎′

𝑄∗ 𝑠′, 𝑎′ |𝑠, 𝑎

► 𝑟 = ℝ 𝑠, 𝑎 : the reward when we perform action 𝑎 in state 𝑠

► 𝑠′, 𝑎′: the next state and action, respectively

Intuition: if the optimal Q-values for the next time-step are known, then

the optimal strategy is to take the action that maximizes the expected

value of 𝑟 + 𝛾𝑄∗ 𝑠′, 𝑎′

► The optimal policy 𝜋∗ corresponds to taking the best action in any state as

specified by 𝑄∗


Q-Learning

Main idea: “explore” all possibilities of state-action pairs (Q-table) and

estimate the 𝑄 𝑠, 𝑎 value

𝑄 𝑠, 𝑎 = 𝔼 𝑟 + 𝛾max𝑎′

𝑄 𝑠′, 𝑎′ |𝑠, 𝑎

► 𝑄 𝑠, 𝑎 : Q-value of state-action pair 𝑠, 𝑎

Episode: All states come in between an initial-state and a terminal-state

Q-learning: upgrading a Q-table

► Initially, all Q-values are null as the agent has no idea about environment

► In each episode, agent

selects an action,

executes the action

observes feedback (reward and the next state)

and updating Q-table from feedback

► 𝑄𝑖 will converge to 𝑄∗ as 𝑖 → ∞


Q-LearningCase study – One-to-all broadcast

Problem statement

► Given a multi-hop wireless network consisting of 𝑁 nodes, including one

source node

► The source broadcasts a message

to all nodes in the shortest time and

minimum number of transmissions.

Simplified problem (Broadcast

backbone construction)

► Find a broadcast backbone that

covers as many nodes as possible

► Intuition: More nodes being

covered, less transmissions

required to cover the whole

network


Q-LearningCase study – Learning model

Given a network 𝐺 = (𝑉, 𝐸)

► 𝑉: the set of nodes, 𝑣0 ∈ 𝑉: the source node

► 𝐸: the set of communication links between nodes

Construct a broadcast backbone 𝐵 = 𝑉𝐵 , 𝐸𝐵► Initially, 𝑣0 ∈ 𝑉𝐵 and 𝐸𝐵 = ∅

Use the current broadcaster as state

► All uncovered neighbors of the broadcaster are added into 𝑉𝐵

An action corresponds to a selection of the next broadcaster

► Selecting one node among nodes in 𝑉𝐵

The immediate reward of an action is the number of nodes newly

covered by the selected broadcaster

An episode starts from state 𝑣0 and continues until a broadcaster

covers the furthest nodes


Q-LearningCase study – Epsilon greedy approach

All nodes are divided into layers 𝐿0…𝐿𝑙 according to distances to the

source node

Let 𝑣 ∈ 𝑉𝐵 be the current broadcaster covering nodes in 𝐿𝑘 0 ≤ 𝑘 ≤ 𝑙

The action is selecting a next broadcaster for the next layer 𝐿𝑘+1► With probability 𝜀

𝑣′ ← random(𝑁 𝑣 ∩ 𝐿𝑘)► Otherwise

𝑣′ ← argmax𝑣′∈𝑁 𝑣 ∩𝐿𝑘

𝑄 𝑣, 𝑣′

Select a node from the backbone to cover the most uncovered nodes

(*) Epsilon greedy approach

► At the beginning, explore the environment and randomly choose actions

► When the Q-table is ready, the agent will start to exploit the environment

and start taking better actions

(*)


Q-LearningCase study – Epsilon greedy approach (cont.)

Immediate reward:

𝑟 𝑣, 𝑣′ = 𝑁 𝑣′ ∩ 𝐿𝑘+1

Learning value: state-action value

► Recall Bellman equation: 𝑄𝑖+1 𝑠, 𝑎 = 𝔼 𝑟 + 𝛾max𝑎′

𝑄𝑖 𝑠′, 𝑎′ |𝑠, 𝑎

→ Implemented Q-value:

𝑄 𝑣, 𝑣′ = 𝑟 + 𝛾max𝑣′′

𝑄 𝑣′, 𝑣′′

► Another implementation (try at home):

𝑄𝑖+1 𝑠, 𝑎 = 1 − 𝛼 𝑄𝑖 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾max𝑎′

𝑄𝑖 𝑠′, 𝑎′

𝛼: learning rate


Q-LearningExample – Initially

i i: ID

Network link

Layer information

source Backbone node

A B C D E F G H I J K L M

A 0 0 0

B 0 0 0 0

C 0 0 0 0 0 0

D 0 0 0 0

E 0 0 0 0

F 0 0 0 0 0 0

G 0 0 0 0 0 0

H 0 0 0

I 0 0

J 0 0 0

K 0 0

L 0 0

M 0 0

A

source

B D

E G

K

C

HF

L MI J

L0

L1

L2

L3


Q-LearningExample – Episode 1 (𝑣 = 𝐴)

i i: ID

Network link

Layer information



A 2 0 0

B 0 0 0 0

C 0 0 0 0 0 0

D 0 0 0 0

E 0 0 0 0

F 0 0 0 0 0 0

G 0 0 0 0 0 0

H 0 0 0

I 0 0

J 0 0 0

K 0 0

L 0 0

M 0 0

• 𝑣′ ∈ 𝐵, 𝐶, 𝐷• 𝑣′← argmax𝑄 𝑣, 𝑣′ = 𝐵• 𝑟 𝐴, 𝐵 = 𝑁 𝐵 ∩ 𝐿2 = 2• 𝑄 𝐴, 𝐵 = 𝑟 + 𝛾max

𝑣′′𝑄 𝐵, 𝑣′′ = 2

A

source

B D

E G

K

C

HF

L MI J

L0

L1

L2

L3


Q-LearningExample – Episode 1 (𝑣 = 𝐵)

i i: ID

Network link

Layer information



A 2 0 0

B 0 0 2 0

C 0 0 0 0 0 0

D 0 0 0 0

E 0 0 0 0

F 0 0 0 0 0 0

G 0 0 0 0 0 0

H 0 0 0

I 0 0

J 0 0 0

K 0 0

L 0 0

M 0 0

• 𝑣′ ∈ 𝐸, 𝐹• 𝑣′← argmax𝑄 𝑣, 𝑣′ = 𝐸• 𝑟 𝐵, 𝐸 = 𝑁 𝐸 ∩ 𝐿3 = 2• 𝑄 𝐵, 𝐸 = 𝑟 + 𝛾max

𝑣′′𝑄 𝐸, 𝑣′′ = 2

A

source

B D

E G

K

C

HF

L MI J

L0

L1

L2

L3



i i: ID

Network link

Layer information



A 2 3 0

B 0 0 2 0

C 0 0 0 0 0 0

D 0 0 0 0

E 0 0 0 0

F 0 0 0 0 0 0

G 0 0 0 0 0 0

H 0 0 0

I 0 0

J 0 0 0

K 0 0

L 0 0

M 0 0

• 𝑣′ ∈ 𝐵, 𝐶, 𝐷• 𝑣′← argmax𝑄 𝑣, 𝑣′ = 𝐵• 𝑣′← random 𝑁 𝐴 ∩ 𝐿1 = 𝐶• 𝑟 𝐴, 𝐶 = 𝑁 𝐶 ∩ 𝐿2 = 3• 𝑄 𝐴, 𝐶 = 𝑟 + 𝛾max

𝑣′′𝑄 𝐶, 𝑣′′ = 3

A

source

B D

E G

K

C

HF

L MI J

L0

L1

L2

L3


Q-LearningExample – Episode 2 (𝑣 = 𝐶)

i i: ID

Network link

Layer information



A 2 3 0

B 0 0 2 0

C 0 0 0 2 0 0

D 0 0 0 0

E 0 0 0 0

F 0 0 0 0 0 0

G 0 0 0 0 0 0

H 0 0 0

I 0 0

J 0 0 0

K 0 0

L 0 0

M 0 0

• 𝑣′ ∈ 𝐹, 𝐺, 𝐻• 𝑣′← random 𝑁 𝐶 ∩ 𝐿2 = 𝐹• 𝑟 𝐶, 𝐹 = 𝑁 𝐹 ∩ 𝐿3 = 2• 𝑄 𝐶, 𝐹 = 𝑟 + 𝛾max

𝑣′′𝑄 𝐹, 𝑣′′ = 2

A

source

B D

E G

K

C

HF

L MI J

L0

L1

L2

L3



i i: ID

Network link

Layer information



A 2 0

B 0 0 2 0

C 0 0 0 2 0 0

D 0 0 0 0

E 0 0 0 0

F 0 0 0 0 0 0

G 0 0 0 0 0 0

H 0 0 0

I 0 0

J 0 0 0

K 0 0

L 0 0

M 0 0

• 𝑣′ ∈ 𝐵, 𝐶, 𝐷• 𝑣′← argmax𝑄 𝑣, 𝑣′ = 𝐶 ☺☺☺• 𝑟 𝐴, 𝐶 = 𝑁 𝐶 ∩ 𝐿2 = 3• 𝑄 𝐴, 𝐶 = 𝑟 + 𝛾max

𝑣′′𝑄 𝐶, 𝑣′′ = 3.2

3.2

A

source

B D

E G

K

C

HF

L MI J

L0

L1

L2

L3


Q-LearningLimitation

No scalable

► Must compute 𝑄(𝑠, 𝑎) for every state-action pairs

E.g. Q-table size is 200 × 200 when broadcasting in network of 200 nodes

► It is good for low dimensional state and action space, and infeasible in

many real problems

E.g. On-demand provisions of multi-dimension resources (CPU, RAM,

bandwidth,…)

Solution: function approximator to estimate 𝑄∗(𝑠, 𝑎)

→ function approximator is a deep neural network → Deep Q-learning


Deep Q-Learning – Neural Network

Objective: find a Q-value that satisfies the Bellman equation


𝑄∗ 𝑠′, 𝑎′ |𝑠, 𝑎

A deep neural network (Q-network) is used to approximate the optimal

Q-value

𝑄 𝑠, 𝑎, 𝜃 ≈ 𝑄∗(𝑠, 𝑎)► 𝜃: approximation parameter (weight)

put to these neural networks, and the resultant strategy canbeused in anonlinestochastic environment. Third, it ispos-sible to train for objectivesthat arehard-to-optimize directlybecause they lack precise models if there exist reward sig-nals that correlate with the objective. Finally, by continu-ing to learn, an RL agent can optimize for a specificwork-load (e.g., small jobs, low load, periodicity) and begracefulunder varying conditions.As a first step towards understanding the potential of RLfor resourcemanagement, we design (§3) and evaluate (§4)DeepRM,asimplemulti-resourcecluster scheduler. DeepRMoperates in an online setting where jobs arrive dynamicallyand cannot be preempted once scheduled. DeepRM learnsto optimize various objectives such as minimizing averagejobslowdownor completion time. Wedescribe themodel in§3.1 andhow wepose thescheduling task asanRL problemin§3.2. To learn, DeepRM employsastandard policy gradi-

ent reinforcement learning algorithm [35]described in §3.3.We conduct simulated experiments with DeepRM on asynthetic dataset. Our preliminary results show that acrossawide range of loads, DeepRM performs comparably or bet-ter than standard heuristics such asShortest-Job-First (SJF)andapackingschemeinspiredby Tetris[17]. It learnsstrate-gies such as favoring short jobs over long jobs and keepingsome resources free to service future arriving short jobs di-

rectly from experience. In particular, DeepRM does not re-quire any prior knowledge of the system’sbehavior to learnthesestrategies. Moreover,DeepRM cansupport avariety ofobjectives just by usingdifferent reinforcement rewards.Looking ahead, deploying an RL-based resourcemanagerin real systems has to confront additional challenges. Toname a few, simple heuristics are often easier to explain,understand, and verify compared to an RL-based scheme.Heuristics are also easier to adopt incrementally. Neverthe-less, given thescaleandcomplexity of many of the resourcemanagement problems that weface today,weareenticed bythepossibility to improveby using reinforcement learning.

2. BACKGROUND

Webriefly reviewReinforcement Learning(RL) techniquesthat webuild on in this paper; we refer readers to [34]for adetailed survey and rigorousderivations.

Reinforcement Learning. Consider thegeneral settingshowninFigure1whereanagent interactswith anenvironment. Ateach time step t, the agent observes some state st , and isasked tochoosean action at . Following theaction, thestateof theenvironment transitions tost+ 1 and theagent receivesreward r t . The state transitions and rewards are stochasticand are assumed to have theMarkov property; i.e. the statetransition probabilities and rewardsdependonly on thestateof theenvironment st and theaction taken by theagent at .It isimportant tonotethat theagent canonly control itsac-tions, it hasnoapriori knowledgeof which state theenviron-ment would transition to or what the rewardmay be. By in-teractingwith theenvironment, duringtraining, theagent can

Agent

state

s

DNN

parameter θ

policy

πθ(s, a)

Environment Take action a

Observe state s

Reward r

Figure 1: Reinforcement Learning with policy repre-sented via DNN.

observethesequantities. Thegoal of learning istomaximizethe expected cumulative discounted reward: E [

P 1t= 0 γ

t r t ],whereγ 2 (0,1]isa factor discounting future rewards.

Policy. Theagent picksactionsbasedonapolicy,definedasa probability distribution over actions ⇡ : ⇡ (s, a) ! [0,1];⇡ (s, a) is the probability that action a is taken in state s. Inmost problemsof practical interest, there aremany possible{state, action} pairs; up to2100 for theproblemweconsiderin thispaper (see§3). Hence, it isimpossible tostorethepol-icy in tabular formand it iscommon touse function approx-

imators [7, 27]. A function approximator hasamanageablenumber of adjustable parameters, ✓; we refer to these as thepolicy parameters and represent the policy as ⇡✓(s, a). Thejustification for approximating the policy is that the agentshould takesimilar actions for “close-by" states.Many formsof function approximatorscanbeused to rep-resent the policy. For instance, linear combinations of fea-turesof thestate/action space(i.e., ⇡✓(s, a) = ✓T φ(s, a)) areapopular choice. Deep Neural Networks (DNNs) [18]haverecently beenusedsuccessfully asfunction approximators tosolve large-scale RL tasks [30, 33]. An advantage of DNNsis that they do not need hand-crafted features. Inspired bythese successes, we use a neural network to represent thepolicy in our design; thedetails are in §3.

Policy gradient methods. We focus on a class of RL al-gorithms that learn by performing gradient-descent on thepolicy parameters. Recall that the objective is to maximizethe expected cumulative discounted reward; the gradient ofthisobjectivegiven by [34]:

r ✓E⇡ ✓

"1X

t= 0

γ t r t

#

= E⇡ ✓ [r ✓ log⇡✓(s, a)Q⇡ ✓(s, a)]. (1)

Here, Q⇡ ✓(s, a) is the expected cumulative discounted re-ward from (deterministically) choosing action a in state s,and subsequently following policy ⇡✓. The key idea in pol-icy gradient methodsistoestimate thegradient by observingthe trajectories of executions that are obtained by followingthepolicy. In thesimpleMonteCarlo Method [19], theagentsamplesmultiple trajectories and uses the empirically com-puted cumulative discounted reward, vt , as an unbiased es-timate of Q⇡ ✓(st , at ). It then updates the policy parameters

51





𝑄∗ 𝑠′, 𝑎′ |𝑠, 𝑎


Q-value


► Forward pass: loss function 𝐿𝑖 𝜃𝑖 = 𝔼 𝑦𝑖 − 𝑄 𝑠, 𝑎, 𝜃𝑖2

where 𝑦𝑖 = 𝔼 𝑟 + 𝛾max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃𝑖−1 |𝑠, 𝑎

► Backward pass: gradient update (with respect to 𝜃)

∇𝜃𝑖𝐿𝑖 𝜃𝑖 = 𝔼 𝑟 + 𝛾max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃𝑖−1 − 𝑄 𝑠, 𝑎, 𝜃𝑖 ∇𝜃𝑖𝑄 𝑠, 𝑎, 𝜃𝑖





𝑄∗ 𝑠′, 𝑎′ |𝑠, 𝑎


Q-value


► Forward pass: loss function 𝐿𝑖 𝜃𝑖 = 𝔼 𝑦𝑖 − 𝑄 𝑠, 𝑎, 𝜃𝑖2

where 𝑦𝑖 = 𝔼 𝑟 + 𝛾max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃𝑖−1 |𝑠, 𝑎

► Backward pass: gradient update (with respect to 𝜃)

∇𝜃𝑖𝐿𝑖 𝜃𝑖 = 𝔼 𝑟 + 𝛾max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃𝑖−1 − 𝑄 𝑠, 𝑎, 𝜃𝑖 ∇𝜃𝑖𝑄 𝑠, 𝑎, 𝜃𝑖

Iteratively try to reduce

the mean-squared error,

i.e. to make the Q-value

close to the target value 𝑦𝑖


Deep Q-Learning – Neural NetworkCase study – Radio resource allocation

Given a downlink cellular network [1] with

► 𝐹 frequency sub-bands: 𝑓 ∈ 1,… , 𝐹

► 𝐾 base stations (BSs): 𝑘 ∈ 1, … , 𝐾

► 𝑃𝑘 is the maximum total power of cell 𝑘

► 𝒰𝑘 is the set of users associated with cell 𝑘

Problem: Find a sub-band and power allocations to maximize network

throughput

[1] K. I. Ahmed, H. Tabassum and E. Hossain, "Deep Learning for Radio Resource

Allocation in Multi-Cell Networks," in IEEE Network. doi:10.1109/MNET.2019.1900029


Deep Q-Learning – Neural NetworkCase study – Radio resource allocation (cont.)

Power allocation

► 𝑃𝑘,𝑓 denotes power allocated to sub-band 𝑓 in cell 𝑘

𝑓∈𝐹

𝑃𝑘,𝑓 ≤ 𝑃𝑘 , ∀𝑘 ∈ 𝐾

Sub-band allocation

► 𝐴𝑘,𝑓 denotes the user who is allocated sub-band 𝑓 in cell 𝑘

𝐴𝑘,𝑓 = argmax𝑢∈𝒰𝑘

𝐵 log(1 + 𝛼SINR𝑢,𝑘,𝑓)

𝐵 is the bandwidth of each sub-band (MHz)

𝛼 = −1.5/ log 5BER ≈ 0.2829

• BER: Bit Error Rate (assumed to be 10−6)

SINR: Signal to Interference plus Noise Ratio

SINR𝑢,𝑘,𝑓 =𝑃𝑘,𝑓𝐺𝑢,𝑘,𝑓

𝜂𝑢+σ𝑙≠𝑘 𝑃𝑙,𝑓𝐺𝑢,𝑙,𝑓

► Sub-band 𝑓 of cell 𝑘 will be allocated to user 𝑢 ∈ 𝒰𝑘 for which it will

maximize the throughput of sub-band 𝑓 in cell 𝑘



Power allocation

► 𝑃𝑘,𝑓 denotes power allocated to sub-band 𝑓 in cell 𝑘

𝑓∈𝐹

𝑃𝑘,𝑓 ≤ 𝑃𝑘 , ∀𝑘 ∈ 𝐾

Sub-band allocation

► 𝐴𝑘,𝑓 denotes the user who is allocated sub-band 𝑓 in cell 𝑘

𝐴𝑘,𝑓 = argmax𝑢∈𝒰𝑘

𝐵 log(1 + 𝛼SINR𝑢,𝑘,𝑓)

𝐵 is the bandwidth of each sub-band (MHz)

𝛼 = −1.5/ log 5BER ≈ 0.2829

• BER: Bit Error Rate (assumed to be 10−6)

SINR: Signal to Interference plus Noise Ratio

SINR𝑢,𝑘,𝑓 =𝑃𝑘,𝑓𝐺𝑢,𝑘,𝑓

𝜂𝑢+σ𝑙≠𝑘 𝑃𝑙,𝑓𝐺𝑢,𝑙,𝑓

► Sub-band 𝑓 of cell 𝑘 will be allocated to user 𝑢 ∈ 𝒰𝑘 for which it will

maximize the throughput of sub-band 𝑓 in cell 𝑘

Sub-band 𝑓 in all neighbor

cells that can interference

with cell 𝑘

Utility (throughput) of

sub-band 𝑓 in cell 𝑘



Network throughput (utility)

𝑈 =

𝑘∈(1,…,𝐾)

𝑢∈𝒰𝑘

𝑓=1

𝐹

[𝕀 𝐴𝑘,𝑓 = 𝑢 𝐵 log(1 + 𝛼SINR𝑢,𝑘,𝑓)]

► 𝕀 𝐴𝑘,𝑓 = 𝑢 is the indicator function indicating membership of 𝑢 in 𝐴𝑘,𝑓More details in [1]

Problem: Find 𝑃𝑘,𝑓 for all 𝑘 ∈ 𝐾 and 𝑓 ∈ 𝐹 to maximize 𝑈

► 𝐴𝑘,𝑓 can be determined on each given 𝑃𝑘,𝑓



Network throughput (utility)

𝑈 =

𝑘∈(1,…,𝐾)

𝑢∈𝒰𝑘

𝑓=1

𝐹

[𝕀 𝐴𝑘,𝑓 = 𝑢 𝐵 log(1 + 𝛼SINR𝑢,𝑘,𝑓)]

► 𝕀 𝐴𝑘,𝑓 = 𝑢 is the indicator function indicating membership of 𝑢 in 𝐴𝑘,𝑓More details in [1]

Problem: Find 𝑃𝑘,𝑓 and 𝐴𝑘,𝑓 for all 𝑘 ∈ 𝐾 and 𝑓 ∈ 𝐹 to maximize 𝑈

Utility (throughput) of

sub-band 𝑓 in cell 𝑘


Deep Q-Learning – Neural NetworkCase study – Learning model [2]

Let 𝐶𝑢,𝑘 denote the channel quality indicator (CQI) vector of a user 𝑢 in

cell 𝑘 over all frequency sub-bands

► E.g. 𝐹 = 3, a particular CQI vector is (16, 9, 8)

Use the 𝐶𝑢,𝑘 of all users in the network as state

► For 𝐾 cells, 𝐹 sub-bands, and 𝑈 users, the state size is 𝐾 × 𝐹 × 𝑈

An action corresponds to a power allocation for all frequency sub-bands

► For 𝐾 cells, 𝑃 power levels, and 𝐹 sub-band, the action size is 𝐾 × 𝑃𝐹

The immediate reward of an action is 1 if the current network

throughput is bigger than the previous one

An episode starts from an initial state and continues as long as the

network throughput is increased

► If the throughput achieved by executing the current action is less than or

equal to the previous one, the state is terminal

[2] K. I. Ahmed, H. Tabassum and E. Hossain, "A Deep Q-Learning Method for Downlink Power Allocation in Multi-Cell

Networks." arXiv preprint arXiv:1904.13032 (2019)


Deep Q-Learning – Neural NetworkCase study – Experience replay approach


Deep Q-Learning – Neural NetworkCase study – Experience replay approach

First initializes a replay memory 𝐷 with transitions (𝑠𝑡 , 𝑎𝑡, 𝑟𝑡 , 𝑠𝑡+1) ,

experiences generated randomly using an ℇ-greedy policy

Randomly selects minibatches of transitions from 𝐷 to train the DNN

Q-values obtained by the trained DNN are used to obtain new

experiences and these experiences are stored in the memory pool 𝐷

Makes DNN training more efficient by using both old and new

experiences

Allows to remember and reuse experiences (i.e. transitions from the

past)

By using the experience replay, transitions are more independent

(correlations among observations can be removed)


Deep Q-Learning – Neural NetworkCase study – Performance evaluation

The total network throughput of DRL scheme is compared with that of

different power allocation schemes, including near-optimal Genetic

Algorithm (GA) [3], WMMSE [4], random power allocation, and

maximum power allocation

[3] S. Sivanandam and S. Deepa, “Genetic algorithm optimization problems,” in Introduction to Genetic Algorithms. Springer,

2008, pp. 165– 209.

[4] Q.Shi et al., "An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering

broadcast channel," IEEE Trans. on Signal Processing, vol. 59, no. 9, pp. 4331-4340


Outlook

Very promising for complex resource allocation problems

Advanced DQL models

► Double Deep Q-Learning (DDQL) to overcome overestimation of action values in a

stochastic environment

► DQL with Prioritized Experience Replay: replay important transitions more

frequently

► Dueling DQL: separately estimate the state and action value functions and then

combine to obtain Q-value

► Asynchronous Multi-step DQL: use multiple agents to train the DNN in parallel

► Distributional DQL: update Q function based on the distribution of the reward

instead of expectation (modified Bellman equation)

► DQL with Noisy Nets: bias and weight perturbed during training by a parametric

function of noise

► Rainbow DQL: integrates the above solutions into a single learning agent

► Deep Deterministic Policy Gradient Q-Learning: for continuous actions/ high-

dimensional action space

► Deep Recurrent Q-Learning (DRQN): for partially observable environments

► Deep SARSA Learning: determine optimal policies in an online fashion


AppendixValue iteration

Value iteration computes the optimal state value function by iteratively

improving the estimate of V(s)

► The algorithm initialize V(s) to arbitrary random values.

► It repeatedly updates the Q(s, a) and V(s) values until they converges.

Value iteration is guaranteed to converge to the optimal values.


AppendixPolicy iteration

Policy iteration will re-define the policy at each step and compute the

value according to this new policy until the policy converges

Policy iteration is also guaranteed to converge to the optimal policy and

it often takes less iterations to converge than the value-iteration

algorithm.

Reinforcement Learning for Mobile Computing -...

Documents

Transcript of Reinforcement Learning for Mobile Computing -...