Learning to Make Decisions Optimally for Self-Driving...

Learning to Make Decisions Optimally for Self-Driving Networks

Song Chong Graduate School of AI & School of EE

KAIST

August 3, 2020

AI Decision-Making Meets Network Autonomy

Knowledge-Defined Networking [SIGCOMM’17]

2

Experience, simulation, generative model etc.

simulationgenerative model

experience

kowledge

intelligence

• Online sequential decision making problems in dynamical systems• Large-scale systems => curse of complexity• System models are unknown and stochastic => curse of uncertainty

• It is mostly about resource management• Congestion control, wireless scheduling, bitrate adaptation in video streaming • Network function placement, resource management in datacenter

• Solved today mostly using meticulously designed heuristics• Painstakingly test and tune the heuristics for good performance• Repeated if workload, environment, and metric of interest change

3

Network Control

Self-Driving Networks: Can networks learn to operate by their own decisions, with very little human

intervention, i.e., directly from experience interacting with environment?

Reinforcement Learning:Complex Decision Making for Dynamical Systems under Uncertainty

Goal: Learn a policy 𝜋𝜋 to generate 𝑎𝑎0, 𝑎𝑎1,⋯ maximizing expected return

4

𝔼𝔼𝜋𝜋 �𝑡𝑡=0

∞

𝛾𝛾𝑡𝑡𝑟𝑟𝑡𝑡 𝛾𝛾 ∈ [0,1)

Previous Works on RL-based Network Control

6

• Resource management in datacenter [HotNets’16]

• Adaptive bitrate video streaming [SIGCOMM’17-1] [ICML WKSHPS’19]

• Scheduling for data processing clusters [SIGCOMM’19]

• Network resource scheduling [TNET’19]

• DVFS of CPU/GPU in mobile systems [SenSys’20]

• Resource management in edge computing [NetworkMag’19]

• Network function embedding [INFOCOM WKSHPS’19]

• Cognitive network management [CommMag’18]

Learning & Approximating Value of Action

• Mathematical model for sequential decision-making problems• Transition probability 𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 = Pr(𝑠𝑠𝑠|𝑠𝑠, 𝑎𝑎)• Policy function 𝑎𝑎 = 𝜋𝜋(𝑠𝑠)• Reward function 𝑟𝑟 = 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 with discount factor 𝛾𝛾 ∈ [0,1)

8

Markov Decision Process

𝑠𝑠0 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 ⋯𝑎𝑎0 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3

Return 𝑟𝑟0 + 𝛾𝛾𝑟𝑟1 + 𝛾𝛾2𝑟𝑟2 + 𝛾𝛾3𝑟𝑟3 ⋯

Optimal policy 𝜋𝜋∗: 𝜋𝜋∗ ← max𝜋𝜋𝔼𝔼𝜋𝜋 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑟𝑟𝑡𝑡

• 𝑄𝑄-function to measure the value-to-go of action 𝑎𝑎 for given state 𝑠𝑠

• Optimal policy 𝜋𝜋∗ for state 𝑠𝑠 is then determined by

9

Action Value

𝑄𝑄 𝑠𝑠, 𝑎𝑎 ≝ 𝑟𝑟 𝑠𝑠, 𝑎𝑎 + 𝔼𝔼𝜋𝜋∗ �𝑡𝑡=1

∞

𝛾𝛾𝑡𝑡𝑟𝑟𝑡𝑡 ,

𝑠𝑠0= 𝑠𝑠 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 ⋯•𝑎𝑎0= 𝑎𝑎 𝜋𝜋∗(𝑠𝑠1) 𝜋𝜋∗(𝑠𝑠2) 𝜋𝜋∗(𝑠𝑠3)

⋯

𝜋𝜋∗ 𝑠𝑠 = argmax𝑎𝑎∈𝐴𝐴

𝑄𝑄 𝑠𝑠, 𝑎𝑎

𝑟𝑟 𝑠𝑠, 𝑎𝑎 ≝ ∑𝑠𝑠𝑠 𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠

• Bellman equation (sufficient and necessary condition for optimality)

• The fixed point equation 𝑄𝑄 = 𝑇𝑇𝑄𝑄 always has a unique solution and the iteration 𝑄𝑄𝑘𝑘+1 ← 𝑇𝑇𝑄𝑄𝑘𝑘 (value iteration method, a.k.a. dynamic programming) converges geometrically to it for any 𝑄𝑄0 [Bertsekas’17]

• However, in real-world applications solving this equation is not that simple• Curse of uncertainty: 𝑃𝑃𝑠𝑠𝑠𝑠′

𝑎𝑎 and 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 may not be known• Curse of complexity: cardinality of sets 𝑆𝑆 and 𝐴𝐴 is prohibitively large

10

Computing Action Value

𝑄𝑄 𝑠𝑠,𝑎𝑎 = ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴

𝑄𝑄 𝑠𝑠𝑠,𝑎𝑎𝑠 ], ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴

𝑄𝑄 = 𝑇𝑇𝑄𝑄

• Value iteration method requires that 𝑃𝑃𝑠𝑠𝑠𝑠′𝑎𝑎 and 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 are known

• 𝑄𝑄-learning: a stochastic value iteration method [Cambridge’89]

• Upon 𝑘𝑘 + 1 -th sample of 𝑠𝑠𝑘𝑘 ,𝑎𝑎𝑘𝑘 , 𝑟𝑟𝑘𝑘 , 𝑠𝑠𝑘𝑘𝑠 = (𝑠𝑠,𝑎𝑎, 𝑟𝑟, 𝑠𝑠𝑠),

• Converge if total asynchronism holds and 𝛼𝛼𝑘𝑘 decreases as 𝑘𝑘 → ∞ such that∑𝑘𝑘𝛼𝛼𝑘𝑘 =∞, ∑𝑘𝑘𝛼𝛼𝑘𝑘2 < ∞, due to asynchronous convergence theory [Tsitsiklis’94]

11

Learning Action Value

𝑄𝑄𝑘𝑘+1 𝑠𝑠, 𝑎𝑎 ← 𝑄𝑄𝑘𝑘 𝑠𝑠, 𝑎𝑎 + 𝛼𝛼𝑘𝑘(𝑟𝑟 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴

𝑄𝑄𝑘𝑘 𝑠𝑠𝑠, 𝑎𝑎𝑠 − 𝑄𝑄𝑘𝑘 𝑠𝑠, 𝑎𝑎 )

Temporal-difference (TD) error

𝑄𝑄𝑘𝑘+1 𝑠𝑠, 𝑎𝑎 ← ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴

𝑄𝑄𝑘𝑘 𝑠𝑠𝑠,𝑎𝑎𝑠 ], ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴

𝑄𝑄𝑘𝑘+1 𝑠𝑠, 𝑎𝑎 ← 1𝑘𝑘+1

∑𝑡𝑡=0𝑘𝑘 [𝑟𝑟𝑡𝑡 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴

𝑄𝑄𝑘𝑘 𝑠𝑠𝑡𝑡𝑠, 𝑎𝑎𝑠 ] Monte Carlo averaging

Stochastic approximation

• Exploitation-exploration tradeoffs• 𝜀𝜀-greedy policy • UBC (Upper Confidence Bound) policy

12

Q-Learning in Action

𝑄𝑄(𝑠𝑠,𝑎𝑎) ← 𝑄𝑄 𝑠𝑠,𝑎𝑎 + 𝛼𝛼 (𝑟𝑟 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴

𝑄𝑄 𝑠𝑠𝑠,𝑎𝑎𝑠 − 𝑄𝑄 𝑠𝑠,𝑎𝑎 )

𝜋𝜋∗ 𝑠𝑠 = argmax𝑎𝑎∈𝐴𝐴

𝑄𝑄 𝑠𝑠,𝑎𝑎

Policy1. Learn Q-table 𝑄𝑄 on samples 𝑠𝑠,𝑎𝑎, 𝑟𝑟, 𝑠𝑠𝑠

2. Choose action from Q-table 𝑄𝑄

Environment

action 𝑎𝑎

state 𝑠𝑠reward 𝑟𝑟

𝑎𝑎 = 1 𝑎𝑎 = 2

𝑠𝑠 = 0 1.1 1.9

𝑠𝑠 = 1 3.0 4.1

𝑠𝑠 = 2 5.1 6.0

Q-table 𝑄𝑄

Action-Value Approximation• Q-learning breaks the curse of uncertainty but does not break the curse

of complexity• Prohibitively large number of state-action pairs to learn and store values• E.g., AlphaGo: 10170 states [Nature’16], Atari breakout: 25628,224 states [Nature’15],

resource management in Google datacenter: 2100 state-action pairs [HotNets’16]

• Approximate 𝑄𝑄(𝑠𝑠, 𝑎𝑎) by a function defined in a lower-dimensional feature space and parameterized by 𝜃𝜃

13

𝑄𝑄 𝑠𝑠, 𝑎𝑎 ≈ �𝑄𝑄 ∅(𝑠𝑠, 𝑎𝑎);𝜃𝜃

Feature extraction mapping

State-action pair(𝑠𝑠, 𝑎𝑎)

Feature vector∅(𝑠𝑠, 𝑎𝑎) Neural

network

Q approximation�𝑄𝑄 ∅(𝑠𝑠, 𝑎𝑎);𝜃𝜃

Convolutional neural network

• DQN = Q-learning + Deep Neural Network (DNN)• Approximate Q-function by a DNN with parameter 𝜃𝜃 in a feature space

Deep Q-Network (DQN) [Nature’15]

𝑄𝑄 𝑠𝑠, 𝑎𝑎 ≈ �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎; 𝜃𝜃 , ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴

13

�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎1; 𝜃𝜃�𝑄𝑄 𝜙𝜙(𝑠𝑠),𝑎𝑎2;𝜃𝜃

�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎𝑛𝑛;𝜃𝜃

argmax𝑎𝑎∈𝐴𝐴

�𝑄𝑄 𝜙𝜙(𝑠𝑠),𝑎𝑎;𝜃𝜃state 𝑠𝑠 action value

action 𝑎𝑎:

parameter 𝜃𝜃

Q-network

NN Approximation Breaks the Curse of Complexity

15

𝑄𝑄−network �𝑄𝑄 ∅(𝑠𝑠, 𝑎𝑎); 𝜃𝜃𝑄𝑄−table 𝑄𝑄(𝑠𝑠, 𝑎𝑎)

Projected Bellman Equation

• Bellman equation

• Projected Bellman equation

16

𝑄𝑄 = 𝑇𝑇𝑄𝑄

�𝑄𝑄 = Π𝑇𝑇 �𝑄𝑄

𝑇𝑇 �𝑄𝑄

Π𝑇𝑇 �𝑄𝑄

{ �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎; 𝜃𝜃 |𝜃𝜃 ∈ Θ}

𝑄𝑄 𝑠𝑠, 𝑎𝑎 = ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴

𝑄𝑄 𝑠𝑠𝑠,𝑎𝑎𝑠 ], ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴

𝜃𝜃 = argmin𝜃𝜃∈Θ

∥ �𝑄𝑄 − 𝑇𝑇 �𝑄𝑄 ∥𝜉𝜉2

= argmin𝜃𝜃∈Θ

𝔼𝔼(𝑠𝑠,𝑎𝑎)~𝜉𝜉 �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃 − ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴

�𝑄𝑄 𝜙𝜙(𝑠𝑠𝑠), 𝑎𝑎𝑠;𝜃𝜃 ] 2PDF of (𝑠𝑠, 𝑎𝑎)

𝔼𝔼𝑠𝑠′|𝑠𝑠,𝑎𝑎

• Loss function to be minimized over 𝜃𝜃

• Monte Carlo averaging

• Experience replay (replay buffer)• Remove sample correlations, enhance exploration and improve sample efficiency• Possible because Q-learning is an off-policy learning method

• Target Q-network• Moving target can cause instability during learning• Fix target by evaluating target action value via a previously-learned Q-network

Monte Carlo Averaging Breaks the Curse of Uncertainty

𝐿𝐿 𝜃𝜃 = 𝔼𝔼(𝑠𝑠,𝑎𝑎,𝑠𝑠′) �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃 − (𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴

�𝑄𝑄 𝜙𝜙 𝑠𝑠𝑠 ,𝑎𝑎𝑠;𝜃𝜃 ) 2

prediction target

𝜃𝜃 ← min𝜃𝜃∑{ 𝑠𝑠,𝑎𝑎,𝑟𝑟,𝑠𝑠′ }[(𝑟𝑟 + 𝛾𝛾max

𝑎𝑎′∈𝐴𝐴�𝑄𝑄 𝜙𝜙 𝑠𝑠𝑠 ,𝑎𝑎𝑠;𝜃𝜃− − �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃 )2]

14

Fixed targetSolve this by Stochastic Gradient Descent (SGD) method

�𝑄𝑄 𝜙𝜙(𝑠𝑠),𝑎𝑎;𝜃𝜃

18

Deep Q-Network in Action

�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃−

�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃

Exploitation-Exploration Tradeoffs

• 𝜀𝜀-greedy policy

• UBC (Upper Confidence Bound) policy

19

𝜋𝜋∗ 𝑠𝑠 = �argmax𝑎𝑎∈𝐴𝐴

𝑄𝑄 𝑠𝑠,𝑎𝑎 , with probability 1−ε

random , with probaility ε

RL Is Not a Panacea for All Our Problems

Max-Weight Scheduling

• State-of-the-art scheduling algorithm• Throughput optimal• Myopic policy• Suffers poor delay performance• Minimize the conditional queue drift

21

𝔼𝔼 ∑𝑛𝑛∈𝑁𝑁(𝑞𝑞𝑛𝑛 𝑡𝑡 + 1 2 − 𝑞𝑞𝑛𝑛 𝑡𝑡 2) 𝑞𝑞 𝑡𝑡“Unknown” capacity region

Per-user capacity

Per-user queue

Per-user arrival

Scheduling

?

𝑸𝑸+-UCB: Beyond Max-Weight [TNET’19]

• RL-based algorithm considering return-to-go

• Guarantee throughput and delay optimality

• Guarantee max-weight algorithm performance during learning phase

• Sample-efficient exploration

22

RL-based algorithmGoal

Max-weight algorithm

Training Iteration

Perform

ance

Joint Throughput and Delay Optimality

23

40.8% 𝑟𝑟 𝑠𝑠𝑡𝑡 ,𝑎𝑎𝑡𝑡 , 𝑠𝑠𝑡𝑡+1 =

Reward Design:

−∑𝑛𝑛∈𝑁𝑁 𝑞𝑞𝑛𝑛 𝑡𝑡 + 1 − 𝜈𝜈∑𝑛𝑛∈𝑁𝑁 𝑞𝑞𝑛𝑛 𝑡𝑡 + 1 2 − 𝑞𝑞𝑛𝑛 𝑡𝑡 2 , 𝜈𝜈 > 0

Throughput optimality Delay optimality

Delay during Learning Phase

24

𝑄𝑄+-UCB vs Max-Weight Delay over initial 100 iterations

DVFS and Thermal Throttling for Mobile Devices

• Dynamic Voltage and Frequency Scaling (DVFS)• Dynamically adjust the Voltage-Frequency (VF) level of the processor to

improve energy efficiency• Thermal Throttling

• Lower the processor temperature by setting very low VF level of the processor when overheated

• OS-level rule-based control • Limitations of existing techniques

• Application-agnostic• No cooperation between CPU and GPU due to independent governors• Agnostic about CPU-GPU thermal coupling and thermal environments

• Need predictive management

*

*25

zTT: Learning-based DVFS with Zero Thermal Throttling [SenSys’20]

• Application-aware DVFS• Real-time resource requirements (CPU, GPU) prediction for mobile applications• Maximize user QoE• Minimize power consumption

• Prevent overheating• Predict thermal headroom• Perform DVFS within thermal headroom and avoid thermal throttling• Adapt to changes of thermal environments

• Purpose of Learning• Model learning: Learn transition probability of system states • Environment learning: Predict temperature for a given CPU and GPU clock frequency

combination• Application learning : Learn CPU and GPU resource requirements for a given

application

*

*26

References• [Bertsekas’17] Dynamic Programming and Optimal Control, Vol. 1, D. P. Bertsekas, 4th Ed., Athena Scientific,

2017• [Bertsekas’12] Dynamic Programming and Optimal Control, Vol. 2 – Approximate Dynamic Programming,

D. P. Bertsekas, 4th Ed., Athena Scientific, 2017• [Cambridge’89] C. J. C. H. Watkins, “Learning from delayed rewards”, Ph.D. thesis, Cambridge University,

1989• [Tsitsiklis’94] J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q-learning”, Machine Learning,

1994• [Nature’15] V. Mnih et al., “Human-level control through deep reinforcement learning”, Nature, 2015 • [HotNets’16] H. Mao et. al., “Resource management with deep reinforcement learning”, ACM HotNets, 2016• [SIGCOMM’17] Albert Mestres et. al., “Knowledge-defined networking”, ACM SIGCOMM 2017

• [SIGCOMM’17-1] H. Mao et. al., “Neural Adaptive Video Streaming with Pensieve”, ACM SIGCOMM, 2017

• [TNET’19] J. Bae, J. Lee and S. Chong, “Learning to schedule network resources throughput and delay optimally using 𝑄𝑄+-learning”, submitted to IEEE/ACM Trans. on Networking, 2019

• [SenSys’20] S. Kim, K. Lee and S. Chong, “zTT: Learning-based DVFS with Zero Thermal Throttling for Mobile MPSoCs”, submitted to ACM SenSys, 2020

27

• [ICML WKSHPS’19] H. Mao et al., “Real-world Video Adaptation with Reinforcement Learning”, ICML Woskshop 2019

• [SIGCOMM’19] H. Mao et al., “Learning Scheduling Algorithms for Data Processing Clusters”, ACM SIGCOMM 2019

• [NetworkMag’19] D. Zeng et al., “Resource Management at the Network Edge: A Deep Reinforcement Learning Approach,” IEEE Network, vol. 33, no. 3, pp. 26–33, May 2019

• [INFOCOM WKSHPS’19] M. Dolati et al., “Deep-ViNE: Virtual Network Embedding with Deep Reinforcement Learning,” IEEE INFOCOM Workshop 2019

• [CommMag’18] S. Ayoubi et al., “Machine Learning for Cognitive Network Management,” IEEE Communications Magazine, vol. 56, no. 1, pp. 158–165, Jan. 2018

28

Learning to Make Decisions Optimally for Self-Driving...

Documents

Transcript of Learning to Make Decisions Optimally for Self-Driving...