Risk, Reward & Reinforcement
Transcript of Risk, Reward & Reinforcement
Risk, Reward & ReinforcementRisk, Reward & Reinforcement
John MoodyDepartment of Computer Science
OGI School of Science & EngineeringOregon Health & Science University
Machine Learning, Statistics & DiscoveryAMS Workshop, Snowbird Utah, June 25, 2003
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Goals of This TalkGoals of This Talk
• Introduce Reinforcement Learning
• Present Direct Reinforcement– Contrast w/ Value Function RL Methods– Causal, Non-Markovian, Partially-Observed
• Describe Risk-Averse Reinforcement
• Demonstrate application to– A Competitive Game– Trading & Asset Allocation
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ardPreview: S&PPreview: S&P--500 / T500 / T--Bill Asset AllocationBill Asset Allocation
100
101
Equ
ity
RRL−Trader System vs Q−Trader System
Buy and HoldRRL−Trader Q−Trader
−1
0
1
RR
L−Tr
ader
1970 1975 1980 1985 1990−1
0
1
Q−T
rade
r
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
What is Reinforcement Learning?What is Reinforcement Learning?
RL Considers:• A Goal-Directed Agent • interacting with an Uncertain Environment• that attempts to maximize Reward / Utility
RL is a Dynamic Learning Paradigm:• Trial & Error Discovery of Strategy• Actions result in Reinforcement
Time Plays a Critical Role:• Rewards depend on sequences of actions• Rewards may be delayed or received over time
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Reinforcement vs. Supervised LearningReinforcement vs. Supervised Learning
Sound Bites:• “Learning from Examples” (SL)• “Learning by Trial and Error” (RL)Distinctions:• Static (SL) vs. Dynamic (RL)• Feedback: “Instructive” (SL) vs. “Evaluative” (RL)• SL usually ignores larger problem (goals, utility)• RL agents take action, may influence environmentCharacteristics of RL Applications: • Dynamical model of world is not known• Labeled examples expensive or unavailable• Temporal credit assignment problem
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Origins of Reinforcement LearningOrigins of Reinforcement Learning
• Psychology and Animal Behavior – Thorndike's “Law of Effect”, Animal Intelligence (1911)– Skinner’s “Operant Conditioning”, The Behavior of Organisms (1938)– “Trial and Error” Learning & “Reinforcement” Theories
• Computational Intelligence – Turing, “Computing Machinery and Intelligence” (1950) – Minsky, “Neural-Analog Reinforcement” (1954)– * Farley & Clark’s Policy Gradient Learner (1954)– Samuel's Checkers Program (1959)
• Operations Research & Control Engineering– Bellman, Dynamic Programming (1957)
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ardModern RLModern RL
• Value Function Methods– Sutton's "Temporal Difference " (1988)– Watkins' "Q-Learning" (1989)– Tesauro's "TD-Gammon" (1994)
• Actor-Critic Methods– Barto, Sutton & Anderson (1983)– Werbos’ Taxonomy (1992)– Konda & Tsitsiklis, NIPS*1999 (2000)
• Direct Reinforcement: Policy Gradient & Policy Search– Williams' “REINFORCE” (1988, 1992) – Moody, et al: "RRL" and Finance (1996 -- present)– Baxter & Bartlett: "Direct Gradient-Based RL" (1999)– Ng & Jordan: “Pegasus: Policy Search” (2000)– NIPS*2000 Workshop
"Learn the Policy or Learn the Value-Function?"
Is a paradigm shift occurring?
( )TD λ
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Dynamic ProgrammingDynamic ProgrammingDiscrete Time Stochastic ControlMarkov Decision Process (MDP): Agent
operating with discrete states xtaking actions areceiving rewards R
MDP System Model:
Value Function and Policy:
Goal: find optimal V and hence π
{ }
00
( ) [ , ( )]
"policy" ( ) ; 0, , , ( , )
tt t
t
t t
V x E R x x x x
a x t P x a
π
π
γ π
π π
∞
=
= = = = = = ∞
∑…
1Transition probability ( ) for actions :
Distribution ( , ) of rewards ( , )xy t t t
R
P a a x y
P x a R x a+→
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Dynamic Programming, contDynamic Programming, contBellman’s Recursion Equation:
The optimal policy satisfies:
Optimal Policy is defined implicitly:
Finding requires also determining knowing the System Model:
and computing expectations!
{ }( ) ( , ) ( ) ( ( , )) ( )xya y
V x P x a P a E R x a V yπ ππ γ= +∑ ∑
* *( ) max ( ( , )) ( ) ( )xyya
V x E R x a P a V yγ
= +
∑
* *( ) max ( ) and ( ) arg max ( )V x V x x V xπ π
π ππ= =
*π
( ) , ( , ) .xy RP a P x a
*π *V
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Reinforcement Learning: Reinforcement Learning: Beyond Dynamic ProgrammingBeyond Dynamic Programming
RL Algorithms offer approximate solutions to:• Dynamic Programming Problems• Stochastic Control Problems
RL Algorithms:• Do not require a model of the system
– Learn via simulation or live trial & error experience• Avoid Bellman’s “curse of dimensionality”
– By using “function approximation” to smooth state space• Learn on-line:
– Stochastic optimization– “Exploration” vs. “Exploitation”
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
QQ--Learning: Adaptive DPLearning: Adaptive DPQ-Function: state ⊗ action → value
Q-Learning (Watkins 1989): estimates iteratively via simulation without a system model
Optimal Policy is defined implicitly!
Problems: representations & robustness
*π̂
( ), ( , ) :xy RP a P x a
* *
* *
( ) max ( , )
( , ) ( ( , )) ( ) max ( , )xyy
a
b
V x Q x a
Q x a E R x a P a Q y bγ
=
= + ∑
ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )b
Q x a R x a Q y b Q x aη γ ∆ = + −
*Q̂
* *ˆˆ ( ) arg max ( , )b
a x Q x b =
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Direct ReinforcementDirect ReinforcementRepresent / learn the policy: observation → action
directly without learning a value function!
Motivation:• Simpler, more natural problem representations• Only the local gradient of the value function matters
– Local estimates of performance often available• Solve non-Markovian problems• Find solutions with problems with only “partial observability”• Seek a “good” policy, not an “optimal” policy
RRL (Recurrent Reinforcement Learning):a “policy gradient” algorithm
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Learning via Direct ReinforcementLearning via Direct ReinforcementDR Agent:
• “Partially Observes”: information , not full state
• “Non-Markovian”: Recurrent policy
• Takes action, Receives reward
• Causal performance function(Generally path-dependent)
• Learn policy by varying
GOAL: Maximize performanceor marginal performance
1( ; , )t t t tF F F Iθ −=
( )1, ;t t t tR F F S−
1 1( , ,..., )t tU R R R−
1t t t tD U U U −≡ ∆ = −
1( ; , )t t tF F Iθ − tθ
tI tS
TU
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Recurrent Reinforcement Learning (RRL) Recurrent Reinforcement Learning (RRL) Deterministic gradient (batch):
with recursion:
Stochastic gradient (on-line):
stochastic recursion:
Stochastic parameter update (on-line):
Constant : adaptive learning. Declining : stochastic approx.
( ) 1
1 1
TT t t t tT
t t t t
dU dR dF dR dFdUd dR dF d dF d
θθ θ θ
−
= −
= +
∑
( ) 1
1 1
t t t t t t
t t t t t t
dU dU dR dF dR dFd dR dF d dF d
θθ θ θ
−
− −
≈ +
( )t tt
t
dUd
θθ ρ
θ∆ =
1
1
t t t t
t
dF F dF dFd dF dθ θ θ
−
−
∂= +∂
1
1 1
t t t t
t t t t
dF F dF dFd dF dθ θ θ
−
− −
∂≈ +∂
ρ ρ
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
RL Algorithms ComparedRL Algorithms Compared
Q-Learning• Learn Q-Function• Value=Q(Action)• Q: state ⊗ action → value• Action
Properties• Bellman’s Equation: A-Causal• MDP Assumption• Complex representations • Curse of Dimensionality• Computations expensive• Policies often unstable
Direct Reinforcement• Learn Policy F• Use local performance est.• F: observations → action • Action
Properties• Causal: Forward in Time• Recurrent, Partially Observable• Enables simpler reps• Reduces Curse of Dimensionality• More efficient in practice• Yields more robust policies
( , )F N x θ=b
argmax ( , , )F N x b θ=
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard(How to Turn an Easy Problem into a Harder Problem)
Problem Description• Binary actions: • Rewards for {incorrect, correct} :• Vector of inputs:
– Oracle input: X1 (tells the agent the correct action At)– Noise inputs: X2, X3,, . . . XN (boolean random variables)
Complexity of DR Agent• Perceptron Policy Learner:
Optimal policy w/ single threshold:
Complexity of Q Agent• Optimal policy is XOR function:
Representation requires at least two thresholds.
The Oracle ProblemThe Oracle Problem
s ig n ( )t tA W X= ⋅
tA
1tX
+
+ −
−
{ }1, 1tA ∈ − +{ }1, 1tR ∈ − +
tX
( )1s ig nt tA X=
1( , )t tQ A X
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
The Oracle SimulationThe Oracle Simulation
• Measure how many trials required to learn representation
• Convergence criteria: – RRL: Correct policy for all possible input vectors– Q-Learner: MSE < 0.01 for all possible input vectors– Maximum of 30,000 trials per run
• Repeated runs for multiple learning rates– Choose learning rate with quickest convergence on
average• 50 random initializations; N = 1,2,3,4,5,10 inputs
RRL Q-Learner– Min # trials: 1 1150– Max #trials: 39 29350– # Runs Non-Converged: 0 15
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
The Oracle: Simulation ResultsThe Oracle: Simulation Results
RRL−3 Q−3 RRL−5 Q−5 RRL−10 Q−10 10
0
101
102
103
104
Oracle Simulation: RRL vs Q−Learner
log(
# T
rials
)
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
RoShamBoRoShamBo (RPS) (RPS)
Rules : Rock beats Scissors.
Paper beats Rock.
Scissors beats Paper.
“ The rules are simple, but the game itself is as complex as the mind of your opponent.”
(www.worldrps.com)
Character of human throws:• Rock: commonly perceived as the most aggressive throw
• Paper: considered the most subtle throw
• Scissors: often perceived as clever or crafty
Competitions: World Championship, Computer Olympiad
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
RPS Player RepresentationRPS Player RepresentationSoftmax representation for probability(action):
Inputs & Weights:
Opponent’s two previous moves:
Player’s two previous moves (recurrent):
Learning Algorithm:
Stochastic Direct Reinforcement (SDR)
A generalization of RRL
maf
fF m
ba
aa ,....,1for
()]exp[()]exp[()
1
==∑ =
W Bi
V Ai
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Simulated Competition Simulated Competition
‘ Human Champions’ Dataset:The “Great Eight” Gambits are the eight most widely used by recent champions.
1. Avalanche (RRR) 2. Bureaucrat (PPP) 3. Crescendo (PSR) 4. Dénouement (RSP)5. Fistfull o’ Dollars (RPP) 6. Paper Dolls (PSS) 7. Scissor Sandwich (PSP) 8. Toolbox (SSS)
“Great Eight Plus Four” yields equal probability of R,P,S.
Training a ‘Dynamic Opponent’ using SDR Learn by playing against “Great Eight Plus Four”
Learn to beat the opponent using Recurrent SDRLearn by playing against the Dynamic Opponent
Opponent tries to predict Player’s move, and vice versa
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Three Model Players: Learning Curves Three Model Players: Learning Curves
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0.2
0.3
0.4
0.5The fraction of Wins,Draws,and Losses with SDR.
No
recu
rren
ce
Wins: 0.357 Draws: 0.383 Losses: 0.260
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0.2
0.3
0.4
0.5
Onl
y re
curr
ence
Wins: 0.422 Draws: 0.305 Losses: 0.273
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0.2
0.3
0.4
0.5
All
wei
ghts
Wins: 0.463 Draws: 0.298 Losses: 0.239
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
[Wins [Wins –– Losses] for Three Model PlayersLosses] for Three Model Players
No Recurrence Only Recurrence All Weights
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
Summary Stats: 10 Simulations
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Three Players: {Losses, Draws, Wins}Three Players: {Losses, Draws, Wins}
LossesDraws Wins0.2
0.25
0.3
0.35
0.4
0.45
0.5
No recurrenceLossesDraws Wins
0.2
0.25
0.3
0.35
0.4
0.45
0.5The fraction of Wins,Draws,and Losses with SDR.
Only recurrenceLossesDraws Wins
0.2
0.25
0.3
0.35
0.4
0.45
0.5
All weights
Summary Stats: 10 Simulations
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Risk Averse ReinforcementRisk Averse Reinforcement
Sources of Uncertainty in RL:• Stochastic Rewards• Stochastic Environment• Stochastic Policies
Standard RL framework takes expectations of above!• “Optimal Policies” are only “optimal” in expectation
Partial Observability: Another source of uncertainty
Risk: Distributions of trajectories, rewards, outcomes• Probability of Poor Performance• Risk of Ruin
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Asset Management as a MicrocosmAsset Management as a Microcosmfor Reinforcement Learning Researchfor Reinforcement Learning Research
• Simulations– Bang-Bang Control Problem
Simple Buy / Sell Decisions (“Long” / “Short” Positions)– Transaction Costs → Recurrent, Non-Markovian Rep.
• Uncertain Environment:– Details of markets / economy are Unobservable– Prices / fundamentals / economic data are Very Noisy– Markets react quickly to Unpredictable News / Events
• Challenging Problem:– Efficient Markets Theory: You Can’t Beat the Market.– Competitive Game: Trading opportunities will be discovered,
exploited and eliminated by others.– Prediction Accuracy Limited: “1/2 + ε”
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
InputSeries
TargetSeries
TransactionCosts
Trades/PortfolioWeights
Forecasts
Bottleneck
Profits/Losses
( )U θ θ, ′
ForecastingSystem θ
TradingRules ′θ
-Supervised Learning:
Error(θ)
Trading based on ForecastsTrading based on ForecastsFour limitations:• Two sets of parameters• Forecast error is not Utility • Forecaster ignores transaction costs• Information bottleneck
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Learning to Trade via Direct ReinforcementLearning to Trade via Direct Reinforcement
InputSeries
TargetSeries
TransactionCosts
Trades/PortfolioWeights
ReinforcementLearning:
TradingSystem
θ
Profits/Losses( )U θ
Delay
( )U θ
Four advantages:
• One set of parameters
• A single utility function
• U includes transaction costs
• Direct mapping from inputs to actions
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Structure of TradersStructure of Traders
• Single Asset
- Price series
- Return series simple returns
or rates of return
• Traders
- Discrete position size
- Recurrent policy
• Information Set:
– Full system state is not known
tz
1t t tr z z −= −
1
1tt
t
zrz −
= −
}{ 1,0,1tF ∈ −
1( ; , )t t t tF F F Iθ −=
}{ 1 2 1 2, , ,...; , , ,...t t t t t t tI z z z y y y− − − −=
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Returns, Profit & Wealth for TradersReturns, Profit & Wealth for Traders• Simple Trading Returns and Profit:
• Compounded Trading Returns and Wealth:
• Transaction Costs: represented by .
• Risk Free Rate:suppressed for simplicity ( ).
• Note: Market impact: = function(trade size)
1 1
1
t t t t t
T
t tt
R F r F F
P R
δ
µ
− −
=
= − −
= ∑
}{1 1
01
(1 ) (1 ) 1
1
t t t t t
T
T tt
R F r F F
W W R
δ− −
=
= + ⋅ − − −
= +∏
0ftr =
δ
δ
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Balancing Reward with Risk:Balancing Reward with Risk:Financial Performance MeasuresFinancial Performance Measures
Performance Functions:• Path independent:
(Standard Utility Functions)• Path dependent: • In general:
Performance Ratios:• Sharpe Ratio:
• Downside Deviation Ratio:
• Sterling Ratio:
For Learning:• Per-Period Returns: • Marginal Performance:
( )t tU U W=
1 1 0( , ,..., , )t t tU U W W W W−=
1 0( , ,..., )t t tU U R R W−=
Average( )Standard Deviation( )
t
t
RR
Average( )Downside Deviation( )
t
t
RR
Average( )Draw-Down( )
t
t
RR
tR1t t t tD U U U −≡ ∆ = −
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Maximizing the Sharpe RatioMaximizing the Sharpe Ratio
Sharpe Ratio:
Exponential Moving Average Sharpe Ratio:
with time scale and
Motivation: EMA Sharpe ratio • emphasizes recent patterns;• is causal & can be updated incrementally.
Average( )Standard Deviation( )
tT
t
RSR
=
2 1 2( )( )
t
t t
AS tK B Aη =
−
1 1( )t t t tA A R Aη− −= + −2
1 1( )t t t tB B R Bη− −= + −
1 21 21
K ηη
−= −
1η −
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Differential Sharpe Ratio Differential Sharpe Ratio for Adaptive Optimizationfor Adaptive Optimization
Expand to first order in :
Define Differential Sharpe Ratio as:
where
Motivation for DSR:• isolates contribution of to (“marginal utility” );• provides interpretability;• adapts to changing market conditions;• facilitates efficient on-line learning (stochastic optimization).
1 1
2 3 21 1
1( ) 2( )
( )
t t t t
t t
B A A BdS tD t
d B Aη
η η− −
− −
∆ − ∆≡ =
−
20
( )( ) ( 1) | ( ).
dS tS t S t O
dη
η η ηη ηη =≈ − + +
1t t tA R A −∆ = −2
1t t tB R B −∆ = −
( )S tη
tR tU
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Long / Short Trader SimulationLong / Short Trader Simulation• Learns from scratch and on-line
• Moving average Sharpe Ratio with η = 0.01
0.5
1
Price
−1
0
1
Sign
al
0
100
200
Prof
it(%
)
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−0.2
−0.1
0
0.1
Shar
pe R
.
time
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Trader Simulation (summary stats)Trader Simulation (summary stats)Effects of transaction costs on performance
100 runs; costs = 0.2, 0.5, and 1%
0.2 0.5 1
2
3
4
5
6
7
Tra
din
g F
re
qu
en
cy (%
)
Transaction Cost (%)0.2 0.5 1
100
200
300
400
500
600
700
800
Cu
mu
lativ
e S
um
o
f P
ro
fits (%
)
Transaction Cost (%)0.2 0.5 1
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
Sh
arp
e R
atio
Transaction Cost (%)
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Asset Allocation ExampleAsset Allocation ExampleS&PS&P--500 Index and 3500 Index and 3--Month TMonth T--BillBill
102
103
S&P−
500
S&P 500 Index With Divs. Reinvested
1970 1975 1980 1985 1990
4
6
8
10
12
14
16
Annu
aliz
ed Y
ield
s (%
)
Time
Treasury Bill YieldS&P 500 Div. Yield
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Maximizing the Differential Sharpe Ratio:Maximizing the Differential Sharpe Ratio:S&PS&P--500 / T500 / T--Bill Asset AllocationBill Asset Allocation
100
101
Equ
ity
RRL−Trader System vs Q−Trader System
Buy and HoldRRL−Trader Q−Trader
−1
0
1
RR
L−Tr
ader
1970 1975 1980 1985 1990−1
0
1
Q−T
rade
r
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Gaining Economic Insights byGaining Economic Insights byOpening Up the “Black Box”Opening Up the “Black Box”
Which of the 85 economic / financial input series for the S&P-500 / T-Bill trader are most important?
Relative sensitivity of input i :
Each year, average the sensitivity for each input
Note: Sensitivity Analysis is straightforward for Direct RL,but not for Q-Learning
max
ii
jj
dFdx
SdFdx
=
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
S&PS&P--500: Three Most Important Variables500: Three Most Important Variables85 series: Learned relationships are nonstationary over time
1970 1975 1980 1985 1990 19950
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Date
Nor
mal
ized
Abs
olut
e Se
nsiti
vity
Sensitivity Analysis: Average on RRL−Trader Committee
Yield Curve Slope 6 Month Diff. in AAA Bond yield6 Month Diff. in TBill Yield
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Minimizing Downside RiskMinimizing Downside Risk
Downside Deviation:
- Degree Downside Deviation:
Lower Partial Moment:
Downside Deviation Ratio:
1 22
1
1 min{ , 0}T
T t tt
DD RT
θ=
= − ∑
1
1( ) max{ ,0}T
nT t t
tLPM n R
Tθ
=
= −∑
thN [ ]1( ) ( ) nnT TDD LPM n=
Average( )Downside Deviation( )
tT
t
RDDRR
=
thN
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Artificial Price SeriesArtificial Price Series
Pric
e
0 2000 4000 6000 8000 10000
0.6
0.8
1.0
1.2
1.4
1.6
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Performance Results (cont'd)Performance Results (cont'd)
0 50 100 150 200 250−0.08
−0.06
−0.04
−0.02
0
Dra
wD
own
DDR TraderSR Trader
0 50 100 150 200 2500
2
4
6x 10
−3
Mov
ing
Ave
rage
Dev
iatio
ns
Time
DownsideStandard
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
DrawDraw--Down ComparisonDown Comparison
−0.12 −0.1 −0.08 −0.06 −0.04 −0.02 00
0.5
1
1.5
2
2.5
3Log Histograms of Maximum DrawDowns
Maximum DrawDown
Log1
0 (C
ount
)
DDRSR
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Position ComparisonPosition Comparison
Short Neutral Long 0
5
10
15
20
25
30
35
40
Per
cent
of T
ime
in P
ositi
on
Trading Signal
Negatively Skewed Returns
SRDDR
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
British Pound: British Pound: ReturnReturnAA=15%,=15%, SRSRAA=2.3,=2.3, DDRDDRAA=3.3=3.3
1.5
1.52
1.54
1.56Pr
ice
−1
0
1
Sign
al
1
1.02
1.04
1.06
1.08
Equi
ty
1000 2000 3000 4000 5000 6000 7000 8000
−0.2
0
0.2
Shar
pe R
.
time
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ardComments on the British Pound ResultsComments on the British Pound Results
The Simulations are Suggestive, not Conclusive:• BP performance better than Deutschmark or Yen• Price data are Reuters “indicative” quotes, not transactions • Market microstructure effects could influence profitability.• We spent little time designing the trader.• From a real-world standpoint, the work is preliminary.
Efficacy of RRL for FX confirmed by Carl Gold (Caltech):• More extensive simulations w/ Olsen 30 minute FX quotes• IEEE CIFER *2003 Proceedings
Looking Ahead:• Further analysis of microstructure / transaction costs is
needed. FX broker transaction prices would help.• A true test requires live trading.
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Closing RemarksClosing Remarks• Direct Reinforcement
Advantages over Value Function RL:– Natural representations– Efficiency & robustness– Causal algorithms and nonstationary problems
• Recurrence – Naturally occurs in real-world problems– Must abandon MPD framework
• Risk Averse Reinforcement– Conventional RL considers only expected rewards– Risk: The distribution of rewards matters– Robust policies for the real world must be low risk!
• Interesting research opportunities for Statisticians, Computer Scientists
URL: www.cse.ogi.edu/~moody
Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir
ect
Rei
nfo
rcem
ent
Dir
ect
Rei
nfo
rcem
ent
Ris
k &
Rew
ard
Ris
k &
Rew
ard
Some References:Some References:
URL: www.cse.ogi.edu/~moody
Papers:John Moody and Matthew Saffell, ‘Learning to Trade via Direct Reinforcement’, Special Issue on Financial Engineering, IEEE Transactions on Neural Networks, 12(4):875-889, July 2001.
John Moody and Matthew Saffell, ‘Minimizing Downside Risk via StochasticDynamic Programming', in “Computational Finance 1999", Y. S. Abu-Mostafa,B. LeBaron, A. W. Lo, and A. S. Weigend, editors, MIT Press, Cambridge, MA, pp. 403-416, 2000.
John Moody and Matthew Saffell, `Reinforcement Learning for Trading’,in Advances in Neural Information Processing Systems, S. Solla, M. Kearnsand David Cohn, eds., v. 11, pp 917-923, MIT Press, 1999.
Moody, J., Wu, L., Liao, Y. & Saffell, M., `Performance Functions andReinforcement Learning for Trading Systems and Portfolios', Journal ofForecasting v. 17, pp. 441-470, 1998.