Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration...

12
Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han Bao 3,2 , Voot Tangkaratt 2 , Masashi Sugiyama 2,3 1 National Taiwan University 2 RIKEN Center for Advanced Intelligence Project 3 The University of Tokyo Poster #47 Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 1 / 12

Transcript of Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration...

Page 1: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Imitation Learning from Imperfect Demonstration

Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,Voot Tangkaratt2, Masashi Sugiyama2,3

1National Taiwan University

2RIKEN Center for Advanced Intelligence Project

3The University of Tokyo

Poster #47

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 1 / 12

Page 2: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Introduction

Imitation learning

learning from demonstration instead of a reward function

Demonstration

a set of decision makings (state-action pairs x)

Collected demonstration may be imperfectDriving: traffic violationPlaying basketball: technical foul

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 2 / 12

Page 3: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Motivation

Confidence: how optimal is state-action pair x (between 0 and 1)

A semi-supervised setting: demonstration partially equipped with confidence

How?

crowdsourcing: N(1)/(N(1) + N(0)).digitized score: 0.0, 0.1, 0.2, . . . , 1.0

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 3 / 12

Page 4: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Generative Adversarial Imitation Learning [1]

One-to-one correspondence between policy π and distribution of demonstration [2]

Utilize generative adversarial training

minθ

maxw

Ex∼pθ [logDw (x)] + Ex∼popt [log(1− Dw (x))]

Dw : discriminator, popt: demonstration distribution of πopt, and pθ: trajectorydistribution of agent πθ

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 4 / 12

Page 5: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Problem Setting

Human switches to non-optimal policies when they make mistakes or are distracted

p(x) = α p(x |y = +1)︸ ︷︷ ︸popt(x)

+(1− α) p(x |y = −1)︸ ︷︷ ︸pnon(x)

Confidence: r(x) , Pr(y = +1|x)

Unlabeled demonstration: {xi}nui=1 ∼ p

Demonstration with confidence: {(xj , rj)}ncj=1 ∼ q

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 5 / 12

Page 6: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Proposed Method 1: Two-Step Importance Weighting Imitation Learning

Step 1: estimate confidence by learning a confidence scoring function g

Unbiased risk estimator (come to Poster #47 for details):

RSC,`(g) = Ex ,r∼q[r · (`(g(x)))]︸ ︷︷ ︸Risk for optimal

+Ex ,r∼q[(1− r)`(−g(x))]︸ ︷︷ ︸Risk for non-optimal

Theorem

For δ ∈ (0, 1), with probability at least 1− δ over repeated sampling of data for training g ,

RSC,`(g)− RSC,`(g∗) = Op( n

−1/2c︸ ︷︷ ︸

# of confidence

+ n−1/2u︸ ︷︷ ︸

# of unlabeled

)

Step 2: employ importance weighting to reweight GAIL objective

Importance weighting

minθ

maxw

Ex∼pθ [logDw (x)] + Ex∼p[r(x)

αlog(1− Dw (x))]

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 6 / 12

Page 7: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Proposed Method 2: GAIL with Imperfect Demonstration and Confidence

Mix the agent demonstration with the non-optimal one

p′ = αpθ + (1− α)pnon

Matching p′ with p enables pθ = popt and meanwhile benefits from the large amountof unlabeled data.

Objective:

V (θ,Dw ) = Ex∼p[log(1− Dw (x))]︸ ︷︷ ︸Risk for P class

+αEx∼pθ [logDw (x)] + Ex ,r∼q[(1− r) logDw (x)]︸ ︷︷ ︸Risk for N class

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 7 / 12

Page 8: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Setup

Confidence is given by a classifier trained with the demonstration mixture labeled as optimal(y = +1) and non-optimal (y = −1)

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 8 / 12

Page 9: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Results: Higher Average Return of the Proposed Methods

Environment: MujocoProportion of labeled data: 20%

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 9 / 12

Page 10: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Results: Unlabeled Data Helps

More unlabeled data results in lower variance and better performance

proposed methods are robust to noise

(a) Number of unlabeled data. The number in thelegend indicates proportion of orignal unlabeled data.

(b) Noise influence. The number in the legend indicatesstandard deviation of Gaussian noise.

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 10 / 12

Page 11: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Conclusion

Two approaches that utilize both unlabeled and confidence data are proposed

Our methods are robust to labelers with noise

The proposed approaches can be generalized to other IL and IRL methods

Poster #47

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 11 / 12

Page 12: Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2, Voot Tangkaratt2, Masashi

Reference

[1] Ho, Jonathan, and Stefano Ermon. ”Generative adversarial imitation learning.” Advancesin Neural Information Processing Systems. 2016.

[2] Syed, Umar, Michael Bowling, and Robert E. Schapire. ”Apprenticeship learning usinglinear programming.” Proceedings of the 25th international conference on Machinelearning. ACM, 2008.

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 12 / 12