Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple...

122
,,,,,,,,,, Some Recent Insights on Transfer Learning P Q? Samory Kpotufe Columbia University Based on work with Guillaume Martinet, and (ongoing) Steve Hanneke

Transcript of Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple...

Page 1: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Some Recent Insights on Transfer Learning

P → Q?

Samory KpotufeColumbia University

Based on work with Guillaume Martinet, and (ongoing) Steve Hanneke

Page 2: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Transfer Learning:

Given data {Xi, Yi} ∼i.i.d. P , produce a classifier for (X,Y ) ∼ Q.

Case study: Apple Siri’s voice assistant

- Initially trained on data from American English speakers ...

- Could not understand 30M+ nonnative speakers in the US!

Costly Solution ≡ 5+ years acquiring more data and retraining!

A Main Practical Goal:

Cheaply transfer ML software between related populations.

Page 3: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Transfer Learning:

Given data {Xi, Yi} ∼i.i.d. P , produce a classifier for (X,Y ) ∼ Q.

Case study: Apple Siri’s voice assistant

- Initially trained on data from American English speakers ...

- Could not understand 30M+ nonnative speakers in the US!

Costly Solution ≡ 5+ years acquiring more data and retraining!

A Main Practical Goal:

Cheaply transfer ML software between related populations.

Page 4: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Transfer Learning:

Given data {Xi, Yi} ∼i.i.d. P , produce a classifier for (X,Y ) ∼ Q.

Case study: Apple Siri’s voice assistant

- Initially trained on data from American English speakers ...

- Could not understand 30M+ nonnative speakers in the US!

Costly Solution ≡ 5+ years acquiring more data and retraining!

A Main Practical Goal:

Cheaply transfer ML software between related populations.

Page 5: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Transfer is of general relevance:

AI for Judicial Systems

- Source Population: prison inmates

- Target Population: everyone arrested

Over 60% inaccurate risk assessments on minorities(2016 Pro-Publica study)

Main Issue: Good Target data is hard or expensive to acquire

AI in medicine, Genomics, Insurance Industry, Smart cities,...

Page 6: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Transfer is of general relevance:

AI for Judicial Systems

- Source Population: prison inmates

- Target Population: everyone arrested

Over 60% inaccurate risk assessments on minorities(2016 Pro-Publica study)

Main Issue: Good Target data is hard or expensive to acquire

AI in medicine, Genomics, Insurance Industry, Smart cities,...

Page 7: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Transfer is of general relevance:

AI for Judicial Systems

- Source Population: prison inmates

- Target Population: everyone arrested

Over 60% inaccurate risk assessments on minorities(2016 Pro-Publica study)

Main Issue: Good Target data is hard or expensive to acquire

AI in medicine, Genomics, Insurance Industry, Smart cities,...

Page 8: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Transfer is of general relevance:

AI for Judicial Systems

- Source Population: prison inmates

- Target Population: everyone arrested

Over 60% inaccurate risk assessments on minorities(2016 Pro-Publica study)

Main Issue: Good Target data is hard or expensive to acquire

AI in medicine, Genomics, Insurance Industry, Smart cities,...

Page 9: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Many heuristics ... but no mature theory or principles

Page 10: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Basic questions remain largely unanswered:

Suppose: h is trained on source data ∼ P , to be transferred to target Q.

• Is there enough information in source P about target Q?

• If not, how much new data should we collect, and how?

• Would unlabeled target data suffice? Or help at least?

Page 11: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Basic questions remain largely unanswered:

Suppose: h is trained on source data ∼ P , to be transferred to target Q.

• Is there enough information in source P about target Q?

• If not, how much new data should we collect, and how?

• Would unlabeled target data suffice? Or help at least?

Page 12: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Basic questions remain largely unanswered:

Suppose: h is trained on source data ∼ P , to be transferred to target Q.

• Is there enough information in source P about target Q?

• If not, how much new data should we collect, and how?

• Would unlabeled target data suffice? Or help at least?

Page 13: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Basic questions remain largely unanswered:

Suppose: h is trained on source data ∼ P , to be transferred to target Q.

• Is there enough information in source P about target Q?

• If not, how much new data should we collect, and how?

• Would unlabeled target data suffice? Or help at least?

Page 14: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

Page 15: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

For PY |X 6= QY |X : [Scott 19], [Blanchard et al. 19], [Cai & Wei 19].

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

Page 16: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

Page 17: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

Page 18: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

Page 19: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

Page 20: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Formal Setup:

Covariate-Shift: PX 6= QX but PY |X = QY |X .

Given: source data {Xi, Yi} ∼ PnP , target data {Xi, Yi} ∼ QnQ .

Goal: h with small target error errQ(h) = EQ 1(h(X) 6= Y ).

What is the best errQ(h) achievable in terms of nP , nQ?

Depends on distance(PX → QX)

Page 21: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

However, the right notion of distance(PX → QX) remains unclear...

Page 22: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

However, the right notion of distance(PX → QX) remains unclear...

Basic intuition: Transfer iseasiest if P has sufficient massin regions of large Q-mass.

6= proportions of � and M(e.g. 6= proportions of English accents)

Page 23: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

However, the right notion of distance(PX → QX) remains unclear...

Basic intuition: Transfer iseasiest if P has sufficient massin regions of large Q-mass.

6= proportions of � and M(e.g. 6= proportions of English accents)

Many foundational results quantify this intuition ...

Page 24: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Common notions of distance(PX → QX):

• Extensions of TV: differences |PX(A)−QX(A)|, suitable A(e.g. dA divergence/Y-discrepancy of S. Ben David, M. Mohri, ...)

• Density Ratios: ratio dQX/dPX over data space(e.g., Sugiyama, Belkin, Jordan, Wainwright, ...)

How well do these notions measure transferability?

Page 25: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Common notions of distance(PX → QX):

• Extensions of TV: differences |PX(A)−QX(A)|, suitable A(e.g. dA divergence/Y-discrepancy of S. Ben David, M. Mohri, ...)

• Density Ratios: ratio dQX/dPX over data space(e.g., Sugiyama, Belkin, Jordan, Wainwright, ...)

How well do these notions measure transferability?

Page 26: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Common notions of distance(PX → QX):

• Extensions of TV: differences |PX(A)−QX(A)|, suitable A(e.g. dA divergence/Y-discrepancy of S. Ben David, M. Mohri, ...)

• Density Ratios: ratio dQX/dPX over data space(e.g., Sugiyama, Belkin, Jordan, Wainwright, ...)

How well do these notions measure transferability?

Page 27: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Common notions of distance(PX → QX):

• Extensions of TV: differences |PX(A)−QX(A)|, suitable A(e.g. dA divergence/Y-discrepancy of S. Ben David, M. Mohri, ...)

• Density Ratios: ratio dQX/dPX over data space(e.g., Sugiyama, Belkin, Jordan, Wainwright, ...)

How well do these notions measure transferability?

Page 28: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Seminal results: (TV, dA, Y-disc, dQ/dP , KL, Renyi, Wasserstein)

The closer PX is to QX =⇒ the easier Transfer is ...

However: PX far from QX ���=⇒ Transfer is Hard

These notions can be pessimistic in measuring transferability ...

Page 29: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Seminal results: (TV, dA, Y-disc, dQ/dP , KL, Renyi, Wasserstein)

The closer PX is to QX =⇒ the easier Transfer is ...

However: PX far from QX ���=⇒ Transfer is Hard

These notions can be pessimistic in measuring transferability ...

Page 30: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Seminal results: (TV, dA, Y-disc, dQ/dP , KL, Renyi, Wasserstein)

The closer PX is to QX =⇒ the easier Transfer is ...

However: PX far from QX ���=⇒ Transfer is Hard

These notions can be pessimistic in measuring transferability ...

Page 31: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Seminal results: (TV, dA, Y-disc, dQ/dP , KL, Renyi, Wasserstein)

The closer PX is to QX =⇒ the easier Transfer is ...

However: PX far from QX ���=⇒ Transfer is Hard

Large TV, dA, Y-disc ≈ 1/2

These notions can be pessimistic in measuring transferability ...

Page 32: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Seminal results: (TV, dA, Y-disc, dQ/dP , KL, Renyi, Wasserstein)

The closer PX is to QX =⇒ the easier Transfer is ...

However: PX far from QX ���=⇒ Transfer is Hard

Large dQ/dP , KL-div ≈ ∞

These notions can be pessimistic in measuring transferability ...

Page 33: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Seminal results: (TV, dA, Y-disc, dQ/dP , KL, Renyi, Wasserstein)

The closer PX is to QX =⇒ the easier Transfer is ...

However: PX far from QX ���=⇒ Transfer is Hard

Large dQ/dP , KL-div ≈ ∞

These notions can be pessimistic in measuring transferability ...

Page 34: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

We propose a new distance(P → Q) shown to control transfer ...

Page 35: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

Page 36: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

Page 37: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

Page 38: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

Page 39: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

Page 40: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

Page 41: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Relating source P to target Q [Kpo., Martinet, COLT 18]

Main intuition: PX needs mass in regions of significant QX mass.

Transfer exponent γ ≥ 0:

∀Q x,∀ r ∈ (0, 1], P (B(x, r)) ≥ C · rγ ·QX(B(x, r))

γ captures a continuum of easy to hard transfer ...

Page 42: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

First let’s look at extremes γ =∞ or 0

Notice that γ.= γ(P → Q) is asymmetric

(unlike TV, dA, Y-discrepancy, Wasserstein, ...)

Page 43: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

First let’s look at extremes γ =∞ or 0

Notice that γ.= γ(P → Q) is asymmetric

(unlike TV, dA, Y-discrepancy, Wasserstein, ...)

Page 44: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

First let’s look at extremes γ =∞ or 0

Notice that γ.= γ(P → Q) is asymmetric

(unlike TV, dA, Y-discrepancy, Wasserstein, ...)

Page 45: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

The continuum 0 < γ <∞:

Page 46: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

The continuum 0 < γ <∞:

γ ≡ How fast PX shifts mass away from QX -dense regions.

Page 47: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

The continuum 0 < γ <∞:

γ ≡ How fast fQ/fP goes to ∞.

Page 48: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

The continuum 0 < γ <∞:

γ ≡ Difference in support dimension.

Page 49: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Optimistic:γ is often small when other measures are not

Page 50: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Optimistic:γ is often small when other measures are not

Large TV, dA, Y-disc ≈ 1/2

but here typically γ ≈ 0

Page 51: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Optimistic:γ is often small when other measures are not

Large dQX/dPX , KL-div ≈ ∞but γ = 1 ≡ dim(PX)− dim(QX)

Page 52: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

γ captures performance limits (minimax rates) under transfer ...

Page 53: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Performance depends on γ + hardness of Q:

Easy to hard Target QX,Y classification

Easy QX,Y Hard QX,Y

Essential: Noise in QY |X , and QX -mass near decision boundary

Page 54: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Performance depends on γ + hardness of Q:

Easy to hard Target QX,Y classification

Easy QX,Y Hard QX,Y

Essential: Noise in QY |X , and QX -mass near decision boundary

Page 55: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Performance depends on γ + hardness of Q:

Easy to hard Target QX,Y classification

Easy QX,Y Hard QX,Y

Essential: Noise in QY |X , and QX -mass near decision boundary

Page 56: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 57: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 58: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 59: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 60: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 61: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 62: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 63: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 64: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 65: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Parametrizing easy to hard QX,Y :

Setup: X ∈ compact X , Y ∈ {0, 1}

Noise conditions on η(x) = E[Y |x]:

• Smoothness: |η(x)− η(x′)| ≤ λρ(x, x′)α.

• Noise Margin: QX (x : |η(x)− 1/2| < t) ≤ Ctβ.

2 types of regularity on QX :

• Near-uniform mass: for any ball Br, QX(Br) ≥ Crd.

• Support regularity: XQ has r-cover size ≤ Cr−d.

d above acts like the intrinsic dimension of QX , for X ∈ IRD.

Classification is easiest with large α, β, small d .... so is transfer

Page 66: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Minimax rates of Transfer:

Given: labeled source and target data {Xi, Yi} ∼ PnP ×QnQ .Excess error: EQ(h) ≡ errQ(h)− infh errQ(h).

Theorem. Define h on {Xi, Yi}, even with knowledge of PX , QX :

infh

supdist(P,Q)=γ

E EQ(h) ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0,

d0 = 2 + d/α for near-uniform QX , and d0 = 2 + β + d/α otherwise.

Immediate message:Transfer is easiest as γ → 0, hardest as γ →∞ ...

Page 67: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Minimax rates of Transfer:

Given: labeled source and target data {Xi, Yi} ∼ PnP ×QnQ .Excess error: EQ(h) ≡ errQ(h)− infh errQ(h).

Theorem. Define h on {Xi, Yi}, even with knowledge of PX , QX :

infh

supdist(P,Q)=γ

E EQ(h) ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0,

d0 = 2 + d/α for near-uniform QX , and d0 = 2 + β + d/α otherwise.

Immediate message:Transfer is easiest as γ → 0, hardest as γ →∞ ...

Page 68: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Minimax rates of Transfer:

Given: labeled source and target data {Xi, Yi} ∼ PnP ×QnQ .Excess error: EQ(h) ≡ errQ(h)− infh errQ(h).

Theorem. Define h on {Xi, Yi}, even with knowledge of PX , QX :

infh

supdist(P,Q)=γ

E EQ(h) ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0,

d0 = 2 + d/α for near-uniform QX , and d0 = 2 + β + d/α otherwise.

Immediate message:Transfer is easiest as γ → 0, hardest as γ →∞ ...

Page 69: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Minimax rates of Transfer:

Given: labeled source and target data {Xi, Yi} ∼ PnP ×QnQ .Excess error: EQ(h) ≡ errQ(h)− infh errQ(h).

Theorem. Define h on {Xi, Yi}, even with knowledge of PX , QX :

infh

supdist(P,Q)=γ

E EQ(h) ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0,

d0 = 2 + d/α for near-uniform QX , and d0 = 2 + β + d/α otherwise.

Immediate message:Transfer is easiest as γ → 0, hardest as γ →∞ ...

Page 70: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Minimax rates of Transfer:

Given: labeled source and target data {Xi, Yi} ∼ PnP ×QnQ .Excess error: EQ(h) ≡ errQ(h)− infh errQ(h).

Theorem. Define h on {Xi, Yi}, even with knowledge of PX , QX :

infh

supdist(P,Q)=γ

E EQ(h) ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0,

d0 = 2 + d/α for near-uniform QX , and d0 = 2 + β + d/α otherwise.

Immediate message:Transfer is easiest as γ → 0, hardest as γ →∞ ...

Page 71: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Minimax rates of Transfer:

Given: labeled source and target data {Xi, Yi} ∼ PnP ×QnQ .Excess error: EQ(h) ≡ errQ(h)− infh errQ(h).

Theorem. Define h on {Xi, Yi}, even with knowledge of PX , QX :

infh

supdist(P,Q)=γ

E EQ(h) ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0,

d0 = 2 + d/α for near-uniform QX , and d0 = 2 + β + d/α otherwise.

Immediate message:Transfer is easiest as γ → 0, hardest as γ →∞ ...

Page 72: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Simulations with increasing γ:

nS ≡ Source sample size (no target sample used)

Transfer requires more source data for larger γ values.

Page 73: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Simulations with increasing γ:

nS ≡ Source sample size (no target sample used)

Transfer requires more source data for larger γ values.

Page 74: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Simulations with increasing γ:

nS ≡ Source sample size (no target sample used)

Transfer requires more source data for larger γ values.

Page 75: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Simulations with increasing γ:

nS ≡ Source sample size (no target sample used)

Transfer requires more source data for larger γ values.

Page 76: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Lower-bound analysis:

Main Ingredients:

- Established techniques based on Fano’s inequality.

- h has access to (P,Q) samples, but has to do well on just Q ...

Page 77: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Lower-bound analysis:

Main Ingredients:

- Established techniques based on Fano’s inequality.

- h has access to (P,Q) samples, but has to do well on just Q ...

Page 78: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Lower-bound analysis:

Main Ingredients:

- Established techniques based on Fano’s inequality.

- h has access to (P,Q) samples, but has to do well on just Q ...

Page 79: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Upper-bound analysis:

Rate achieved by k-NN on combined source and target data.

Main ingredient: show that NN distances depend on ρ.

One open problem we had to solve:

Rates for Vanilla k-NN without uniform density assumption

Page 80: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Upper-bound analysis:

Rate achieved by k-NN on combined source and target data.

Main ingredient: show that NN distances depend on ρ.

One open problem we had to solve:

Rates for Vanilla k-NN without uniform density assumption

Page 81: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Upper-bound analysis:

Rate achieved by k-NN on combined source and target data.

Main ingredient: show that NN distances depend on ρ.

One open problem we had to solve:

Rates for Vanilla k-NN without uniform density assumption

Page 82: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Upper-bound analysis:

Rate achieved by k-NN on combined source and target data.

Main ingredient: show that NN distances depend on ρ.

One open problem we had to solve:

Rates for Vanilla k-NN without uniform density assumption

Page 83: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Many new messages:

Recall Minimax Rates ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0

- Fast rates O(1/n) are possible even with large γ.

- Target data most beneficial when nQ � nd0/(d0+γ/α)P .

- Unlabeled data does not improve rates beyond constants ...

Page 84: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Many new messages:

Recall Minimax Rates ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0

- Fast rates O(1/n) are possible even with large γ.

- Target data most beneficial when nQ � nd0/(d0+γ/α)P .

- Unlabeled data does not improve rates beyond constants ...

Page 85: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Many new messages:

Recall Minimax Rates ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0

- Fast rates O(1/n) are possible even with large γ.

- Target data most beneficial when nQ � nd0/(d0+γ/α)P .

- Unlabeled data does not improve rates beyond constants ...

Page 86: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Many new messages:

Recall Minimax Rates ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0

- Fast rates O(1/n) are possible even with large γ.

- Target data most beneficial when nQ � nd0/(d0+γ/α)P .

- Unlabeled data does not improve rates beyond constants ...

Page 87: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Many new messages:

Recall Minimax Rates ≈(nd0/(d0+γ/α)P + nQ

)−(β+1)/d0

- Fast rates O(1/n) are possible even with large γ.

- Target data most beneficial when nQ � nd0/(d0+γ/α)P .

- Unlabeled data does not improve rates beyond constants ...

Page 88: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

All good but ...

Can we automatically decide how much target data to sample?(Ongoing work...)

Page 89: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

All good but ...

Can we automatically decide how much target data to sample?(Ongoing work...)

Page 90: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

All good but ...

Can we automatically decide how much target data to sample?(Ongoing work...)

Page 91: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

γ yields insights into adaptive sampling: (ongoing work)

Setup:nP labeled samples from P , nQ unlabeled samples from Q.

Adaptive Sampling:

Sample in low-confidence regions A ⊂ X with large γ(A).

(γ(A)← compares PX and QX in region A)

Essentially label in Q-massive regions with few samples from P ...

Above refines a procedure of [Berlind, Urner, 15]

Page 92: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

γ yields insights into adaptive sampling: (ongoing work)

Setup:nP labeled samples from P , nQ unlabeled samples from Q.

Adaptive Sampling:

Sample in low-confidence regions A ⊂ X with large γ(A).

(γ(A)← compares PX and QX in region A)

Essentially label in Q-massive regions with few samples from P ...

Above refines a procedure of [Berlind, Urner, 15]

Page 93: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

γ yields insights into adaptive sampling: (ongoing work)

Setup:nP labeled samples from P , nQ unlabeled samples from Q.

Adaptive Sampling:

Sample in low-confidence regions A ⊂ X with large γ(A).

(γ(A)← compares PX and QX in region A)

Essentially label in Q-massive regions with few samples from P ...

Above refines a procedure of [Berlind, Urner, 15]

Page 94: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

γ yields insights into adaptive sampling: (ongoing work)

Setup:nP labeled samples from P , nQ unlabeled samples from Q.

Adaptive Sampling:

Sample in low-confidence regions A ⊂ X with large γ(A).

(γ(A)← compares PX and QX in region A)

Essentially label in Q-massive regions with few samples from P ...

Above refines a procedure of [Berlind, Urner, 15]

Page 95: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

γ yields insights into adaptive sampling: (ongoing work)

Setup:nP labeled samples from P , nQ unlabeled samples from Q.

Adaptive Sampling:

Sample in low-confidence regions A ⊂ X with large γ(A).

(γ(A)← compares PX and QX in region A)

Essentially label in Q-massive regions with few samples from P ...

Above refines a procedure of [Berlind, Urner, 15]

Page 96: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Initial experiments with CIFAR-10:

Transfer Setup: Target Q has 50% images of cats and dogs.

nP = 20K, nQ = 6K unlabeled Q data

h : Deep NN with 10 layers, and k-NN on top.

Page 97: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Initial experiments with CIFAR-10:

Transfer Setup: Target Q has 50% images of cats and dogs.

nP = 20K, nQ = 6K unlabeled Q data

h : Deep NN with 10 layers, and k-NN on top.

Page 98: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Initial experiments with CIFAR-10:

Transfer Setup: Target Q has 50% images of cats and dogs.

nP = 20K, nQ = 6K unlabeled Q data

h : Deep NN with 10 layers, and k-NN on top.

Page 99: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Initial experiments with CIFAR-10:

Transfer Setup: Target Q has 50% images of cats and dogs.

nP = 20K, nQ = 6K unlabeled Q data

h : Deep NN with 10 layers, and k-NN on top.

Page 100: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Results

Best performance achieved after relatively few label requests ...

Page 101: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Results

Best performance achieved after relatively few label requests ...

Page 102: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Quick Summary and some New Directions ...

Page 103: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Quick Summary:

• γ captures a more optimistic view of transferability P → Q.

• Unlabeled data can only improve constants in the rates.

• Adaptive sampling is possible with no knowledge of γ.

Page 104: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Quick Summary:

• γ captures a more optimistic view of transferability P → Q.

• Unlabeled data can only improve constants in the rates.

• Adaptive sampling is possible with no knowledge of γ.

Page 105: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Quick Summary:

• γ captures a more optimistic view of transferability P → Q.

• Unlabeled data can only improve constants in the rates.

• Adaptive sampling is possible with no knowledge of γ.

Page 106: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Quick Summary:

• γ captures a more optimistic view of transferability P → Q.

• Unlabeled data can only improve constants in the rates.

• Adaptive sampling is possible with no knowledge of γ.

Page 107: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Quick Summary:

• γ captures a more optimistic view of transferability P → Q.

• Unlabeled data can only improve constants in the rates.

• Adaptive sampling is possible with no knowledge of γ.

Page 108: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New direction: refining γ ...

Often in practice, a family H of predictors is fixed(NN, trees, SVMs, Neural Nets ...)

Intuition:

Consider regions of X most relevant to H (with S. Hanneke)

This yields H specific performance limits ...

Page 109: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New direction: refining γ ...

Often in practice, a family H of predictors is fixed(NN, trees, SVMs, Neural Nets ...)

Intuition:

Consider regions of X most relevant to H (with S. Hanneke)

This yields H specific performance limits ...

Page 110: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New direction: refining γ ...

Often in practice, a family H of predictors is fixed(NN, trees, SVMs, Neural Nets ...)

Intuition:

Consider regions of X most relevant to H (with S. Hanneke)

This yields H specific performance limits ...

Page 111: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New direction: refining γ ...

Often in practice, a family H of predictors is fixed(NN, trees, SVMs, Neural Nets ...)

Intuition:

Consider regions of X most relevant to H (with S. Hanneke)

This yields H specific performance limits ...

Page 112: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New direction: refining γ ...

Often in practice, a family H of predictors is fixed(NN, trees, SVMs, Neural Nets ...)

Intuition:

Consider regions of X most relevant to H (with S. Hanneke)

In particular:

Consider disagreements between classifiers

γ : PX(h 6= h′) & QX(h 6= h′)1+γ

Set A where 2 half-spaces disagree

This yields H specific performance limits ...

Page 113: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New direction: refining γ ...

Often in practice, a family H of predictors is fixed(NN, trees, SVMs, Neural Nets ...)

Intuition:

Consider regions of X most relevant to H (with S. Hanneke)

In particular:

Consider disagreements between classifiers

γ : PX(h 6= h′) & QX(h 6= h′)1+γ

Set A where 2 half-spaces disagree

This yields H specific performance limits ...

Page 114: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New Messages: [Kpo. & Hanneke, 19]

Near optimal heuristics for bounded VC classes:(no need to estimate γ)

No Classification noise:

ERM on combined source and target data is minimax-optimal.

Any Level of Noise:

Minimize RQ(h) subject to RP (h) ≤ minh′ RP (h′) + ∆nP (h)

§ Hard to implement in general...

Page 115: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New Messages: [Kpo. & Hanneke, 19]

Near optimal heuristics for bounded VC classes:(no need to estimate γ)

No Classification noise:

ERM on combined source and target data is minimax-optimal.

Any Level of Noise:

Minimize RQ(h) subject to RP (h) ≤ minh′ RP (h′) + ∆nP (h)

§ Hard to implement in general...

Page 116: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New Messages: [Kpo. & Hanneke, 19]

Near optimal heuristics for bounded VC classes:(no need to estimate γ)

No Classification noise:

ERM on combined source and target data is minimax-optimal.

Any Level of Noise:

Minimize RQ(h) subject to RP (h) ≤ minh′ RP (h′) + ∆nP (h)

§ Hard to implement in general...

Page 117: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

New Messages: [Kpo. & Hanneke, 19]

Near optimal heuristics for bounded VC classes:(no need to estimate γ)

No Classification noise:

ERM on combined source and target data is minimax-optimal.

Any Level of Noise:

Minimize RQ(h) subject to RP (h) ≤ minh′ RP (h′) + ∆nP (h)

§ Hard to implement in general...

Page 118: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Somehow we are still just scratching the surface of what’s possible...

Results extend beyond covariate-shift to PY |X 6= QY |X

Mostly Open:

• More complex transfer regimes?

• Multitask, Curriculum, Lifelong, Fairness, Robustness ?

Thanks!

Page 119: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Somehow we are still just scratching the surface of what’s possible...

Results extend beyond covariate-shift to PY |X 6= QY |X

Mostly Open:

• More complex transfer regimes?

• Multitask, Curriculum, Lifelong, Fairness, Robustness ?

Thanks!

Page 120: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Somehow we are still just scratching the surface of what’s possible...

Results extend beyond covariate-shift to PY |X 6= QY |X

Mostly Open:

• More complex transfer regimes?

• Multitask, Curriculum, Lifelong, Fairness, Robustness ?

Thanks!

Page 121: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Somehow we are still just scratching the surface of what’s possible...

Results extend beyond covariate-shift to PY |X 6= QY |X

Mostly Open:

• More complex transfer regimes?

• Multitask, Curriculum, Lifelong, Fairness, Robustness ?

Thanks!

Page 122: Some Recent Insights on Transfer Learningskk2175/Talks/transferLearning.pdfCase study: Apple Siri’s voice assistant - Initially trained on data from American English speakers ...-Could

, , , , , , , , , ,

Somehow we are still just scratching the surface of what’s possible...

Results extend beyond covariate-shift to PY |X 6= QY |X

Mostly Open:

• More complex transfer regimes?

• Multitask, Curriculum, Lifelong, Fairness, Robustness ?

Thanks!