Visually grounded cross-lingual keyword spotting in speech ...

58
Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/

Transcript of Visually grounded cross-lingual keyword spotting in speech ...

Page 1: Visually grounded cross-lingual keyword spotting in speech ...

Visually grounded cross-lingualkeyword spotting in speech

SLTU, August 2018

Herman Kamper1 and Michael Roth2

1E&E Engineering, Stellenbosch University, South Africa2Saarland University, Germany

http://www.kamperh.com/

Page 2: Visually grounded cross-lingual keyword spotting in speech ...

Advances in speech recognition

• Addiction to labels: 2000 hours transcribed speech audio;∼350M/560M words text [Xiong et al., TASLP’17]

• Very different from the “supervision” infants use to learn language• Sometimes not possible, e.g., for unwritten languages

1 / 12

Page 3: Visually grounded cross-lingual keyword spotting in speech ...

Advances in speech recognition

• Addiction to labels: 2000 hours transcribed speech audio;∼350M/560M words text [Xiong et al., TASLP’17]

• Very different from the “supervision” infants use to learn language• Sometimes not possible, e.g., for unwritten languages

1 / 12

Page 4: Visually grounded cross-lingual keyword spotting in speech ...

Advances in speech recognition

• Addiction to labels: 2000 hours transcribed speech audio;∼350M/560M words text [Xiong et al., TASLP’17]

• Very different from the “supervision” infants use to learn language

• Sometimes not possible, e.g., for unwritten languages

1 / 12

Page 5: Visually grounded cross-lingual keyword spotting in speech ...

Advances in speech recognition

• Addiction to labels: 2000 hours transcribed speech audio;∼350M/560M words text [Xiong et al., TASLP’17]

• Very different from the “supervision” infants use to learn language• Sometimes not possible, e.g., for unwritten languages

1 / 12

Page 6: Visually grounded cross-lingual keyword spotting in speech ...

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

• Maybe we cannot use this type of data for full ASR, but maybe it canbe used for other tasks?

• Goal: Use this type of data for cross-lingual keyword spotting

2 / 12

Page 7: Visually grounded cross-lingual keyword spotting in speech ...

Images as weak labels for speechCan we use images as weak labels in low-resource settings?

Play

• Maybe we cannot use this type of data for full ASR, but maybe it canbe used for other tasks?

• Goal: Use this type of data for cross-lingual keyword spotting

2 / 12

Page 8: Visually grounded cross-lingual keyword spotting in speech ...

Images as weak labels for speechCan we use images as weak labels in low-resource settings?

Play

• Maybe we cannot use this type of data for full ASR, but maybe it canbe used for other tasks?

• Goal: Use this type of data for cross-lingual keyword spotting

2 / 12

Page 9: Visually grounded cross-lingual keyword spotting in speech ...

Images as weak labels for speechCan we use images as weak labels in low-resource settings?

Play

• Maybe we cannot use this type of data for full ASR, but maybe it canbe used for other tasks?

• Goal: Use this type of data for cross-lingual keyword spotting

2 / 12

Page 10: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual keyword spotting

kuwaka

kuwaka

Swahili speech corpus

Written query:

burning(English)

3 / 12

Page 11: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 12: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 13: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 14: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 15: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 16: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 17: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech

X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 18: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech

X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 19: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 20: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual word prediction from images

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

`

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwd

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

Swahili speech X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

Swahili speech

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a cross-lingualspoken bag-of-words(BoW) classifier

Swahili speech

4 / 12

Page 21: Visually grounded cross-lingual keyword spotting in speech ...

Experimental details• Goal: Use visual grounding for cross-lingual keyword spotting

• Proof-of-concept: Use English speech with German queries:

X

Loss

max

conv

max

feedfwd

`

VGG-16

f(X)Feld Hu

ndespr

ingt

English speech

German (text) tags

yde

Cross-lingual keyword spotter

I

• Data: 8000 images with 5 English spoken captions (∼37 hours)• Weak labels: German visual tagger trained on German Multi30k

5 / 12

Page 22: Visually grounded cross-lingual keyword spotting in speech ...

Experimental details• Goal: Use visual grounding for cross-lingual keyword spotting• Proof-of-concept: Use English speech with German queries

:

X

Loss

max

conv

max

feedfwd

`

VGG-16

f(X)Feld Hu

ndespr

ingt

English speech

German (text) tags

yde

Cross-lingual keyword spotter

I

• Data: 8000 images with 5 English spoken captions (∼37 hours)• Weak labels: German visual tagger trained on German Multi30k

5 / 12

Page 23: Visually grounded cross-lingual keyword spotting in speech ...

Experimental details• Goal: Use visual grounding for cross-lingual keyword spotting• Proof-of-concept: Use English speech with German queries:

X

Loss

max

conv

max

feedfwd

`

VGG-16

f(X)Feld Hu

ndespr

ingt

English speech

German (text) tags

yde

Cross-lingual keyword spotter

I

• Data: 8000 images with 5 English spoken captions (∼37 hours)• Weak labels: German visual tagger trained on German Multi30k

5 / 12

Page 24: Visually grounded cross-lingual keyword spotting in speech ...

Experimental details• Goal: Use visual grounding for cross-lingual keyword spotting• Proof-of-concept: Use English speech with German queries:

X

Loss

max

conv

max

feedfwd

`

VGG-16

f(X)Feld Hu

ndespr

ingt

English speech

German (text) tags

yde

Cross-lingual keyword spotter

I

• Data: 8000 images with 5 English spoken captions (∼37 hours)• Weak labels: German visual tagger trained on German Multi30k

5 / 12

Page 25: Visually grounded cross-lingual keyword spotting in speech ...

Predictions on test data

f(X1)

f(X2)

f(X3)

Given German keyword:

fw(Xi) = Pθ(w|Xi): score for whether (English) speech Xi

contains translation of given (German) keyword w

‘Hunde’ English speech collection

(want to search)correspondsto dim. w

Evaluation: Does predicted keyword occur in reference translation?

6 / 12

Page 26: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

•• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

•• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 27: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: Fahrrad

Output (in top 10):•• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

•• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 28: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

• Play

• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

•• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 29: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

• man riding a bicycle on a foggy day

• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

•• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 30: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

• man riding a bicycle on a foggy day• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

•• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 31: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

• man riding a bicycle on a foggy day• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)

Output (in top 10):•• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 32: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

• man riding a bicycle on a foggy day• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

• Play

• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 33: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

• man riding a bicycle on a foggy day• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

• a woman in black and red listens to an ipod walks down the street

• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 34: Visually grounded cross-lingual keyword spotting in speech ...

Example predictions (top retrievals)Task: Given written German keyword, find utterances in an unseenEnglish speech collection containing that keyword

Input: FahrradOutput (in top 10):

• man riding a bicycle on a foggy day• a biker does a trick on a ramp• a person is doing tricks on a bicycle in a city

Input: Straße (street)Output (in top 10):

• a woman in black and red listens to an ipod walks down the street• people on the city street walk past a puppet theater• an asian woman rides a bicycle in front of two cars

7 / 12

Page 35: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual keyword spotting performance

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

8 / 12

Page 36: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)

Output:• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 37: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field

• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 38: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)

• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 39: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 40: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)

Output:• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 41: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass

• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 42: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)

• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 43: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 44: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)

Output:• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 45: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors

• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 46: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean

• a small group of people sitting together outside ∗ (3)

9 / 12

Page 47: Visually grounded cross-lingual keyword spotting in speech ...

A few more example predictionsInput: Feld (field)Output:

• a team of baseball players in blue uniforms walking together on field• a brown and black dog running through a grassy field ∗ (1)• two small children walk away in a field

Input: grun(en) (green)Output:

• boy wearing a green and white soccer uniform running through the grass• a girl is screaming as she comes off the water slide ∗ (3)• a brown dog is chasing a red frisbee across a grassy field ∗ (2)

Input: groß(en) (big)Output:

• a large crowd of people ice skating outdoors• a surfer catching a large wave in the ocean• a small group of people sitting together outside ∗ (3)

9 / 12

Page 48: Visually grounded cross-lingual keyword spotting in speech ...

Error analysis by annotator

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

10 / 12

Page 49: Visually grounded cross-lingual keyword spotting in speech ...

Error analysis by annotator

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

10 / 12

Page 50: Visually grounded cross-lingual keyword spotting in speech ...

Error analysis by annotator

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

0 20 40 60 80 100

P@10 (%)

DETextPrior

DEVisionCNN

XVisionSpeechCNN

XBoWCNN

10 / 12

Page 51: Visually grounded cross-lingual keyword spotting in speech ...

Cross-lingual keyword spotting

kuwaka

kuwaka

motoWritten query:

burning(English)

11 / 12

Page 52: Visually grounded cross-lingual keyword spotting in speech ...

Conclusions and future work

• Visual grounding makes it possible to perform cross-lingual keywordspotting without any parallel speech and text or translations

• Future: Apply approach to a truly low-resource language

• Perform error analysis on larger scale

• Visual tagger improvements: language-agnostic visual recognition

12 / 12

Page 53: Visually grounded cross-lingual keyword spotting in speech ...

Conclusions and future work

• Visual grounding makes it possible to perform cross-lingual keywordspotting without any parallel speech and text or translations

• Future: Apply approach to a truly low-resource language

• Perform error analysis on larger scale

• Visual tagger improvements: language-agnostic visual recognition

12 / 12

Page 54: Visually grounded cross-lingual keyword spotting in speech ...

Conclusions and future work

• Visual grounding makes it possible to perform cross-lingual keywordspotting without any parallel speech and text or translations

• Future: Apply approach to a truly low-resource language

• Perform error analysis on larger scale

• Visual tagger improvements: language-agnostic visual recognition

12 / 12

Page 55: Visually grounded cross-lingual keyword spotting in speech ...

Conclusions and future work

• Visual grounding makes it possible to perform cross-lingual keywordspotting without any parallel speech and text or translations

• Future: Apply approach to a truly low-resource language

• Perform error analysis on larger scale

• Visual tagger improvements: language-agnostic visual recognition

12 / 12

Page 56: Visually grounded cross-lingual keyword spotting in speech ...

https://github.com/kamperh/

Page 57: Visually grounded cross-lingual keyword spotting in speech ...

Training: Visually grounded model

X

Loss

max

conv

max

feedfwd

`

VGG-16

f(X)Feld Hu

ndespr

ingt

English speech

German (text) tags

yde

Cross-lingual keyword spotter

I

Page 58: Visually grounded cross-lingual keyword spotting in speech ...

Testing: Cross-lingual keyword spotting

f(X1)

f(X2)

f(X3)

Given German keyword:

fw(Xi) = Pθ(w|Xi): score for whether (English) speech Xi

contains translation of given (German) keyword w

‘Hunde’ English speech collection

(want to search)correspondsto dim. w