Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train:...

41
Bleaching Text: Abstract Features for Cross-lingual Gender Prediction. Rob van der Goot, Nikola Ljubeˇ si´ c, Ian Matroos, Malvina Nissim & Barbara Plank

Transcript of Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train:...

Page 1: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text: Abstract Features for Cross-lingual

Gender Prediction.

Rob van der Goot, Nikola Ljubesic, Ian Matroos,Malvina Nissim & Barbara Plank

Page 2: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text: Abstract Features for Cross-lingual

Gender Prediction.

Rob van der Goot, Nikola Ljubesic, Ian Matroos,Malvina Nissim & Barbara Plank

Page 3: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Gender Prediction

The task of predicting gender based only on text.

Page 4: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Gender Prediction

2000 2018

Perf

orm

ance

Open Vocabulary

Page 5: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Gender Prediction

2000 2018

Featu

res

Open Vocabulary

Page 6: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Gender Prediction

2000 2018

Modelin

g D

ata

sets

Open Vocabulary

Page 7: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Gender Prediction

SVM with word/char n-grams performs best!

Page 8: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Gender Prediction

SVM with word/char n-grams performs best!

I Winner PAN 2017 shared task on author profiling:

I Words: 1-2 grams

I Characters: 3-6 grams

Page 9: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

https://www.brewbound.com/news/power-hour-craft-beer-growth-opportunity-lies-female-consumers

https://www.craftbrewingbusiness.com/news/survey-women-drinking-beer-men-drinking-less/

https://www.nzherald.co.nz/business/news/article.cfm?c_id=3&objectid=11802831

Page 10: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

However, how would this lexicalized approach work acrossdifferent:

I time-spans

I domains

I languages???

Page 11: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Cross-lingual Gender Prediction

I Train a model on source language(s) and evaluate ontarget language.

Page 12: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Cross-lingual Gender Prediction

I Dataset: TwiSty corpus (Verhoeven et al., 2016) +English

I 200 tweets per user, 850 - 8,112 users per language

Page 13: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Cross-lingual Gender Prediction

Train:

FR EN NL PT ESTest Language

50

60

70

80

90

Acc

ura

cy

Page 14: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Cross-lingual Gender Prediction

USER Jaaa moeten we zeker doen

Page 15: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Page 16: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Page 17: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Original Massacred a bag of Doritos for lunch!

Page 18: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Original Massacred a bag of Doritos for lunch!

Freq 0 5 2 5 0 5 1 0

Page 19: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Original Massacred a bag of Doritos for lunch!

Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04

Page 20: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Original Massacred a bag of Doritos for lunch!

Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!

Page 21: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Original Massacred a bag of Doritos for lunch!

Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!PunctA w w w w w w wp jjjj

Page 22: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Original Massacred a bag of Doritos for lunch!

Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!PunctA w w w w w w wp jjjjShape ull l ll ll ull ll llx xx

Page 23: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Original Massacred a bag of Doritos for lunch!

Freq 0 5 2 5 0 5 1 0Length 09 01 03 02 07 03 06 04PunctC w w w w w w w!PunctA w w w w w w wp jjjjShape ull l ll ll ull ll llx xxVowels cvccvccvc v cvc vc cvcvcvc cvc cvccco oooo

Page 24: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

I No tokenization

I Replace usernames and URLs

I Use concatenation of the bleached representations

I Tuned in-language

I 5-grams perform best

Page 25: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Train:

FR EN NL PT ESTest Language

50

60

70

80

90

Acc

ura

cy

Lexicalized

Bleached

Page 26: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Trained on all other languages:

EN NL FR PT ESTest Language

50

60

70

80A

ccura

cy

Lexicalized

Bleached

Page 27: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleaching Text

Most predictive features

Male Female

1 W W W W ”W” USER E W W W2 W W W W ? 3 5 1 5 23 2 5 0 5 2 W W W W4 5 4 4 5 4 E W W W W5 W W, W W W? LL LL LL LL LX6 4 4 2 1 4 LL LL LL LL LUU7 PP W W W W W W W W *-*8 5 5 2 2 5 W W W W JJJ9 02 02 05 02 06 W W W W &W;W

10 5 0 5 5 2 J W W W W

Page 28: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Human Experiments

I Are humans able to predict gender based only on text forunknown languages?

Page 29: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Human Experiments

I 20 tweets per user (instead of 200)

I 6 annotators per language pair

I Each annotating 100 users

I 200 users per language pair, so 3 predictions per user

Page 30: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Human Experiments

I 20 tweets per user (instead of 200)

I 6 annotators per language pair

I Each annotating 100 users

I 200 users per language pair, so 3 predictions per user

Page 31: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Human Experiments

Page 32: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Human Experiments

NL NL NL PT FR NL

Test Language

50

60

70

80

90

Acc

ura

cyLexicalized

Bleached

Humans

(note that the classifier had acces to 200 tweets)

Page 33: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Conclusions

I Lexical models break down when used cross-language

I Bleaching text improves cross-lingual performance

I Humans performance is on par with our bleachedapproach

Page 34: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Thanks for your attention

Page 35: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Cross-lingual Embeddings

EN NL FR PT ESTest Language

50

60

70

80

90A

ccura

cyLexicalized

Bleached

Embeddings

See: Plank (2017) & Smith et al. (2017)

Page 36: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Lexicalized Cross-language

Test → EN NL FR PT ES

Train

EN 52.8 48.0 51.6 50.4NL 51.1 50.3 50.0 50.2FR 55.2 50.0 58.3 57.1PT 50.2 56.4 59.6 64.8ES 50.8 50.1 55.6 61.2

Avg 51.8 52.3 53.4 55.3 55.6

Page 37: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

In-language performance

EN NL FR PT ESTest Language

60

70

80

90A

ccura

cy

Lexicalized

Bleached

Page 38: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Bleached + Lexicalized

EN NL FR PT ESTest Language

50

60

70

80A

ccura

cy

Bleached

Bleached+lex

Page 39: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Unigrams vs fivegrams

EN NL FR PT ESTest Language

50

60

70

80A

ccura

cy

Unigram

Fivegram

Page 40: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Number of unique unigrams for Dutch

Feature Size

Lexicalized 281011Bleached 54103

Frequency 8Length 79PunctAgr 107PunctCons 5192Shape 2535Vowels 46198

Page 41: Bleaching Text: Abstract Features for Cross-lingual Gender ...Cross-lingual Gender Prediction Train: FR EN NL PT ES Test Language 50 60 70 80 90 Accuracy. Cross-lingual Gender Prediction

Language to language feature analysis

45

50

55

60

65

70

EN

EN NL FR PT ES

45

50

55

60

65

NL

45

50

55

60

65

FR

TR

AIN

45

50

55

60

65

PT

45

50

55

60

65

ES

LegendvowelsshapepunctCpunctAlengthfrequencyall

TEST