Language Variety Identification using Distributed Representations of Words and Documents

Language Variety Identification usingDistributed Representations of Words and Documents

Marc Franco-Salvador, Francisco Rangel, Paolo Rosso, Mariona Taulé, and M. Antònia Martí

[email protected], [email protected], [email protected],{mtaule,amarti}@ub.edu

Introduction

“Author profiling aims to identify the linguistic profile of an author on the basis of his writing style.”

“Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language.”

Example

The same sentence in varieties of Spanish:

“Estaba haciendo el tonto con mi perro y perdí el móvil” (ES-SP)

“Estaba haciendo boludeces con mi perro y extravié el celular” (ES-AR)

“Estaba haciendo el pendejo con mi perro y extravié el celular” (ES-MX)

Translation:

“I was goofing around with my dog and I lost my mobile” (EN)

Related work

● Zampieri and Gebre (2012) investigated varieties of Portuguese applying different features such as word and character n-grams.

● Sadat et al. (2014) differentiated between six different varieties of Arabic in blogs and forums using character n-grams.

● Maier and Gómez-Rodríguez (2014) employed meta-learning to classify tweets from Argentina, Chile, Colombia, Mexico and Spain.

● Kríž et al. (2015) employed cross-entropy to detect English texts written for non-native English speakers.

------------------------------------------------------------------------------------------

● Fabra-Boluda et al. (2015) NLEL_UPV_Autoritas participation at Discrimination between Similar Languages (DSL) 2015 shared task

● Franco-Salvador et al. (2015) applied distributed representations of words and documents to classify different varieties of European languages.

Related work

Tasks on language variety identification:– Workshop on Language Technology for Closely Related

Languages and Language Variants at EMNLP2014.

– VarDial Workshop at COLING 20145 - Applying NLP Tools to Similar Languages, Varieties and Dialects.

– T4VarDial - Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialect (DSL) shared task (Zampieri et al., 2014, 2015) at RANLP.

Proposed approach - motivationThe distributed representations of words capture many linguistic regularities (Mikolov et al., 2013b):

vector('Paris') - vector('France') + vector('Italy')

is very close to

vector('Rome')

vector('king') - vector('man') + vector('woman')

is very close to

vector('queen')

Le and Mikolov (2014) employed distributed representations of sentences to classify the polarity of subjective text.

Distributed representation models● Continuous bag-of-words (CBOW) model (Mikolov

et al., 2013b, 2013c).– It maximizes the classification of a word in a text based

on the surrounding context (bag-of-words representation).

– It is fast and maximizes the syntactic accuracy.

● Continuous skip-gram model (Mikolov et al., 2013b, 2013c).– It maximizes the classification of a word in a text based

on a close word. Distant words have less impact on the prediction.

– It considerably maximizes the semantic accuracy.

Skip-gram model

Skip-gram model

The objective of the model is to maximize the average of the log probability:

Conditional probability should be estimated using the softmax function [Barto, 1998]:

Reminder:

Alternatives to softmax function

Negative sampling (Mikolov et al. 2013b)

It simplifies the Noise Contrastive Estimation (NCE) (Gutmann and Hyvarinen, 2012) keeping the vector ̈quality.

“the task is to distinguish the target word from a noise distribution using logistic regression, where there are k negative samples for each word.” (Mikolov et al. 2013b)

WO

Pn(w)

Generating distributed vectors of sentences and documents

Two alternatives:– Average the vectors of the words of a text (“Skip-

gram” in the evaluation)e.g.: (vector('I') + vector('love') + vector('the') + vector('capital') + vector('of') + vector('Bulgaria')) / 6

– Use directly the Sentence Vectors variation (“SenVec” in the evaluation)

Generating distributed vectors of sentences and documents

Two alternatives:– Average the vectors of the words of a text (“Skip-

gram” in the evaluation)e.g.: (vector('I') + vector('love') + vector('the') + vector('capital') + vector('of') + vector('Bulgaria')) / 6

– Use directly the Sentence Vectors variation (“SenVec” in the evaluation)

* We classified all the vectors using logistic regression

Proposed alternatives

Author profiling models:

– Emotion-labeled Graphs (Rangel and Rosso, 2015) (EmoGraphs)

– Information Gain Word-Patterns (Martí et al., 2015) (IG-WP)

EmoGraph of “He estado tomando cursos en línea sobre temas valiosos que disfruto estudiando y que podrían ayudarme a hablar en público” ( “I have been taking online courses about valuable subjects that I enjoy studying and might help me to speak in public”)

Information Gain Word-Patterns

Information Gain Word-Patterns (IG-WP) (Martí et al., 2015) obtains lexico-syntactic patterns aiming to represent the content of documents.

The method is based on the pattern-construction hypothesis:

– “those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions”.

Information Gain Word-Patterns

Pattern structure:

Examples:

In the experiments we selected as features the set of 1,000 words the obtained the patterns with the highest information gain.

Dataset

We introduce the HispaBlogs1 dataset, a new collection of Spanish blogs from five different countries: Argentina, Chile, Mexico, Peru and Spain.

There are 450 training and 200 testing blogs respectively for each language variety.

Each user blog is represented by a set of user posts, with 10 posts per user/blog.

1 https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs

Evaluation

We measured the accuracy of classification comparing our approaches with several models and baselines.

Author profiling models:– EmoGraphs

– IG-WP

Baselines:– Bag-of-words

– Character 4-grams

– TF-IDF 2-grams

– TF-IDF graphs

Experimental results

Test set confusion matrix (in %) of Skip-gram model

Conclusions

● The use of distributed representations allows to obtain competitive results in the task of language variety identification in social media.

● The use of averages of vectors of words (Skip-gram) or vectors of documents (SenVec) provided similar results without significant differences.

Future work

● We will investigate how to apply distributed representations to other author profiling tasks such as age and gender identification.

● We will continue working to improve the current model in order to generate better distributed representations for discriminating between similar languages.

Thank you for your time :)

Questions / feedback?

[email protected]

This work has been published at

Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martí, M. A. (2015). Language variety identification using distributed representations of words and documents. In Proceeding of the 6th International Conference of CLEF on Experimental IR meets Multilinguality, Multimodality, and Interaction (CLEF 2015), volume LNCS(9283). Springer-Verlag.

References

Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.

Fabra-Boluda, R., Rangel, F., Rosso, P. (2015). NLEL_UPV_Autoritas participation at Discrimination between Similar Languages (DSL) 2015 shared task. In: Proc. of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.

Franco-Salvador, M. Rosso, P., & Rangel, F. (2015). Distributed Representations of Words and Documents for Discriminating Similar Languages. In: Proc. of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.

Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(1), 307-361.

Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.

Maier, W., & Gómez-Rodrıguez, C. (2014). Language variety identification in Spanish tweets. LT4CloseLang 2014, 25.

Martí, M.A., Bertran, M., Taulé, M., Salamó, M. (2015). Distributional approach based on syntactic dependencies for discovering constructions. In Computational Linguistics (under review)

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR.

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013c). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119).

Morin, F., & Bengio, Y. (2005, January). Hierarchical probabilistic neural network language model. In Proceedings of the international workshop on artificial intelligence and statistics (pp. 246-252).

Rangel, F., & Rosso, P. (2015). On the impact of emotions on author profiling. Information Processing & Management.

Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. SocialNLP 2014, 22.

Zampieri, M., & Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In KONVENS2012-The 11th Conference on Natural Language Processing (pp. 233-237). Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI).

Zampieri, M., Tan, L., Ljubešic, N., & Tiedemann, J. (2014). A report on the DSL shared task 2014. COLING 2014, 58.

Zampieri, M., Tan, L., Ljubešic, N., Tiedemann, J., & and Nakov, P. (2015). Overview of the dsl shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.

Language Variety Identification using Distributed Representations of Words and Documents

Data & Analytics

Transcript of Language Variety Identification using Distributed Representations of Words and Documents