data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX...

Using emojis as universal sentence representation for social media

Alexis DUTOT-

22/05/2019

PARIS NLP S3

MEETUP #5

● Introduction

● DeepMoji

● Internal challenges

● Our approach: Unimoji

● Conclusion & perspectives

Introduction

Linkfluence

- Social Media Intelligence company

- Activities: software & market research

- 2 products:

- Radarly

- Search

- 250+ employees over 6 offices

Our day-to-day work

Research Production

- Read papers

- Technological watch

- Prototype new features

- Train models

- Science popularization

- Implement new features to fit in the

production pipeline (near real-time

inference)

- Build batch computations for AI features

not computed in real-time

- Enhance the processing pipeline

Introduction

Our day-to-day work

Production environment

- Research playground

- Machine learning & NLP toolkits

- Programming languages

Introduction

Our pipeline

Language detection

NER extraction

Categorization

Opinion mining

Location & user inference

Stats:

● ~ 1200 documents per second

● > 60 languages

● > 10 platforms (social medias & web)

● > 65 models in the pipeline

Introduction

Our pipeline

Stats:

● ~ 1200 documents per second

● > 60 languages

● > 10 platforms (social medias & web)

● > 65 models in the pipeline

Introduction

Language detection

NER extraction

Categorization

Opinion mining

Location & user inference

Opinion miningIntroduction

- Sentiment Analysis: Document-level

sentiment analysis with 4 classes: positive,

negative, neutral and mixed

- Emotion detection: Document-level

multi-emotion detection with 7 classes:

anger, disgust, fear, joy, love, sadness and

surprise

Introduction

● Initial goal: enhance the sentiment analysis algorithm that was in the production pipeline

● Challenges:

○ Social media posts are noisy user-generated content: spelling mistakes, grammatical errors,

contractions, abbreviations, specific terms, ...

○ Very few annotated corpora available with few examples per corpus

○ The majority of these corpora are in English and “domain-specific”

Sentiment analysis task for social media data is limited by the scarcity of manually

annotated data

Opinion mining

Opinion miningIntroduction

Use distant supervision methods to make models learn useful text representations (like emotional

content) before modeling these tasks directly:

● Use specific hashtags: #good, #bad, #angry, #fml to automatically label high volume of data

(Mohammad, 2012)

● Use predefined positive and negative emoticons or emojis sets for automatic data labelling (Deriu

et al., 2016, Tang et al., 2014) → Our previous sentiment analysis model

● Pre-train a model to predict emojis given a document to learn a rich emotional text representation

and fine-tune it on a specific opinion mining task: DeepMoji (Felbo et al., 2017)

How can we leverage this “lack” of manually annotated data ?

DeepMoji

DeepMoji: leverage the power of emoji to accurately encode the

emotional content of texts.

The power of emoji

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm (Felbo et al., 2017)

DeepMoji

https://deepmoji.mit.edu/

This was soooo FUN !!! 😁😁 [this, was, soo, fun, !!] POSITIVE

→ Build a training set of 1.2B tweets with emojis as noisy labels

This was soooo FUN !!! 😁😁 [this, was, soo, fun, !!] 😁

→ Pre-train a model to predict an emoji probability distribution given a text

→ Fine-tune this model on a specific opinion mining task (sentiment analysis, emotion detection & sarcasm

detection)

The modelDeepMoji

2-layers BiLTSM with attention

Pre-training Transfer learning

Fine-tuning is done using the chain-thaw approach:

sequentially fine-tune one layer at a time

Advantages of DeepMojiDeepMoji

● SoTA on 3 opinion mining tasks (before BERT’s arrival)

● Really good fit for our use-case: opinion mining on social media posts

● Simple and easy-to-read code written in Keras to perform tests and reproduce results

Internal Challenges

Challengesreminder

Internal challenges

1. Perform opinion mining on many (>60) languages on every social media

platforms :

DeepMoji requires manually annotated data for each target task and for each

language

2. Handle at least 1200 documents per second without making the hardware

costs skyrocket :

We assume that a Bi-LSTM would not be an option

Multilingual problem

Computational problem

Limitations & resourcesInternal challenges

Research environnement

Production environnement

- Hardware: 4 GTX 1080 Ti

- Frameworks: Keras + Tensorflow

Tensorflow offers an “almost” stable

Java API (ONNX or DeepLearning4J not

mature yet)

- CPUs-only production instances

- Current processing pipeline on Apache

Storm (JVM) does not handle

batching

Not ideal for deep learning models

Our ideaUnimoji

Deep Learning is awesome !

J’adore mon nouvel iPhone

Detesto el fin de la Casa de Papel… NEGATIVE

POSITIVE

👍 0.35

😔 0.002Doc2Emoji

TRAINED ON ENGLISH

❤ 0.68

😢 0.001

😡 0.36

😂 0.005

POSITIVE

Emoji2SentimentTRAINED ON ENGLISH ANNOTATED CORPORA

Doc2EmojiTRAINED ON FRENCH

Doc2EmojiTRAINED ON SPANISH

Emojis are universal across the languages and are more and more used upon social media platforms

Proof of ConceptInternal challenges

● Validating the approach: use DeepMoji pre-trained (predicts emoji probability distribution) + MLP

(predicts sentiment from the distribution)

Small loss of accuracy compared to fine-tuned methods (2-5 points) → acceptable

● Reproduce DeepMoji pre-training on our own English data

● Issues:

1. 1 epoch: 12 days (too long)

2. Inference time in production: 50 ms/input (too slow)

Internal challenges

Tackling the computational problem

At this point:

● Impossible to use a RNN architecture in production● Need an alternative...

1. Can we replace the DeepMoji architecture with a computationally cheaper one while preserving a good emotional context representation ?

2. Can this emotional context representation using emojis be used to perform multilingual opinion mining tasks ?

Our approach: Unimoji

Doc2EmojiUnimoji

Different CNNs architectures tried

Final architecture is a combination of:

● Attentive convolutions (Yin, 2017)

● 2-layers CNN architecture used by SwissCheese team, winners of Task

1-A of SemEval2016 (Deriu et al., 2016)

Light attentive convolution layerDoc2Emoji architecture that we used

Unimoji

Statistics:

● Dataset: 512M tweets

● Training: 44 h/epoch (vs 12 days/epoch)

● Predict in production: 5 ms/input (vs 50 ms/epoch)

Our architecture performed almost as good as DeepMoji

Is this representation accurate enough to resolve opinion mining tasks ?

Top 1 and 5 emoji prediction accuracies

Doc2Emoji

Emoji2TaskUnimoji

Architecture: 2-layers neural network

Comparing the quality of learnt sentence representations: benchmarking over DeepMoji approach

Emoji2TaskUnimoji

2. Can this emotional context representation using emojis be used to perform multilingual opinion mining tasks ?

Train 3 new Doc2Emoji models: French, German, Simplified Chinese

Experiments: Sentiment analysis & Emotion detection

Multilingual Sentiment analysis

Unimoji

Training: SemEval 2016 Task 4-A dataset (3 classes: negative, positive, neutral)

Evaluation: internally annotated data in English, German, French & Chinese

Results: (vs previous algorithm)

● English accuracy improvement: ~ +10% (90% acc)

● French accuracy improvement: ~ +7% (87% acc)

● German accuracy improvement: ~ +6% (81% acc)

● Chinese accuracy improvement: ~ -30% (40% acc)

The multilingual approach improved the results for all languages except for Chinese

→ Emojis context in Chinese ≠ Emojis context in English

Unimoji

Multilingual Emotion detection

Love & sadness

Anger & disgust

Surprise

Training: SemEval 2018 Task 1-Ec dataset

We kept only 7 emotions : anger, disgust, fear, joy, love, sadness and surprise (multilabel classification)

Evaluation: internally annotated data in English, German, French

Results:

● English accuracy : 85%

● French accuracy: 80%

● German accuracy: 77%

Results → good enough to validate our approach

Unimoji

Validating our approach

2. Can this emotional context representation using emojis can be used to perform multilingual opinion mining tasks ?

*If the emotional context in which emojis are used is not too different from the context of the language in which the Emoji2Task was trained on.

Conclusion & Perspectives

So far...Conclusion & Perspectives

● Integrated our Unimoji model for sentiment analysis and emotion detection for 6

languages: French, English, Spanish, Portuguese, German and Italian

● For the Simplified Chinese model, Doc2Emoji model was fine-tuned on a Chinese

sentiment analysis dataset (improving accuracy by ~20%)

● Plan to add more languages to the model...

Key takeawaysConclusion & Perspectives

● 10x faster Doc2Emoji architecture based on CNNs with small accuracy loss

● Unimoji = Modular architecture: one can change the Doc2Emoji/Emoji2Task architectures

with any model

● 2 opinions mining tasks trained using the same English emoji probability distribution as

emotional representation:

○ Sentiment analysis (improving our inference accuracy)

○ Emotion detection (new feature !)

Key takeawaysConclusion & Perspectives

● Doc2Emoji can be fine-tuned for any language if a reliable manually annotated dataset is

available

● Such model have limitations: different emotional contexts for emoji, different emoji

distribution across 2 languages, ...

What’s next ?

Conclusion & Perspectives

● Add more languages

● Continue to explore limitations

● Don’t focus only on emojis

→ Explore Cross-lingual models (LASER, XLM)

● New opinion tasks

→ Saracasm detection, hate detection, optimism/pessimism, ...

Thank you !

Questions ?

LONDON

1 Primrose Street, London EC2A 2EXcontact-uk@linkfluence.com

DÜSSELDORF

Erkrather Straße 234b, 40233 Düsseldorfkontakt@linkfluence.com

SHANGHAI上海昌平路68号510-512室近西苏州路Rm 512, 68 Changping Road, Shanghaicontact-asia@linkfluence.com

SINGAPORECapital Tower #12-01, 168 Robinson Road, 068912 Singaporecontact-asia@linkfluence.com

5, rue Choron, 75009 Pariscontact@linkfluence.com

SAN FRANCISCO575 Market Street #11, San Francisco CA 94105contact@linkfluence.com

data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX...

Documents

Transcript of data representation for social media Using emojis as ... · 1 Primrose Street, London EC2A 2EX...

NRW-FoRum DüsselDorf€¦ · interfaces as well as exclusive iPhone wallpapers. NRW-FoRum DüsselDorf 4 CultuRAl AsPeCts. NRW-FoRum DüsselDorf 5 NAvigAtioN main menu 1. NRW-FoRum

Puppet Camp Düsseldorf 2014: Keynote

Mobile Monday Düsseldorf

27 Oct – 3 Nov 2010 Düsseldorf

RUDERCLUBGERMANIA · PDF fileRUDERCLUB GERMANIA DÜSSELDORF 1904 e.V. CLUBHAUS Am Sandacker 43, 40221 Düsseldorf-Hamm Postanschrift: Postfach 250107 40093

Universität Düsseldorf: · Universität Düsseldorf:

The Düsseldorf-Rath Works - Vallourec

Düsseldorf und Hamburg Partner: Ashford Castle

Düsseldorf NEW HF6 for - BENO

Düsseldorf · Düsseldorf 22nd June 2018 | Courses: Drescher – Wilmes – Ludwig 23rd June 2018 | 9th Benefit-System User Meeting DÜSSELDORF – 22nd / 23rd June, 2018 Friday,

Tender Issue - The Original€¦ · SuperYacht News in Brief, 2013 boot Düsseldorf Messe Düsseldorf, Düsseldorf, Germany.boat-duesseldorf.com, 2013 at Show Strictly Sail Miami

Diata12 opening: Twitter research in Düsseldorf

Fishing for Complementarities: Competitive Research ......Hanna Hottenrott, Düsseldorf Institute for Competition Economics (DICE), Heinrich Heine University Düsseldorf, Universitätsstrasse

● Düsseldorf/Melbourne, 2010

OF HEINRICH HEINE UNIVERSITY DÜSSELDORF

Heinrich University Düsseldorf

alles fussball - der shop Düsseldorf

Hochschule Düsseldorf Fachbereich ...

HHU Düsseldorf, SS 2003Information Retrieval1. HHU Düsseldorf, SS 2003Information Retrieval2 Wer befasst sich mit Information Retrieval? Konferenzen –

ARWorld Conference Düsseldorf, Germany