[email protected], [email protected] · 2020. 11. 30. · [email protected],...

9
Multimodal Learning for Hateful Memes Detection Yi Zhou * , Zhenhao Chen * 1 IBM, Singapore 2 The University of Maryland, USA [email protected], [email protected] Abstract Memes are used for spreading ideas through social net- works. Although most memes are created for humor, some memes become hateful under the combination of pictures and text. Automatically detecting the hateful memes can help reduce their harmful social impact. Unlike the con- ventional multimodal tasks, where the visual and textual in- formation is semantically aligned, the challenge of hateful memes detection lies in its unique multimodal information. The image and text in memes are weakly aligned or even irrelevant, which requires the model to understand the con- tent and perform reasoning over multiple modalities. In this paper, we focus on multimodal hateful memes detection and propose a novel method that incorporates the image cap- tioning process into the memes detection process. We con- duct extensive experiments on multimodal meme datasets and illustrated the effectiveness of our approach. Our model achieves promising results on the Hateful Memes De- tection Challenge 1 . 1. Introduction Automatically filter hateful memes is crucial for social networks since memes are spreading on social media, and hateful memes bring attach to people. Given an image, the multimodal meme detection task is expected to find clues from the sentences in the meme image and associate them to the relevant image regions to reach the final detection. Due to the rich and complicated mixture of visual and tex- tual knowledge in memes, it is hard to identify the im- plicit knowledge behind the multimodal memes efficiently. Driven by the recent advances in neural networks, there have been some works try to detect offensive or misleading content for visual and textual content [1]. However, current methods are still far from mature because of the huge gap between the meme image and text content. Hateful memes detection can be treated as a vision- language (VL) task. Vision and language problems have * Equal contribution 1 Code is available at GitHub what do you mean "your chair" i don't see you sitting in it A cat that is laying down on a couch curtain paw cat ear eye paw ... Cross-Attention Cross-Attention Cross-Attention I mage Caption OCR Text Hateful Meme ? (No) Figure 1: Illustration of our proposed multimodal memes detection approach. It consists of an image captioner, an ob- ject detector, a triplet-relation network, and a classifier. Our method considers three different knowledge extracted from each meme: image caption, OCR sentences, and visual fea- tures. The proposed triplet-relation network models the triplet-relationships among caption, objects, and OCR sen- tences, adopting the cross-attention model to learn the more discriminative features from cross-modal embeddings. gained a lot of attraction in recent years[2, 3], with sig- nificant progress on important problems such as Visual Question Answering (VQA) [4, 5] and Image Caption- ing [6, 7, 8, 9, 10]. Specifically, the multimodal memes de- tection task shares the same spirit as VQA, which predicts the answer based on the input image and question. Recent VQA has been boosted by the advances of image under- standing and natural language processing (NLP) [11, 12]. Most recent VQA methods follow the multimodal fusion framework that encodes the image and sentence and then fuses them for answer prediction. Usually, the given image is encoded with a Convolutional Neural Network (CNN) based encoder, and the sentence is encoded with a Recur- rent Neural Network (RNN) based encoder. With the ad- vancement of Transformer [13] network, many recent works incorporate multi-head self-attention mechanisms into their methods [14, 15], and achieve a considerable jump in per- formances. The core part of the transformer lies in the self-attention mechanism, which transforms input features into contextualized representations with multi-head atten- tion, making it an excellent framework to share information between different modalities. 1 arXiv:2011.12870v3 [cs.CV] 6 Dec 2020

Transcript of [email protected], [email protected] · 2020. 11. 30. · [email protected],...

  • Multimodal Learning for Hateful Memes Detection

    Yi Zhou∗, Zhenhao Chen*1IBM, Singapore 2The University of Maryland, USA

    [email protected], [email protected]

    Abstract

    Memes are used for spreading ideas through social net-works. Although most memes are created for humor, somememes become hateful under the combination of picturesand text. Automatically detecting the hateful memes canhelp reduce their harmful social impact. Unlike the con-ventional multimodal tasks, where the visual and textual in-formation is semantically aligned, the challenge of hatefulmemes detection lies in its unique multimodal information.The image and text in memes are weakly aligned or evenirrelevant, which requires the model to understand the con-tent and perform reasoning over multiple modalities. In thispaper, we focus on multimodal hateful memes detection andpropose a novel method that incorporates the image cap-tioning process into the memes detection process. We con-duct extensive experiments on multimodal meme datasetsand illustrated the effectiveness of our approach. Ourmodel achieves promising results on the Hateful Memes De-tection Challenge1.

    1. IntroductionAutomatically filter hateful memes is crucial for social

    networks since memes are spreading on social media, andhateful memes bring attach to people. Given an image, themultimodal meme detection task is expected to find cluesfrom the sentences in the meme image and associate themto the relevant image regions to reach the final detection.Due to the rich and complicated mixture of visual and tex-tual knowledge in memes, it is hard to identify the im-plicit knowledge behind the multimodal memes efficiently.Driven by the recent advances in neural networks, therehave been some works try to detect offensive or misleadingcontent for visual and textual content [1]. However, currentmethods are still far from mature because of the huge gapbetween the meme image and text content.

    Hateful memes detection can be treated as a vision-language (VL) task. Vision and language problems have

    *Equal contribution1Code is available at GitHub

    what do you mean "your chair" i don't see you sitting in it

    A cat that is laying down on a couch

    curtain

    paw

    cat

    ear

    eye

    paw...

    Cross-Attention

    Cross-Attention

    Cross-Attention

    Image Caption

    OCR Text

    Hateful Meme ?(No)

    Figure 1: Illustration of our proposed multimodal memesdetection approach. It consists of an image captioner, an ob-ject detector, a triplet-relation network, and a classifier. Ourmethod considers three different knowledge extracted fromeach meme: image caption, OCR sentences, and visual fea-tures. The proposed triplet-relation network models thetriplet-relationships among caption, objects, and OCR sen-tences, adopting the cross-attention model to learn the morediscriminative features from cross-modal embeddings.

    gained a lot of attraction in recent years[2, 3], with sig-nificant progress on important problems such as VisualQuestion Answering (VQA) [4, 5] and Image Caption-ing [6, 7, 8, 9, 10]. Specifically, the multimodal memes de-tection task shares the same spirit as VQA, which predictsthe answer based on the input image and question. RecentVQA has been boosted by the advances of image under-standing and natural language processing (NLP) [11, 12].Most recent VQA methods follow the multimodal fusionframework that encodes the image and sentence and thenfuses them for answer prediction. Usually, the given imageis encoded with a Convolutional Neural Network (CNN)based encoder, and the sentence is encoded with a Recur-rent Neural Network (RNN) based encoder. With the ad-vancement of Transformer [13] network, many recent worksincorporate multi-head self-attention mechanisms into theirmethods [14, 15], and achieve a considerable jump in per-formances. The core part of the transformer lies in theself-attention mechanism, which transforms input featuresinto contextualized representations with multi-head atten-tion, making it an excellent framework to share informationbetween different modalities.

    1

    arX

    iv:2

    011.

    1287

    0v3

    [cs

    .CV

    ] 6

    Dec

    202

    0

    https://github.com/czh4/hateful-memes

  • While a lot of multimodal learning works focus onthe fusion between high-level visual and language fea-tures [16, 17, 18, 19], it is hard to apply the multimodalfusion method to memes detection directly since memesdetection is more focused on the reasoning between visualand textual modalities. Modeling the relationships betweenmultiple modalities and explore the implicit meaning be-hind them is not an easy task. For example, Fig. 1, showsa cat lying down on a couch, but the sentences in the imageare not correlated to the picture. The misaligned semanticinformation between visual and textual features brings sig-nificant challenges for memes detection. Even for a human,such an example is hard to identify. Besides, merely ex-tracting the visual and textual information from an image iscrucial for memes detection.

    Considering the big gap between visual content and sen-tence in the meme image, in this paper, we propose a novelapproach that generates relevant image descriptions formemes detection. Specifically, we adopt a triplet-relationnetwork to model the relationships between three differ-ent inputs. To summarize, our contributions to this paperare twofold. First, we design a Triplet-Relation Network(TRN) that enhances the multimodal relationship model-ing between visual regions and sentences. Second, we con-duct extensive experiments on the meme detection dataset,which requires highly complex reasoning between imagecontent and sentences. Experimental evaluation of our ap-proach shows significant improvements in memes detectionover the baselines. Our best model also ranks high in thehateful memes detection challenge leaderboard.

    2. Related WorksHate Speech Detection. Hate speech is a broadly studiedproblem in network science [20] and NLP [21, 22]. De-tecting hate information in language has been studied for along time. One of the main focuses of hate speech with verydiverse targets has appeared in social networks [23, 24].The common method for hate speech detection is to get asentence embedding and then feed the embedding into abinary classifier to predict hate speech [25, 26]. Severallanguage-based hate speech detection datasets have been re-leased [27, 28, 29]. However, the task of hate speech detec-tion has shown to be challenging, and subject to undesiredbias [30, 31], notably the definition of hate speech [21],which brings the challenges for the machine to understandand detect them. Another direction of hate speech detectiontargets multimodal speech detection. In [32], they collecteda multimodal dataset based on Instagram images and associ-ated comments. In their dataset, the target labels are createdby asking the Crowdflower workers the questions. Simi-larly, in [33], they also collected a similar dataset from In-stagram. In [34], they found that augment the sentence withvisual features can hugely increase the performance of hate

    speed prediction. Recently, there are some larger datasetproposed. Gomez et al. [35] introduced a large dataset formultimodal hate speech detection based on Twitter data.The recent dataset introduced in [36] is a larger dataset,which is explicitly designed for multimodal hate speech de-tection.

    Visual Question Answering. Recently, many VQAdatasets have been proposed [37, 38]. Similar to multi-modal meme detection, the target of VQA is to answer aquestion given an image. Most current approaches focus onlearning the joint embedding between the images and ques-tions [39, 40, 17]. More specifically, the image and questionare passed independently through the image encoder andsentence encoder. The extracted image and question fea-tures can then be fused to predict the answer. Usually, thosequestions are related to the given images, then the challengeof VQA lies in how to reason over the image based on thequestion. Attention is one of the crucial improvements toVQA [41, 42]. In [42], they first introduce soft and hardattention mechanism. It models the interactions betweenimage regions and words according to their semantic mean-ing. In [43], they propose a stack attention model which in-creasingly concentrates on the most relevant image regions.In [44], they present a co-attention method, which calcu-lates attention weights on both image regions and words.Inspired by the huge success of transformer [45] in neu-ral machine translation (NMT), some works have been pro-posed to use the transformer to learn cross-modality en-coder representations[14, 15, 46].

    Despite the considerable success of vision-language pre-training in VQA. Memes detection is still hard to solve dueto its special characters. Apply the methods in VQA tomemes detection will facing some issues. First, the sen-tences in a meme are not like questions, which are mostlybased on the image content. Second, it is hard to predictthe results from visual modality or textual modality directly.The model needs to understand the implicit relationshipsbetween image contents and sentences. Our work adoptsimage captioning as the bridge which connects the visualinformation and textual information and models their rela-tionships with a novel triplet-relation network.

    3. Method

    Fig. 2 shows our proposed framework. The whole sys-tem consists of two training paths: image captioning andmultimodal learning. The first path (top part) is identical tothe image captioning that maps the image to sentences. Thesecond training path (the bottom part) detects the label fromthe generated caption, OCR sentences, and detected objects.In the following, we describe our framework in detail.

  • Sentence Decoder

    A cat that is laying down on a couch

    Y/N

    CNN

    ObjectDetector

    OCRwhat do you mean "your chair"

    i don't see you sitting in it

    - a small dog chewing on hot dog like dog toy- the dog is sitting on a pillow and eating a hot dog- a dog is UNK on a bean bag while chewing a hot dog toy- a dog chewing on a chew toy while sitting on a beanbag- a small dog is chewing on a chew toy

    Sentence Decoder

    CNN A black and white dog laying on a bed with a hot dog

    cat ear eareyeeyepawcurtain paw paw

    CrossAtt

    CrossAtt

    CrossAtt

    Transformer

    Ground-Truth Captions

    Image Captioning Model

    Tr iplet-Relation Network

    Input Image

    Classifier

    Visual features + labels

    Image Caption

    OCR sentences

    Figure 2: Overview of our proposed hateful memes detection framework. It consists of three components: image captioner,object detector, and triplet-relation network. The top branch shows the training of the image captioning model on image-caption pairs. The bottom part is meme detection. It takes image caption, OCR sentences, and object detection results inputsand uses the three inputs’ joint representation to predict the answer.

    3.1. Input Embeddings

    3.1.1 Sentence Embedding

    The motivation we want to generate an image caption foreach meme is that image captioning provides a good so-lution for image content understanding. The goal for im-age captioning training is to generate a sentence Sc thatdescribes the content of the image I . In particular, wefirst extract the image feature fI with an image encoderP (fI |I), and then decode the visual feature into a sen-tence with a sentence decoder P (S|fI). More formally,the image captioning model P (S|I) can be formulated asP (fI |I)P (S|fI). During inference, the decoding processcan be formulated as:

    Ŝc = argmaxS

    P (S|fv)P (fv|I) (1)

    where Ŝc is the predicted image description. The most com-mon loss function to train Eq. 1 is to minimize the negativeprobability of the target caption words.

    As shown in Fig. 1, our model has two kinds of textualinputs: image caption Sc and OCR sentence So. The pre-dicted caption Ŝc is first split into words {ŵc1, . . . , ŵcNC}by WordPiece tokenizer [47], where NC is the number ofwords. Following the recent vision-language pretrainingmodels [46, 14], the textual feature is composed of wordembedding, segment embedding, and position embedding:

    ŵci = LN(fWordEmb(ŵ

    ci ) + fSegEmb(ŵ

    ci ) + fPosEmb(i)

    )(2)

    where ŵci ∈ Rdw is the word-level feature, LN representthe layer normalization [48], fWordEmb(·), fSegEmb(·), andfPosEmb(·) are the embedding functions.

    Each meme also contains textual information. We canextract the sentences with the help of the off-the-shelf OCRsystem. Formally, we can extract the So = {wo1, . . . , woNO}from the given meme image, where No is the number ofwords. We follow the same operations as image captionand calculate the feature for each token as:

    woi = LN(fWordEmb(w

    oi ) + fSegEmb(w

    oi ) + fPosEmb(i)

    )(3)

    where ŵoi ∈ Rdw is the word-level feature for OCR token.Those three embedding functions are shared with Eq. 2.

    Following the method in [46], we insert the special to-ken [SEP] between and after the sentence. In our design,we concatenate the embeddings of the image caption alongwith OCR sentence as {wo1:NO ,w[SEP], ŵ

    c1:NC

    ,w[SEP]},where w[SEP] is the word-level embedding for special token.

    3.1.2 Image Embedding

    Instead of getting the global representation for each image,we take the visual features of detected objects as the repre-sentation for the image. Specifically, we extract and keepa fixed number of semantic region proposals from the pre-trained Faster R-CNN[49]2. Formally, an image I consistsof Nv objects, where each object oi is represented by itsregion-of-interest (RoI) feature vi ∈ Rdo , and its positionalfeature poi ∈ R4 (normalized top-left and bottom-right co-ordinates). Each region embedding is calculated as follows:

    voi = LN (fVisualEmb(vi) + fVisualPos(poi )) (4)

    2https://github.com/airsplay/py-bottom-up-attention

  • where voi ∈ Rdv is the position-aware feature for each pro-posal, fVisualEmb(·) and fVisualPos(·) are two embedding lay-ers.

    3.2. Triplet-Relation Network

    The target of triplet-relation network is to modelthe cross-modality relationships between image features(vo1:Nv ) and two textual features (ŵ

    c1:Nc

    and wo1:No ). Moti-vated by the success of the self-attention mechanism [13],we adopt the transformer network as the core module forour TRN.

    Each transformer block consists of three main compo-nents: Query (Q), Keys (K), and Values (V ). Specifically,let H l = {h1, . . . , hN} be the encoded features at l-thlayer. H l is first linearly transformed into Ql, Kl, and V l

    with learnable parameters. The output H l+1 is calculatedwith a softmax function to generate the weighted-averagescore over its input values. For each transformer layer, wecalculate the outputs for each head as follows:

    H l+1Self-Att = Softmax(Ql(Kl)T√

    dk) · V l (5)

    where dk is the dimension of the Keys and H l+1Self-Att is theoutput representation for each head.

    Note that the inputs H0 to the TRN are the combinationof the two textual features and visual features. In this paper,we explore two variants of TRN: one-stream [46] and two-stream [14]. One-stream denotes that we model the visualand textual features together in a single stream. Two-streammeans we use two separate streams for vision and languageprocessing that interact through co-attentional transformerlayers. For each variant, we stack LTRN these attention lay-ers which serve the function of discovering relationshipsfrom one modality to another. For meme detection, we takethe final representation h[CLS] for the [CLS] token as thejoint representation.

    3.3. Learning

    Training image captioner. For the image captioner learn-ing, the target is to generate image descriptions close to theground-truth captions. Follow the recent image captioningmethods; we first train image encoder and sentence decoderby minimizing the cross-entropy (XE) loss as follows:

    LXE = −∑i

    log p(wci |wc0:i−1, I) (6)

    where p(wct |wc0:t−1) is the output probability of t-th wordin the sentence given by the sentence decoder.

    After training the image captioner with Eq. 6, we fur-ther apply a reinforcement learning (RL) loss that takes theCIDEr [50] score as a reward and optimize the image cap-tioner by minimizing the negative expected rewards as:

    LRL = −ES̃c∼Pθ [r(S̃c)] (7)

    where S̃c = {w̃c1:Nc} is the sampled caption, r(S̃c) is the

    reward function calculated by the CIDEr metric, θ is theparameter for image captioner. Following the SCST methoddescribed in [51], Eq. 7 can be approximated as:

    ∇θLRL(θ) ≈ −(r(S̃c)− r(Ŝc)) · ∇θ log p(S̃c) (8)

    where Ŝc is the image caption predicted by greedy decod-ing. The relative reward design in Eq. 8 tends to increasethe probability of S̃c that score higher than Ŝc.

    Training classifier. For meme detection training, wefeed the joint representation h[CLS] of language and vi-sual content to a fully-connected (FC) layer, followed bya softmax layer, to get the prediction probability ŷ =softmax(fFC(h[CLS])). A binary cross-entropy (BCE) lossfunction is used as the final loss function for meme detec-tion:

    LBCE = −EI∼D[y log(ŷ) + (1− y) log(1− ŷ)] (9)

    where N is the number of training samples, I is sampledfrom the training set D, y and ŷ represent the ground-truthlabel and detected result for the meme, respectively.

    4. Experiments4.1. Dataset and Preprocessing

    In our experiments, we use two datasets: MSCOCO [52]and Hateful Memes Detection Challenge dataset providedby Facebook [36]. We describe the detail of each datasetbelow.

    MSCOCO. MSCOCO is an image captioning datasetwhich has been widely used in image captioning task [53,54]. It contains 123,000 images, where each image has fivereference captions. During training, we follow the settingof [55] by using 113,287 images for training, 5,000 imagesfor validation, and 5,000 images for testing. The best imagecaptioner is selected base on the highest CIDEr score.

    The Hateful Memes Challenge Dataset. This dataset iscollected by Facebook AI as the challenge set. The datasetincludes 10,000 memes, each sample containing an imageand extracted OCR sentence in the image. For the purposeof this challenge, the labels of memes have two types, non-hateful and hateful. The dev and test set consist of 5% and10% of the data respectively and are fully balanced, whilethe rest train set has 64% non-hateful memes and 36% hate-ful memes.

    Data Augmentation. We augment the sentence in thehateful memes dataset with the back-translation strategy.Specifically, we enrich the OCR sentences through twopretrained back-translator3: English-German-English and

    3https://github.com/pytorch/fairseq

  • Table 1: Experimental results and comparison on Hateful Memes Detection dev and test splits (bold numbers are the best re-sults). ‘OCR Text (Back-Translation)’ means we augment the OCR sentences in training set through trained back-translators.‘Image Caption’ means using captions for meme detection. AUROC in percentage (%) are reported.

    Model Basic Inputs Additional Inputs AUROC (Dev)Object Feature + OCR Text Object Labels OCR Text (Back-Translation) Image Caption

    V+L

    3 7 7 7 70.473 7 7 3 72.973 7 3 7 72.433 7 3 3 73.933 3 7 7 70.963 3 7 3 72.663 3 3 7 71.573 3 3 3 72.15

    V&L

    3 7 7 7 66.943 7 7 3 71.113 7 3 7 63.473 7 3 3 67.943 3 7 7 70.223 3 7 3 70.463 3 3 7 66.683 3 3 3 69.85

    English-Russian-English. We also apply different beamsizes (2, 5, and 10) during the sentence decoding to get thediverse sentences for each sentence.

    4.2. Implementation Details

    We present the hyperparameters related to our baselinesand discuss those related to model training. For visual fea-ture preprocessing, we extract the ROI features using thepretrained Faster R-CNN object detector [49]. We keep afixed number of region proposals for each image and get thecorresponding RoI feature, bounding box, and predicted la-bels. During image captioning training, we use a mini-batchsize of 100. The initial learning rate is 1e-4 and graduallyincreased to 4e-5. We use Adam [56] as the optimizer. Weuse the pretrained ResNet101 [57] as the image encoder andrandomly initialize the weights for the sentence decoder.During memes detection training, we set the dimension ofthe hidden state of the transformer to 768, and initialize thetransformer in our model with the BERT models pertainedin MMF4. We use Adam [56] optimizer with an initial learn-ing rate of 5e-5 and train for 10,000 steps. We report theperformance of our models with the Area Under the Curveof the Receiver Operating Characteristic (AUROC). It mea-sures how well the memes predictor discriminates betweenthe classes as its decision threshold is varied. During on-line submission, we submit predicted probabilities for thetest samples. The online (Phase1 and Phase2) rankings aredecided based on the best submissions by AUROC.

    4Https://github.com/facebookresearch/mmf

    Table 2: Performance comparison with existing methods ononline test server.

    Inputs Model AUROC (Test)

    ImageHuman [36] 82.65Image-Grid [36] 52.63Image-Region [36] 55.92

    Text Text BERT [36] 65.08

    Image + Text

    Late Fusion 64.75Concat BERT [36] 65.79MMBT-Grid [36] 67.92MMBT-Region [36] 70.73ViLBERT [36] 70.45Visual BERT [36] 71.33ViLBERT CC [36] 70.03Visual BERT COCO [36] 71.41

    Image + Text + CaptionOurs (V+L) 73.42Ours (V&L) 74.80

    4.3. Result and Discussion

    We conduct the ablation study in Table 1. The baselinemodels can be divided into two categories: V+L and V&L.V+L represents the one-stream model. It takes the concate-nated visual and textual features as input and produces con-textually embedded features with a single BERT. The pa-rameters are shared between visual and textual encoding.V&L represents the two-stream model. It first adopts thetwo-stream for each modality and then models the cross-relationship with a cross-attention based transformer. In ourexperiments, we initialize the V+L models with pretrainedVisual BERT [46], and the V&L models with pretrainedViLBERT [14]. The parameters for all models are finetunedon the meme detection task.

  • two polar bears sitting next to each other

    you wouldn't stop me if i was a polar bear

    a cat and a dog standing next to each other

    why did god create them? so that they could understand and hate each other

    No

    stuffed animals in the front of a bus

    the big bird scolds the negatives because they even dared to believe they were allowed to ride on a bus

    a white cat sitting on top of a box

    when your chinese food turns up completely undercooked

    No

    Yes Yes

    nose, tongue, head, nose, ear, bear, paw, eye, eye, nose tail, cat, dog, cat, ear, ear, cat, cat, dog, paw

    eyes, window, eye, eyes, door, eye, eyes, door, window, lights eyes, eyes, eye, eye, ear, windows, cat, shelf, window, eyes

    a cat that is laying down on a couch

    what do you mean "your chair" i don't see you sitting in it

    No

    cat, cat, paw, head, cat, ear, paw, curtain, eyes, ear

    a close up of a tiger on a field

    i played baseball yesterday with a bunch of orphans. i won because none of them knew where home was.

    Yes

    ear, nose, ear, paw, face, mouth, ear, mouth, eye, paw

    there is a statue of a baby elephant

    remember when illegals just wanted to go home? now they want free food, health care and housing

    Yes

    button, sky, button, wall, button, button, hand, elephant, finger, wall

    a small dog is wearing a blanket

    these aren't people

    No

    tongue, tongue, paw, mouth, paw, paw, leg, eye, dog, ear

    Memes

    OCR Text

    Objects

    Result

    Memes

    Caption

    Objects

    Result

    Caption

    OCR Text

    ID: 69078ID: 37592ID: 79043 ID: 98034

    ID: 31960ID: 74198ID: 16278ID: 69140

    Figure 3: Qualitative examples of of hateful memes detection. We show the generated caption and object labels for eachsample.

    Effectiveness of Image Captioning. In Table 1, we cansee that V&L models with image caption outperform otherV&L models by a large margin on the dev set. These resultssupport our motivation that image captioning helps the hate-ful memes detection. The generated image descriptions pro-vide more meaningful clues to detect the ‘hateful’ memessince the captions can describe the image’s content and pro-vide semantic information for cross-modality relationshipreasoning. The performance boost brings by image caption-ing further indicates that, due to the rich and societal con-tent in memes, only considering object detection and OCRrecognition is insufficient. A practical solution should alsoexplore some additional information related to the meme.

    Effectiveness of Language Augmentation. We also verifythe effectiveness of data augmentation in Table 1. We cansee that back-translation can bring some improvement toV&L models. However, for V+L models, back-translationdoes not show improvement. We think the reason for theineffectiveness of back-translation on V+L models is thatthe one-stream models handle the multimodal embeddingswith the shared BERT model. For hateful memes detec-tion, the OCR sentences are not semantically aligned withthe image content, which weakens sentence augmentationeffectiveness. For V&L, introducing augmented sentencescan improve the intra-modality modeling for language sinceit contains independent branches for visual and textual mod-

    eling separately.

    Effectiveness of Visual Labels. We consider combin-ing the predicted object labels as additional input features.Specifically, we treat the object labels as linguistic wordsand concatenate them with OCR sentence and image cap-tion. We can see that the object labels can improve the V+Land V&L models. This is reasonable since object labels canbe seen as the “anchor” between RoI features and textualfeatures (OCR text and caption). A similar finding can alsobe found in [58].

    Comparisons with the Existing Methods. Table 2 showsthe comparisons of our method on Hateful memes challengewith existing methods. We can see that our method achievesbetter performance, which demonstrates the advantage ofour triplet-relation network. We also submit the results ofour model to the online testing server, and our best modelwith ensembling (named as “naoki”) achieves the 13th po-sition among 276 teams in the Phase 2 competition5.

    Visualization Results. Fig. 3 shows some generated cap-tions and predicted results. From those samples, we can seethat our model can understand the implicit meaning behindthe image and sentences with the help of the predicted im-age captions. For example, in the last image, although there

    5https://www.drivendata.org/competitions/70/hateful-memes-phase-2/leaderboard/

  • is no explicit relationship between the image and OCR test,our method still can predict the correct result by connectingdifferent modalities with the image caption.

    5. Conclusion

    In this paper, we propose a novel multimodal learningmethod for hateful memes detection. Our proposed modelexploits the combination of image captions and memesto enhance cross-modality relationship modeling for hate-ful memes detection. We envision such a triplet-relationnetwork to be extended to other frameworks that requireadditional features from multimodal signals. Our modelachieves competitive results in the hateful memes detectionchallenge.

    References

    [1] Paula Fortuna and Sérgio Nunes, “A survey on auto-matic detection of hate speech in text,” ACM Comput.Surv., vol. 51, no. 4, pp. 85:1–85:30, July 2018. 1

    [2] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, LianyangMa, Amir Shahroudy, Bing Shuai, Ting Liu, Xingx-ing Wang, Gang Wang, Jianfei Cai, et al., “Recentadvances in convolutional neural networks,” PatternRecognition, 2018. 1

    [3] Jiuxiang Gu, Jason Kuen, Shafiq Joty, Jianfei Cai,Vlad Morariu, Handong Zhao, and Tong Sun, “Self-supervised relationship probing,” NeurIPS, 2020. 1

    [4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick, andDevi Parikh, “Vqa: Visual question answering,” inICCV, 2015. 1

    [5] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, andD. Parikh, “Making the V in VQA matter: Elevat-ing the role of image understanding in Visual QuestionAnswering,” in CVPR, 2017. 1

    [6] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollár, andC Lawrence Zitnick, “Microsoft coco captions: Datacollection and evaluation server,” arXiv preprintarXiv:1504.00325, 2015. 1

    [7] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao,Xu Yang, and Gang Wang, “Unpaired image caption-ing via scene graph alignments,” in ICCV, 2019. 1

    [8] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang,“Unpaired image captioning by language pivoting,” inECCV, 2018. 1

    [9] Jiahui Gao, Yi Zhou, Philip LH Yu, and Jiuxiang Gu,“Unsupervised cross-lingual image captioning,” arXivpreprint arXiv:2010.01288, 2020. 1

    [10] Jiuxiang Gu, Gang Wang, Jianfei Cai, and TsuhanChen, “An empirical study of language cnn for im-age captioning,” in ICCV, 2017. 1

    [11] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang,“Bottom-up and top-down attention for image cap-tioning and visual question answering,” in CVPR,2018. 1

    [12] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu,Steven CH Hoi, Xiaogang Wang, and Hongsheng Li,“Dynamic fusion with intra-and inter-modality atten-tion flow for visual question answering,” in CVPR,2019. 1

    [13] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin, “Attention is all youneed,” in NeurIPS, 2017. 1, 4

    [14] Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee, “VilBERT: Pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks,”in NeurIPS, 2019. 1, 2, 3, 4, 5

    [15] Hao Tan and Mohit Bansal, “LXMERT: Learningcross-modality encoder representations from trans-formers,” in EMNLP, 2019. 1, 2

    [16] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim,Jeonghee Kim, Jung-Woo Ha, and Byoung-TakZhang, “Hadamard product for low-rank bilinearpooling,” in ICLR, 2017. 2

    [17] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang,“Bilinear attention networks,” in NeurIPS, 2018. 2

    [18] Hedi Ben-Younes, Remi Cadene, Nicolas Thome, andMatthieu Cord, “Block: Bilinear superdiagonal fusionfor visual question answering and visual relationshipdetection,” in AAAI, 2019. 2

    [19] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, andDacheng Tao, “Beyond bilinear: Generalized multi-modal factorized high-order pooling for visual ques-tion answering,” TNNLS, 2018. 2

    [20] M. H. Ribeiro, P. H. Calais, Y. A. Santos, V. A.Almeida, and W. Meira Jr, “Characterizing and de-tecting hateful users on twitter,” in ICWSM, 2018. 2

  • [21] Z. Waseem, T. Davidson, D. Warmsley, and I. We-ber, “Understanding abuse: A typology of abusivelanguage detection subtasks,” 2017. 2

    [22] Anna Schmidt and Michael Wiegand, “A survey onhate speech detection using natural language process-ing,” in Proceedings of the Fifth International work-shop on natural language processing for social media,2017. 2

    [23] Mainack Mondal, Leandro Araújo Silva, and Fabrı́cioBenevenuto, “A measurement study of hate speech insocial media,” in HT, 2017. 2

    [24] Shervin Malmasi and Marcos Zampieri, “Detect-ing hate speech in social media,” CoRR, vol.abs/1712.06427, 2017. 2

    [25] Ritesh Kumar, Atul Kr Ojha, Shervin Malmasi, andMarcos Zampieri, “Benchmarking aggression identi-fication in social media,” in TRAC, 2018. 2

    [26] Shervin Malmasi and Marcos Zampieri, “Challengesin discriminating profanity from hate speech,” JETAI,2018. 2

    [27] Zeerak Waseem, “Are you a racist or am I seeingthings? annotator influence on hate speech detectionon twitter,” in Proceedings of the First Workshopon NLP and Computational Social Science, 2016, pp.138–142. 2

    [28] Zeerak Waseem and Dirk Hovy, “Hateful symbols orhateful people? predictive features for hate speech de-tection on twitter,” in NAACL student research work-shop, 2016. 2

    [29] Antigoni Maria Founta, Constantinos Djouvas, De-spoina Chatzakou, Ilias Leontiadis, Jeremy Black-burn, Gianluca Stringhini, Athena Vakali, MichaelSirivianos, and Nicolas Kourtellis, “Large scalecrowdsourcing and characterization of twitter abusivebehavior,” in ICWSM, 2018. 2

    [30] Maarten Sap, Dallas Card, Saadia Gabriela, YejinChoi, and Noah A. Smith, “The risk of racial biasin hate speech detection,” in ACL, 2019. 2

    [31] Thomas Davidson, Debasmita Bhattacharya, and Ing-mar Weber, “Racial bias in hate speech and abusivelanguage detection datasets,” in ACL, 2019. 2

    [32] Homa Hosseinmardi, Sabrina Arredondo Mattson,Rahat Ibn Rafiq, Richard Han, Qin Lv, and ShivakantMishra, “Detection of cyberbullying incidents on theinstagram social network,” arXiv:1503.03909, 2015.2

    [33] Haoti Zhong, Hao Li, Anna Squicciarini, Sarah Rajt-majer, Christopher Griffin, David Miller, and CorneliaCaragea, “Content-driven detection of cyberbullyingon the instagram social network,” in IJCAI, 2016. 2

    [34] Fan Yang, Xiaochang Peng, Gargi Ghosh, ReshefShilon, Hao Ma, Eider Moore, and Goran Predovic,“Exploring deep multimodal fusion of text and photofor hate speech classification,” in Proceedings of theThird Workshop on Abusive Language Online, 2019.2

    [35] Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimos-thenis Karatzas, “Exploring hate speech detection inmultimodal publications,” in WACV, 2020. 2

    [36] Douwe Kiela, Hamed Firooz, Aravind Mohan,Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia,and Davide Testuggine, “The hateful memes chal-lenge: Detecting hate speech in multimodal memes,”arXiv preprint arXiv:2005.04790, 2020. 2, 4, 5

    [37] Mengye Ren, Ryan Kiros, and Richard Zemel, “Ex-ploring models and data for image question answer-ing,” in NeurIPS, 2015. 2

    [38] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan-nis Kalantidis, Li-Jia Li, David A Shamma, et al., “Vi-sual genome: Connecting language and vision usingcrowdsourced dense image annotations,” IJCV, 2017.2

    [39] Akira Fukui, Dong Huk Park, Daylen Yang, AnnaRohrbach, Trevor Darrell, and Marcus Rohrbach,“Multimodal compact bilinear pooling for visual ques-tion answering and visual grounding,” in EMNLP,2016. 2

    [40] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao,“Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” inICCV, 2017. 2

    [41] Yuke Zhu, Oliver Groth, Michael Bernstein, andLi Fei-Fei, “Visual7w: Grounded question answeringin images,” in CVPR, 2016. 2

    [42] Huijuan Xu and Kate Saenko, “Ask, attend and an-swer: Exploring question-guided spatial attention forvisual question answering,” in ECCV, 2016. 2

    [43] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,and Alex Smola, “Stacked attention networks for im-age question answering,” in CVPR, 2016. 2

  • [44] Jiasen Lu, Jianwei Yang, Dhruv Batra, and DeviParikh, “Hierarchical question-image co-attention forvisual question answering,” in NeurIPS, 2016. 2

    [45] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin, “Attention is all youneed,” in NeurIPS, 2017. 2

    [46] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang, “Visualbert: A simple andperformant baseline for vision and language,” arXivpreprint arXiv:1908.03557, 2019. 2, 3, 4, 5

    [47] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,et al., “Google’s neural machine translation system:Bridging the gap between human and machine trans-lation,” arXiv preprint arXiv:1609.08144, 2016. 3

    [48] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey EHinton, “Layer normalization,” arXiv preprintarXiv:1607.06450, 2016. 3

    [49] Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun, “Faster r-cnn: Towards real-time object detectionwith region proposal networks,” in NeurIPS, 2015. 3,5

    [50] Ramakrishna Vedantam, C Lawrence Zitnick, andDevi Parikh, “Cider: Consensus-based image descrip-tion evaluation,” in CVPR, 2015. 4

    [51] Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jarret Ross, and Vaibhava Goel, “Self-critical se-quence training for image captioning,” in CVPR,2017. 4

    [52] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollár, andC Lawrence Zitnick, “Microsoft coco: Common ob-jects in context,” in ECCV, 2014. 4

    [53] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio, “Show, attend and tell: Neuralimage caption generation with visual attention,” inICML, 2015. 4

    [54] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jian-fei Cai, and Mingyang Ling, “Scene graph generationwith external knowledge and image reconstruction,”in CVPR, 2019. 4

    [55] Andrej Karpathy and Li Fei-Fei, “Deep visual-semantic alignments for generating image descrip-tions,” in CVPR, 2015. 4

    [56] Diederik Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” in ICLR, 2015. 5

    [57] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,”in CVPR, 2015. 5

    [58] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu,Pengchuan Zhang, Lei Zhang, Lijuan Wang, HoudongHu, Li Dong, Furu Wei, et al., “Oscar: Object-semantics aligned pre-training for vision-languagetasks,” in ECCV, 2020. 6