How Images Inspire Poems: Generating Classical Chinese ... · isting poems while writing new poems....

How Images Inspire Poems: Generating Classical Chinese Poetry from Imageswith Memory Networks

Linli Xu†, Liang Jiang†, Chuan Qin†, Zhe Wang‡, Dongfang Du††Anhui Province Key Laboratory of Big Data Analysis and Application,

School of Computer Science and Technology, University of Science and Technology of China‡AI Department, Ant Financial Services Group

[email protected], [email protected], [email protected]@antfin.com, [email protected]

Abstract

With the recent advances of neural models and natural lan-guage processing, automatic generation of classical Chinesepoetry has drawn significant attention due to its artistic andcultural value. Previous works mainly focus on generatingpoetry given keywords or other text information, while visualinspirations for poetry have been rarely explored. Generatingpoetry from images is much more challenging than gener-ating poetry from text, since images contain very rich visualinformation which cannot be described completely using sev-eral keywords, and a good poem should convey the imageaccurately. In this paper, we propose a memory based neu-ral model which exploits images to generate poems. Specif-ically, an Encoder-Decoder model with a topic memory net-work is proposed to generate classical Chinese poetry fromimages. To the best of our knowledge, this is the first workattempting to generate classical Chinese poetry from imageswith neural networks. A comprehensive experimental inves-tigation with both human evaluation and quantitative analy-sis demonstrates that the proposed model can generate poemswhich convey images accurately.

IntroductionClassical Chinese poetry is a priceless and important her-itage in Chinese culture. During the history of more than2000 years in China, millions of classical Chinese po-ems have been written to praise heroic characters, beautifulscenery, love, etc. Classical Chinese poetry is still fascinat-ing us today with its concise structure, rhythmic beauty andrich emotions. There are distinct genres of classical Chinesepoetry, including Tang poetry, Song iambics and Qing po-etry, etc., each of which has different structures and rules.Among them, quatrain is the most popular one, which con-sists of four lines, with five or seven characters in each line.The lines of a quatrain follow specific rules including theregulated rhythmical pattern, where the last characters in thefirst (optional), second and fourth line must belong to thesame rhythm category. In addition, each Chinese character isassociated with one tone which is either Ping (the level tone)or Ze (the downward tone), and quatrains are required to fol-low a pre-defined tonal pattern which regulates the tones ofcharacters at various positions (Wang 2002). An example of

Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Table 1: An example of a 7-char quatrain. The tone of eachcharacter is shown at the end of each line, where “P” repre-sents Ping (the level tone), “Z” represents Ze (the downwardtone), and “*” indicates the tone can be either. Rhymingcharacters are underlined.

a quatrain written by a very famous classical Chinese poetLi Bai is shown in Table 1.

The stringent rhyme and tone regulations of classical Chi-nese poetry pose a major challenge for generating Chi-nese poems automatically. In recent years, various attemptshave been made on automatic generation of classical Chi-nese poems from text information such as keywords. Amongthem, rule-based approaches (Tosa, Obara, and Minoh 2008;Wu, Tosa, and Nakatsu 2009; Netzer et al. 2009; Oliveira2009; 2012), genetic algorithms (Manurung 2004; Zhou,You, and Ding 2010; Manurung, Ritchie, and Thompson2012) and statistical machine translation methods (Jiang andZhou 2008; He, Zhou, and Jiang 2012) have been developed.More recently, with the significant advances in deep neuralnetworks, a number of poetry generation algorithms basedon neural networks have been proposed with the paradigm ofsequence-to-sequence learning, where poems are generatedline by line and each line is generated by taking the previouslines as input (Zhang and Lapata 2014; Yi, Li, and Sun 2016;Wang et al. 2016a; 2016b; Zhang et al. 2017). However, re-strictions exist in previous works, including topic drift andsemantic inconsistency which are caused by only consider-ing the writing intent of a user in the first line. In addition,a limited number of keywords with a proper order is usuallyrequired (Wang et al. 2016b), which limits the flexibility of

arX

iv:1

803.

0299

4v1

[cs

.CL

] 8

Mar

201

8

the process.On the other hand, visual inspirations are more natural

and intuitive than text for poem writing. People can writepoems to express their aesthetic appreciation or sentimentalreflections at the sight of absorbing scenery, such as mag-nificent mountains and fast rivers. As a consequence, thereusually exists a correspondence between a poem and an im-age explicitly or implicitly, where a poem either describes ascene, or leaves visual impressions on the readers. For ex-ample, the poem shown in Table 1 represents an image witha magnificent waterfall falling down from a very high moun-tain. Because of this inherent correlation, generating classi-cal Chinese poems from images becomes an interesting re-search topic.

As far as we know, visual inspirations from images havebeen rarely explored in automatic generation of classicalChinese poetry. The task of generating poetry from imagesis much more challenging than generating poetry from key-words in general, since very rich visual information is con-tained in an image, which requires sophisticated representa-tion as a bridge to convey the essential visual features andsemantic concepts of the image to the poem generator. Inaddition, to generate a poem that is consistent with the im-age and coherent itself, the topic flow should be carefullymanipulated in the generated sequence of characters.

In this paper, we propose an Encoder-Decoder frameworkwith topic memory to generate classical Chinese poetry fromimages, which integrates direct visual information with se-mantic topic information in the form of keywords extractedfrom images. We consider poetry generation as a sequence-to-sequence learning problem, where a poem is generatedline by line, and each line is generated by taking all previ-ous lines into account. Moreover, we introduce a memorynetwork to support unlimited number of keywords extractedfrom images and to determine a latent topic for each charac-ter in the generated poem dynamically. This addresses the is-sues of topic drift and semantic inconsistency, while resolv-ing the restriction of requiring a limited number of keywordswith a proper order in previous works. Meanwhile, to lever-age the visual information of images not contained in the se-mantic concepts, we integrate direct visual information intothe proposed Encoder-Decoder framework to ensure the cor-respondence between images and poems. The experimentalresults demonstrate that our model has great advantages inexploiting the visual and topic information in images, and itcan generate poetry with high quality which conveys imagesaccurately and consistently.The main contributions of this paper are:

1. We consider a new research topic of generating classicalChinese poetry from images, which is not only for enter-tainment or education, but also an exploration integratingnatural language processing and computer vision.

2. We employ memory networks to tackle the problems oftopic drift and semantic inconsistency, while resolving therestriction of requiring a limited number of keywords witha proper order in previous works.

3. We integrate keywords summarizing semantic topics anddirect visual information to capture the information con-

veyed in images when generating poems.

Related WorkPoetry generation has been a challenging task over the pastdecades. A variety of approaches have been proposed, mostof which focus on generating poetry from text. Amongthem, phrase search based approaches are proposed in (Tosa,Obara, and Minoh 2008; Wu, Tosa, and Nakatsu 2009) forJapanese poetry generation. Semantic and grammar tem-plates are used in (Oliveira 2009), while genetic algorithmsare employed in (Manurung 2004; Manurung, Ritchie, andThompson 2012; Zhou, You, and Ding 2010). In the workof (Jiang and Zhou 2008; Zhou, You, and Ding 2010;He, Zhou, and Jiang 2012), poetry generation is treated as astatistical machine translation problem, where the next lineis generated by translating the previous line. Another ap-proach in (Yan et al. 2013) generates poetry by summarizingusers’ queries.

More recently, deep neural networks have been appliedin automatic poetry generation. An RNN-based frameworkis proposed in (Zhang and Lapata 2014) where each poemline is generated by taking the previously generated lines asinput. In the work of (Yi, Li, and Sun 2016), the attentionmechanism is introduced into poetry generation, where anattention-based Encoder-Decoder model is proposed to gen-erate poem lines sequentially. A different genre of classicalChinese poetry is generated in (Wang et al. 2016a), which isthe first work to generate Chinese Song iambics, with eachline of variable length. However, the neural methods intro-duced above share the limitation that the writing intent of auser is taken into consideration only in the first line, whichwill cause topic drift in the following lines. To address that,a modified attention-based Encoder-Decoder model is pre-sented in (Wang et al. 2016b) which assigns a sub-topic key-word for each poem line. As a consequence, the number ofkeywords is fixed to the number of poem lines, and the key-words have to be sorted in a proper order manually, whichlimits the flexibility of the approach.

In this work, we address the issues in the previous worksby introducing a memory network with unlimited capac-ity of keywords which can dynamically determine a topicfor each character. Memory networks are a class of neu-ral networks proposed in (Weston, Chopra, and Bordes2014) to augment recurrent neural networks with a mem-ory component to model long term memory. In the workof (Sukhbaatar et al. 2015), memory networks are extendedwith an end-to-end training mechanism, which makes themodel more generally applicable. Zhang et al. (2017) intro-duces memory networks into automatic poetry generationfor the first time to leverage the prior knowledges in the ex-isting poems while writing new poems. In the testing stage,by using a memory network, some existing poems are repre-sented as external memory which provides prior knowledgefor new poems. In contrast to (Zhang et al. 2017), we em-ploy memory networks in a completely different way, wherewe represent keywords as memory entities, such that we canhandle keywords without restrictions and dynamically deter-mine the topic for each character.

Figure 1: An illustration of the pipeline to generate a poem from an image using Memory-Based Image to Poem Generator.

Memory-Based Image to Poem GeneratorImages convey rich information from visual signals to se-mantic topics that can inspire good poems. To build a naturalcorrespondence between an image and a poem, we propose aframework which integrates the semantic keywords summa-rizing the important topics of an image, along with the directvisual information to generate a poem. Specifically, given animage, keywords containing semantic topics are extracted asthe outline when generating the poem, while visual informa-tion is exploited to embody the information not conveyed bythe keywords.

FrameworkGiven an image I , we are generating a poem P which con-sists of L poem lines {l1, l2, .., lL}. In the framework asillustrated in Figure 1, we first extract a set of keywordsK = {k1, k2, ..., kN} and a set of visual feature vectorsV = {v1, v2, ..., vB} from I with keyword extractor anda convolutional neural network (CNN) based visual featureextractor respectively. The poem is then generated line byline. Specifically, when generating the i-th line li, the pre-viously generated lines denoted by l1:i−1, which is the con-catenation from l1 to li−1, the keywords K and visual fea-ture vectors V are jointly exploited in the Memory-basedImage to Poem Generator (MIPG) model, which is the keycomponent in the framework.

As illustrated in Figure 2, the MIPG model is essentiallyan Encoder-Decoder model consisting of two modules: anImage-based Encoder (I-Enc) and a Memory-based Decoder(M-Dec). In I-Enc, visual features V are extracted from theimage I with a CNN, while a bi-directional Gated RecurrentUnit (Bi-GRU) model (Cho et al. 2014) is employed to buildsemantic features H from previously generated lines l1:i−1.In M-Dec, the i-th poem line, denoted by a sequence of char-acters {y1, ..., yG}, is generated based on the keywords K,the visual feature vectors V , as well as the semantic fea-tures H from the previous lines. To generate each characteryt ∈ li, we first transform V and H into dynamic represen-tations for yt with the attention mechanism, followed by aTopic Memory Network designed to dynamically determinea topic for yt. Finally yt is predicted using a topic-bias prob-ability which enhances the consistency between the imageand poem.

Image-based Encoder (I-Enc)In I-Enc, as illustrated in the lower half of Figure 2, wefirst encode the image I into B local visual features vectorsV = {v1, v2, ..., vB}, each of which is a Dv-dimensionalrepresentation corresponding to a different part of the im-age. We use a CNN model as the visual feature extractor.Specifically, V is produced by fetching the output of a cer-tain convolutional layer of the CNN feature extractor whichtakes I as input

V = CNN(I).

Meanwhile, we use a Bi-GRU to encode the preceding linesof the generated poem l1:i−1, which takes the form of a se-quence of the character embeddings {x1, x2, ..., xC} intothe corresponding hidden vectors H = [h1, h2, ..., hC ],whereC denotes the length of l1:i−1 and hj is the concatena-tion of the forward hidden vector

−→h j and backward hidden

vector←−h j at the j-th step in the Bi-GRU. That is,

−→h j = GRU(

−→h j−1, xj),

←−h j = GRU(

←−h j+1, xj),

hj = [−→h j ;←−h j ].

The visual features V and semantic features H extracted inI-Enc are then exploited in M-Dec to dynamically determinewhich parts of the image and what content in the precedinglines it should focus on when generating each character.

Memory-based Decoder (M-Dec)In M-Dec, as illustrated in the upper half of Figure 2, eachline li = {y1, ..., yG} is generated character by character.Specifically, at the t-th step, we use another GRU whichmaintains an internal state st to predict yt. st is updatedbased on st−1, yt−1, H and V recurrently, and can be for-mulated as

st = f(st−1, yt−1, ht, vt),

ht = attH(H),

vt = attV (V ),

where f denotes the function to update the internal state inGRU, attH and attV denote the functions of attention whichtransform H and V into dynamic representations ht and vtrespectively, indicating which parts of the image and what

Figure 2: An illustration of the Memory-based Image to Poem Generator (MIPG).

content in the preceding lines the model should focus onwhen generating the next character. More formally,

ht = attH(H) =

C∑j=1

αtjhj , hj ∈ H,

where αtj is the weight of the j-th character computed bythe attention model:

αtj =exp(atj)

C∑n=1

exp(atn)

,

atn = uTa tanh(Wast−1 + Uahn),

in which atj is the attention score on hj at the t-th step. ua,Wa and Ua are the parameters to learn.

Similarly, vt is computed using attV taking V as input:

vt = attV (V ) =

B∑j=1

βtjvj , vj ∈ V,

βtj =exp(btj)

B∑n=1

exp(btn)

,

btn = uTb tanh(Wbst−1 + Ubvn).

Instead of predicting the t-th character yt by using st di-rectly, we introduce a Topic Memory Network to dynam-ically determine a proper topic for yt by taking st as inputand output a topic-aware state vector ot which contains notonly information in the image and the preceding lines, butalso the latent topic to generate yt. A multi-layer perceptronis then used to predict yt from ot.

Topic Memory Network. Considering the rich informa-tion conveyed by images, it is essentially impossible to fullydescribe an image with too few keywords. In the meantime,the problem of topic drift would weaken the consistency be-tween a pair of image and poem. To address these issues,

as illustrated in Figure 3, we use a Topic Memory Network,in which each memory entity is a keyword extracted froman image, to dynamically determine a latent topic for eachcharacter by taking all keywords extracted from the imageinto consideration.

To achieve that, we use the general-v1.3 model providedby Clarifai1 to extract a set of keywords K = {k1, ..., kN},and encode each keyword kj inK into two memory vectors:an input memory vector qj which is used to calculate theimportance of kj on yt, and an output memory vector mj

which contains the semantic information of kj .Specifically, for each keyword kj ∈ K with Cj char-

acters, we encode kj into a semantic vector. To take theorder of Cj characters into account, we use a Bi-GRU toencode kj into a sequence of forward hidden state vectors[−→q1 , ...,−→qCj ] and a sequence of backward hidden state vec-tors [←−q1 , ...,←−qCj ]. Then the input memory vector qj is com-puted by concatenating the last forward state −→qCj and thefirst backward state←−q1 , that is, qj = [−→qCj ;

←−q1 ].The output memory representation mj of each keyword

kj is computed by the mean value of the embedding vectorsof all characters in kj ,

mj =1

Cj

Cj∑n=1

ejn,

where ejn represents the word embedding of the n-th char-acter in kj that is learned during training.

With the hidden state vector at the t-th step st, we com-pute the importance zj of each keyword kj ∈ K based onthe similarity between st and the input memory representa-tion qj of the keyword. This can be formulated as follows:

zj = softmax(sTt qj), 1 ≤ j ≤ N.

Given the weights of the keywords [z1, ..., zN ], a latent topicvector dt for yt is calculated by the weighted sum of the

1https://clarifai.com

Bi-GRU

Word Embedding

Weighted Sum

Output Memory

Weight

Input Memory

Inner Product

Keyw

ords

Figure 3: An illustration of the Topic Memory Network.

output memory representations of keywords [m1, ...,mN ]

dt =

N∑j=1

zjmj .

Based on dt and st, we compute a topic-aware state vectorot = dt + st to integrate the latent topics, visual featuresand semantic information from the previously generatedcharacters when predicting yt.

Finally, the t-th character yt is predicted based on ot,vt, ht and yt−1. Moreover, to enhance the topic con-sistency between poems and images, we encourage thedecoder to generate characters that appear in the keywordsextracted from the images by adding a bias probabilityto the characters in K. Specifically, we define a genericvocabulary EG which contains all possible characters, anda topic vocabulary ET which contains all characters inK, satisfying ET ⊆ EG. To predict yt, we calculate thetopic character probability pT (yt) in addition to the genericcharacter probability pG(yt) by

pG(yt = w) = gG(ot, vt, ht), w ∈ EG,

pT (yt = w) =

{gT (ot, vt, ht), w ∈ ET ,0, w ∈ EG \ ET ,

where gT and gG represent functions corresponding to mul-tilayer perceptrons followed by softmax, to compute pT andpG respectively. pT and pG are summed to compute the char-acter probability p, and the character with the maximumprobability is picked as the next character:

p(yt = w) = λpT (yt = w) + pG(yt = w),

yt = argmaxw

p(yt = w), w ∈ EG,

where λ is a hyperparameter to balance the generic proba-bility pG and the topic bias probability pT .

ExperimentsDatasetIn this paper, we are interested in the task of generating qua-trains given images. To investigate the performance of ourproposed framework on this task, a dataset of image-poempairs needs to be constructed where images and poems arematched. For the convenience of data preparation, we focuson generating quatrains with 4 lines and 7 characters in eachline. The framework proposed in this paper can be easilygeneralized to generate other types of poetry.

To construct the dataset of image-poem pairs, we col-lect 68,715 images from the Internet and use the poemdataset provided in (Zhang and Lapata 2014), which con-tains 65,559 7-character quatrains. Given the large numbersof images and poems, it is impractical to match them manu-ally. Instead, we exploit the key concepts of the images andpoems and match them automatically. Specifically, for eachimage, we obtain several keywords (e.g., water, tree) withthe general-v1.3 model provided by Clarifai; similarly weextract key concepts from each line of poems. Images andpoem lines with common concepts can then be matched. Inthis way, a dataset with 2,311,359 samples is constructed,each of which consists of an image, the preceding poem linesand the next poem line. We randomly select 50,000 samplesfor validation, 1,000 for testing and the rest for training.

Training DetailsWe use 6,000 most frequently used characters as the vocab-ulary EG. The number of recurrent hidden units is set to 512for both encoder and decoder. The dimensions of the inputand output memory in the topic memory network are alsoboth set to 512. The hyperparameter λ to balance PT andPG is set to 0.5 which is tuned on the validation set fromλ = 0.0, 0.1, ..., 1.0. All parameters are randomly initializedfrom a uniform distribution with support from [-0.08, 0.08].The model is trained with the AdaDelta algorithm (Zeiler2012) with the batch size set to 128, and the final model isselected according to the cross entropy loss on the valida-tion set. During training, we invert each line to be generatedfollowing (Yi, Li, and Sun 2016) to make it easier for themodel to generate poems obeying rhythm rules. For the vi-sual feature extractor, we choose a pre-trained VGG-19 (Si-monyan and Zisserman 2014) model and use the output ofthe conv5 4 layer, which includes 196 vectors of 512 dimen-sions, as the local visual features of an image. For the ma-jority of images, 10-20 keywords are extracted.

Evaluation MetricsFor the general task of text generation in natural languageprocessing, there exist various metrics for evaluation includ-ing BLEU and ROUGE. However, it has been shown that theoverlap-based automatic evaluation metrics have little cor-relation with human evaluation (Liu et al. 2016). Thus, forautomatic evaluation, we only calculate the recall rate of thekey concepts in an image which are described in the gener-ated poem to evaluate whether our model can generate con-sistent poems given images. Considering the uniqueness of

Table 2: Human evaluation of all models. Bold values indicate the best performance.Models Poeticness Fluency Coherence Meaning Consistency Average

SMT 6.97 5.85 5.18 5.22 5.15 5.67RNNPG-A 7.27 6.62 5.95 5.91 5.09 6.17RNNPG-H 7.36 6.20 5.51 5.67 5.41 6.03

PPG-R 6.47 5.27 4.91 4.82 4.96 5.29PPG-H 6.52 5.57 5.24 5.09 5.48 5.58

MIPG (full) 8.30 7.69 7.07 7.16 6.70 7.38MIPG (w/o keywords) 7.21 6.78 6.13 6.15 4.10 6.08

MIPG (w/o visual) 7.05 4.81 4.76 5.21 3.98 5.16

the poem generation task in terms of text structure and liter-ary creation, we evaluate the quality of the generated poemswith a human study.

Following (Wang et al. 2016b), we use the metrics listedbelow to evaluate the quality of a generated poem:

• Poeticness. Does the poem follow the rhyme and toneregulations?

• Fluency. Does the poem read smoothly and fluently?

• Coherence. Is the poem coherent across lines?

• Meaning. Does the poem have a reasonable meaning andartistic conception?

In addition to these metrics, for our task of generatingChinese classical poems given images, we need to evaluatehow well the generated poem conveys the input image. Herewe introduce a metric Consistency to measure whether thetopics of the generated poem and the given image match.

Model VariantsIn addition to the proposed framework of MIPG, we evaluatetwo variants of the model to examine the influence of thevisual and semantic topic information on the quality of thepoems generated:

• MIPG (full). The proposed model, which integrates di-rect visual information and semantic topic information.

• MIPG (w/o keywords). Based on MIPG, the semantictopic information is removed by setting the input and out-put memory vectors of keywords to~0, such that the modelonly leverages visual information in poetry generation.

• MIPG (w/o visual). Based on MIPG, the visual informa-tion is removed by setting the visual feature vectors ofimage to ~0, such that the model only leverages semantictopic information in poetry generation.

BaselinesAs far as we know, there is no prior work on generating Chi-nese classical poetry from images. Therefore, for the base-lines we implement several previously proposed keyword-based methods listed below:

SMT. A statistical machine translation model (He, Zhou,and Jiang 2012), which translates the preceding lines to gen-erate the next line, and the first line is generated from inputkeywords with a template-based method.

Table 3: Recall rate of the key concepts in an image de-scribed in the poems generated by all the models.

Models Recall Models RecallRNNPG-A 12.85% PPG-R 33.5%RNNPG-H 11.82% PPG-H 33.7%

SMT 19.79% MIPG 58.8%

RNNPG. A poetry generation model based on recurrentneural networks (Zhang and Lapata 2014), where the firstline is generated with a template-based method taking key-words as input and the other three lines are generated se-quentially. In our implementation, two schemes are used toselect the keywords for the first line, including using all thekeywords (RNNPG-A) and using the important keywordsselected by human (RNNPG-H) respectively. Specifically,for RNNPG-A, we use all the keywords extracted from animage (10-20 keywords), while for RNNPG-H, we invite 3volunteers to vote for the most important keywords in theimage (3-6 keywords).

PPG. An attention-based encoder-decoder frameworkwith a sub-topic keyword assigned to each line (Wang et al.2016b). Since PPG requires that the number of keywords tobe equal to the number of lines (4 for quatrains), we considertwo schemes to select 4 keywords from the keyword set ex-tracted from the image, including random selection (PPG-R)and human selection (PPG-H). Specifically, for PPG-R, werandomly select 4 keywords from the keyword set and ran-domly sort them. For PPG-H, we invite 3 volunteers to votefor the 4 most important keywords in the image and sortthem in the order of relevance.

Human EvaluationWe invite eighteen volunteers, who are knowledgeable inclassical Chinese poetry from reading to writing, to evaluatethe results of various methods. 45 images are randomly sam-pled as our testing set. Volunteers rate every generated poemwith a score from 1 to 10 from 5 aspects: Poeticness, Flu-ency, Coherence, Meaning and Consistency. Table 2 sum-marizes the results.

Overall Performance. The results in Table 2 indicate thatthe proposed model MIPG outperforms the baselines withall metrics. It is worth mentioning that in terms of “Con-sistency” which measures how well the generated poem

Table 4: Two sample poems generated from the corresponding image by MIPG.

can describe the given image, MIPG achieves the best per-formance, demonstrating the effectiveness of the proposedmodel at capturing the visual information and semantic topicinformation in the generated poems. From the comparisonof RNNPG-A and RNNPG-H, one can notice that, by pick-ing important keywords manually, the poems generated byRNNPG-H are more consistent with images, which impliesthe importance of keyword selection in poetry generation.Similarly, with important keywords selected manually, PPG-H outperforms PPG-R from all aspects, especially in “Co-herence” and “Consistency”. As a comparison, in the pro-posed MIPG framework, the topic memory network makesit possible to dynamically determine a topic while gener-ating each character, in the meantime, the encoder-decoderframework ensures the generation of fluent poems follow-ing the regulations. As a whole, the visual information andthe topic memory network work together to generate poemsconsistent with given images.

Analysis of Model Variants. The results in the bottomrows of Table 2 correspond to the model variants includ-ing MIPG (full), MIPG (w/o keywords) and MIPG (w/o vi-sual). One can observe that ignoring semantic keywords inMIPG (w/o keywords) or visual information in MIPG (w/ovisual) degrades the performance of the proposed model sig-nificantly, especially in terms of “Consistency”. This pro-vides clear evidence justifying that the visual informationand semantic topic information work together to generatepoems consistent with images.

Automatic Evaluation of Image-Poem ConsistencyDifferent from the keyword-based poetry generation models,for the task of poetry generation from images, the image-poem consistency is a new and very important metric whileevaluating the quality of the generated poems. Therefore, weconduct an automatic evaluation in terms of “Consistency”in addition to human evaluation, which is achieved by com-

puting the recall rate of the key concepts in an image that aredescribed in the generated poem.

The results are shown in Table 3, where one can ob-serve that the proposed MIPG framework outperforms allthe other baselines with a large margin, which indicatesthat the poems generated by MIPG can better describe thegiven images. One should also notice the difference betweenthe subjective and quantitative evaluation in “Consistency”from Table 2 and Table 3. For instance, PPG-R and PPG-Hachieve almost equal recall rates of keywords, but the “Con-sistency” score of PPG-R is significantly lower than PPG-Hin Table 2. This is reasonable considering that both PPG-Rand PPG-H select 4 keywords extracted from the given im-age to generate poems, which yield equal recall rates. In themeantime, the keywords selected by human describe the im-age better than randomly picked keywords, therefore PPG-Hcan generate poems more consistent with images than PPG-R from the human perspective. In addition, the reason thatthe keyword recall rates of RNNPG-A and RNNPG-H arerelatively low is probably due to the fact that RNNPG oftengenerates the first poem line which is semantically relevantwith given keywords while not containing any of them. As aconsequence, RNNPG-A and RNNPG-H may generate po-ems with low keyword recall rates but high “Consistency”scores.

Examples

To further illustrate the quality of the poems generated by theproposed MIPG framework, we include two examples of thepoems generated with the corresponding images in Table 4 .As shown in these examples, the poems generated by MIPGcan capture the visual information and semantic concepts inthe given images. More importantly, the poems nicely de-scribe the the images in a poetic manner, while followingthe strict regulations of classical Chinese poetry.

ConclusionIn this paper, we propose a memory based neural networkmodel for classical Chinese poetry generation from images(MIPG) where visual features as well as semantic topics ofimages are exploited when generating poems. Given an im-age, semantic keywords are extracted as the skeleton of apoem, where a topic memory network is proposed that cantake as many keywords as possible and dynamically selectthe most relevant ones to use during poem generation. Onthis basis, visual features are integrated to embody the in-formation missing in the keywords. Numerical and humanevaluation regarding the quality of the poems from differentperspectives justifies that the proposed model can generatepoems that describe the given images accurately in a poeticmanner, while following the strict regulations of classicalChinese poetry.

AcknowledgementsThis research was supported by the National Natural Sci-ence Foundation of China (No. 61375060, No. 61673364,No. 61727809 and No. 61325010), and the Fundamental Re-search Funds for the Central Universities (WK2150110008).We also gratefully acknowledge the support of NVIDIACorporation with the donation of the Titan X GPU used forthis work.

ReferencesCho, K.; Van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learningphrase representations using rnn encoder-decoder for statis-tical machine translation. arXiv preprint arXiv:1406.1078.He, J.; Zhou, M.; and Jiang, L. 2012. Generating chineseclassical poems with statistical machine translation models.In AAAI.Jiang, L., and Zhou, M. 2008. Generating chinese coupletsusing a statistical mt approach. In Proceedings of the 22ndInternational Conference on Computational Linguistics-Volume 1, 377–384. Association for Computational Linguis-tics.Liu, C.-W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Char-lin, L.; and Pineau, J. 2016. How not to evaluate your di-alogue system: An empirical study of unsupervised evalua-tion metrics for dialogue response generation. arXiv preprintarXiv:1603.08023.Manurung, R.; Ritchie, G.; and Thompson, H. 2012. Usinggenetic algorithms to create meaningful poetic text. Jour-nal of Experimental & Theoretical Artificial Intelligence24(1):43–64.Manurung, H. 2004. An evolutionary algorithm approach topoetry generation.Netzer, Y.; Gabay, D.; Goldberg, Y.; and Elhadad, M. 2009.Gaiku: Generating haiku with word associations norms. InProceedings of the Workshop on Computational Approachesto Linguistic Creativity, 32–39. Association for Computa-tional Linguistics.Oliveira, H. 2009. Automatic generation of poetry: anoverview. Universidade de Coimbra.

Oliveira, H. G. 2012. Poetryme: a versatile platform forpoetry generation. Computational Creativity, Concept In-vention, and General Intelligence 1:21.Simonyan, K., and Zisserman, A. 2014. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556.Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-to-end memory networks. In Advances in neural informationprocessing systems, 2440–2448.Tosa, N.; Obara, H.; and Minoh, M. 2008. Hitch haiku: Aninteractive supporting system for composing haiku poem.In International Conference on Entertainment Computing,209–216. Springer.Wang, Q.; Luo, T.; Wang, D.; and Xing, C. 2016a. Chinesesong iambics generation with neural attention-based model.arXiv preprint arXiv:1604.06274.Wang, Z.; He, W.; Wu, H.; Wu, H.; Li, W.; Wang, H.; andChen, E. 2016b. Chinese poetry generation with planningbased neural network. arXiv preprint arXiv:1610.09889.Wang, L. 2002. A summary of rhyming constraints of chi-nese poems.Weston, J.; Chopra, S.; and Bordes, A. 2014. Memory net-works. arXiv preprint arXiv:1410.3916.Wu, X.; Tosa, N.; and Nakatsu, R. 2009. New hitchhaiku: An interactive renku poem composition supportingtool applied for sightseeing navigation system. In Interna-tional Conference on Entertainment Computing, 191–196.Springer.Yan, R.; Jiang, H.; Lapata, M.; Lin, S.-D.; Lv, X.; and Li,X. 2013. i, poet: Automatic chinese poetry compositionthrough a generative summarization framework under con-strained optimization. In IJCAI.Yi, X.; Li, R.; and Sun, M. 2016. Generating chineseclassical poems with rnn encoder-decoder. arXiv preprintarXiv:1604.01537.Zeiler, M. D. 2012. Adadelta: an adaptive learning ratemethod. arXiv preprint arXiv:1212.5701.Zhang, X., and Lapata, M. 2014. Chinese poetry generationwith recurrent neural networks. In EMNLP, 670–680.Zhang, J.; Feng, Y.; Wang, D.; Wang, Y.; Abel, A.; Zhang,S.; and Zhang, A. 2017. Flexible and creative chinesepoetry generation using neural memory. arXiv preprintarXiv:1705.03773.Zhou, C.-L.; You, W.; and Ding, X. 2010. Genetic algorithmand its implementation of automatic generation of chinesesongci. Journal of Software 21(3):427–437.

How Images Inspire Poems: Generating Classical Chinese ... · isting poems while writing new poems....

Documents

Transcript of How Images Inspire Poems: Generating Classical Chinese ... · isting poems while writing new poems....