arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best...

13
STEFANN: S CENE T EXT E DITOR USING F ONT A DAPTIVE N EURAL N ETWORK APREPRINT Prasun Roy * CVPR Unit Indian Statistical Institute Kolkata Kolkata, India [email protected] Saumik Bhattacharya * CVPR Unit Indian Statistical Institute Kolkata Kolkata, India [email protected] Subhankar Ghosh * CVPR Unit Indian Statistical Institute Kolkata Kolkata, India [email protected] Umapada Pal CVPR Unit Indian Statistical Institute Kolkata Kolkata, India [email protected] March 5, 2019 ABSTRACT Textual information in a captured scene play important role in scene interpretation and decision making. Pieces of dedicated research work are going on to detect and recognize textual data accurately in images. Though there exist methods that can successfully detect complex text regions present in a scene, to the best of our knowledge there is no work to modify the textual information in an image. This paper deals with a simple text editor that can edit/modify the textual part in an image. Apart from error correction in the text part of the image, this work can directly increase the reusability of images drastically. In this work, at first, we focus on the problem to generate unobserved characters with the similar font and color of an observed text character present in a natural scene with minimum user intervention. To generate the characters, we propose a multi-input neural network that adapts the font-characteristics of a given characters (source), and generate desired characters (target) with similar font features. We also propose a network that transfers color from source to target character without any visible distortion. Next, we place the generated character in a word for its modification maintaining the visual consistency with the other characters in the word. The proposed method is a unified platform that can work like a simple text editor and edit texts in images. We tested our methodology on popular ICDAR 2011 and ICDAR 2013 datasets and results are reported here. Keywords Scene text editing · Font synthesis · Neural network 1 Introduction Text is widely present in different design and scene images. It contains important information for the readers. However, if any alteration is required in a text present in an image, it becomes extremely difficult for several reasons. In one hand, a limited observation of character samples makes it difficult to predict the other characters that might be required for the editing; on the other hand, different real world conditions, like scene lighting, perspective distortions, background information, font etc., make a direct replacement of known or unknown characters even harder. The main motivation of this work is to design an algorithm for editing text information present in images that can work simply like conventional text editors, like notepad or MS Word. * These authors contributed equally in this work. The names are in alphabetic order. arXiv:1903.01192v1 [cs.CV] 4 Mar 2019

Transcript of arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best...

Page 1: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

STEFANN: SCENE TEXT EDITOR USING FONT ADAPTIVENEURAL NETWORK

A PREPRINT

Prasun Roy∗

CVPR UnitIndian Statistical Institute Kolkata

Kolkata, [email protected]

Saumik Bhattacharya∗

CVPR UnitIndian Statistical Institute Kolkata

Kolkata, [email protected]

Subhankar Ghosh∗

CVPR UnitIndian Statistical Institute Kolkata

Kolkata, [email protected]

Umapada PalCVPR Unit

Indian Statistical Institute KolkataKolkata, India

[email protected]

March 5, 2019

ABSTRACT

Textual information in a captured scene play important role in scene interpretation and decisionmaking. Pieces of dedicated research work are going on to detect and recognize textual data accuratelyin images. Though there exist methods that can successfully detect complex text regions present in ascene, to the best of our knowledge there is no work to modify the textual information in an image.This paper deals with a simple text editor that can edit/modify the textual part in an image. Apartfrom error correction in the text part of the image, this work can directly increase the reusability ofimages drastically. In this work, at first, we focus on the problem to generate unobserved characterswith the similar font and color of an observed text character present in a natural scene with minimumuser intervention. To generate the characters, we propose a multi-input neural network that adaptsthe font-characteristics of a given characters (source), and generate desired characters (target) withsimilar font features. We also propose a network that transfers color from source to target characterwithout any visible distortion. Next, we place the generated character in a word for its modificationmaintaining the visual consistency with the other characters in the word. The proposed method isa unified platform that can work like a simple text editor and edit texts in images. We tested ourmethodology on popular ICDAR 2011 and ICDAR 2013 datasets and results are reported here.

Keywords Scene text editing · Font synthesis · Neural network

1 Introduction

Text is widely present in different design and scene images. It contains important information for the readers. However,if any alteration is required in a text present in an image, it becomes extremely difficult for several reasons. In one hand,a limited observation of character samples makes it difficult to predict the other characters that might be required forthe editing; on the other hand, different real world conditions, like scene lighting, perspective distortions, backgroundinformation, font etc., make a direct replacement of known or unknown characters even harder. The main motivation ofthis work is to design an algorithm for editing text information present in images that can work simply like conventionaltext editors, like notepad or MS Word.

∗These authors contributed equally in this work. The names are in alphabetic order.

arX

iv:1

903.

0119

2v1

[cs

.CV

] 4

Mar

201

9

Page 2: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 1: Examples of text editing results using STEFANN: (a) Original images from ICDAR2013 dataset; (b) Editedimages. It can be observed that STEFANN can edit multiple characters in a word (top row) as well as an entire word(bottom row) in a text region.

Earlier, researchers have proposed font synthesis algorithms based on different geometrical features of fonts [1, 2, 3].These geometrical models neither generalize the wide variety of available fonts, nor can be applied directly to animage for character synthesis. Later, researchers have addressed the problem of generating unobserved charactersof a particular font from some defined or random set of observations using deep learning algorithms [4, 5, 6]. Withthe rise of generative adversarial networks (GAN), the problem of character synthesis has also been addressed usingGAN-based algorithms [7, 8]. Though GAN based font synthesis could be used to estimate the target character, thereare several challenges that make direct implementation of font synthesis for scene images difficult. First of all, mostof GAN-based font synthesis models demand an explicit recognition of the source character. As recognition of textin scene images is itself a challenging problem, it is preferable if the target characters can be generated without arecognition step. Otherwise, any error in recognition process would accumulate, and make the entire text editing processunstable. Secondly, it is often observed that a particular word in an image may have mixture of different font types, sizes,colors, widths etc. Even depending on the relative location of the camera and the texts in the scene, each character mayexperience different amount of perspective distortions. Some GAN based models [7, 8] require multiple observations ofa font-type to faithfully generate the other unobserved characters. A multiple observation based generation of targetcharacters requires a rigorous distortion removal before the application of a generative algorithm. Thus, rather than aword-level generation, we follow a character-level generative model to accommodate maximum flexibility.

2

Page 3: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

1.1 Main contributions

To the best of our knowledge, this is the first work that attempts to modify texts in scene images. For this purposewe design a generative network that adapts the font features from single character and generates other necessarycharacter(s). We also propose a model to transfer color of the source character to generated target character. The entireprocess works without any explicit recognition of the characters. Like conventional text editors, the method requiresminimum user inputs to edit a text in the image. To restrict the complexity of our problem, we limit our discussion tothe scene texts with upper-case non-overlapping characters. However, we demonstrate that the proposed method canalso be applied for lower-case characters.

2 Related Works

Because of its large potential, character synthesis from a few examples is a well-known problem. Previously, severalPieces of work tried to address the problem using geometrical modeling of fonts [1, 2, 3]. Different synthesis models arealso proposed by researchers explicitly for Chinese font generation [8, 9]. Along with statistical models [2] and bilinearfactorization [10], machine learning algorithms are used to transfer font features. Recently, deep learning techniquesalso become popular in font synthesis problem. Supervised [6] and definite samples [4] of observations are used togenerate the unknown samples using deep neural architecture. Recently, GAN-based generative models are found to beeffective in different image synthesis problems. GAN can be used in image style transfer [11], structure generation[12] or in both [7]. Some of these algorithms achieved promising results in generating font structures [5, 8], whereassome exhibits the potential to generate complex fonts with color [7]. To the best of our knowledge, these generativealgorithms work with text images that are produced using design softwares, and their applicability to edit real sceneimages are unknown. Moreover, most of the algorithms [7, 4] require explicit recognition of the source characters togenerate the unseen character set. This may create difficulty in our problem as text recognition in scene images is itselfa challenging problem [13, 14, 15], and any error in recognition step may ruin the entire generative process. Charactergeneration from multiple observations is also challenging for scene images as the observations may have different fonts,or may simply have different scaling and perspective distortions.

Convolutional neural network (CNN) is proved to be effective in style transfer in generative models [11, 16, 17].Recently, CNN models are used to generate style and structure with different visual features [18]. We propose a CNNbased character generation network that works without any explicit recognition of the source characters. For a naturallooking generation, it is important that the color of the source character can be transferred faithfully to the generatedcharacter. Color transfer is a popular topic in image processing [19, 20, 21]. Though these traditional approachesare good for transferring colors in images, most of them are inappropriate for transferring colors in character images.Recently GAN based models are also deployed in color transfer models [7, 22]. In this work, we include a CNN basedcolor transfer model that takes the color information present in the source character, and transfer it to the generatedtarget character. The proposed color transfer model not only transfers solid colors from source to target character, it canalso transfer gradient colors keeping visual consistency.

3 Methodology

The flow of the proposed method can be broadly classified into following steps: (1) Selection of the character(s) to bereplaced, (2) Generation of the binary target character(s), (3) Colorization, and (4) Placement of generated characters.In the first step, we select the text area that requires to be modified, and detect the bounding boxes of each characterin the selected region. Next, the user selects the bounding box around the characters to be modified. The user alsospecifies the target characters that need to be placed. According to the user inputs, target characters are generated,colorized and placed in inpainted regions.

3.1 Selection of source character

Let us assume that I is an image that has multiple text regions, and Ω is the domain of a text region that requiresmodification. The region Ω can be selected using any text detection algorithm [23, 24, 25]. Alternatively, an user canselect the corner points of a polygon that bounds a word to define Ω. In this work, we use EAST [26] to tentativelymark the text regions, followed by a quadrilateral corner selection to define Ω. After selecting the text region, we applyMSER algorithm [27] to detect the binary masks of individual characters present in region Ω. However, MSER alonecannot generate a sharp mask for most of the characters. Thus, we calculate the final binarized image Ic defined as,

3

Page 4: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 2: Architecture of STEFANN consisting of FANnet and Colornet to generate colored target character fromcolored source character. The figure shows how a source character (here ‘H’) is first used to generate a binary targetcharacter (here ‘N’), and how the color transfer takes place from source character to target character. Layer names inthe figures are: conv (2D convlution), FC (fully connected), up-conv (upsampling + convolution).

Ic(p) =

IM (p)

⊙IB(p) if p ∈ Ω

0 otherwise

where IM is the binarized output of the MSER algorithm [27] applied on I , IB is the binarized image of I and⊙

denotes element-wise product of matrices. The image Ic contains the binarized characters in the selected region Ω. Ifthe color of the source character is darker than its background, we apply image inversion on I before generating IB .

Assuming the characters are non-overlapping, we apply a connected component analysis, and compute the minimumbounding rectangles of each connected component. If there are N number of connected components present in a scene,Cn ⊆ Ω denotes the nth connected area where 0 < n ≤ N . The bounding boxes Bn contain the same indices of theconnected areas that they are bounding. User specifies the indices that they wish to edit. We define Θ as the set ofindices that require modification, such that |Θ| ≤ N , where |.| denotes cardinality of a set. The binarized images ICθ

associated with components Cθ, θ ∈ Θ are the source characters, and with proper padding (discussed in the followingsection), they individually act as the input of the font generative network. Each ICθ

has the same dimension with thebounding box Bθ.

3.2 Generation of target character

Conventionally, most of the neural networks take square images as input. However, as ICθmay have different aspect

ratios depending on several factors like source character, font type, font size etc., a direct resizing of ICθwould distort

the actual font features present in the character. Rather, we pad ICθmaintaining its aspect ratio to generate a square

binary image Iθ of size mθ ×mθ such that mθ = max(hθ, wθ), where hθ and wθ are the height and width of boundingbox Bθ respectively, and max(.) is a mathematical operation that finds the maximum value. We pad both sides of ICθ

along x and y axes with px and py respectively to generate Iθ such that,

px =⌊mθ − wθ

2

⌋, py =

⌊mθ − hθ2

⌋3.2.1 Font Adaptive Neural Network (FANnet)

Our generative font adaptive neural network (FANnet) takes two different input information. It takes an input imageof dimension 64× 64. The other input indicates a one-hot encoded vector v of size 26 where the index of the targetcharacter is encoded. For example, if our target character is ‘B’, then v has the value 1 at the second position and zeroin every other location. The input image passes through three convolution layers followed by a reshaping and a fullyconnected (FC) layer FC1. The encoded vector v also connects with a FC layer FC2. The outputs of FC1 and FC2 givea higher dimensional representation of the inputs. Outputs of FC1 and FC2 are concatenated and followed by two FClayers of 1024 neurons each. The expanding part of the representation contains two ‘upconv’ layers with 16 filters.Each ‘upconv’ layer contains a upsampling followed by a 2D convolution. The final layer contains an upconv layerwith 1 filter. All the convolution layers have ReLU activation with kernel size 3× 3. The architecture of FANnet is

4

Page 5: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

depicted in Fig. 2. The network minimizes the mean absolute error (MAE) while training with Adam optimizer [28]with momentum parameters β1 = 0.9, β2 = 0.999 and regularization parameter ε = 10−7.

We train FANnet with 1000 fonts with all 26 uppercase character images as source images, and 26 different one-hotencoded vectors for each source images. It implies that for 1000 fonts, we train the model to generate any of the26 uppercase target character images from any of the 26 uppercase source character images depending on the userpreference. Thus, our training dataset has total 0.676 million input tuples. The validation set contains 0.202 millioninput tuples generated from 300 fonts. We select all the fonts from Google font database 2. We save the weights of thenetwork layers where the MAE loss is minimum over the validation set. The output of FANnet is a grayscale image onwhich we apply Otsu thresholding to get binary target character image.

3.3 Color transfer

It is important to have a faithful transfer of color from source character for a visually consistent generation of targetcharacter. We propose a CNN based architecture, named Colornet, that takes two images as inputs – one sourcecharacter image with color, and another image with binary target character, and generates target character image withtransferred color.

3.3.1 Colornet

Each input image goes through 2D convolution layer with 64 filers of size 3 × 3. The convolution layers (layersConv1_col and Conv2_col) have leaky-ReLU activation with α = 0.2. The outputs of Conv1_col and Conv2_col arebatch-normalized and concatenated, which is followed by three block convolution layers with two max-pooling layersin between. The expanding part of Colornet contains one upsampling layer and one upconv layer with 3 filters of size3× 3. The architecture of Colornet is shown in Fig. 2. The network minimizes the mean absolute error (MAE) whiletraining with Adam optimizer that has the same parameter settings as FANnet. We train it with synthetically generatedimage pairs. The color source image and the binary target image both are generated using same font type randomlyselected from 1300 fonts. The source color image contains both solid and gradient colors so that the network can learnto transfer wide variety of color variation present in source images. Our training dataset has total 0.6 million inputtuples and the validation set contains 0.2 million input tuples. We save the weights of the network layers where theMAE loss is minimum over the validation set. We perform an element-wise multiplication of the output of Colornetwith the binary target image to get the final colorized target character image.

3.4 Character placement

Even after generation of target character, placement of it requires several careful operations. First we need to remove thesource character from I so that the generated target character can be placed. We use inpainting [29] using W (ICθ

, ψ)as mask to remove the character that needs to be edited, where W (Ib, ψ) is the dilation operation on any binary imageIb using the structural element ψ. In our experiments, we consider ψ = 3× 3. To begin the target character placement,first the output of Colornet is resized to the dimension of Iθ. We consider that the resized color target character is Rθwith minimum rectangular bounding box BRθ . If BRθ is smaller (larger) than Bθ, then we need to remove (add) theregion Bθ \BRθ so that we have the space to position Rθ with proper inter-character spacing. We apply content awareseam carving technique [30] to manipulate the non-overlapping region. It is important to mention that if BRθ is smallerthan Bθ then after seam carving, the entire text region Ω will shrink to a region Ωs, and we also need to inpaint theregion Ω \ Ωs for consistency. However, both the regions Bθ \BRθ and Ω \ Ωs are considerably small, and are easy toinpaint for uppercase characters. Finally, we place the generated target character on seam curved image such that thecentroid of BRθ overlaps with the centroid of Bθ.

4 Results

We tested our algorithm on well-known ICDAR dataset. The images in the dataset are scene images with texts writtenin different unknown fonts. In Figs. 3 and 4, we show some of the images from ICDAR dataset that are edited usingSTEFAAN. In each image pair, the left image is the original image and the right image is the edited image. In some ofthe images, several characters are edited in a particular text region, whereas in some images, several text regions areedited in a particular image. It can be observed that not only the font features and colors are transferred successfully tothe target characters, but also the inter-character spacing is maintained in most of the cases. Though, all the images

2https://fonts.google.com/

5

Page 6: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

are natural scene images and contain different lighting conditions, fonts, perspective, backgrounds etc., in all the caseSTEFANN edited the images without almost any visible distortion.

Figure 3: Text editing results on ICDAR images using STEFANN. In each image pair, the left image is the originalimage and the right image is the edited image [Best viewed with 300% zoom in the digital version].

To evaluate the performance of the proposed FANnet model, we give a particular source image and generate all theother characters for different fonts. The output for some randomly selected font-types are shown in Fig. 5. In this figure,we provide only the character image of ‘A’, and generate all the other characters successfully. It can be observed thatthe model is able to generate the characters maintaining the features of the input font. To quantify the generation qualityof FANnet, we give only one source character at a time, and measure the average structural similarity index (SSIM)[31] of all 26 generated target characters for 300 test fonts. In Fig. 6, we show the average SSIM of the generatedcharacters with respect to the index of the source characters. It can be seen that in overall generation, some characters,like ‘I’ and ‘L’, are less informative, whereas characters, like ‘B’ and ‘M’, are more informative in generative process.To further analyze the robustness of the generative model, we build another model with same architecture as describedbefore, and train it with lower-case character images of same font database. The output of the model is shown in Fig.

6

Page 7: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 4: More images that are edited using STEFANN. In each image pair, the left image is the original image and theright image is the edited image [Best viewed with 300% zoom in the digital version].

7 (a) when only a lower-case character (character ‘a’) is given as the source image, and all the lower-case charactersare generated. We also observe that the model can transfer some font features even when a lower-case (upper-case)character is provided as source character, and we try to generate the upper-case (lower-case) characters.

The performance of Colornet is shown in Fig. 8 for both solid colors and gradient colors. It can be observed thatin both cases, Colornet can transfer the color of the source characters to the respective generated target charactersfaithfully. The model works equally well for all alphanumeric characters including lower-case texts. To understandthe functionality of the layers in Colornet, we perform an ablation study, and select the best model architecture that

7

Page 8: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 5: Generation of target characters using FANnet. In each image block, the upper row shows the ground truth andthe lower row shows the generated characters when the network has observed one particular source character (character‘A’) in each case.

faithfully transfers the color. Along with proposed Colornet architecture, we compare a modified architecture Colornet-lwhere we remove the Block conv3 layer along with one upsampling layer, and another architecture Colornet-f that hasthe same architecture with the proposed Colornet model but with 16 filters in both Conv1_col and Conv2_col layers.The results of color transfer for some input tuples are shown in Fig.9 for these three models. It can be observed thatColornet-l produces visible color distortion in the generated images, whereas some of the color information are notpresent in the images generated by Colornet-f.

8

Page 9: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 6: Average SSIM with respect to the index of source character.

Figure 7: Additional generation results of target characters using FANnet: In each image block, the upper row showsthe ground truth, and the lower row shows the generated characters when the network has observed only one particular(a) lower-case source character (character ‘a’); (b) lower-case source character (character ‘a’), and it tries to generateupper-case target characters; (c) upper-case source character (character ‘A’), and it tries to generate lower-case targetcharacters.

Automatic seam carving of the text region is one of the important steps to perform perceptually consistent modificationsmaintaining the inter-character distance. Seam carving is particularly required when the target character is ‘I’. It can beobserved from Fig. 10 that the edited images with seam carving looks visually more consistent than the edited imageswithout seam carving.

9

Page 10: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 8: Color transfer using Colornet: (a) binary target character image; (b) Color source character image; (c) Groundtruth; (d) color transferred image. It can be observed that Colornet can successfully transfer gradient color as well asflat color.

Figure 9: Colornet ablation study with for different models: (a) and (b) input tuples; (c) ground truth; (d) output ofproposed Colornet model; (e) output of Colornet-l; (f) Output of Colornet-f.

5 Discussion and Conclusion

To the best of our knowledge, this is the first attempt to develop a unified platform that can edit text in images asa normal text editor with consistent font features. The proposed FANnet architecture can also be used to generatelower-case target characters. However, after the color transfer process, it is difficult to predict the size of the lower-casetarget character just from the lower-case source character. It is mainly because lower-case characters are placed indifferent ‘text zones’ [32], and the generated target character may not be replaced directly if the source character and thetarget character are not from same text zone. In Fig. 11, we show some images where we edit the lowercase characterswith STEFANN. In Fig. 12, we also show some cases where STEFANN fails to edit the text faithfully. The major reasonbehind the failed cases are the inappropriate generation of target characters. In some cases, the generated characters arenot consistent with the same characters present in the scenes (Fig. 11(a)), whereas in some cases the font features arenot transferred properly into the generated texts (Fig. 11(b)). In all the images that are edited and shown in this paper,the number of characters in a text region is not changed. The main limitation of the present methodology is that thefont generative model FANnet generates images with dimension 64× 64. While editing high resolution text regions, arigourous upsamling is often required to match the size of source character. This may include sever distortion in shapeof the upsampled target character due to interpolation. In future, we plan to integrate super-resolution after generatingthe colored target character to generate very high resolution character images that are necessary to edit many designfiles.

References

[1] N. D. Campbell and J. Kautz, “Learning a manifold of fonts,” ACM Transactions on Graphics (TOG), vol. 33,no. 4, p. 91, 2014. 2, 3

[2] H. Q. Phan, H. Fu, and A. B. Chan, “Flexyfont: Learning transferring rules for flexible typeface synthesis,” inComputer Graphics Forum, vol. 34, pp. 245–256, Wiley Online Library, 2015. 2, 3

[3] R. Suveeranont and T. Igarashi, “Example-based automatic font generation,” in International Symposium on SmartGraphics, pp. 127–138, Springer, 2010. 2, 3

10

Page 11: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 10: Effectiveness of seam carving. In each block, original image is in top, edited image without seamingoperation is in middle, and edited image with seaming operation is in bottom [Best viewed with 300% zoom in thedigital version].

[4] S. Baluja, “Learning typographic style: from discrimination to synthesis,” Machine Vision and Applications,vol. 28, no. 5-6, pp. 551–568, 2017. 2, 3

[5] J. Chang and Y. Gu, “Chinese typography transfer,” arXiv preprint arXiv:1707.04904, 2017. 2, 3[6] P. Upchurch, N. Snavely, and K. Bala, “From A to Z: supervised transfer of style and content using deep neural

network generators,” arXiv preprint arXiv:1603.02003, 2016. 2, 3[7] S. Azadi, M. Fisher, V. Kim, Z. Wang, E. Shechtman, and T. Darrell, “Multi-content GAN for few-shot font style

transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 11, p. 13,2018. 2, 3

[8] P. Lyu, X. Bai, C. Yao, Z. Zhu, T. Huang, and W. Liu, “Auto-encoder guided GAN for chinese calligraphysynthesis,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1,pp. 1095–1100, IEEE, 2017. 2, 3

[9] B. Zhou, W. Wang, and Z. Chen, “Easy generation of personal Chinese handwritten fonts,” in Multimedia andExpo (ICME), 2011 IEEE International Conference on, pp. 1–6, IEEE, 2011. 3

[10] J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural computation,vol. 12, no. 6, pp. 1247–1283, 2000. 3

11

Page 12: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

Figure 11: Some images where lower-case characters are edited using STEFANN [Best viewed with 300% zoom in thedigital version].

Figure 12: Some images where STEFANN fails to edit text with visual consistency [Best viewed with 300% zoom inthe digital version].

[11] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in 2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423, IEEE, 2016. 3

[12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,”arXiv preprint, 2017. 3

[13] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, “Edit probability for scene text recognition,” arXiv preprintarXiv:1805.03384, 2018. 3

[14] A. Gupta, A. Vedaldi, and A. Zisserman, “Learning to read by spelling: Towards unsupervised text recognition,”arXiv preprint arXiv:1809.08675, 2018. 3

12

Page 13: arXiv:1903.01192v1 [cs.CV] 4 Mar 2019A PREPRINT - MARCH 5, 2019 1.1 Main contributions To the best of our knowledge, this is the first work that attempts to modify texts in scene

A PREPRINT - MARCH 5, 2019

[15] L. Neumann, Scene text localization and recognition in images and videos. PhD thesis, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University, 2017. 3

[16] C. Li and M. Wand, “Combining markov random fields and convolutional neural networks for image synthesis,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2479–2486, 2016. 3

[17] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang, “Visual attribute transfer through deep image analogy,” arXivpreprint arXiv:1705.01088, 2017. 3

[18] A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox, “Learning to generate chairs, tables and carswith convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4,pp. 692–705, 2017. 3

[19] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Computer graphicsand applications, vol. 21, no. 5, pp. 34–41, 2001. 3

[20] Y.-W. Tai, J. Jia, and C.-K. Tang, “Local color transfer via probabilistic segmentation by expectation-maximization,”in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1,pp. 747–754, IEEE, 2005. 3

[21] T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to greyscale images,” in ACM Transactions onGraphics (TOG), vol. 21, pp. 277–280, ACM, 2002. 3

[22] C. Li, J. Guo, and C. Guo, “Emerging from water: Underwater image color correction based on weakly supervisedcolor transfer,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 323–327, 2018. 3

[23] M. Bušta, L. Neumann, and J. Matas, “Deep textspotter: An end-to-end trainable scene text localization andrecognition framework,” in Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2223–2231,IEEE, 2017. 3

[24] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai, “Multi-oriented scene text detection via corner localization andregion segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp. 7553–7563, 2018. 3

[25] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text detection in natural scene images,” IEEE transactionson pattern analysis and machine intelligence, vol. 36, no. 5, pp. 970–983, 2014. 3

[26] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “EAST: an efficient and accurate scene textdetector,” in Proc. CVPR, pp. 2642–2651, 2017. 3

[27] H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk, and B. Girod, “Robust text detection in naturalimages with edge-enhanced maximally stable extremal regions,” in Image Processing (ICIP), 2011 18th IEEEInternational Conference on, pp. 2609–2612, IEEE, 2011. 3, 4

[28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. 5[29] A. Telea, “An image inpainting technique based on the fast marching method,” Journal of graphics tools, vol. 9,

no. 1, pp. 23–34, 2004. 5[30] S. Avidan and A. Shamir, “Seam carving for content-aware image resizing,” in ACM Transactions on graphics

(TOG), vol. 26, p. 10, ACM, 2007. 5[31] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to

structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004. 6[32] U. Pal and B. Chaudhuri, “Automatic separation of machine-printed and hand-written text lines,” in Document

Analysis and Recognition, 1999. ICDAR’99. Proceedings of the Fifth International Conference on, pp. 645–648,IEEE, 1999. 10

13