arXiv:2005.01646v1 [cs.LG] 4 May 2020(Kingma and Welling,2014) capture the more visu-ally salient...

A Probabilistic Generative Model forTypographical Analysis of Early Modern Printing

Kartik Goyal1 Chris Dyer2 Christopher Warren3 Max G’Sell4 Taylor Berg-Kirkpatrick5

1Language Technologies Institute, Carnegie Mellon University 2Deepmind3Department of English, Carnegie Mellon University4Department of Statistics, Carnegie Mellon University

5Computer Science and Engineering, University of California, San Diego{kartikgo,cnwarren,mgsell}@andrew.cmu.edu [email protected] [email protected]

Abstract

We propose a deep and interpretable prob-abilistic generative model to analyze glyphshapes in printed Early Modern documents.We focus on clustering extracted glyph imagesinto underlying templates in the presence ofmultiple confounding sources of variance. Ourapproach introduces a neural editor model thatfirst generates well-understood printing phe-nomena like spatial perturbations from tem-plate parameters via interpertable latent vari-ables, and then modifies the result by generat-ing a non-interpretable latent vector responsi-ble for inking variations, jitter, noise from thearchiving process, and other unforeseen phe-nomena associated with Early Modern print-ing. Critically, by introducing an inferencenetwork whose input is restricted to the vi-sual residual between the observation and theinterpretably-modified template, we are ableto control and isolate what the vector-valuedlatent variable captures. We show that ourapproach outperforms rigid interpretable clus-tering baselines (Ocular) and overly-flexibledeep generative models (VAE) alike on thetask of completely unsupervised discovery oftypefaces in mixed-font documents.

1 Introduction

Scholars interested in understanding details relatedto production and provenance of historical docu-ments rely on methods of analysis ranging from thestudy of orthographic differences and stylometrics,to visual analysis of layout, font, and printed char-acters. Recently developed tools like Ocular (Berg-Kirkpatrick et al., 2013) for OCR of historical docu-ments have helped automate and scale some textualanalysis methods for tasks like compositor attri-bution (Ryskina et al., 2017) and digitization ofhistorical documents (Garrette et al., 2015). How-ever, researchers often find the need to go beyond

Figure 1: We desire a generative model that can be biased tocluster according to typeface characteristics (e.g. the length ofthe middle arm) rather than other more visually salient sourcesof variation like inking.

textual analysis for establishing provenance of his-torical documents. For example, Hinman (1963)’sstudy of typesetting in Shakespeare’s First Foliorelied on the discovery of pieces of damaged ordistinctive type through manual inspection of everyglyph in the document. More recently, Warren et al.(2020) examine pieces of distinctive types acrossseveral printers of the early modern period to positthe identity of clandestine printers of John Milton’sAreopagitica (1644). In such work, researchersfrequently aim to determine whether a book wasproduced by a single or multiple printers (Weiss(1992); Malcolm (2014); Takano (2016)). Hence,in order to aid these visual methods of analyses,we propose here a novel probabilistic generativemodel for analyzing extracted images of individ-ual printed characters in historical documents. Wedraw from work on both deep generative modelingand interpretable models of the printing press todevelop an approach that is both flexible and con-trollable – the later being a critical requirement forsuch analysis tools.

As depicted in Figure 1, we are interested in iden-tifying clusters of subtly distinctive glyph shapesas these correspond to distinct metal stamps inthe type-cases used by printers. However, other

arX

iv:2

005.

0164

6v1

[cs

.LG

] 4

May

202

0

sources of variation (inking, for example, as de-picted in Figure 1) are likely to dominate conven-tional clustering methods. For example, power-ful models like the variational autoencoder (VAE)(Kingma and Welling, 2014) capture the more visu-ally salient variance in inking rather than typeface,while more rigid models (e.g. the emission modelof Ocular (Berg-Kirkpatrick et al., 2013)), fail tofit the data. The goal of our approach is to accountfor these confounding sources of variance, whileisolating the variables pertinent to clustering.

Hence, we propose a generative clustering modelthat introduces a neural editing process to add ex-pressivity, but includes interpretable latent vari-ables that model well-understood variance in theprinting process: bi-axial translation, shear, androtation of canonical type shapes. In order to makeour model controllable and prevent deep latent vari-ables from explaining all variance in the data, weintroduce a restricted inference network. By onlyallowing the inference network to observe the vi-sual residual of the observation after interpretablemodifications have been applied, we bias the poste-rior approximation on the neural editor (and thusthe model itself) to capture residual sources of vari-ance in the editor – for example, inking levels, inkbleeds, and imaging noise. This approach is relatedto recently introduced neural editor models for textgeneration (Guu et al., 2018).

In experiments, we compare our model withrigid interpretable models (Ocular) and powerfulgenerative models (VAE) at the task of unsuper-vised clustering subtly distinct typeface in scannedimages early modern documents sourced fromEarly English Books Online (EEBO).

2 Model

Our model reasons about the printed appearancesof a symbol (say majuscule F) in a document via amixture model whose K components correspondto different metal stamps used by a printer for thedocument. During various stages of printing, ran-dom transformations result in varying printed man-ifestations of a metal cast on the paper. Figure 2depicts our model. We denote an observed imageof the extracted character by X . We denote choiceof typeface by latent variable c (the mixture com-ponent) with prior π. We represent the shape ofthe k-th stamp by template Tk, a square matrixof parameters. We denote the interpretable latentvariables corresponding to spatial adjustment of

Figure 2: Proposed generative model for clustering imagesof a symbol by typeface. Each mixture component c corre-sponds to a learnable template Tk. The λ variables warp(spatially adjust) the original template T to T . This warpedtemplate is then further transformed via the z variables to Tvia an expressive neural filter function parametrized by θ.

the metal stamp by λ, and the editor latent vari-able responsible for residual sources of variationby z. As illustrated in Fig. 2, after a cluster compo-nent c = k is selected, the corresponding templateTk undergoes a transformation to yield Tk. Thistransformation occurs in two stages: first, the inter-pretable spatial adjustment variables (λ) producean adjusted template (§2.1), Tk = warp(Tk, λ),and then the neural latent variable transforms theadjusted template (§2.2), Tk = filter(Tk, z). Themarginal probability under our model can be writ-ten as

p(X) =∑k

πk

∫p(X|λ, z;Tk)p(λ)p(z)dzdλ

where p(X|λ, z;Tk) refers to the distribution overthe binary pixels of X where each pixel has abernoulli distribution parametrized by the valueof the corresponding pixel-entry in Tk.

2.1 Interpretable spatial adjustmentEarly typesetting was noisy, and the metal pieceswere often arranged with slight variations whichresulted in the printed characters being positionedwith small amounts of offset, rotation and shear.These real-valued spatial adjustment variables aredenoted by λ = (r, o, s, a), where r represents therotation variable, o = (oh, ov) represents offsets

along the horizontal and vertical axes, s = (sh, sv)denotes shear along the two axes. A scale fac-tor, a = 1.0 + a, accounts for minor scale vari-ations arising due to the archiving and extractionprocesses. All variables in λ are generated from aGaussian prior with zero mean and fixed varianceas the transformations due to these variables tendto be subtle.

In order to incorporate these deterministic trans-formations in a differentiable manner, we map λ toa template sized attention map Hij for each outputpixel position (i, j) in T as depicted in Figure 3.The attention map for each output pixel is formedin order to attend to the corresponding shifted (orscaled or sheared) portion of the input templateand is shaped according to a Gaussian distributionwith mean determined by an affine transform. Thisapproach allows for strong inductive bias whichcontrasts with related work on spatial-VAE (Bepleret al., 2019) that learns arbitrary transformations.

Figure 3: Translation operation: The mode of the attentionmap is shifted by the offset values for every output pixel in T .Similar operations account for shear, rotation, and scale.

2.2 Residual sources of variations

Apart from spatial perturbations, other majorsources of deviation in early printing include ran-dom inking perturbations caused by inconsistentapplication of the stamps, unpredictable ink bleeds,and noise associated with digital archiving of thedocuments. Unlike in the case of spatial pertur-bations which could be handled by deterministicaffine transformation operators, it is not possible toanalytically define a transformation operator due tothese variables. Hence we propose to introduce anon-interpretable real-valued latent vector z, witha Gaussian prior N (0, I) , that transforms T intoa final template T via neurally-parametrized func-tion filter(T , z; θ) with neural network parametersθ. This function is a convolution over T whosekernel is parametrized by z, followed by non-linearoperations. Intuitively, parametrizing the filter byz results in the latent variable accounting for varia-tions like inking appropriately because convolutionfilters capture local variations in appearance. Sri-vatsan et al. (2019) also observed the effectiveness

cXObservation Template choice

TcWarped

template

Rc

Residual

z Inference parameters

ϕ

Posterior Approximation

q(z |X, c; ϕ)

Rc = X � Tc

z = InferNet(Rc, c)

Figure 4: Inference network for z conditions on the mixturecomponent and only the residual image left after subtractingthe λ-transformed template from the image. This encouragesz to model variance due to sources other than spatial adjust-ments.

of using z to define a deconvolutional kernel forfont generation.

2.3 Learning and Inference

Our aim is to maximize the log likelihood of theobserved data ({Xd | d ∈ N, d < n}) of n imageswrt. model parameters:

LL(T1,...,k, θ) = maxT,θ

∑d

log[∑

k

πk∫p(Xd|λd, zd;Tk, θ)p(λd)p(zd)dzddλd

]During training, we maximize the likelihood wrt.λ instead of marginalizing, which is an approxima-tion inspired by iterated conditional modes (Besag,1986):

maxT,θ

∑d

log∑k

maxγk,d

πk

∫p(Xd|λd = γk,d, zd;

Tk, θ)p(λd = γk,d)p(zd)dzd

However, marginalizing over z remains intractable.Therefore we perform amortized variational infer-ence to define and maximize a lower bound onthe above objective (Kingma and Welling, 2014).We use a convolutional inference neural networkparametrized by φ (Fig. 4), that takes as input, themixture component k, the residual image Rk =X − Tk, and produces mean and variance parame-ters for an isotropic gaussian proposal distributionq(z | Rk, k;φ). This results in the final training

objective:

maxT,θ,φ

∑d

log∑k

Eq(zd|Rd,k,k;φ)

[maxγk,d

(πk

p(Xd|λ = γk,d, zd;Tk, θ)p(λ = γk,d))]

−KL(q(zd|Rd,k, k;φ)||p(z)

)We use stochastic gradient ascent to maximize thisobjective with respect to T, γ, θ and φ.

3 Experiments

We train our models on printed occurrences of 10different uppercase character classes that schol-ars have found useful for bibliographic analysis(Warren et al., 2020) because of their distinctive-ness. As a preprocessing step, we ran Ocular (Berg-Kirkpatrick et al., 2013) on the grayscale scannedimages of historical books in EEBO dataset and ex-tracted the estimated image segments for the lettersof interest.

3.1 Quantitative analysisWe show that our model is superior to strong base-lines at clustering subtly distinct typefaces (usingrealistic synthetic data), as well as in terms of fit-ting the real data from historical books.

3.1.1 Baselines for comparisonOcular: Based on the emission model of Ocu-lar that uses discrete latent variables for the ver-tical/horizontal offset and inking variables, andhence has limited expressivity.λ-only: This model only has the interpretable con-tinuous latent variables pertaining to spatial adjust-ment.VAE-only: This model is expressive but doesn’thave any interpretable latent variables for explicitcontrol. It is an extension of Kingma et al. (2014)’smodel for semi-supervised learning with a continu-ous latent variable vector in which we obtain tighterbounds by marginalizing over the cluster identitiesexplicitly. For fair comparison, the encoder anddecoder convolutional architectures are the sameas the ones in our full model. The correspondingtraining objective for this baseline can be writtenas:

maxT,θ,φ

∑d

log∑k

Eq(zd|Xd,k;φ)

[πkp(Xd|zd;Tk, θ)

]−KL

(q(zd|Xd, k;φ)||p(z)

)No-residual: The only difference from the fullmodel is that the encoder for the inference network

conditions the variational distribution q(z) on theentire input image X instead of just the residualimage X − T .

3.1.2 Font discovery in Synthetic DataEarly modern books were frequently composedfrom two or more type cases, resulting in docu-ments with mixed fonts. We aim to learn the dif-ferent shapes of metal stamps that were used astemplates for each cluster component in our model.Data: In order to quantitatively evaluate ourmodel’s performance, we experiment with syntheti-cally generated realistic dataset for which we knowthe ground truth cluster identities in the follow-ing manner: For each character of interest, wepick three distinct images from scanned segmentedEEBO images, corresponding to three differentmetal casts. Then we randomly add spatial perur-bations related to scale, offset, rotation and shear.To incorporate varying inking levels and other dis-tortions, we randomly either perform erosion, di-lation, or a combination of these warpings usingOpenCV (Bradski, 2000) with randomly selectedkernel sizes. Finally, we add a small Gaussian noiseto the pixel intensities and generate 300 perturbedexamples per character class.Results: We report macro-averaged resultsacross all the character classes on three differ-ent clustering measures, V-measure (Rosenbergand Hirschberg, 2007), Mutual Information andFowlkes and Mallows Index (Fowlkes and Mal-lows, 1983). In Table 11, we see that our modelsignificantly outperforms all other baselines on ev-ery metric. Ocular and λ-only models fail becausethey lack expressiveness to explain the variationsdue to random jitters, erosions and dilations. TheVAE-only model, while very expressive, performspoorly because it lacks the inductive bias neededfor successful clustering. The No-residual modelperforms decently but our model’s superior perfor-mance emphasizes the importance of designing arestrictive inference network such that z doesn’texplain any variation due to the interpretable vari-ables.

3.1.3 Fitting Real Data from Historical BooksFor the analysis of real books, we selected threebooks from the EEBO dataset printed by differentprinters. We modeled each character class for eachbook separately and report the macro-aggregatedupper bounds on the negative log likelihood (NLL)in Table 11. We observe that adding a small amount

V-measure Mutual Info F&M NLL

Ocular 0.42 0.45 0.61 379.21λ-only 0.49 0.51 0.70 322.04VAE-only 0.22 0.29 0.38 263.45No-residual 0.54 0.58 0.73 264.27Our Model 0.73 0.74 0.85 257.92

Table 1: (a) Clustering results on synthetic data (V-measure,Mutual Info, F&M). (b) Test negative log likelihood (NLL)on real data from historical documents, or negative ELBObound for intractable models (NLL).

of expressiveness makes our λ-only model betterthan Ocular. The upper bounds of other inferencenetwork based models are much better than the(likely tight)1 bounds of both the interpretable mod-els. Our model has the lowest upper bound of allthe models while retaining interpretability and con-trol.

3.2 Qualitative analysis

We provide visual evidence of desirable behavior ofour model on collections of character extractionsfrom historical books with mixed fonts. Specif-ically, we discus the performance of our modelon the mysterious edition of Thomas Hobbes’Leviathan known as “the 25 Ornaments” edition.(Hobbes, 1651 [really 1700?]). The 25 OrnamentsLeviathan is an interesting test case for several rea-sons. While its title page indicates a publisherand year of publication, both are fabricated (Mal-colm, 2014). The identities of its printer(s) remainspeculative, and the actual year of publication isuncertain. Further, the 25 Ornaments exhibits twodistinct fonts.

3.2.1 Quality of learned templates

X

T

T Learned Template parameters

Transformed Templates

Observations

Figure 5: The learned templates for F and R and the trans-formed templates T for four examples of F are shown. Ourmodel is able to learn desirable templates based on underlyingglyph structure.

Our model is successful in discovering distinctlyshaped typefaces in the 25 Ornaments Leviathan.

1For Ocular and λ-only models, we report the upperbound obtained via maximization over the interpretable la-tent variables. Intuitively, these latent variables are likely tohave unimodal posterior distributions with low variance, hencethis approximation is likely tight.

We focus on the case study of majuscule letters Fand R, each of which have two different typefacesmixed in throughout. The two typefaces for F dif-fer in the length of the middle arm (Fig. 1), and thetwo typefaces for R have differently shaped legs. InFig. 5, we show that our model successfully learnsthe two desired templates T1 and T2 for both thecharacters which indicates that the clusters in ourmodel mainly focus on subtle differences in under-lying glyph shapes. We also illustrate how the la-tent variables transform the model templates T to Tfor four example F images. The model learns com-plex functions to transform the templates which gobeyond simple affine and morphological transfor-mations in order to account for inking differences,random jitter, contrast variations etc.

3.2.2 Interpretable variables (λ) and Control

1 2 3 Avg.

Unaligned raw Images

Aligned Images

Figure 6: Result of alignment on Leviathan extractionsusing the interpretable λ variables along with their pixelwiseaverage images. Aligned average image is much sharper thanthe unaligned average image.

Finally, we visualize the ability of our modelto separate responsibility of modelling variationamong the interpretable and non-interpretable vari-ables appropriately. We use the inferred values ofthe interpretable (λ) variable for each image in thedataset to adjust the corresponding image. Sincethe templates represent the canonical shape of theletters, the λ variables which shift the templates toexplain the images can be reverse applied to theinput images themselves in order to align them byaccounting for offset, rotation, shear and minor sizevariations. In Fig. 6, we see that the input images(top row) are uneven and vary by size and orienta-tion. By reverse applying the inferred λ values, weare able to project the images to a fixed size suchthat they are aligned and any remaining variationsin the data are caused by other sources of variation.Moreover, this alignment method would be cru-cial for automating certain aspects of bibliographicstudies that focus on comparing specific imprints.

4 Conclusion

Beyond applications to typeface clustering, the gen-eral approach we take might apply more broadly toother clustering problems, and the model we devel-

oped might be incorporated into OCR models forhistorical text.

5 Acknowledgements

This project is funded in part by the NSF undergrants 1618044 and 1936155, and by the NEHunder grant HAA256044-17.

ReferencesTristan Bepler, Ellen Zhong, Kotaro Kelley, Edward

Brignole, and Bonnie Berger. 2019. Explicitly disen-tangling image content from translation and rotationwith spatial-vae. In Advances in Neural InformationProcessing Systems, pages 15409–15419.

Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein.2013. Unsupervised transcription of historical docu-ments. In Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 207–217, Sofia, Bul-garia. Association for Computational Linguistics.

Julian Besag. 1986. On the statistical analysis of dirtypictures. Journal of the Royal Statistical Society: Se-ries B (Methodological), 48(3):259–279.

G. Bradski. 2000. The OpenCV Library. Dr. Dobb’sJournal of Software Tools.

Edward B Fowlkes and Colin L Mallows. 1983. Amethod for comparing two hierarchical clusterings.Journal of the American statistical association,78(383):553–569.

Dan Garrette, Hannah Alpert-Abrams, Taylor Berg-Kirkpatrick, and Dan Klein. 2015. Unsupervisedcode-switching for multilingual historical documenttranscription. In Proceedings of the 2015 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 1036–1041, Denver, Col-orado. Association for Computational Linguistics.

Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren,and Percy Liang. 2018. Generating sentences byediting prototypes. Transactions of the Associationfor Computational Linguistics, 6:437–450.

Charlton Hinman. 1963. The printing and proof-reading of the first folio of Shakespeare, volume 1.Oxford: Clarendon Press.

Thomas Hobbes. 1651 [really 1700?]. Leviathan, or,the matter, form, and power of a common-wealth ec-clesiastical and civil. By Thomas Hobbes of Malmes-bury. Number R13935 in ESTC. [false imprint]printed for Andrew Crooke, at the Green Dragon inSt. Pauls Church-yard, London.

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In 2nd InternationalConference on Learning Representations, ICLR

2014, Banff, AB, Canada, April 14-16, 2014, Con-ference Track Proceedings.

Durk P Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. 2014. Semi-supervisedlearning with deep generative models. In Advancesin neural information processing systems, pages3581–3589.

Noel Malcolm. 2014. Editorial Introduction. InLeviathan, volume 1. Clarendon Press, Oxford.

Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy-based external clus-ter evaluation measure. In Proceedings of the 2007joint conference on empirical methods in naturallanguage processing and computational natural lan-guage learning (EMNLP-CoNLL), pages 410–420.

Maria Ryskina, Hannah Alpert-Abrams, Dan Garrette,and Taylor Berg-Kirkpatrick. 2017. Automatic com-positor attribution in the first folio of shakespeare. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 411–416, Vancouver, Canada.Association for Computational Linguistics.

Nikita Srivatsan, Jonathan Barron, Dan Klein, and Tay-lor Berg-Kirkpatrick. 2019. A deep factorizationof style and structure in fonts. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 2195–2205, Hong Kong,China. Association for Computational Linguistics.

Akira Takano. 2016. Thomas Warren: A Printer ofLeviathan (head edition). Annals of Nagoya Univer-sity Library Studies, 13:1–17.

Christopher N. Warren, Pierce Williams, Shruti Rijh-wani, and Max G’Sell. 2020. Damaged type andAreopagitica’s clandestine printers. Milton Studies,62.1.

Adrian Weiss. 1992. Shared Printing, Printer’s Copy,and the Text(s) of Gascoigne’s ”A Hundreth SundrieFlowres”. Studies in Bibliography, 45:71–104.

A Character wise quantitative analysis

The quantitative experiments were performed onthe following character classes: A, B, E, F, G, H,M, N, R, W.


λ-only 0.77 0.82 0.89 264.90VAE-only 0.33 0.38 0.5 230.45No-residual 0.79 0.85 0.90 231.45Our Model 0.78 0.86 0.89 226.25

Table 2: Results for character A

https://www.aclweb.org/anthology/P13-1021

https://www.aclweb.org/anthology/P13-1021

https://doi.org/10.3115/v1/N15-1109

https://doi.org/10.3115/v1/N15-1109

https://doi.org/10.3115/v1/N15-1109

http://arxiv.org/abs/1312.6114

http://arxiv.org/abs/1312.6114

https://doi.org/10.18653/v1/P17-2065

https://doi.org/10.18653/v1/P17-2065

https://doi.org/10.18653/v1/D19-1225

https://doi.org/10.18653/v1/D19-1225

http://www.jstor.org/stable/40371958





Table 3: Results for character B



Table 4: Results for character E



Table 5: Results for character F



Table 6: Results for character G



Table 7: Results for character H



Table 8: Results for character M



Table 9: Results for character N



Table 10: Results for character R



Table 11: Results for character W

arXiv:2005.01646v1 [cs.LG] 4 May 2020(Kingma and Welling,2014) capture the more visu-ally salient...

Documents

Transcript of arXiv:2005.01646v1 [cs.LG] 4 May 2020(Kingma and Welling,2014) capture the more visu-ally salient...