Zanichelli XML-based Dictionaries Editing System

Slide 1

Zanichelli XML-basedDictionaries Editing SystemDaniele Fusi

In this speech Ill show you an overview of our XML-based dictionaries editing system.11 - System RequirementsMultiple presentations, legacy content, operating environmentLets start from system requirements.2One content, multiple presentations

datacd-rom / dvdweb sites or servicespaper bookse-books

3Today it is a fairly obvious assumption that in digital editions we must provide a single content once, and yet present it in several different forms and media, according to our target and to the different versions of our product. For instance we might want to prepare our dictionary once, and then publish it in a web site or as a web service, or print a traditional paper book, generate an e-book or a cd-rom. Also, selecting from the same content we could deliver products for different targets, from tourists to students up to scholars.Existing environment: requirements

authors accustomed toWYSIWYG editing in Word processors no technical training

IT point of view text as a database query and interactivity multiple media and forms

editors content validation and uniformation text-based tools simple content structure

designers DTP pagination flattened structure import / export

Also, typically a working environment is already well-established and each team member has his own requirements and tools. For instance, authors are accustomed to write in Word processor in a What-You-See-Is-What-You-Get environment, nor they want to change their habits or face a difficult technical training. Editors in turn use tools and procedures for validating and polishing contents which typically are text-based and demand a simple content structure. Designers go even further and usually ask for a completely flattened structure which fits well into DTP software; yet they require not only to import content but also to export it once the last-minute corrections have been done by themselves. Finally, IT persons require to consider a dictionary as a sort of database, providing interactive query capabilities and content transformations to comply with the multiple presentations of the unique content.4Existing content: conversion

word processordocuments

3rd partyformats

Finally, more often than not we must deal with existing digital content, either in Word processor or in third-party formats, which require to be converted into our generic dictionary format to be further processed.5Digital format requirementstext-based storage, both machine- and user-readableusing standard technologies (portable & durable)open to expansion and customizationeasy to manipulateeasy to transform for import/exportfocused on semantics: content rather than its presentation

Having said that, our digital format must comply with a number of requirements: it must be text-based, so that it is both machine- and user-readable and complies with several existing procedures made for text; it should use standard techonologies, which grant longevity and portability; it must be open to expansion and customization to fit to different works; it must be easy to manipulate and transform so that it can pass through all the production steps and fit to different media and publishing forms; and it should be focused on semantics, i.e. on the content itself rather than on any of its specific presentations.6Content and semantics: dictionary

...lemma:

In a digital dictionary semantics is essential: lets make a sample. Say we have a traditional Word processor text representing a Greek-Italian dictionary like in this picture. Here we just have a long string of characters, so that software has no competence about the semantics of the text itself. Should we want to look for the lemma stoma, we could just make a full-text search: this way we would not only find the unique lemma stoma in the dictionary, but also all the occurrences of this word in the text, e.g. in samples. Thus we would have to browse several results until we find the desired one. Things would get worse if we want to make finer searches, e.g. look for all the lemmata built with suffix -ma, a very common indoeuropean suffix. In this case we would find thousands of words which are not lemmata, and our search would be almost useless.7Marking semantics in text: fields

lemmamorphologyetymontranslationsampleworketc ...Clearly this is not the way to go. In a text like this all the characters are equal, while we need a way of marking some portions of the text as semantically defined: e.g. some words represent the lemma, others represent morphological indications, etymology, translations, samples, samples works with their authors, etc.8

Semantic markup: applicationslemmamorphologyetymontranslationsampleworkalphabetical lemmata list, normal or invertedlist of lemmata grouped by grammatical categorylist of lemmata grouped by etymon (roots dictionary)rudimentary bidirectional dictionarylook for quotationlist of quoted works and authorsetc ...complex searcheslemmamorphologyetymonworketc ...

If we mark the whole text this way, we will then be able to exploit the full potential of a true digital edition. For instance, we might be able to look to just the list of lemmata, in normal or inverse order (the latter is useful when studying e.g. word formation), or group them by grammatical category (substantives, adjectives, verbs, etc.), etymology (thus getting a sort of roots dictionary), or even generate a rudimentary bidirectional dictionary by collecting all the Italian translations and listing them in alphabetical order (whence a sort of Italian-Greek dictionary). We might also look for specific quotations among the samples, or for specific authors or works, etc. Of course we can also combine these searches into complex queries, looking e.g. for all the substantives based on a specific root and quoted in samples from Homer, etc.: possibilities are endless.92 Solution overviewXML-based implementationSo, these are our requirements and objectives. Lets now see an overview of our solution.10Implementation: XMLXMLDictionaryUnicode text fileswidely used standardbuilt for openness and transformation (XSLT)representation of any kind of data, independently from their presentationhierarchical modelwell-fit to hierarchical model: letter, lemma, fieldstypically stored as text for existing worksdictionaryletterlemmafieldOur solution is based onto an XML implementation, which provides the following benefits: a dictionary is just a bunch of standard Unicode text files, structured with a widely used standard built right for openness and transformation. Also, XML is built to represent any kind of data, independently of their presentation, and spots a hierarchical model which fits well a dictionary, which typically is composed of letters which contain lemmata which contain several fields (lemma, morphology, translations, samples, etc.). Finally, existing digital data are already stored as text.11Sample: lemma and fieldslemma = dizionriodate = 1965grammar = s.m.translation 1 = complesso dei lemmi di un dizionario e sim.separatortranslation 2 = lista dei lemmidizionrio [1965] s.m. complesso dei lemmi di un dizionario e sim. lista dei lemmiLets see a sample of this hierarchical structure: here we have a typical dictionary item, consisting of lemma, date of first attestation, grammatical indications, and a couple of translations divided by a separator. Each of these elements represent a content field in the item.12dictionarytranslationseparatortranslationgrammardateHierarchical structurelemmalemmalemmata...letter

letters...These fields are included in a lemma item, and several of these items are included in a letter; several letters are finally included in a dictionary.13Minimalist structureFlat, yet extensiblesmallest depth satisfies practical requirementsfields vary at will accor-ding to the dictionary language and typevariability of fields compensates for relatively flat hierarchyWe thus have a minimalist hierarchy which can be represented as dictionary letter lemma field. This is the core of our solution, whose depth is thus kept at a minimum right to satisfy all the practical requirements seen above. The hierarchy is deep enough to effectively represent any dictionary, and even if it is fixed we can provide a great deal of customization because field types can be added at will.14Structure and compromisesPractical devicesfields define lemma parts: etymon, translation, grammar, samples, ...formatting is automatically derived from semantic structure (lemma = bold, grammar = italic, author = smallcaps, ...)text escapes define specific formatting for portions of field values, whenever they are not considered as semantically relevantI came by cabFocus on semantics1 field (sample) in lemma: hierarchy needs notto be deeper, yet allow emphasis on byAs a consequence of our focus on semantics, fields just define lemma parts like etymon, translation, grammar, samples, etc, while formatting is automatically derived from semantics (e.g. we might want to make the lemma bold, grammatical information italic, authors smallcaps, etc.). Yet, like in the case of its minimal structure this solution is a compromise between practical and theorical requirements: some practical devices (text escapes) are used to define specific formattings wherever their semantic relevance is not considered important. For instance, in a sample like I came by cab we might want to emphasize the preposition by printing it in italic, but semantically we are just interested in marking the sample sentence as such as a whole. The sample is just one field in the lemma and hierarchy needs not to be deeper; yet we allow emphasis on the preposition by with the aid of some textual escapes in the XML element value.15Storage: dataXML files: one file per lettereach dictionary has its own alphabet and sorting schemelemmata: automatically inserted in the proper file and at the proper position according to their contentlemma ID overriding for special sortingXML files(letters)lemma ct (du)acoteABSabiesse10 minutestenminutesAs for storage, a dictionary is just a bunch of XML files, one for each letter. Each dictionary has its own alphabet and sorting scheme, so that lemmata can be automatically inserted in the proper file position according to their content. Yet, the software solution also gives the possibility of overriding lemmata identifiers for sorting them in special ways (e.g. the lemma 10 minutes in a movies titles dictionary must be filed under letter T, as if it were written tenminutes).16Storage: metadataself-descriptive dictionary: additional XML files define:fields list and types within each dictionaryalphabet and sort order for each dictionary, including diacritics sensitivityother support dictionary-specific resources (e.g. frequently typed symbols, preview styles)prelemmaetymonabbreviationphoneticstranslationvariantgrammarcategory (A, B, C...)section (1, 2, 3...)separator ( ...)...abcdefghijklmnoprstuvzcroatianAlso, a dictionary is self-descriptive: a set of special XML files define fields types, alphabet and sort order, and eventually dictionary-specific editing resources like frequently typed symbols, preview styles, etc.173 EditingAuthorsLets now try to briefly follow the main production steps in this system, starting from authors.18Visual Editingvisual UI:authors build lemmata visually by blocks, and are shielded from underlying XML codeXML code integrity is granted by softwaretypographical preview is provided for WYSIWYG accustomed authorsXMLdata

file = letterletterlemmatafieldsXMLmetadata

So in this system a dictionary is just a set of XML files: each represents a letter, including lemmata which include fields; additionally, other XML files define each dictionarys metadata. Yet, authors must know nothing of all this as we provide them with a special editing software. This allows them to build lemmata visually by blocks, thus being shielded from the complexity of the underlying XML code; also, it grants XML code integrity, and provides authors with a typographical preview so that they can have an immediate feeling of what they are doing.19Editing software: editing by blocks

lemmata listvisual editing: fields in lemmatypographicalpreviewletterselectorHere is a snapshot of this software. As you can see, users can select letters and get a list of lemmata, and then select a lemma to get a list of blocks, which can be edited and filled at will. Whenever users change a block a typographical preview (bottom pane) is updated, thus providing users with the immediate feel of their changes. Authors just edit text by blocks, and the software generates and manipulates the underlying XML code.20Editing in distributed scenariosWeb based visual editingThis scenario can also be replicated in a remote environment.21Web: distributed scenariodictionaries are stored centrally in a web serveran ASP.NET web site manages accesses and versioning for different authors and worksvisual editing implemented as a Silverlight RIA, running from authors own computer, yet inside a web page:desktop-class responsiveness for applicationtrue platform independence (Mac / PC, IE / Mozilla / Safari)no need for software distribution and installationcentralized software maintenance

In this case, dictionaries (XML files) are stored centrally in a web server. An ASP.NET application manages accesses and versioning for different authors and works, while the editing software is implemented with a Silverlight rich internet application. This provides a desktop-class responsive application, true platform independence and all the benefits of a centralized administration and maintenance.22Distributed editing

SQL

database formanaging accessASP.NET serverapplication managesusers and worksversionsSilverlight application runs onclient computer for visual editingXMLauthorspecialized authoreditorThis scenario also allows multiple authors to work on the same dictionary. Each author has his own role and privileges, reflected in the editing application.23Visual editing in your web browser

lemmata listvisual editing: fields in lemmatypographicalpreviewletterselector

Heres a screenshot of the web editing Silverlight application. As you can see, it is very similar to its desktop counterpart.244 - RevisionEditorsOnce authors have finished, editors come into play.25Content revisions and transformations

merging differentversions (multipleauthors scenarios)editors validationand uniformationDTP paginationfor printing

Using specialized software, the unique XML content can be automatically processed for merging different versions of the same work (in a multiple-author scenario), validating and polishing text and importing it in a traditional DTP application for printing.26Automated revision and correction

test selectiontest descriptionresultsA companion software application allows editors to perform a number of very specialized validation tests thus ensuring full content proofing. Here you can see a screenshot of this application, showing test selection, description and results.275 - PublicationEditorsFinally, contents can be published for the end user.28One content, multiple outputs

printcd/dvdmobile devices(Mobipocket)web sitesAs shown before, we can easily start from the unique XML-based content and automatically transform it into a variety of different outputs: printed texts, cdroms, e-books for mobile devices, web sites, etc.29Extending the modelSample: RTL languages and root-based dictionariesFinally we can show the extensibility of this model with a very complex dictionary involving the Arabic language.30Arabic-Italian dictionaryclashing RTL/LTR text flowsspecial alphabetical order:several letters share the same rankdifferent sorting according to levelroot-based dictionary:letterrootlemmafieldexisting dictionaries structure must be kept unchanged even if a deeper hierarchy would be required+0621...+0627+0628+062A123...roots are sorted accordingto predefined scheme,lemmata in roots are arbitrarilysorted by authorsAn Arabic-Italian dictionary implies lots of peculiarities: we must accommodate for clashing right-to-left and left-to-right text flows, define a very peculiar alphabetic ordering (where many letters being complementary share the same position, and sorting differs according to the hierarchical level), and above all fit into our model a deeper structure, as we are dealing with a root-based dictionary. Instead of having just letters, lemmata and fields here we have letters, roots, lemmata and fields. Yet we must strive to keep our existing structure unchanged so that we can reuse the rest of our production environment.31letterHierarchy depthsOther dictionariesArabic: roots...lemma...lemma.........item = rootitem = lemmaitem = lemma = set of fieldsitem = root = set of fields, some delimitinglemmata boundariesWe thus have two competing hierarchies: one for all other dictionaries, and one for Arabic. As we must fit both in the same scheme, a trick is used: in other dictionaries a letter contains a sequence of lemmata, while in Arabic a letter contains a sequence of roots and lemmata. Even if these are laid out on the same level, roots are hierarchically higher than lemmata and a special software is provided to deal with this special relevance.32Deeper hierarchy illusion: special editorArabic-X editorXML structure unchanged: each file is a letter containing items, each item contains fieldsitems are roots, not lemmataa special field defines lemmata boundaries whithin each rootuser sees letters, roots, lemmata in root, fields in lemmata; XML structure remains letter-items-fields...lemma...lemmaA special editor provides users with the illusion of a deeper hierarchy: the XML structure is unchanged: each file is a letter containing items, each item contains fields. Anyway, here items are not lemmata but roots, and a specially added field defines lemmata boundaries whithin each root. Thus the user sees letters, roots, lemmata in root, fields in lemmata; but the inner XML structure remains letters, items, and fields.33Specialized editor: Arabic

letterselectorroots in letterlemmata in rootvisual editing: fields in lemmatypographicalpreview, bidirectional flows

Heres the special editor in action. Of course it must also deal with LTR text flow and Arabic editing, but the WPF technology used here offers a great deal of support for complex typography.34Arabic editor trick: advantagesuser experience is almost unchanged (there are 2 lists instead of 1 to choose from for editing, roots and lemmata)XML structure unchanged: all the other editorial processes require no change so that the new dictionary fits into them easilyfields variability (already responsible for structure expandability) makes this trick possibleone model, several viewsThus, even when dealing with an Arabic roots dictionary the user experience is almost unchanged and XML structure is unchanged, so that all the other editorial processes require no change. Again, its right the fields variability, already responsible for structure expandability, which makes this trick possible. Once again: one model, several views.35Daniele Fusi

http://www.fusisoft.it

[email protected]

Zanichelli XML-based Dictionaries Editing System

Documents

Transcript of Zanichelli XML-based Dictionaries Editing System