Cross Media Publishing of MediaWiki Contentworldcomp-proceedings.com/proc/p2012/EEE2841.pdf · This...

5
Cross Media Publishing of MediaWiki Content Thorsten Kisner Management and Engineering AHT GROUP AG Essen, Germany Email: [email protected] Abstract—Wikis are excellent tools for distributed authoring, knowledge management and collaboration. Even for inexperi- enced users the used markup syntax is easy to understand and tends not in an obstacle to participate in the authoring process. Widely used in universities, enterprises and communities, Wikis represent a web based content management system for collaborative working and authoring. In many situations users want to have the content of a Wiki or selected articles available in different types of media other than the web based online version. This is the domain of cross media publishing systems being able to create several output formats from one single source. In this paper we present a module of the cross media publishing framework openFuXML to integrate Wiki content and render it to different output formats. I. I NTRODUCTION Wikis have become a widely used tool in the Internet, universities, schools and enterprises since their invention in 1995 with a significant impact on society with Wikipedia.A impressive increase of Wikis as a platform for collaborative knowledge sharing [1], ranging from some loosely connected articles up to vast databases with hundreds of thousands of articles and dense connections between these, has taken place during the last ten years. In the area of education, projects like Wikibooks or Wikiver- sity are taking advantage of the community to produce text- books, course materials or curricula to support and organize educational activities. The impact of Wikibooks in the textbook industry is de- scribed in [2] in the context of community processes. It is stated that communities working on Wikibooks are currently only loosely connected and mostly unorganized; nevertheless these communities are capable to produce reasonable results under these conditions. Especially with hyperlinked articles and semantic links between these Wikis are a significant step forward from collections of printed material, although traditional Wikis only provide the direct view on one article. Often a user wishes to combine several articles or include articles in his own content and produce a document with a consistent layout: A project manager wants to compile a dossier with several CVs available as individual articles, a tourist wants to create an individual sightseeing guide of several points of interest available in Wikipedia or an author who wants to include a Wiki article in his document or compile a document from several Wiki articles. In general there are several situations where users want to make a snapshot for a collection of articles with a specified revision, e.g. a release version of a software documentation or report and provide this as a printout. After deciding to create the document containing several Wiki articles, authors are confronted with the question how to do this effectively. The simplest way of integration (“Copy & Paste”) is sufficient for single articles. For content structures basing mainly on Wiki sources with dozens of articles, this becomes a cumbersome task, even more if this must be repeated after changes of content in the future. This paper introduces a module of the cross media publishing framework openFuXML [3] doing this automatically by converting Wiki articles into XML and utilizing the framework to create high quality textbooks (PDF), web pages (HTML) or ebooks (EPUB) from the same content repository. The remainder of this paper discusses the related work in Section II. Section III introduces the cross media publishing and authoring framework openFuXML and is followed by discussion of integration Wiki content from a technical point of view in Section IV. Section VI presents results and ends with an overview on future work. II. RELATED WORK While we are not aware of other work extracting content from Wiki to XML with the objective to apply cross media publishing systems for ebook or high quality textbook produc- tion, several approaches exist to access and process content stored in Wikis. General work in extracting information from unstructured content [4], extracting lexical knowledge [5] from Wiki articles as well as using natural language processing systems for Wiki content [6] has been carried out. A wide range of bots or bot frameworks exist for automatic or semi-automatic content editing of Wiki text. This includes tasks such as spell checking, link checking or finding dupli- cates or disambiguation in the database. Plugable rendering engines are also related to our work. E. g., the Wiki markup rendering engine Radeox is proposed in [7], this project aims to provide a generic translation of wiki markup written in Java. The engine VersoWiki is implemented in PHP and allows the transformation of Wiki markup to HTML and back. Common features of other PHP based engines like PmWiki or TextWiki are the usage of regular expressions and HTML as the primary output format. Notable work was undertaken in [8] to create a data inter- change format for page content of MediaWiki. A Document

Transcript of Cross Media Publishing of MediaWiki Contentworldcomp-proceedings.com/proc/p2012/EEE2841.pdf · This...

Page 1: Cross Media Publishing of MediaWiki Contentworldcomp-proceedings.com/proc/p2012/EEE2841.pdf · This paper introduces a module of the cross media publishing framework openFuXML [3]

Cross Media Publishing of MediaWiki ContentThorsten Kisner

Management and EngineeringAHT GROUP AGEssen, Germany

Email: [email protected]

Abstract—Wikis are excellent tools for distributed authoring,knowledge management and collaboration. Even for inexperi-enced users the used markup syntax is easy to understandand tends not in an obstacle to participate in the authoringprocess. Widely used in universities, enterprises and communities,Wikis represent a web based content management system forcollaborative working and authoring. In many situations userswant to have the content of a Wiki or selected articles availablein different types of media other than the web based onlineversion. This is the domain of cross media publishing systemsbeing able to create several output formats from one single source.In this paper we present a module of the cross media publishingframework openFuXML to integrate Wiki content and render itto different output formats.

I. INTRODUCTION

Wikis have become a widely used tool in the Internet,universities, schools and enterprises since their invention in1995 with a significant impact on society with Wikipedia. Aimpressive increase of Wikis as a platform for collaborativeknowledge sharing [1], ranging from some loosely connectedarticles up to vast databases with hundreds of thousands ofarticles and dense connections between these, has taken placeduring the last ten years.

In the area of education, projects like Wikibooks or Wikiver-sity are taking advantage of the community to produce text-books, course materials or curricula to support and organizeeducational activities.

The impact of Wikibooks in the textbook industry is de-scribed in [2] in the context of community processes. It isstated that communities working on Wikibooks are currentlyonly loosely connected and mostly unorganized; neverthelessthese communities are capable to produce reasonable resultsunder these conditions.

Especially with hyperlinked articles and semantic linksbetween these Wikis are a significant step forward fromcollections of printed material, although traditional Wikis onlyprovide the direct view on one article. Often a user wishesto combine several articles or include articles in his owncontent and produce a document with a consistent layout:A project manager wants to compile a dossier with severalCVs available as individual articles, a tourist wants to createan individual sightseeing guide of several points of interestavailable in Wikipedia or an author who wants to include aWiki article in his document or compile a document fromseveral Wiki articles. In general there are several situationswhere users want to make a snapshot for a collection of articles

with a specified revision, e.g. a release version of a softwaredocumentation or report and provide this as a printout.

After deciding to create the document containing severalWiki articles, authors are confronted with the question how todo this effectively. The simplest way of integration (“Copy &Paste”) is sufficient for single articles. For content structuresbasing mainly on Wiki sources with dozens of articles, thisbecomes a cumbersome task, even more if this must berepeated after changes of content in the future. This paperintroduces a module of the cross media publishing frameworkopenFuXML [3] doing this automatically by converting Wikiarticles into XML and utilizing the framework to createhigh quality textbooks (PDF), web pages (HTML) or ebooks(EPUB) from the same content repository.

The remainder of this paper discusses the related work inSection II. Section III introduces the cross media publishingand authoring framework openFuXML and is followed bydiscussion of integration Wiki content from a technical pointof view in Section IV. Section VI presents results and endswith an overview on future work.

II. RELATED WORK

While we are not aware of other work extracting contentfrom Wiki to XML with the objective to apply cross mediapublishing systems for ebook or high quality textbook produc-tion, several approaches exist to access and process contentstored in Wikis.

General work in extracting information from unstructuredcontent [4], extracting lexical knowledge [5] from Wiki articlesas well as using natural language processing systems for Wikicontent [6] has been carried out.

A wide range of bots or bot frameworks exist for automaticor semi-automatic content editing of Wiki text. This includestasks such as spell checking, link checking or finding dupli-cates or disambiguation in the database.

Plugable rendering engines are also related to our work.E. g., the Wiki markup rendering engine Radeox is proposedin [7], this project aims to provide a generic translationof wiki markup written in Java. The engine VersoWiki isimplemented in PHP and allows the transformation of Wikimarkup to HTML and back. Common features of other PHPbased engines like PmWiki or TextWiki are the usage of regularexpressions and HTML as the primary output format.

Notable work was undertaken in [8] to create a data inter-change format for page content of MediaWiki. A Document

Page 2: Cross Media Publishing of MediaWiki Contentworldcomp-proceedings.com/proc/p2012/EEE2841.pdf · This paper introduces a module of the cross media publishing framework openFuXML [3]

Type Definition (DTD) is proposed (and implemented) todescribe the page content, however this is only done for metainformation like revisions, dates etc. The content itself is nottransformed into XML.

Also notable is the built-in function “Printable Version” or“Add Page to Book” of MediaWiki. While the first functioncreates merely a HTML page without navigation elements,the second function allows users to create a PDF containingseveral articles. These generated PDF documents lack of cus-tomized tables of content, and other editorial elements like listsof figures, tables or index registers. The image quality is equalto the screen resolution only and formulas are created (withpoor quality) as images. While these shortcomings may not berecognized by inexperienced users, they will soon realize thatthe content cannot be modified, reordered or extended. Theseserious restrictions for the creation of different output formatare addressed in this paper.

III. THE CMP FRAMEWORK OPENFUXML

A. Cross media publishing

Cross media publishing (CMP) describes a publishing pro-cess that creates different types of media (e.g. books, cd-rom, web pages, e-books) from one single source (“singlesource publishing”). The content is managed and stored in amedia independent manner, text elements are often managedusing the Extensible Markup Language (XML). Images areavailable often in the source format, openFuXML maintainstwo different formats: (i) a scalable vector format (Encapsu-lated Postscript (EPS) or Scalable Vector Graphics (SVG))optimized for printing or further processing and (ii) a rasterimage (Portable Network Graphic (PNG)) for screen output.

Using this concept of a single source, a cross mediapublishing process is able to render the content automaticallyfor different types of target formats. This results in both aflexible and cost efficient way to store and maintain contentrepositories, since only a single repository has to be managed– thus avoiding redundancies of the same content for differentmedia types.

B. History

The cross media publishing framework openFuXML becameaccessible in February 2007 and was initially presented atED-MEDIA 2007 in Vancouver [9]. The development processstarted at the University of Hagen, Germany, in 2003, when anew development of a genuine cross media publishing systemwas agreed upon following extensive evaluation of existingsystems. None of those was assessed suitable for publishinglearning material with enriched interactive components in themultimedia format HTML or to provide high quality forprinting press with the economical constraint to author andpublish hundreds of courses cost efficiently. The developmentproject received funding until 2005 and was then continued bythe Department of Communication Systems and the Centerfor Development of Distance Teaching of the University ofHagen. The publishing system openFuXML is available asopen source software with a GPL licence in the CampusSource

Software Exchange and SourceForge.net, while the hostingof the sources migrated to SourceForge.net in 2008. Themigration to SourceForge.net enables external contributions tothe development process and to apply the system with its latestenhancements and features [3].

C. Media Concept

The authoring framework openFuXML [9] offers structuraland formatting elements as well as editorial elements (seeFigure 1) which are well known from other word processingsystems: common structural elements like sections, paragraphsor lists are available as well as visual formatting like bold,italic, verbatim or underline. Different markups like footnotes,marginalia or literature references are supported and belong tothe editorial elements.

Structural Element

Didactical Element

Subject Element

Editorial Element

Formatting Element

Figure 1. Media Concept of openFuXML

We worked together with professors from different facultiesto be able to offer special subject environments for partic-ular target groups such as mathematicians, lawyers, socialscientists, engineers and computer scientists. The availabledidactical elements cover a wide range of concepts, e.g.examples, hints, exercises and solutions, author information,prerequisites and learning objectives.

IV. THE OPENFUXML WIKI ENGINE

A. Wiki Markup Transformation

Listing 1 shows a simple Wiki article only containing a fewheaders on different levels. Knowing the rules of applyingdifferent levels of headers with the symbol “=”, one directlyunderstands the structure.

Listing 1. MediaWiki markup==The openFuXML Wiki Engine =====Wiki Markup T r a n s f o r m a t i o n =======Wiki Example========openFuXML Example====

The transformation of this Wiki markup to XML is shownin Listing 2 and described in detail in [10]. The hierarchicaland semantic representation in XML is much more complex,but it’s advantages can easily demonstrated. If one wishes to

Page 3: Cross Media Publishing of MediaWiki Contentworldcomp-proceedings.com/proc/p2012/EEE2841.pdf · This paper introduces a module of the cross media publishing framework openFuXML [3]

include the Wiki article in Listing 1 into a pre-defined sub-section, all header depth levels need to be modified manually.In XML (Listing 2) the section element can directly beinserted (or referenced) within another section.

Listing 2. MediaWiki markup transformed to XML<s e c t i o n>

< t i t l e >The openFuXML Wiki Engine </ t i t l e ><s e c t i o n>

< t i t l e >Wiki Markup T r a n s f o r m a t i o n </ t i t l e ><s e c t i o n>

< t i t l e >Wiki Example</ t i t l e ></ s e c t i o n><s e c t i o n>

< t i t l e >openFuXML Example</ t i t l e ></ s e c t i o n>

</ s e c t i o n></ s e c t i o n>

The Content Transformation process includes text and me-dia objects. If source images are available as Scaleable VectorGraphic (SVG), this format is converted lossless into EPSand used for the PDF output rather than just copying lowquality bitmap images. Mathematical expressions are extractedfrom the “alt” tag and thus available in Latex notation forfurther processing. The output of this step are different contentobjects, mostly articles (ofx:section) with text, images,links and references.

B. Template and Macro Transformation

The definition of templates and environments is a sophisti-cated way for users and operators of Wiki servers to customizethe layout of the pages. On the one side this is the worst-casefor external rendering or transformation engines, because usersare free to define whatever they want in these templates andthe actual presentation is often done in conjunction with CSS.On the other side, there is semantic information in the formof key-value pairs available. Obviously, the rendering engineneeds to understand these information to be able to render itin the right way.

Beside a generic Template Transformator mapping the key-value pairs to a simple table, users a free to decide how atemplate will be processed. In the configuration file a uniquetransformation class can be chosen for each template. Theoutcome of this transformation class may either a valid XMLdocument with the namespace of the authoring frameworkor a intermediate XML format with an arbitrary namespacewhich is translated in further processing to a openFuXML validdocument.

V. CONTENT COMPILATION

A. Structure Definition

All references to external Wiki content are defined insidethe container element wiki:content. The most importantelements are wiki:page and wiki:category:

• wiki:page This element represents an article (a Wikipage), which will be inserted at this position. The childelement has to be a ofx:section, the optional attributetransparent="true" will insert the content of thearticle at this position and not create a own section withthe complete article.

• wiki:category All articles tagged with the givencategory are summarized. Either as individual sectionsor compressed a synoptical table.

Document structures are defined in XML with the open-FuXML authoring framework. Listing 3 (with partly omittedXML namespaces) shows an example of such a definition.

Listing 3. Example of XML structure definition<? xml v e r s i o n =” 1 . 0 ” e n c o d i n g =”UTF−8” ?><o f x : o f x d o c>

<o f x : m e t a d a t a>< t i t l e>Hel loWor ld</ t i t l e>

</ o f x : m e t a d a t a><o f x : c o n t e n t>

<s e c t i o n s o u r c e =” t e x t / i n t r o d u c t i o n . xml ” /><s e c t i o n>

< t i t l e>Use Cases</ t i t l e><w i k i : c o n t e n t>

<page name=” Ca te go r y : Use Ca se ” d e p t h =” 0 ” /></ w i k i : c o n t e n t><w i k i : c o n t e n t>

<c a t e g o r y name=” Use Case ”><s e c t i o n />

</ c a t e g o r y></ w i k i : c o n t e n t>

</ s e c t i o n></ o f x : c o n t e n t>

</ o f x : o f x d o c>

The document has two top-level sections:

• The section Introduction demonstrated the inclusion ofa XML document as a external source. The directive willbe completely replaced by the content in the externaldocument.

• The section Use Cases starts with the content of the Wikipage Category:Use_Case and is followed by allpages labeled with the category Use_Case as individualsub-sections.

B. Content Compilation

The structure definition document is the source for theprocess of Content Compilation. In a couple of steps, allexternal MediaWiki servers are contacted and the definedarticles are locally stored for further processing.

1) Since all container elements can be defined as references(with external=true and a source attribute), theroot document is parsed and all external content ele-ments are merged into the root document.

2) The document at this stage contains the complete localcontent. In this step the document is parsed for externalwiki:content Elements (see Listing 3) and the el-ement is replaced by a local (external) content elementpointing to a file.

3) Depending on the configuration (never, always or if theWiki page is updated), the Wiki page is requested fromthe server and saved as Wiki markup.

4) The Wiki markup is transformed to XML, this step isexplained in Section IV-A.

5) The newly created external elements are merged into theroot document.

6) References are processed with customizable strategiesfor internal and external links.

Page 4: Cross Media Publishing of MediaWiki Contentworldcomp-proceedings.com/proc/p2012/EEE2841.pdf · This paper introduces a module of the cross media publishing framework openFuXML [3]

The resulting XML document contains all content informationand can be processed in the rendering engines for crossmedia publishing. Depending on the target format and it’sconfiguration, editorial elements like tables of content, figuresor lists are available. Didactical and subject elements can begenerated by individual Template Transformators. The processof content compilation and cross media publishing is outlinedin Figure 2.

PDF

HTML

Rendering

Process

Authoring

Framework

Figure 2. Content Compilation with openFuXML

C. Content StructuringA general issue for all tools and algorithms trying to

transform a collection of hypertext documents into a linearbook is the ordering of different pages in the book. Let’sassume the root page R of the hypertext collection points topages A, B and C, and they respectively point to pages A1,A2, B1, B2, B3, and C1. Should pages ordered in depth firstfashion R, A, A1, A2, B, B1, B2, B3, C, and C1 or in a breadthfirst fashion R, A, B, C, A1, A2, B1, B2, B3, C1?

Event with this simplified example it’s easy to find argu-ments for and against each of the strategies, with additionallinks e. g. from A1 to B2 or A1 to B2 to C2 to A1 bothstrategies will fail.

One solution implicitly used in Figure 2 is a manualordering. This is implemented for the wiki:page andwiki:category directives.

A collection of hyperlinked Wiki articles can be describedas directed graph with an ordered pair G = (V,E) with Vas a set of elements called vertices or nodes, and E a set ofordered pairs of nodes called directed edges.

R

A B C

A.1 A.2 C.1 C.2

A.2.1 A.2.2

Figure 3. Tree structure of a textbook

A textbook can be interpreted as a special case of a graphG = (V,E) with |V | = n nodes and |E| = m edges calledTree. A node is equivalent to a section or subsection and aedge represents the relationship “is parent of”. The additionalconstraints (i) there is exactly one path between two nodes,(ii) G is (minimal) fully connected with m = n− 1 and (iii)G is (maximal) acyclic (i. e. the no cycle at all) describe atree [11] shown in Figure 3.

Type Nodes V Edge E

Wiki Article has a reference toTextbook (sub)section has subsection

Table ICHARACTERISTICS FOR SELECTED COUNTRIES

This theoretical descriptions reflect our intuitive understand-ing of the structure structure of a textbook and wiki articlesand is summarized in Table I.

The required structure1 for a textbook and the graphicalrepresentation of a tree is outlined in Figure 3.

Figure 4 shows an example of linked hypertext documents.With the graph metrics in-degree and out-degree it’s possibleto identify the node A as a root node and F as a leaf. But withthis information it’s not sure if F should be a child of A or D.Cyclic links (shown with B, D and E) must be detected andresolved during the transformation to a tree-like structure.

A couple of algorithms and heuristics are implemented tosupport users with the compilation of hyperlinked articles totextbooks. The directtive wiki:page supports the attributedepth to specify the number of links which should be followedduring content processing. Another metric is the distance tomanual defined root elements dR. This is demonstrated in

1Of course this structure does not restrict the usage of internal referencesby authors “see section x.y”

Page 5: Cross Media Publishing of MediaWiki Contentworldcomp-proceedings.com/proc/p2012/EEE2841.pdf · This paper introduces a module of the cross media publishing framework openFuXML [3]

A

B

D

F

E

C

Figure 4. Hypertext documents

Figure 5: The node C might belong to A or R2, but thealgorithm will attach it to R2 because dR2,C < dR1,C .

R1

A B R2

CA.1

C.1 C.2

Figure 5. Minimization of root distance dR

Wiki pages can be labeled with meta-information like cate-gories. If this information are available a clustering algorithmcan be applied trying to cluster all pages with the same meta-information to the same parent. The user can specify a orderedlist of key-words to give them a high priority for the algorithm.

Without semantic information the local clustering coeffi-cient ci [12] can be used, The connections of all neighborsof node i to all possible connections among each other isrepresented by ci, a high ci for hyperlinked articles suggeststhat articles with common neighbors tend to be neighbors, too.

VI. CONCLUSION AND FUTURE WORK

This approach combines the advantages of both environ-ments: the simplicity of markup and authoring of Wikis andthe flexibility of structured XML content objects for crossmedia publishing. The system is successfully tested to processa internal knowledge management Wiki, a Wiki for softwaredevelopment documentation and learning materials created ina distributed authoring process. It allows authors with basic ITknowledge to participate in the development process directlywith the Wiki. Editors can work with the XML based authoringframework and directly include single articles, multiple articlesor a selection of articles based on categories. The resultingdocument can be processed by the cross media publishingsystem openFuXML and rendered to HTML, PDF or EPUB.Based on XML, a transformation to other schemas, likeDocBook or individual formats is possible.

Beside Wiki markup itself, the processing of custom tem-plates is the most important step for reasonable results. Futurework will focus on templates processing, creation of generictransformations and a repository for custom transformationclasses, e.g. for templates used in Wikipedia or individual usergenerated templates.

REFERENCES

[1] A. Forte and A. Bruckman, “Constructing text:: Wiki as a toolkit for(collaborative?) learning,” in WikiSym ’07: Proceedings of the 2007international symposium on Wikis. New York, NY, USA: ACM, 2007,pp. 31–42.

[2] M.-F. G. Lin, S. Sajjapanroj, and C. Bonk, “Wikibooks and wiki-bookians: Loosely-coupled community or the future of the textbookindustry?” in Proceedings of World Conference on Educational Mul-timedia, Hypermedia and Telecommunications 2009. Honolulu, HI,USA: AACE, June 2009, pp. 3689–3697.

[3] T. Kisner, H. Hemmer, and J. Rambow, “Recent developments and newfields of application of the cross-media-publishing framework open-fuxml,” in Proceedings of World Conference on Educational Multimedia,Hypermedia and Telecommunications 2009. Honolulu, USA: AACE,June 2009.

[4] A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen,T. Chen, E. Chu, P. Derose, B. Gao, C. Gokhale, J. Huang, W. Shen,and B. Q. Vuong, “Information extraction challenges in managingunstructured data,” SIGMOD Rec., vol. 37, no. 4, pp. 14–20, 2008.

[5] T. Zesch, C. Muller, and I. Gurevych, “Extracting lexical semanticknowledge from wikipedia and wiktionary,” in Proceedings of theConference on Language Resources and Evaluation (LREC), 2008.

[6] R. Witte and T. Gitzinger, “Connecting wikis and natural language pro-cessing systems,” in WikiSym ’07: Proceedings of the 2007 internationalsymposium on Wikis. New York, NY, USA: ACM, 2007, pp. 165–176.

[7] M. L. Jugel and S. J. Schmidt, “The radeox wiki render engine,” inWikiSym ’06: Proceedings of the 2006 international symposium onWikis. New York, NY, USA: ACM, 2006, pp. 33–36.

[8] J. Voss. (2007) Wikipedia DTD. (Last checked: 2011-04-03). [Online].Available: http://meta.wikimedia.org/wiki/Wikipedia.DTD

[9] T. Kisner, J. Gellweiler, and F. Kaderali, “The cross-media-publishingframework openfuxml,” in Proceedings of World Conference on Educa-tional Multimedia, Hypermedia and Telecommunications 2007, C. Mont-gomerie and J. Seale, Eds. Vancouver, Canada: AACE, June 2007, pp.3137–3141.

[10] T. Kisner and J. Rambow, “The openFuXML Wiki Engine: TransformingWiki Sources to Structured XML Content Objects,” in Proceedingsof Second International Conference on Mobile, Hybrid, and On-lineLearning (eL&mL 2010, Digital World 2010). St Maarten, NetherlandsAntilles: IEEE Computer Society, February 2010.

[11] R. Diestel, Graph Theory. Springer, 2005.[12] D. J. Watts and S. H. Strogatz, “Collective dynamics of small-world

networks,” Nature, vol. 393, pp. 440–442, 1998.