SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel,...

SaariStory: A framework to represent the medieval history of Saarland

Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor: Dr. Caroline Sporleder

Text Mining for Historical DocumentsWS 2011/12

History of Saarland: Motivation

• Medieval History of Saarland

• Query: Which records talked about financial matters involving Peter in the year 600 to 1300

SaariStory

• Enabling keyword based search• Answering complex queries• Providing topic based search• Showing temporal changes in the number of

results for a time independent query

POS tagger

Noun extractor

DB of the text indexed with person,

location, dates and

topics

Topic extractor

Workflow of SaariStory

Search result annotated with the

topics

Preprocessing text

GraphicalUser Interface for querying

Tokenizer

Description of the data

• Two parts of the data• Data block: Chronologically sorted records of

events• Index block: index for all these data blocks with

alphabetically sorted keywords

Components of a data blocktimestamp

Characteristics of the data blocks

Number of words 200,646

Number of lines 15,021

Unique data blocks 1,490

Pages 612

Components of a index block

keywords

Dates connecting to index block

Characteristics of the index blocks

Number of words 86,485

Number of lines 10,803

Unique index blocks 934

Pages 277

POS tagger

Noun extractor

location, dates and

topics

Topic extractor

SaariStory

topics

Preprocessing text

Tokenizer

Preprocessing of the Data block

• Basic strategy:• Convert pdf to text using nitro pdf • Parse text to separate the data blocks

• Problem: How to separate data blocks?• New lines do not indicate starting of data blocks• Distinguish between start of a page and start of

a data block

• Solution: Regular expression• Each data block starts with a date: • yyyy-mm-dd, yyyy-mm, yyyy/mm, yyyy

• Use them to search “Regest” and “Druck” too

• Data structure to present the processed data

Preprocessing of the Index block

• Problem: How to separate index blocks?• The only way to separate them is to use the fact:

titles are in bold text• The bold annotation is lost in pdf to text

conversion

Preprocessing of the Index block

• Solution: pdf doc html • The bold annotations are preserved• Use regular expression to search for the

<b></b> tag• Took care of broken lines for line breaks

-0830-05- <next line> 10 -0830-05-10

Preprocessing of the index block

• Data structure to present the processed index data

POS tagger

Noun extractor

location, dates and

topics

Topic extractor

SaariStory

topics

Preprocessing text

Tokenizer

Tokenizer and POS tagger

• Need to process data from each data block• We use openNLP for this purpose• Open source• Easy to use• Pre-trained for German

Noun Extraction

• Straightforward to get from POS tagged text• ~6000 nouns from 1490 data blocks

• Problem: 22 minutes for only 6000 nouns

Noun Extraction

• Solution:• Run the POS tagger concurrently over data blocks• Feed only the tokens which could be nouns

we use the optimized version to have better quality of nouns

Method Time (seconds)

Speed up

Sequential 1322.6 -

Concurrent 12.4 100x

Concurrent with optimization 12.4 100x

POS tagger

Noun extractor

location, dates and

topics

Topic extractor

SaariStory

topics

Preprocessing text

Tokenizer

Topic extraction

• We use Latent Dirichlet allocation (LDA)• Proposed by Biel et al. in 2002• Assumption: Each document is a mixture of

small number of topics• Needed to set two parameters: Used trial and

error to get meaningful topics

Topic extraction

• Using LDA, try 1: Use blindly• Some very frequent no so meaningful words • E.g. Saarbrücken

• They are in every topic and every data block• Every topic is assigned to every data block

Topic extraction

• Using LDA, try 2: Remove some nouns• We remove the 15 most frequent nouns• Covered 24% of all nouns

• Problem: some of these nouns may be relevant to some documents

Topic extraction

• Using LDA, try 3: Only keep essential nouns• Calculate tf-idf score for word w in a data block d

tfw,d = #of occurrences of w in d

idfw =

tf-idfw,d = tfw,d x idfw

• If tf-idfw,d < 3.0 we remove w from d• Use LDA on the modified data block

Topic extraction

• We use LDA on the modified data blocks after using tf-idf score

• LDA identifies 7 meaningful topics with 10 words each• Bekanntmachung, Besitz, Finanzen, Vereinbarungen,

Familie, Schulden, Recht

POS tagger

Noun extractor

location, dates and

topics

Topic extractor

SaariStory

topics

Preprocessing text

Tokenizer

Putting the data into database

• Type of queries we want to answer• Enabling keyword based search• Answering complex queries• Providing topic based search• Showing temporal changes in the number of

results for a time independent query

Database schema

SELECT d.*, GROUP_CONCAT(DISTINCT t.topic_name) AS topic_names FROM (`data` AS d LEFT OUTER JOIN `data_topics` AS dt ON d.id = dt.data_id) LEFT OUTER JOIN `topics` AS t ON dt.topic_id = t.id WHERE d.startDate >= '0600-00-00' AND d.endDate <= '1600-00-00' AND dt.topic_id = 1 OR dt.topic_id = 6 AND d.data_block LIKE '%Keyword%' GROUP BY d.id ORDER BY d.startDate

Database settings

• We use MySQL database• First we used a web based sql provider• Not always live• Too slow in times of high load

• We set up a local database

Database # Rows filled Time to fill the DB (seconds)

Average time for query (seconds)

Remote 25,274 4,800 10

Local 25,527 300 3

POS tagger

Noun extractor

location, dates and

topics

Topic extractor

SaariStory

topics

Preprocessing text

Tokenizer

Graphical user interface

Evaluation

• Precision and recall is 100% for keyword based random queries• We compared manual results with results from our

system• The topic detection works! The following text is

labeled as “Familie” and “Finanzen”

Future work

• Improving our pre-processing step to include more corner cases

• Extract and save footnotes from the text• Possibility to add more data blocks to the

database• Taking care of line breaks in data blocks

POS tagger

Noun extractor

location, dates and

topics

Topic extractor

SaariStory

topics

Preprocessing text

Tokenizer

• Back up slides

SaariStory: Database design

User name

location date NAME_ID

Peter Saarbruecken 601 1

Paul Saarbruecken 699 2

DATA_ID date Data block from text

1 699 <string>

2 786 <string>

TOPIC_ID topic

1 CRIME

2 PUNISHMENT

NAME_ID DATA_ID

DATA_ID TOPIC_ID

Design challenge-1: stripping data from the pdf

• Text = data-block + index of places/names• How to parse data from data blocks in pdf• Convert the data into text using an online tool• Parse it to get data blocks using regular expression

• How to parse data from the index• The bold ones are the keywords• Convert the index into html and then parse

Design challenge-2: Extracting topics

SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel,...

Documents

Transcript of SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel,...

Administrator Update · Dr. Raj Shah and Ms. Annie Barz ... – NIA funds do not cover all that our ADCs want/need to do » Biomarker studies are expensive » Desire to supplement

Defending against large-scale crawls in online social networks Mainack Mondal Bimal Viswanath Allen Clement Peter Druschel Krishna Gummadi Alan Mislove.

VIC Development National Center of High-performance Computing Barz Hsu (barz@nchc.org.tw)barz@nchc.org.tw 15 January 2007.

China ABC Book!!!! :) Lucy Barz. A A is for Anyang. Anyang was one of the first cities that the Chinese built.

Online-Marketing für Bildungsinstitute · Online-Marketing für Bildungsinstitute - Eine erste Einführung Aufbauseminar Prof. Dr. Heiner Barz Klaudija Paunovic Donnerstag: 10:30

Home - Krieg & Fischer GmbH · Reference list (extract of more than 150 biogas plant reference) BLÜMEL 1994 Germany, bio-waste, 2 x 800 m³ digester volume, HPP 2 x 160 kWel BARZ

Mohsen Minaei*, Mainack Mondal, Patrick Loiseau, Krishna ...€¦ · Lethe:ConcealContentDeletionfromPersistentObservers 207 Sheeran’sdeletionofatweetfrom2011wasfoundand widely

Emosaic: Visualizing Affective Content of Text at Varying ...Emosaic: Visualizing Affective Content of Text at Varying Granularity Philipp Geuder, Marie Claire Leidinger, Martin von

Barz: Learning and Education in the Context of Social ... · Barz: Learning and Education in the Context of Social Milieus 6 Description of the Sinus-Milieus in Germany-West 2000

RV 2015: Shared-Use Mobility: Advancing Equitable Access in Low-Income and Disenfranchised Communities of Color by Sara Barz

Behind Barz July-Aug 2010 issue

Moving Beyond Set-It-And-Forget-It Privacy Settings on ...ckanich/papers/mondal2019moving.pdf · Mainack Mondal IIT Kharagpur / University of Chicago mainack@cse.iitkgp.ac.in Günce

Plant Research...Plant Research Editor-in-Chief Editorial Board E. Reinhard, Tubingen H. P. T. Ammon, Tiibingen Hippokrates Verlag Pharmazeutisches lnstitut W. Barz, Miinster Stuttgart

Behind Barz Mag Mar-Apr 2011

Behind Barz Jan-Feb 2012

Quantum processing by remote quantum control...Stefanie Barz-Towards a quantum internet Wolfgang Dür, Raphael Lamprecht and Stefan Heusler-Causal and causally separable processes

Journal Club MicroRNA In Vitro Diagnostics by Use of Immunoassay Analyzers A. Kappel, C. Backes, Y. Huang, S. Zafari, P. Leidinger, B. Meder, H. Schwarz,

Alien species on Sylt Seaweeds Maria Redeker, Carolin Barz, Olga Kolychalow.

Sassanian stucco decorations from the Ramavand (Barz ... fileSeymareh river. The excavations, although limited in extent and time, provided important information on architecture and

Torpe, B - Pace University Webspacewebpage.pace.edu/dnabirahni/rahnidocs/law802/A PERPETUAL... · Web viewIn fact, in Persian, the word “Barz” refers to crop seeds, and “Gar”