SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel,...
Transcript of SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel,...
![Page 1: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/1.jpg)
SaariStory: A framework to represent the medieval history of Saarland
Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor: Dr. Caroline Sporleder
Text Mining for Historical DocumentsWS 2011/12
![Page 2: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/2.jpg)
History of Saarland: Motivation
• Medieval History of Saarland
• Query: Which records talked about financial matters involving Peter in the year 600 to 1300
![Page 3: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/3.jpg)
SaariStory
• Enabling keyword based search• Answering complex queries• Providing topic based search• Showing temporal changes in the number of
results for a time independent query
![Page 4: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/4.jpg)
POS tagger
Noun extractor
DB of the text indexed with person,
location, dates and
topics
Topic extractor
Workflow of SaariStory
Search result annotated with the
topics
QUERY
Preprocessing text
GraphicalUser Interface for querying
Tokenizer
![Page 5: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/5.jpg)
Description of the data
• Two parts of the data• Data block: Chronologically sorted records of
events• Index block: index for all these data blocks with
alphabetically sorted keywords
![Page 6: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/6.jpg)
Components of a data blocktimestamp
data
![Page 7: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/7.jpg)
Characteristics of the data blocks
Number of words 200,646
Number of lines 15,021
Unique data blocks 1,490
Pages 612
![Page 8: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/8.jpg)
Components of a index block
keywords
Dates connecting to index block
![Page 9: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/9.jpg)
Characteristics of the index blocks
Number of words 86,485
Number of lines 10,803
Unique index blocks 934
Pages 277
![Page 10: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/10.jpg)
POS tagger
Noun extractor
DB of the text indexed with person,
location, dates and
topics
Topic extractor
SaariStory
Search result annotated with the
topics
QUERY
Preprocessing text
GraphicalUser Interface for querying
Tokenizer
![Page 11: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/11.jpg)
Preprocessing of the Data block
• Basic strategy:• Convert pdf to text using nitro pdf • Parse text to separate the data blocks
• Problem: How to separate data blocks?• New lines do not indicate starting of data blocks• Distinguish between start of a page and start of
a data block
![Page 12: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/12.jpg)
Preprocessing of the Data block
• Solution: Regular expression• Each data block starts with a date: • yyyy-mm-dd, yyyy-mm, yyyy/mm, yyyy
• Use them to search “Regest” and “Druck” too
![Page 13: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/13.jpg)
Preprocessing of the Data block
• Data structure to present the processed data
![Page 14: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/14.jpg)
Preprocessing of the Index block
• Problem: How to separate index blocks?• The only way to separate them is to use the fact:
titles are in bold text• The bold annotation is lost in pdf to text
conversion
![Page 15: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/15.jpg)
Preprocessing of the Index block
• Solution: pdf doc html • The bold annotations are preserved• Use regular expression to search for the
<b></b> tag• Took care of broken lines for line breaks
-0830-05- <next line> 10 -0830-05-10
![Page 16: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/16.jpg)
Preprocessing of the index block
• Data structure to present the processed index data
![Page 17: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/17.jpg)
POS tagger
Noun extractor
DB of the text indexed with person,
location, dates and
topics
Topic extractor
SaariStory
Search result annotated with the
topics
QUERY
Preprocessing text
GraphicalUser Interface for querying
Tokenizer
![Page 18: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/18.jpg)
Tokenizer and POS tagger
• Need to process data from each data block• We use openNLP for this purpose• Open source• Easy to use• Pre-trained for German
![Page 19: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/19.jpg)
Noun Extraction
• Straightforward to get from POS tagged text• ~6000 nouns from 1490 data blocks
• Problem: 22 minutes for only 6000 nouns
![Page 20: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/20.jpg)
Noun Extraction
• Solution:• Run the POS tagger concurrently over data blocks• Feed only the tokens which could be nouns
we use the optimized version to have better quality of nouns
Method Time (seconds)
Speed up
Sequential 1322.6 -
Concurrent 12.4 100x
Concurrent with optimization 12.4 100x
![Page 21: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/21.jpg)
POS tagger
Noun extractor
DB of the text indexed with person,
location, dates and
topics
Topic extractor
SaariStory
Search result annotated with the
topics
QUERY
Preprocessing text
GraphicalUser Interface for querying
Tokenizer
![Page 22: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/22.jpg)
Topic extraction
• We use Latent Dirichlet allocation (LDA)• Proposed by Biel et al. in 2002• Assumption: Each document is a mixture of
small number of topics• Needed to set two parameters: Used trial and
error to get meaningful topics
![Page 23: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/23.jpg)
Topic extraction
• Using LDA, try 1: Use blindly• Some very frequent no so meaningful words • E.g. Saarbrücken
• They are in every topic and every data block• Every topic is assigned to every data block
![Page 24: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/24.jpg)
Topic extraction
• Using LDA, try 2: Remove some nouns• We remove the 15 most frequent nouns• Covered 24% of all nouns
• Problem: some of these nouns may be relevant to some documents
![Page 25: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/25.jpg)
Topic extraction
• Using LDA, try 3: Only keep essential nouns• Calculate tf-idf score for word w in a data block d
tfw,d = #of occurrences of w in d
idfw =
tf-idfw,d = tfw,d x idfw
• If tf-idfw,d < 3.0 we remove w from d• Use LDA on the modified data block
![Page 26: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/26.jpg)
Topic extraction
• We use LDA on the modified data blocks after using tf-idf score
• LDA identifies 7 meaningful topics with 10 words each• Bekanntmachung, Besitz, Finanzen, Vereinbarungen,
Familie, Schulden, Recht
![Page 27: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/27.jpg)
POS tagger
Noun extractor
DB of the text indexed with person,
location, dates and
topics
Topic extractor
SaariStory
Search result annotated with the
topics
QUERY
Preprocessing text
GraphicalUser Interface for querying
Tokenizer
![Page 28: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/28.jpg)
Putting the data into database
• Type of queries we want to answer• Enabling keyword based search• Answering complex queries• Providing topic based search• Showing temporal changes in the number of
results for a time independent query
![Page 29: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/29.jpg)
Database schema
SELECT d.*, GROUP_CONCAT(DISTINCT t.topic_name) AS topic_names FROM (`data` AS d LEFT OUTER JOIN `data_topics` AS dt ON d.id = dt.data_id) LEFT OUTER JOIN `topics` AS t ON dt.topic_id = t.id WHERE d.startDate >= '0600-00-00' AND d.endDate <= '1600-00-00' AND dt.topic_id = 1 OR dt.topic_id = 6 AND d.data_block LIKE '%Keyword%' GROUP BY d.id ORDER BY d.startDate
![Page 30: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/30.jpg)
Database settings
• We use MySQL database• First we used a web based sql provider• Not always live• Too slow in times of high load
• We set up a local database
Database # Rows filled Time to fill the DB (seconds)
Average time for query (seconds)
Remote 25,274 4,800 10
Local 25,527 300 3
![Page 31: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/31.jpg)
POS tagger
Noun extractor
DB of the text indexed with person,
location, dates and
topics
Topic extractor
SaariStory
Search result annotated with the
topics
QUERY
Preprocessing text
GraphicalUser Interface for querying
Tokenizer
![Page 32: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/32.jpg)
Graphical user interface
![Page 33: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/33.jpg)
Graphical user interface
![Page 34: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/34.jpg)
Evaluation
• Precision and recall is 100% for keyword based random queries• We compared manual results with results from our
system• The topic detection works! The following text is
labeled as “Familie” and “Finanzen”
![Page 35: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/35.jpg)
Future work
• Improving our pre-processing step to include more corner cases
• Extract and save footnotes from the text• Possibility to add more data blocks to the
database• Taking care of line breaks in data blocks
![Page 36: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/36.jpg)
POS tagger
Noun extractor
DB of the text indexed with person,
location, dates and
topics
Topic extractor
SaariStory
Search result annotated with the
topics
QUERY
Preprocessing text
GraphicalUser Interface for querying
Tokenizer
![Page 37: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/37.jpg)
• Back up slides
![Page 38: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/38.jpg)
SaariStory: Database design
User name
location date NAME_ID
Peter Saarbruecken 601 1
Paul Saarbruecken 699 2
DATA_ID date Data block from text
1 699 <string>
2 786 <string>
TOPIC_ID topic
1 CRIME
2 PUNISHMENT
NAME_ID DATA_ID
1 23
2 39
DATA_ID TOPIC_ID
1 2
2 3
![Page 39: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/39.jpg)
Design challenge-1: stripping data from the pdf
• Text = data-block + index of places/names• How to parse data from data blocks in pdf• Convert the data into text using an online tool• Parse it to get data blocks using regular expression
• How to parse data from the index• The bold ones are the keywords• Convert the index into html and then parse
![Page 40: SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649e1a5503460f94b07808/html5/thumbnails/40.jpg)
Design challenge-2: Extracting topics