1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

68
1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz

Transcript of 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

Page 1: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

1

M-CAST in libraries

National library of the Czech Republic

Marie Balíková@nkp.cz

Page 2: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

2

Multilingual Content Aggregation System based on TRUST Search Engine

• European project • aim of the project is to develop a multilingual system • will be applied in large digital collections of multilingual data

– libraries• hybrid• digital (internet)

– publishing houses– press agencies – scientific databases

• system is tested by two libraries, for which Multimedia Content Aggregation Portal (M-CAP) is created

• portal allows to find answers to natural language queries

Page 3: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

3

Page 4: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

4

Page 5: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

5

Application, target users group • M-CAP portals are developed and tested in two

libraries– Polish Internet Library - PBI– Czech National Library - CNL

• to make their digital resources available online for finding answers to natural language queries in multilingual digital collections

• target users group of the M-CAST system can be categorized into 2 main classes– internet users – library users

Page 6: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

6

Library users

• can search metadata about documents or entire documents or parts of documents (zones and fields) by entering words, phrases using database search

• one of the main objective of M-CAST system is to enable library and internet users to pose questions in natural language by offering them QA method

• according to prevalent information resources contained in library collections, two kinds of libraries in current online environment are defined

– hybrid libraries

– digital libraries with a subcategory of internet libraries

Page 7: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

7

Hybrid libraries

• 'new' electronic information resources and 'traditional' hardcopy resources co-exist

• brought together in an integrated information service

• accessed via electronic gateways available both on-site and remotely via the Internet or local computer networks

• intention of hybrid libraries users is to get information about a document or piece of information extracted from metadata or the document itself

• is supposed that the portion of digitized or digital born documents in hybrid libraries will be growing, therefore the demand to formulate the natural language queries by the library users will increase as well

Page 8: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

8

Page 9: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

9

Page 10: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

10

Digital Libraries• organized collections of multimedia and other

types of resources in electronic form• acquisition, storage, preservation, retrieval is

carried out through the use of digital technology• access to the entire collection is globally available

directly or indirectly across a network• DL supports users in dealing with information

objects and • helps them in the organization and presentation of

the objects via electronic/digital means

• Internet Libraries – a subcategory of Digital Libraries

Page 11: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

11

Polish Internet Library

• intended to become a full presentation of Polish (and world) literature, containing works belonging to the sphere of fiction and non-fiction literature

• Polish Internet Library will constitute the basis for the creation of Polish educational and cultural resources on the Internet, whose lack creates one of the barriers to the development of an information society

Page 12: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

12

Page 13: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

13

Example of searching at PBI

Page 14: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

14

Page 15: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

15

Users’ requirements at PBI• a users survey was performed to assess some

users’ requirements and expectations for the M-CAST system

• significant survey results useful for the M-CAST system requirements:– 85 % of users are interested in receiving

responses in foreign languages, among which• 94 % in English• 28 % in French• 18 % in Italian• 12 % in Czech• 6% in Portuguese

– 74 % of users would like to receive simplified responses translation to Polish

Page 16: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

16

Users’ requirements at CNL

• a users survey conducted by the CNL assesses following users requirements regarding foreign languages:– 70 % of library users are interested in providing

searches in foreign languages, from which– 80 % in English– 25 % in French– 13 % in Polish (North Moravia and Silesia)– 10 % in Italian– 5 % in Portuguese

Page 17: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

17

Survey results - conclusions

• current library and internet users prefer– to receive responses in different languages– to perform searches in foreign language literature – to receive a simplified translation of query results

• M-CAST system will enable – choosing a response language– searching among different repositories using peer

to peer communication

– choosing to translate query results to the query language

Page 18: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

18

Question answering (QA) system

• is a type of information retrieval based on sophisticated natural language processing (NLP) techniques

• provides direct answers to user questions posed in natural language by consulting its knowledge base

• three main components of automated QA system– a retrieval/ search engine that handles retrieval requests– a query formulation mechanism that translates natural-

language questions into queries for the IR engine• in order to retrieve relevant documents from the

collection– an answer extraction

• analyses these documents and extracts answers from them

Page 19: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

19

QA system based on TRUST search engine

• system searches a set of plain text documents in a local hard disk• returns a ranked list of sentences containing the answer to a given

natural language question– in future a unique exact answer from the retrieved sentence will

be extracted• the question is submitted• it is categorized according to special question typology• trough an internal query a set of potentially relevant documents are

retrieved• each document contains a list of sequences which are assigned to

the same category as the question• sentences are weighted according to their semantic relevance and

similarity with the question• trough specific answer patterns are sentences examined again and

the parts containing possible answers are extracted and weighted• a single answer is chosen from among all candidates

Page 20: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

20

Questions• factoid/factaul/fact based questions

– fact-based, short-answer questions such as "How much does the camel cost?“ – answer: typically a noun phrase

• opinionoid/opinion-oriented questions – involve opinions, evaluations, judgments, emotions, sentiments,

or speculations

• specificity of questions – questions should be general enough to apply to more than one

document on the topic– questions shouldn't be too specific, asking for details not likely to

appear in other documents

• documents provided to the automatic system will be from the same period of time

• 86 types of questions

Page 21: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

21

Formulating questions, answer string

• for each of the defined questions a set of answer strings/sentences is needed

• an answer string/sentence is a piece of text from a document that contains some words that correspond to the question

• each answer string/sentence should appear explicitly in the text, it MUST be wholly contained in a single sentence

• „explicit“ means that the answer string/sentence need not contain the same words as used in the question

• but it is NOT possible to bring in extra background knowledge to interpret the string as an answer

• there should be at least one document in the collection that contains an answer to defined question

• the answer string can NOT be longer than a whole sentence

• for a single question, it is possible that there may be more than one answer string in the document collection

Page 22: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

22

M-CAST – searching scenario

simple search• user enters a query/question in a simple search formadvanced search• user enters a query/question in an advanced search form settings• user can define: – results list size – maximum number of query results displayed on a

single page– query language – language of a query entered by a user– response language – language of resources in which the M-CAST

will perform a search, and display results– repositorie(s) – repositories among which M-CAST will perform a

search– full-text and metadata option– if full-text is checked, search will be

performed using a full-text of resources; if not checked – search will be performed using only resources’ metadata

– best answers – if checked, only best search results will be displayed– spell checking – if checked, M-CAST will perform spell checking of a

query entered by a user

Page 23: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

23

Welcome page of M-CAST at CNL

Settings

Simple search

Search term

Page 24: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

24

Simple search - settings

Response language

Results list size

Query language

Repositories

Highlight Keyword

Full text

Page 25: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

25

Results list siteQuestion

Number of results

Ranked

Result list, author, title, fragment containing the

answer

Page 26: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

26

CNL resources

CNL makes following resources available for M-CAST search purposes:

– ALEPH library catalogue system – about 100 000 catalogue records with full text abstracts of the documents and contents of documents – to be integrated in December

– Manuscriptorium – about 50 000 catalogue records with document’s metadata

– Kramerius – OCRed old monographs and periodicals

Page 27: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

27

M-CAST at CNL - limits of testing

• few digitized and digital born documents free available

• questions/types of questions defined for retrieving information by using contemporary language

• „modern“ queries are less effective for retrieving documents containing historical terms (applied in historical texts)

• when formulating questions we have to overcome difficulties with– spelling– complex syntax– historical vocabulary

• factoid questions

Page 28: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

28

Kolik rodin se vrátilo do Chebu?How many families did return to the city of Cheb?

Page 29: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

29

Do Chebu se vrátilo 23 rodin23 families returned to the city of Cheb

Page 30: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

30

Kolik rodin se vrátilo do Lokte?How many families did return to the city of Loket?

Page 31: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

31

Do Lokte se vrátilo 8 rodinEight families returned to the city of Loket

Page 32: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

32

How much does a camel cost?How much can a camel carry?

Page 33: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

33

A camel can carry about 8-10 cents (old)A camel costs 120 guilders (zlaty)

Page 34: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

34

Kolik starostů má Cařihrad?How many mayors are in Constantinople?

Page 35: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

35

…mají 4 starostyThere are four mayors in Constantinople

Page 36: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

36

Kdy se mohou Turci ženit?When can Turcs get married?

Page 37: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

37

Turci mohou se již ve třináctém roce ženitiTurcs can get married at the age of 13

Page 38: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

38

Kolik let je králi?How old is the king?

Page 39: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

39

…jest mu přes 40 letThe king is over 40 years old

Page 40: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

40

Co nosí bulharské vojsko?What do Bulgarian troops wear?

Page 41: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

41

Vojsko bulharské nosí přižloutlý oblekThe Bulgarian troops wear yellowish clothes

Page 42: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

42

Kdo jel v čele průvodu?Who rode at the head of the procession?

Page 43: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

43

V čele průvodu jel Václav Vilém z RoupovaAt the head of the procession rode Václav Vilém z Roupova

Page 44: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

44

Kdy se bude konat pohřeb?When will the funeral take place?

Page 45: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

45

…o druhé hodině po polednách….at 2 pm

Page 46: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

46

Kde je jeskyně?Where is the cave?

Page 47: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

47

Jeskyně se nachází na vrchu EulinusThe cave is situated on the hill Eulinus

Page 48: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

48

Kam chodí Smyrčané v neděli?Where do the citizens of Smyrna go on Sundays?

Page 49: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

49

Smyrčané chodí v neděli neb ve svátek na pivo On Sundays, the citizens of Smyrna go to beer

Page 50: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

50

Čeho jsou Athény hlavním městem?Which country is Athens capital of?

Page 51: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

51

Athény jsou hlavním městem ŘeckaAthens is capital of Greece

Page 52: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

52

How much does the palace in the suburbs of Tofano cost?

Page 53: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

53

The palace in the suburbs of Tofano costs 70 million

Page 54: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

54

Co udělají Turci křesťanům?What will Turcs do for Christians?

Page 55: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

55

Turci udělají křesťanům všechno, co mohouTurcs do for Christians everything they can

Page 56: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

56

Kdy byla dobyta Roudnice?When was the city of Roudnice captured?

Page 57: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

57

Roudnice byla dobyta …The city of Roudnice was captured on …

Page 58: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

58

Kde se sjeli přívrženci Gustava Adolfa?Where did the adherents of Gustav Adolf II meet?

Page 59: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

59

Všichni přívrženci Gustava Adolfa se sjeli v Halle All the adherents of Gustav Adolf II met In the city of Halle

Page 60: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

60

Co píše kancléř Vilém Slavata?What does chancellor Vilém Slavata write?

Page 61: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

61

Page 62: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

62

What church have the Servits abandoned?

Page 63: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

63

The Servits abandoned Saint Michael church

Page 64: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

64

Kolik žen je povoleno Turkovi?How many women are allowed to a Turkish man?

Page 65: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

65

Four women are allowed to a Turkish man

Page 66: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

66

Kdy zemřel Jan Hus?When did Jan Hus die?

Page 67: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

67

Page 68: 1 M-CAST in libraries National library of the Czech Republic Marie Balíková@nkp.cz.

68

Thank you for your attention!

Special thanks to my colleagues

Irena Pilíková

Narcisa Podhradská

Magdalena Servítová

Jana Vejražková