Implementing a Corpus for Sinhala Language

23
Implementing a Corpus for Sinhala Language Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. De Silva

Transcript of Implementing a Corpus for Sinhala Language

Implementing a Corpus for

Sinhala Language

Upeksha W. D.

Wijayarathna D. G. C. D.

Siriwardena M. P.

Lasandun K. H. L.

Dr. Chinthana Wimalasuriya

Prof. Gihan Dias

Mr. N. H. N. D. De Silva

OUTLINE

• Introduction

• Resource Gathering

• Data Storage

• User Interface Design and Implementation

• Application Programming Interface

• Limitations

INTRODUCTION

What is a Language Corpus?

▪ A collection of authentic texts of the

language

▪ Stored electronically

▪ Contains actual usage patterns of the

language

▪ Covers a wide context of the language

▪ Can be used to discover information about

language that may not have been noticed

through intuition alone

ARCHITECTURE OF THE

CORPUS

IDENTIFIED SINHALA

RESOURCES

News Academic Creative

Writing

Spoken Gazette

News Paper Text books Fiction Subtitle Gazette

News Items Religious Blogs

Wikipedia Magazine

Mahawansa

COMPOSITION OF CORPUS

SINHALA SOURCES

DATA REPRESENTATION

▪ Raw data is separately stored as XML

formatted files and converted to necessary

formats according to information need

▪ Primary Metadata

▪ Article Name

▪ Author

▪ URL

▪ Date (year, Month, Day)

▪ Genre

DATA STORAGE SYSTEM

Considerations

▪ Relational Databases

▪ Oracle DB

▪ H2 Database

▪ Indexed File Systems

▪ Apache Solr

▪ Column Store Databases

▪ Cassandra

▪ Graph Databases

▪ Neo4j

EVALUATION CRITERIA

We considered performance for inserting

data and for retrieving 12 different

information needs.

Data Insertion time comparison

Information Retrieval Performance

Comparison - Part 1

Information Retrieval Performance

Comparison - Part 2

•Cassandra performed better than

others in most of the scenarios,

and its insertion time increased

linearly.

•So we chose it for implementing

corpus.

•Cassandra Uses Query Based

Data Modeling.

•Uses a separate column family for

each information need

User Interface Design and

Implementation

● Web interface of Sinmin has been designed

for users who would prefer a visualised and

summarized view of statistical data of

Sinmin.

● Visual design of the interface has been made

in a way that any user without prior

experience of the interface is able to fulfill his

information requirements with little effort.

Searching the frequency of a word

Searching the most probable words

comes after a word

Comparing two words

APPLICATION PROGRAMING

INTERFACE (API)

•REST API to expose Corpus services

•Much complex and customizable data retrieval

and filtering

•Interface for third party applications to

consume

LIMITATIONS OF THE CORPUS

● words are not annotated with their Part of

Speech (POS) tags and lemmas.

● If a new information need occurs, new

column families may need to be created for

them and data has to be inserted again.

QUESTIONS?

THANK YOU!