CoLiOS - Corpus Linguistic Open Source
-
Upload
marius-corici -
Category
Education
-
view
1.414 -
download
2
description
Transcript of CoLiOS - Corpus Linguistic Open Source
Alexandru-Lucian Gînscă1, Adrian Iftene1, Marius Corîci2
ConsILR Conference, 8-9 December, Bucharest, RomaniaNational Museum of Romanian Literature, (MNLR)
11“Al. I. Cuza”, University of Ia“Al. I. Cuza”, University of Iassi, i, RomRomaaniania11FacultFaculty of Computer Science y of Computer Science
22Intelligentics, Cluj-Napoca, Intelligentics, Cluj-Napoca, RomaniaRomania
Motivation Existing Sentiment Corpora Files Sources Annotations Annotation Process Corpus Statistics Evaluation Metrics Proposal Conclusions
ConsILR Conference, 8-9 December, MNLR, Bucharest
Sentiment Analysis or Opinion Mining represents for some time a hot topic within Web 2.0 era.
To build robust systems for Sentiment Analysis, there are needed resources for training and evaluating the systems.
The lack of such a Sentiment Corpus for Romanian.
We intend to make it publicly available, free of charge for individual researchers and research centers.
ConsILR Conference, 8-9 December, MNLR, Bucharest 3
4ConsILR Conference, 8-9 December, MNLR, Bucharest
Existing Sentiment Corpora: MPQA opinion corpus, Large Movie Review Dataset, SentiWordNet, The JDPA Sentiment Corpus, UMass Amherst Linguistics Sentiment Corpora
Languages: English, German, Italian, Chinese, Japanese
5ConsILR Conference, 8-9 December, MNLR, Bucharest
Romanian online publications: Online NewsPapers (MediaFax, Romania Libera, etc) Blogs (Chinezu.eu, Zoso.ro, etc) News Portals (Realitatea.net, StirileProTv.ro, etc)
Category: Telecommunications
Companies: Orange, Vodafone, Cosmote and so on.
6ConsILR Conference, 8-9 December, MNLR, Bucharest
<paragraph id=“”></paragraph>
<sentimentGroup value=“” id_group=“”> </sentimentGroup>
-4 <= value <= 4
<entity type=“” sentiment=“” id_entity=“” id_group=“”></entity> -4 <= value <= 4
7ConsILR Conference, 8-9 December, MNLR, Bucharest
8ConsILR Conference, 8-9 December, MNLR, Bucharest
Linking sentiment groups to entities
We consider the following major categories: City, Organization, Company, Country, Person and additionaly we consider categories like Brand, Product and Publication
For almost all major categories we consider subcategories: ◦ For Cities we consider Romanian, European, American and Other
Cities◦ For Organizations we consider Parties, Faculties, Universities,
Ministries, etc.◦ For People we consider Sportsmen, Politicians, Males, Females,
etc.
9ConsILR Conference, 8-9 December, MNLR, Bucharest
11 annotators (1st year master students in computational linguistics at FII, UAIC)
As annotation tool we decided to use Serna (http://www.syntext.com/products/serna/) : open source, flexible, easy to use, intuitive
Method 1: process the chosen files with our tools and automatically add annotations for named entities and for sentiments
Method 2: process only at paragraph level
10ConsILR Conference, 8-9 December, MNLR, Bucharest
11
12
13
14
11 annotators 1 week span 110 files 1988 paragraphs 2044 sentiment groups 4301 entities 1101 links between entities and sentiment
groups
15ConsILR Conference, 8-9 December, MNLR, Bucharest
16ConsILR Conference, 8-9 December, MNLR, Bucharest
17ConsILR Conference, 8-9 December, MNLR, Bucharest
Sentiment group precision
Precision for named entities and sentiment group links
18ConsILR Conference, 8-9 December, MNLR, Bucharest
19ConsILR Conference, 8-9 December, MNLR, Bucharest
Relaxed precision for sentiment group value
CG = the set of correctly identified sentiment groups VF (SSG)= the value of the sentiment group as given by the system VG (SSG)= the value of the sentiment group from the gold file.
20ConsILR Conference, 8-9 December, MNLR, Bucharest
Average deviation for sentiment group value
CG = the set of correctly identified sentiment groups VF (SSG)= the value of the sentiment group as given by the system VG (SSG)= the value of the sentiment group from the gold file.
The importance of a Corpus for Sentiment Analysis for Romanian.
The annotation format and methodology.
Comparison between our proposal and existing Sentiment Corpora.
21ConsILR Conference, 8-9 December, MNLR, Bucharest
22ConsILR Conference, 8-9 December, MNLR, Bucharest