A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A...

A Web Corpus for eCareCollection, Lay Annotation and Learning

Marina Santini([email protected])

RISE SICS East

Sweden

3 September 2017LTA'17 - FedCSIS 2017 - Prague

1

RISE & SICS

• RI.SE = Research Institutes of Sweden• SICS = Swedish Institute of Computer Science

• SICS East Linköping

• Group: Language Technology and Intelligent Interaction Design

• Group Leader: Professor Arne Jönsson ([email protected])

Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.

3 September 2017LTA'17 - FedCSIS 2017 - Prague 2

mailto:[email protected]

Outline

• The eCare@home project

• The Language Technology contribution

• The eCare Swedish corpus

• Lay-Specialized Annotation

• The experiments: lay-specialized text classification

• Conclusions


eCare@home

• Website: http://ecareathome.se/

• Interdisciplinary project

• Funded by Swedish Knowledge Foundation

1. Measure attributes about people and the environment.

2. Infer beyond that which we cannot measure.

3. Automatically configure devices and "things" to achieve a task.

4. Represent information in a human consumable way.


http://ecareathome.se/

Language Technology and eCare@home


Generally speaking:

• The Internet of Things Sensor Data = Numbers

• Language Technology to present information in a “human consumable way”

How to Build a Medical Corpus for eCare?


Kardiologi vs hjärtspecialistspecialized vs lay

Starting point: a Web Corpus for eCare


• Creation of a concept-specific medical web corpus useful for eCare (and eHealth) LT applications, such as:

• the automatic extraction of lay synonyms (or paraphrases) of medical terms,

• text simplification

• etc.

Web Corpus

• Which textual sources?

3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8

• Using medical journals?

• Crawling user-generated texts?

• Relying on a specific web genre like blogs?

• Downloading medical websites?

• Web corpus!

Designing a corpus for eCare


1. having a publicly-available medical corpus annotated with lay-specialized labels that can be easily shared;

2. having a corpus with a design and a structure that allow for expansion with additional documents over time;

3. accounting for very specific medical terms.

Potential Issues


1. expanding the corpus over time may cause scalability issues

2. the web is noisy the corpus will be noisy: noise disturbs LT applications

We made two assumptions: 1. scalability increasing the size of the corpus does not necessarily affect the

performance of LT applications negatively

2. Noise noisy texts do not necessarily affect the performance of LT applications negatively

In the next slides...


1. Lay vs Specialized sublanguage

2. The construction and annotation of the eCare Swedish web corpus

3. Experiments 1 and 2: • Experiment 1: robustness to scalability issues ( increasing the size of

the corpus does not necessarily affect the performance of LT application negatively )

• Experiment 2: resilience to noise ( noisy texts do not necessarily affect the performance of LT application negatively)

Lay vs Specialized Sublanguage


• Definition of sublanguage:• A language variety used by a specific user group in

certain communicative/situational contexts or domains• Ex: the jargon used by police, or by the military, or by

the politicians, etc.

• Medical domain not only jargon but also medical terminology!

• Two user groups in contact:• patients lay sublanguage (ex: heart specialist)• professional staff specialized sublanguage (ex:

cardiologist)

The Guardian, 2014

eCare Web Corpus: Construction


• Seed terms from SNOMED CT• Chronic diseases

• The corpus has been bootstrapped with 228 seed terms and 801 documents were bootcat-ted (BootCat, Baroni & Bernardini, 2004)

Initial seeds Retrieved seeds Downloaded documents

Number of words

Unigrams 13 13 112 91 118

Bigrams 215 142 689 618 491

Total 228 155 801 709 609

Example of long medical term (source SNOMED CT

Small Data (vs Big Data)


• Small data: data that has small enough size for human comprehension.

• Use: for many problems and questions, small data in itself is enough.

• Small data (wikipedia) = a new buzz word

https://en.wikipedia.org/wiki/Small_data

eCare Web Corpus: Lay Annotation


• Annotation by a native speaker who participates who has little knowledge of medical terminology.

• The lay-specialized text classification experiments described later on are based on this lay annotation.

Inter-rater agreement: Lay vs Expert (sample)


• Interrater agreement

To be taken with a grain of salt , but normally: • Poor agreement = 0.20 or less• Fair agreement = 0.20 to 0.40• Moderate agreement = 0.40 to 0.60• Good agreement = 0.60 to 0.80• Very good agreement = 0.80 to 1.00

User group bias: annotation maby be biassed by the annotator’s domain expertise.

Noise


• Apparently the web is full of automatically translated documents ! Do we need to sort them out? It depends!

• Out of 801 web documents, 339 have received comments by the lay annotator, e.g. "Machine Translated" or "it is about animals and not about humans".

• Is it important to remove this noise-ness for some LT tasks? Maybe not always

• Computationally, the presence of noise might be irrelevant, so it might not be worth investing resource for cleaning a corpus

Lay/Specialized Text Classification


Based on the annotation made by the lay annotator

• Experiment 1: focus on scalability

• Experiment 2: focus on the impact of noise on performance

Features


No text pre-processing

The texts were defined as “string”

Filter converting strings to word vectors

Experiment 1: Scalability


Experiment 2: Resilience to Noise


• Noisy vs Cleaned

New Experiment: Lay vs Specialized Annotation


• SVM (weka implementation: SMO)

k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP

SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29

801 documents labelled by the lay annotator

k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP

SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25

778 documents labelled by the expert annotator

Conclusions: Findings


• We presented the construction of the the eCare corpus

• We made two claims and we supported our claims with two experiments:

1. We can design an extensible/dynamic corpus without fearing scalability issues

2. We can use a noisy corpus to build (certain) LT applications without fearing bad performance

Replicability


• We encourage the replication of the results presented in this paper, and welcome improvements and further discussion.

• Corpus, scripts and the output of the classification models are available from the project website: http://ecareathome.se.

http://ecareathome.se/

Thanks for your attention !

3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25

Any Questions ?

A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A...

Documents

Transcript of A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A...