Petition for Habeas Corpus - CA Court of Appeal - Marina Strand v LA County
A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A...
Transcript of A Web Corpus for eCare - Marina Santinisantini.se/eCareCorpus/2017_LTA17-A_Web_Corpus_for_eCare...A...
A Web Corpus for eCareCollection, Lay Annotation and Learning
Marina Santini([email protected])
RISE SICS East
Sweden
3 September 2017LTA'17 - FedCSIS 2017 - Prague
1
RISE & SICS
• RI.SE = Research Institutes of Sweden• SICS = Swedish Institute of Computer Science
• SICS East Linköping
• Group: Language Technology and Intelligent Interaction Design
• Group Leader: Professor Arne Jönsson ([email protected])
Citation: Santini M., Jönsson A., Nyström M. and Alirezai M. (2017) Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results. Proceedings of LTA'17, FedCSIS 2017, Prague.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 2
Outline
• The eCare@home project
• The Language Technology contribution
• The eCare Swedish corpus
• Lay-Specialized Annotation
• The experiments: lay-specialized text classification
• Conclusions
3 September 2017LTA'17 - FedCSIS 2017 - Prague 3
eCare@home
• Website: http://ecareathome.se/
• Interdisciplinary project
• Funded by Swedish Knowledge Foundation
1. Measure attributes about people and the environment.
2. Infer beyond that which we cannot measure.
3. Automatically configure devices and "things" to achieve a task.
4. Represent information in a human consumable way.
3 September 2017LTA'17 - FedCSIS 2017 - Prague 4
Language Technology and eCare@home
3 September 2017LTA'17 - FedCSIS 2017 - Prague 5
Generally speaking:
• The Internet of Things Sensor Data = Numbers
• Language Technology to present information in a “human consumable way”
How to Build a Medical Corpus for eCare?
3 September 2017LTA'17 - FedCSIS 2017 - Prague 6
Kardiologi vs hjärtspecialistspecialized vs lay
Starting point: a Web Corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 7
• Creation of a concept-specific medical web corpus useful for eCare (and eHealth) LT applications, such as:
• the automatic extraction of lay synonyms (or paraphrases) of medical terms,
• text simplification
• etc.
Web Corpus
• Which textual sources?
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 8
• Using medical journals?
• Crawling user-generated texts?
• Relying on a specific web genre like blogs?
• Downloading medical websites?
• Web corpus!
Designing a corpus for eCare
3 September 2017LTA'17 - FedCSIS 2017 - Prague 9
1. having a publicly-available medical corpus annotated with lay-specialized labels that can be easily shared;
2. having a corpus with a design and a structure that allow for expansion with additional documents over time;
3. accounting for very specific medical terms.
Potential Issues
3 September 2017LTA'17 - FedCSIS 2017 - Prague 10
1. expanding the corpus over time may cause scalability issues
2. the web is noisy the corpus will be noisy: noise disturbs LT applications
We made two assumptions: 1. scalability increasing the size of the corpus does not necessarily affect the
performance of LT applications negatively
2. Noise noisy texts do not necessarily affect the performance of LT applications negatively
In the next slides...
3 September 2017LTA'17 - FedCSIS 2017 - Prague 11
1. Lay vs Specialized sublanguage
2. The construction and annotation of the eCare Swedish web corpus
3. Experiments 1 and 2: • Experiment 1: robustness to scalability issues ( increasing the size of
the corpus does not necessarily affect the performance of LT application negatively )
• Experiment 2: resilience to noise ( noisy texts do not necessarily affect the performance of LT application negatively)
Lay vs Specialized Sublanguage
3 September 2017LTA'17 - FedCSIS 2017 - Prague 12
• Definition of sublanguage:• A language variety used by a specific user group in
certain communicative/situational contexts or domains• Ex: the jargon used by police, or by the military, or by
the politicians, etc.
• Medical domain not only jargon but also medical terminology!
• Two user groups in contact:• patients lay sublanguage (ex: heart specialist)• professional staff specialized sublanguage (ex:
cardiologist)
The Guardian, 2014
eCare Web Corpus: Construction
3 September 2017LTA'17 - FedCSIS 2017 - Prague 13
• Seed terms from SNOMED CT• Chronic diseases
• The corpus has been bootstrapped with 228 seed terms and 801 documents were bootcat-ted (BootCat, Baroni & Bernardini, 2004)
Initial seeds Retrieved seeds Downloaded documents
Number of words
Unigrams 13 13 112 91 118
Bigrams 215 142 689 618 491
Total 228 155 801 709 609
Example of long medical term (source SNOMED CT
Small Data (vs Big Data)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 14
• Small data: data that has small enough size for human comprehension.
• Use: for many problems and questions, small data in itself is enough.
• Small data (wikipedia) = a new buzz word
eCare Web Corpus: Lay Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 15
• Annotation by a native speaker who participates who has little knowledge of medical terminology.
• The lay-specialized text classification experiments described later on are based on this lay annotation.
Inter-rater agreement: Lay vs Expert (sample)
3 September 2017LTA'17 - FedCSIS 2017 - Prague 16
• Interrater agreement
To be taken with a grain of salt , but normally: • Poor agreement = 0.20 or less• Fair agreement = 0.20 to 0.40• Moderate agreement = 0.40 to 0.60• Good agreement = 0.60 to 0.80• Very good agreement = 0.80 to 1.00
User group bias: annotation maby be biassed by the annotator’s domain expertise.
Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 17
• Apparently the web is full of automatically translated documents ! Do we need to sort them out? It depends!
• Out of 801 web documents, 339 have received comments by the lay annotator, e.g. "Machine Translated" or "it is about animals and not about humans".
• Is it important to remove this noise-ness for some LT tasks? Maybe not always
• Computationally, the presence of noise might be irrelevant, so it might not be worth investing resource for cleaning a corpus
Lay/Specialized Text Classification
3 September 2017LTA'17 - FedCSIS 2017 - Prague 18
Based on the annotation made by the lay annotator
• Experiment 1: focus on scalability
• Experiment 2: focus on the impact of noise on performance
Features
3 September 2017LTA'17 - FedCSIS 2017 - Prague 19
No text pre-processing
The texts were defined as “string”
Filter converting strings to word vectors
Experiment 1: Scalability
3 September 2017LTA'17 - FedCSIS 2017 - Prague 20
Experiment 2: Resilience to Noise
3 September 2017LTA'17 - FedCSIS 2017 - Prague 21
• Noisy vs Cleaned
New Experiment: Lay vs Specialized Annotation
3 September 2017LTA'17 - FedCSIS 2017 - Prague 22
• SVM (weka implementation: SMO)
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.49 78.6 0.78 0.78 0.78 0.74 0.78 0.29
801 documents labelled by the lay annotator
k Acc. Avg. P Avg. R Avg. F ROC A. Avg. TP Avg. FP
SMO 0.54 79.5 0,79 0,79 0,79 0,77 0,79 0,25
778 documents labelled by the expert annotator
Conclusions: Findings
3 September 2017LTA'17 - FedCSIS 2017 - Prague 23
• We presented the construction of the the eCare corpus
• We made two claims and we supported our claims with two experiments:
1. We can design an extensible/dynamic corpus without fearing scalability issues
2. We can use a noisy corpus to build (certain) LT applications without fearing bad performance
Replicability
3 September 2017LTA'17 - FedCSIS 2017 - Prague 24
• We encourage the replication of the results presented in this paper, and welcome improvements and further discussion.
• Corpus, scripts and the output of the classification models are available from the project website: http://ecareathome.se.
Thanks for your attention !
3 September 2017 LTA'17 - FedCSIS 2017 - Prague 25
Any Questions ?