D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof...
Transcript of D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof...
© FIRST consortium Page 1 of 25
Project Acronym: FIRST
Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making
Project Number: 257928
Instrument: STREP
Thematic Priority: ICT-2009-4.3 Information and Communication Technology
D3.1 Semantic resources and data acquisition
Work Package: WP3 - Data acquisition and ontology infrastructure
Due Date: 30/09/2011
Submission Date: 30/09/2011
Start Date of Project: 01/10/2010
Duration of Project: 36 Months
Organisation Responsible for Deliverable: JSI
Version: 1.0
Status: Final
Author Name(s): Miha Grčar JSI
Tobias Häusser UHOH
Dominic Ressel UHOH
Reviewer(s): Achim Klein
Mateusz Radzimski UHOH
ATOS
Nature: R – Report P – Prototype D – Demonstrator O – Other
Dissemination Level: PU - Public CO - Confidential, only for members of the
consortium (including the Commission)
RE - Restricted to a group specified by the consortium (including the Commission Services)
Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)
D3.1
© FIRST consortium Page 2 of 25
Revision history Version Date Modified by Comments 0.1 21/08/2011 Miha Grčar (JSI) High-level TOC and first
inputs.
0.2 09/09/2011 Miha Grčar (JSI) Added content about Dacq and the dataset.
0.3 17/09/2011 Miha Grčar (JSI), Tobias Häusser (UHOH)
Added content about the FIRST ontology.
0.4 27/09/2011 Miha Grčar (JSI) Revision according to the reviewers’ comments.
0.5 28/09/2011 Miha Grčar (JSI), Dominic Ressel (UHOH)
Included Dominic’s contribution on the sentiment corpus construction.
1.0 30/09/2011 Tomás Pariente Final QA and preparation for submission
D3.1
© FIRST consortium Page 3 of 25
Copyright © 2011, FIRST Consortium
The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute all or parts of this document, provided that the FIRST project and the document are properly referenced.
THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
----------------
D3.1
© FIRST consortium Page 4 of 25
Executive summary
This report accompanies three ―products‖ developed in WP3 in the first project year:
(1) The data acquisition software called DacqPipe (Dacq for short)
(2) The FIRST dataset of news and blog posts
(3) The FIRST ontology
In this report, we briefly describe each of these three project assets and provide Web addresses to a range of online demos related to the data acquisition pipeline. Furthermore, we provide download locations and usage instructions for the resources released in the context of D3.1.
In addition, we report the efforts in constructing a manually annotated sentiment corpus that will serve both for the evaluation of the sentiment extraction technology developed in WP4 and training machine learning (i.e., sentiment classification) models in WP6. Since the main purpose of this document is to describe the software prototypes and datasets released at M12, the efforts related to the sentiment corpus construction are reported in Annex 1.
D3.1
© FIRST consortium Page 5 of 25
Table of contents
Executive summary ....................................................................................................... 4
Abbreviations and acronyms ....................................................................................... 7
1 Introduction ............................................................................................................ 8
2 DacqPipe, the data acquisition pipeline ............................................................... 9
2.1 Introduction ....................................................................................................... 9
2.2 Availability ....................................................................................................... 10
2.3 Deployment and usage instructions ................................................................ 10
2.3.1 Deployment and configuration.................................................................. 10
2.3.2 Usage ....................................................................................................... 12
2.3.3 Data files .................................................................................................. 12
2.4 Pointers to online demos ................................................................................. 14
3 FIRST dataset of news and blog posts ............................................................... 15
3.1 Introduction ..................................................................................................... 15
3.2 Availability ....................................................................................................... 16
4 Semantic resources and the FIRST ontology .................................................... 18
4.1 Existing semantic resources ............................................................................ 18
4.2 The FIRST ontology ........................................................................................ 18
4.2.1 Ontology-based information extraction process in FIRST ........................ 19
4.2.2 Sentiment objects and gazetteers ............................................................ 20
4.2.3 Availability ................................................................................................ 21
References ................................................................................................................... 23
Annex 1. Annotated sentiment corpus construction .......................................... 24
Index of Figures
Figure 1: The current topology of the data acquisition pipeline (taken from FIRST D2.2). ........... 9
Figure 2: An example of the file with RSS sources. .................................................................. 11
Figure 3: Dacq screenshot. ...................................................................................................... 12
Figure 4: Annotated document corpus serialized into XML. ...................................................... 13
Figure 5: Annotated document serialized into HTML and displayed in a Web browser. ............ 14
Figure 6: An example of identified ontological instances interrelated through a CorrelationDefinition and a set of JAPE rules to provide a sentiment polarity classification. ..... 19
Figure 7: The ontology part defining sentiment objects and gazetteers is ―grown‖ from a list of seed stock indices. ................................................................................................................... 20
Figure 8: Corpus construction process. .................................................................................... 25
D3.1
© FIRST consortium Page 6 of 25
Index of Tables
Table 1: Supported key-value pairs for configuring Dacq. ......................................................... 11
Table 2: Some basic statistics and types of annotations related to the acquired data (taken from FIRST D2.3). ............................................................................................................................ 15
Table 3: Part 1 of the FIRST dataset – basic statistics and comments. .................................... 16
Table 4: Part 2 of the FIRST dataset – basic statistics and comments. .................................... 17
Table 5: Approach to retrieving documents for the three use cases.......................................... 24
D3.1
© FIRST consortium Page 7 of 25
Abbreviations and acronyms
DacqPipe, Dacq Data acquisition pipeline
OBIE Ontology-based information extraction
JAPE Java Annotation Patterns Engine, a component of the GATE platform
GATE General Architecture for Text Engineering, a Java suite of tools for all sorts of natural language processing tasks, including information extraction in many languages (originally developed at the University of Sheffield)
OWL The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies
N3 Turtle (Terse RDF Triple Language) is a serialization format for Resource Description Framework (RDF) graphs
RDF The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model
D3.1
© FIRST consortium Page 8 of 25
1 Introduction
This report accompanies three ―products‖ developed in WP3 in the first project year. In this report, we briefly describe each of these three project assets and provide Web addresses to a range of online demos related to the data acquisition pipeline. Furthermore, we provide download locations and usage instructions for the resources released in the context of D3.1.
Specifically, this report covers the following three topics:
(1) The data acquisition software called DacqPipe or Dacq for short (see Section 2)
(2) The FIRST dataset of news and blog posts (see Section 3)
(3) The FIRST ontology (see Section 4)
In addition, we report the efforts in constructing a manually annotated sentiment corpus that will serve both for the evaluation of the sentiment extraction technology developed in WP4 and training machine learning (i.e., sentiment classification) models in WP6. Since the main purpose of this document is to describe the software prototypes and datasets released at M12, the efforts related to the sentiment corpus construction are reported in Annex 1.
Note that some of the content in this report was copied from other FIRST reports for the reader’s convenience.
D3.1
© FIRST consortium Page 9 of 25
2 DacqPipe, the data acquisition pipeline
2.1 Introduction
The data acquisition pipeline consists of several technologies that interoperate to achieve the desired goal, i.e., preparing the data for further analysis. It is responsible for acquiring unstructured data from several data sources, preparing it for the analysis, and brokering it to the appropriate analytical components (e.g., information extraction components developed in WP4). The data acquisition pipeline is running continuously (since April 21, 2011), polling the Web and proprietary APIs for recent content, turning it into a stream of preprocessed text documents.
When dealing with official news streams—such as those provided to the consortium by IDMS—a lot of preprocessing steps can be avoided. Official news are provided in a semi-structured fashion such that titles, publication dates, and other metadata are clearly indicated. Furthermore, named entities (i.e., company names and stock symbols) are identified in texts and article bodies are provided in a raw textual format without any boilerplate (i.e., undesired content such as advertisements, copyright notices, navigation elements, and recommendations).
Content from blogs, forums, and other Web content, however, is not immediately ready to be processed by the text analysis methods. Web pages contain a lot of ―noise‖ that needs to be identified and removed before the content can be analysed. For this reason, we have developed DacqPipe (or Dacq), a data acquisition and preprocessing pipeline. Dacq consists of (i) data acquisition components, (ii) data cleaning components, (iii) natural-language preprocessing components, (iv) semantic annotation components, and (v) ZeroMQ emitter components. The current pipeline topology is shown in Figure 1. Note that the ZeroMQ emitter is actually part of the integration framework (WP7) but is tightly integrated into the data acquisition pipeline.
Figure 1: The current topology of the data acquisition pipeline (taken from FIRST D2.2).
The data acquisition components are mainly RSS readers that poll for data in parallel. One RSS reader is instantiated for each Web site of interest. The RSS sources, corresponding to a particular Web site, are polled one after another by the same RSS reader to prevent the servers from rejecting requests due to concurrency. An RSS reader, after it has collected a new set of documents from an RSS source, dispatches the data to one of several processing pipelines. The pipeline is chosen according to its current load size (load balancing). A processing pipeline consists of a boilerplate remover, language detector, duplicate detector, sentence splitter, tokenizer, part-of-speech tagger, lemmatizer, stop-word detector, semantic annotator, and ZeroMQ emitter.
Boilerplate remover
Language detector
Duplicate detector
Sentence splitter
Tokenizer POS tagger LemmatizerStop-word
detector
Boilerplate remover
Language detector
Duplicate detector
Sentence splitter
Tokenizer POS tagger LemmatizerStop-word
detector
Boilerplate remover
RSS reader
RSS reader
RSS reader
Language detector
Duplicate detector
Sentence splitter
Tokenizer POS tagger LemmatizerStop-word
detector
.
.
.
.
.
.
Loadbalancing
One readerper site
processingpipelines
Semantic annotator
ZeroMQemitter
Semantic annotator
ZeroMQemitter
Semantic annotator
ZeroMQemitter
Data cleaningcomponents (ii)
Data acquisitioncomponents (i)
Natural-language processingcomponents (iii)
Semantic annotationcomponents (iv)
ZeroMQ emittercomponents (v)
D3.1
© FIRST consortium Page 10 of 25
The majority of these components were already discussed in FIRST D2.1 (see FIRST D2.1 Section 2.1). The natural-language processing stages (i.e., sentence splitter, tokenizer, part-of-speech tagger, lemmatizer, and stop-word detector) were added because they are a prerequisite for the semantic annotation component and also for the information extraction tasks. Finally, the ZeroMQ emitters were added to establish a ―messaging channel‖ between the data acquisition and preprocessing components (WP3) and the information extraction components (WP4). This enables us to run the two sets of components in two different processes (i.e., runtime environments) or even on two different machines.
2.2 Availability
Dacq is currently available for download as a configurable console-mode application. It is slightly different from the version currently running on the FIRST server. Most notably, it does not require a database connection to run. For this reason, it is extremely easy to deploy it on practically any Windows computer (potentially also under Mono1 on Linux and Mac OS but this setting has not been tested yet). Note, however, that Dacq requires running several threads concurrently when under heavy load in order to process the data in near-real time. The user is thus advised to deploy Dacq on a multi-core machine (e.g., 8 cores or more) or to keep the list of RSS sources appropriately short.
Dacq was successfully deployed on an 8-core machine with 8 GB RAM, acquiring data from more than 2,000 RSS sources from 80 different Web sites (such as CNN, BBC, and Seeking Alpha).
Dacq (the stand-alone console-mode utility) can be downloaded from the following location:
http://first.ijs.si/software/DacqPipeSep2011.zip
The source code is currently unavailable for public. However, it will be release in the accordance with the FIRST open-source strategy once the open-source strategy is fully devised at M18 (i.e., end of March 2012).
2.3 Deployment and usage instructions
2.3.1 Deployment and configuration
Once you have downloaded DacqPipe, follow these steps to install and configure it:
Unzip the downloaded archive into a folder, for example C:\DacqPipe.
If .NET Framework (2.0 or later) is not yet installed on your computer, download it from http://www.microsoft.com/download/en/details.aspx?id=19 (32-bit version) or from http://www.microsoft.com/download/en/details.aspx?id=6523 (64-bit version2). Run the downloaded executable file and follow the setup instructions.
Dacq should now run with its default settings (see the table below for the default settings).
To configure Dacq, simply edit the file Dacq.exe.config (located in the target folder, e.g., C:\DacqPipe) in your favourite text editor.
The configuration file contains a set of key-value pairs in the form ―<add key="…" value="…"/>‖. Table 1 lists the supported key-value pairs.
1 http://www.mono-project.com/Main_Page
2 It is most likely that you are running Dacq in a 64-bit environment.
D3.1
© FIRST consortium Page 11 of 25
Key Description Default value
logFileName (optional)
The location and name of the log file to which Dacq writes events important mainly for debugging.
Not set
xmlDataRoot (optional1)
The location to which the acquired data is stored in the XML format.
.\Data
htmlDataRoot (optional1)
The location to which the acquired data is stored in the HTML format appropriate for viewing.
.\DataHtml
dataSourcesFileName (mandatory)
The location and name of the file containing RSS sources to be polled for content.
.\RssSources.txt
Table 1: Supported key-value pairs for configuring Dacq.
Once installed and configures, Dacq is started simply by invoking Dacq.exe from the folder into which the archive was extracted (e.g., C:\DacqPipe\Dacq.exe).
Site: abcnews
# Site: http://abcnews.go.com/
# RSS list: http://abcnews.go.com/Site/page?id=3520115
http://feeds.abcnews.com/abcnews/topstories
http://feeds.abcnews.com/abcnews/internationalheadlines
http://feeds.abcnews.com/abcnews/usheadlines
http://feeds.abcnews.com/abcnews/politicsheadlines
http://feeds.abcnews.com/abcnews/blotterheadlines
http://feeds.abcnews.com/abcnews/moneyheadlines
http://feeds.abcnews.com/abcnews/technologyheadlines
http://feeds.abcnews.com/abcnews/healthheadlines
http://feeds.abcnews.com/abcnews/entertainmentheadlines
http://feeds.abcnews.com/abcnews/travelheadlines
http://feeds.abcnews.com/abcnews/sportsheadlines
http://feeds.abcnews.com/abcnews/worldnewsheadlines
Site: bbc
# Site: http://www.bbc.co.uk/news/
# RSS list: http://www.bbc.co.uk/news/10628494
http://feeds.bbci.co.uk/news/rss.xml
http://feeds.bbci.co.uk/news/world/rss.xml
http://feeds.bbci.co.uk/news/uk/rss.xml
http://feeds.bbci.co.uk/news/business/rss.xml
http://feeds.bbci.co.uk/news/politics/rss.xml
http://feeds.bbci.co.uk/news/health/rss.xml
http://feeds.bbci.co.uk/news/education/rss.xml
http://feeds.bbci.co.uk/news/science_and_environment/rss.xml
http://feeds.bbci.co.uk/news/technology/rss.xml
Figure 2: An example of the file with RSS sources.
Dacq requires a file with RSS sources to work. These sources are periodically polled for content. The location and name of the file with RSS sources is specified with the dataSourcesFileName configuration parameter. The file format is relatively simple and contains several lists of RSS sources, one for each Web site. An example is shown in Figure 2. Each RSS list starts with a site identifier (e.g., ―Site: abcnews‖). The URLs of RSS sources are listed after the site identifier, each in its own line. The list ends with a next site identifier (or with the end of file). If a line starts with ―#‖, which indicates a comment, it is ignored by Dacq.
1 For the data acquisition to make sense, at least one of the two data locations (i.e., xmlDataRoot and htmlDataRoot)
needs to be set.
D3.1
© FIRST consortium Page 12 of 25
2.3.2 Usage
Dacq starts as a console-mode application. The console displays the current activities of the data acquisition pipeline and potentially reports problems (see Figure 3). The same activity and error messages are written into a log file if logging is enabled (i.e., if logFileName is set; see Section 2.3.1).
Figure 3: Dacq screenshot.
Dacq is shut down by pressing Ctrl-C. The message ―*** Ctrl-C command received. ***‖ will appear in the console. Note that Dacq needs some time to shut down properly as it needs to finalize the processing of the data contained in the component queues. If the shut-down process takes too long and if the finalization of processing is not crucial, the user can close the window by pressing Alt-F4 and thus terminate the application.
2.3.3 Data files
A batch of documents (either news or blog posts), acquired and preprocessed with Dacq, is internally stored as an annotated document corpus object. The annotated document corpus data structure (ADC) is very similar to the one that GATE1 uses and is best described in the
1 GATE is a Java suite of tools developed at the University of Sheffield used for all sorts of natural language
processing tasks, including information extraction. It is freely available at http://gate.ac.uk/ .
D3.1
© FIRST consortium Page 13 of 25
GATE user’s guide1. ADC can be serialized either into XML or into a set of HTML files. Figure 4 shows a toy example of ADC serialized into XML. In short, a document corpus normally contains one or more documents and is described with features (i.e., a set of key-value pairs). A document is also described with features and in addition contains annotations. An annotation gives a special meaning to a text segment (e.g., token, sentence, named entity). Note that an annotation can also be described with features.
<DocumentCorpus xmlns="http://freekoders.org/latino">
<Features>
<Feature>
<Name>source</Name>
<Value>smh.com.au/technology</Value>
</Feature>
</Features>
<Documents>
<Document>
<Name>Steve Jobs quits as Apple CEO</Name>
<Text>Tech industry legend and one of the finest creative minds of a generation,
Steve Jobs, has resigned as chief executive of Apple.</Text>
<Annotations>
<Annotation>
<SpanStart>75</SpanStart>
<SpanEnd>84</SpanEnd>
<Type>named entity/person</Type>
<Features />
</Annotation>
<Annotation>
<SpanStart>122</SpanStart>
<SpanEnd>126</SpanEnd>
<Type>named entity/company</Type>
<Features>
<Feature>
<Name>stockSymbol</Name>
<Value>AAPL</Value>
</Feature>
</Features>
</Annotation>
</Annotations>
<Features>
<Feature>
<Name>URL</Name>
<Value>http://www.smh.com.au/technology/technology-news/steve-jobs-quits-as-
apple-ceo-20110825-1jat8.html</Value>
</Feature>
</Features>
</Document>
</Documents>
</DocumentCorpus>
Figure 4: Annotated document corpus serialized into XML.
The annotated document, contained in the XML in Figure 4, serialized into HTML and displayed in a Web browser is shown in Figure 5.
1 Available at http://gate.ac.uk/sale/tao/split.html . See http://gate.ac.uk/sale/tao/splitch5.html#x8-910005.4.2 for
some simple examples of annotated documents in GATE.
D3.1
© FIRST consortium Page 14 of 25
Figure 5: Annotated document serialized into HTML and displayed in a Web browser.
Dacq sores the acquired and annotated document corpora into files. It can be configured to store them as XMLs, as HTMLs, or both (see Section 2.3.1 on how to configure Dacq). Dacq creates a separate folder for each day (e.g., <xmlDataRoot>\2011\9\8\ would be created on September 8, 2011) and assigns unique names to data files. The name of a file consists of a time stamp and a random ID (e.g., <xmlDataRoot>\2011\9\8\14_29_33_c9bef21a1d4f4e4db0-c82624d5b741bb.xml). Note that the time stamp (the first 8 characters in the file name, i.e., hh_mm_ss) represents the acquisition time and not the publication time.
For storing HTMLs, Dacq similarly creates a separate folder for each day. However, for each document corpus, it then creates another folder with a unique name consisting of a time stamp and a random ID (e.g., <htmlDataRoot>\2011\9\8\14_29_33_c9bef21a1d4f4e4db0c82624d-5b741bb\). Each such folder contains several HTML files: the index file (i.e., Index.html) and one additional file for each document in the corpus. To view a document corpus, open the corresponding Index.html in a Web browser.
2.4 Pointers to online demos
As already stated, the data acquisition pipeline consists of several components that work together. We prepared several simple online demos that demonstrate separate data acquisition components. The following is the list of online demos related to the data acquisition pipeline:
Boilerplate removal demo:
http://first.ijs.si/demos/boilerplateremovedemo/
Language and duplicate detection demo:
http://first.ijs.si/demos/duplicatedetectordemo/
Annotation pipeline demo:
http://first.ijs.si/demos/annotationpipelinedemo/
D3.1
© FIRST consortium Page 15 of 25
3 FIRST dataset of news and blog posts
3.1 Introduction
The FIRST dataset is currently available for download in two parts. The first part was acquired in the time period from April 21, 2011 (17:20+2), to June 29, 2011 (19:18+2). The acquisition of the second part is still in progress; the pipeline was started on June 29, 2011 (19:27+2). The most recent available data archive contains the August data.
Part 1
Apr–Jun 2011
(M7–M9)
Part 2
Jun–Sep 2011
(M9–M12) Now
Part 3
Sep 2011–Sep 2012
(M12–M24)
Part 4
Sep 2012–Sep 2013
(M24–M36)
Sca
le
Number of sites: 39
Number of RSS feeds: 1,950 (~50 per site on
average)
Avg. number of documents per site per day: 870
Total new documents per day: 33,950
Part 1 of the FIRST dataset contains altogether approx. 2,375,000 documents
Number of sites: 80
Number of RSS feeds: 2,472 (~30 per site on
average)
Avg. number of documents per site per day: 425
Total new documents per day: 34,000
Part 2 of the FIRST dataset contains at the moment (Sep 7, 2011) 2,674,827 documents
Unchanged
Number of sites: 160
Number of RSS feeds: 4,800 (~30 per site on
average)
Avg. number of documents per site per day: 425
Total new documents per day: 68,000
An
no
tatio
ns
N/A Boilerplate remover annotations
Language detector, duplicate detector, sentence splitter, tokenizer, POS tagger, lemmatizer, stop-word detector and entity recognizer annotations
Unchanged
Table 2: Some basic statistics and types of annotations related to the acquired data (taken from FIRST D2.3).
Table 2 presents some basic statistics and types of annotations related to the acquired data in the first project year (Part 1 and Part 2) and projections and plans for the second and third project year. Note that the average number of RSS feeds per site and the average number of acquired documents per site per day decrease from Part 1 to Part 2. This is mainly due to the fact that we included a lot of blogs in Part 2. A blog usually provides one single RSS feed and only a few posts per day/week while a larger news Web site provides a range of RSS feeds and hundreds of news per day. Another reason for the drop of the average number of documents per site per day, and consequently the total number of new documents per day, is the new filtering policy. In Part 2, we only accept HTML and plain text documents that are 10 MB or less in size. In Part 1, non-textual content (such as video, audio, PDF, and XML) is also included –
D3.1
© FIRST consortium Page 16 of 25
even though it is not going to be analysed in FIRST – and its size was not limited in the acquisition process.
3.2 Availability
The FIRST dataset is available at:
http://first.ijs.si/firstdataset/
Apart from the basic statistics and comments summarized in Table 3 and Table 4, the dataset download page also allows the user to view the associated RSS sources and explore the data (i.e., view annotated document corpora HTML representations).
Period: April 21, 2011 [17:20+2] – June 29, 2011 [19:18+2]
Web sites / RSS sources:
39 / 1,950
Corpora: 754,601
Documents: ~2,375,000
Annotations: N/A
Corpus features:
RSS channel features (see RSS Specification1): title, description, language, copyright, managingEditor, link, pubDate, category
Other features: _provider [type of the component that produced the corpus], _sourceUrl [link to the HTML file], _source [HTML source encoded as Base642], _timeBetweenPolls [defines polling frequency], _timeStart [corpus acquisition start time], _timeEnd [corpus acquisition end time]
Document features:
RSS item features (see RSS Specification1): title, description, link, pubDate, author, category, source
Other features: _mimeType, _contentType, _charSet, _contentLength, raw [content encoded as Base642], _guid, _time [document acquisition time]
Comments: - All content types accepted (_contentType values: BinaryBase64, Html, Text, Xml).
- No limit on content size.
- If _contentType is "Html", <Text> contains the corresponding HTML content (including tags and boilerplate). The Base64-encoded binary representation of the same HTML content is stored in the feature "raw".
- If _contentType is "Text" or "Xml", <Text> contains the corresponding content. The Base64-encoded binary representation of the same content is stored in the feature "raw".
- If _contentType is "BinaryBase64", <Text> contains the corresponding Base64-encoded binary data. In this case, the feature "raw" contains redundant data.
Size: 40.5 GB in 3 7ZIP files (compression ratio ~10%)
Table 3: Part 1 of the FIRST dataset – basic statistics and comments.
1 http://www.rssboard.org/rss-specification
2 http://en.wikipedia.org/wiki/Base64
D3.1
© FIRST consortium Page 17 of 25
Period: Started on June 29, 2011 [19:27+2]
Web sites / RSS sources:
80 / 2,472
Corpora: 899,241 (and growing)
Documents: 2,674,827 (and growing)
Annotations: TextBlock/Boilerplate, TextBlock/Headline, TextBlock/FullText, TextBlock/Supplement, TextBlock/RelatedContent, TextBlock/UserComment
Corpus features:
RSS channel features (see RSS Specification): title, description, language, copyright, managingEditor, link, pubDate, category
Other features: _provider [type of the component that produced the corpus], _sourceUrl [link to the HTML file], _source [HTML source encoded as base64], _timeBetweenPolls [defines polling frequency], _timeStart [corpus acquisition start time], _timeEnd [corpus acquisition end time]
Document features:
RSS item features (see RSS Specification): title, description, link, pubDate, author, category, source
Other features: _mimeType, _contentType, _charSet, _contentLength, raw [content encoded as base64], _guid, _time [document acquisition time]
Comments: - Only "Html" and "Text" content types accepted. - Only contents that are 10 or less MB in size accepted.
Size: 41.3 GB in 3 7ZIP files (compression ratio ~10%)
Table 4: Part 2 of the FIRST dataset – basic statistics and comments.
D3.1
© FIRST consortium Page 18 of 25
4 Semantic resources and the FIRST ontology
4.1 Existing semantic resources
Semantic and lexical resources, potentially relevant to FIRST, include:
McDonald’s word list (http://www.nd.edu/~mcdonald/Word_Lists.html).
Ontology developed by UHOH, used in their preliminary sentiment analysis experiments (see (Klein, Altuntas, Kessler, & Häusser, 2011)).
Publicly available sentiment-labelled datasets designed for sentiment analysis experiments, e.g.:
o http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
o http://langtech.jrc.it/JRC_Resources.html#Sentiment-quotes
DMoz (http://www.dmoz.org/),
DBpedia (http://dbpedia.org/),
ResearchCyc (http://research.cyc.com/),
SentiWordNet (http://sentiwordnet.isti.cnr.it/),
WordNet (http://wordnet.princeton.edu/),
Financial glossaries (e.g., http://www.forbes.com/tools/glossary/index.jhtml),
Proprietary data available to the consortium via the IDMS API (contains companies, instruments, notations, countries, industrial sectors…).
After examining these resources, we decided to build the FIRST ontology on top of the UHOH ontology, extending it with sentiment objects (i.e., objects of interest, e.g., companies, stocks, countries) from the IDMS proprietary database and sentiment vocabularies from the McDonald’s word list and SentiWordNet.
4.2 The FIRST ontology
The FIRST ontology clearly shows several differences from the traditional notion of an ontology, most notably the following:
It contains only a shallow subsumption hierarchy of the domain.
It contains explicit lexical knowledge about the domain (i.e., it contains terminological gazetteers).
It contains many instances (companies, stocks, countries…), accompanied with the required lexical information, that can be detected in texts in the document annotation process and passed on to the information extraction process.
The following is the list of the high-level triples (i.e., domain-relation-range triples) in the current version of the FIRST ontology:
CorrelationDefinition – correlationDefinitionIsInfluencedByIndicator – Indicator
o Inverse: Indicator – indicatorHasCorrelationDefinition – CorrelationDefinition
CorrelationDefinition – correlationDefinitionInfluencesFeature – ObjectFeature
o Inverse: ObjectFeature – featureOfCorrelationDefinition – CorrelationDefinition
D3.1
© FIRST consortium Page 19 of 25
CorrelationDefinition – correlationDefinitionInfluencesObject – SentimentObject
o Inverse: SentimentObject – objectOfCorrelationDefinition – CorrelationDefinition
The following sections give more information about these high-level entities.
4.2.1 Ontology-based information extraction process in FIRST
In this section, we briefly describe the ontology-based information extraction (OBIE) process employed in FIRST (WP4). This outline should explain the need for the top-level concepts defined in the FIRST ontology. The goal of OBIE in FIRST is to classify the polarity of sentiment with respect to a sentiment object of interest (e.g., a specific stock index) and a certain feature (e.g., the price) based on news and blog posts currently being published (e.g., ―I expect the price of NASDAQ-100 to drop‖). This goal is envisioned to be achieved as follows:
1. The OBIE component inspects the documents for SentimentObjects (e.g., NASDAQ-100), Indicators (e.g., PriceTrend which is identified by the term ―price trend‖) and for OrientationPhrases (e. g., ―unchanged‖). An example is shown in Figure 6.
Figure 6: An example of identified ontological instances interrelated through a CorrelationDefinition and a set of JAPE rules to provide a sentiment polarity classification.
2. CorrelationDefinitions relate classes of sentiment objects (e.g., stock indices) and their features (e.g., future price change) to Indicators. For each detected Indicator, OBIE thus finds the corresponding CorrelationDefinitions. For example, for the price-trend indicator, the correlation definition indicatorHasCorrelationToStockPrice can be found in the ontology. Note that the correlations can be either positive or negative. CorrelationDefinitions are related to ObjectFeatures (through the correlationDefinitionInfluencesFeature relation) and ―constrained‖ to certain classes of sentiment objects (through the correlationDefinition-InfluencesObject relation). For example, the indicatorHasCorrelationToStockPrice correlation definition influences the ObjectFeature ExpectedFuturePriceChange and the members of the class Stock_Index.
3. The identified instances, i.e., the OrientationPhrases, Indicators, CorrelationDefinitions, SentimentObjects, and ObjectFeatures are passed to the GATE’s JAPE engine which executes a set of rules to determine the sentiment on the sentence level.
More information about the information extraction process, the JAPE rules, and the sentiment aggregation model are given in FIRST D4.1 and in (Klein, Altuntas, Kessler, & Häusser, 2011).
ObjectFeature
Indicator OrientationPhraseSentimentObject
CorrelationDefinition
The price trend of NASDAQ-100 is expected to remain unchanged.
JAPE
rules
D3.1
© FIRST consortium Page 20 of 25
4.2.2 Sentiment objects and gazetteers
Sentiment objects are simply objects towards which people express sentiment. The following excerpt from the subsumption hierarchy of the FIRST ontology shows the most important classes of sentiment objects:
SentimentObject
FinancialInstrument
Index
Stock_Index
Stock
Company
Country
A sentiment object is an instance of a SentimentObject descendant (e.g., Google as an instance of Company or GOOG as an instance of Stock) and is equipped with a gazetteer. Gazetteers are instances of the class Gazetteer. They define terms, stop words, and can import other gazetteers in order to provide lexical knowledge required to identify ontological instances in texts.
The part of the ontology defining sentiment objects and gazetteers is automatically induced from the IDMS database and data from MSN Money1. For the purpose of inducing the SentimentObject subsumption hierarchy and populating it with instances, we employ the RDBToOnto methodology2 (Cerbah, 2009) developed in the European project TAO3 (Transitioning Applications to Ontologies). We ―follow‖ the IDMS data access API to ―grow‖ the ontology from a list of seed stock indices as illustrated in Figure 7.
Figure 7: The ontology part defining sentiment objects and gazetteers is “grown” from a list of seed stock indices.
1 http://money.msn.com/
2 http://www.tao-project.eu/researchanddevelopment/demosanddownloads/RDBToOnto.html
3 STREP IST-2004-026460, http://www.tao-project.eu
Indices
Constituents(stocks)
Companies
Countries
D3.1
© FIRST consortium Page 21 of 25
Let us look at a concrete example. One of the seed stock indices from which the FIRST ontology is ―grown‖ is NASDAQ-100 which is defined as follows:
:NASDAQ_100 a :Stock_Index ;
rdfs:label "NASDAQ-100" .
The stock of the Microsoft Corporation is a member of the NASDAQ-100 stock index:
:MICROSOFT a :Stock ;
rdfs:label "MICROSOFT CORP COM USD0.00000625" ;
:memberOf :NASDAQ_100 .
The shares of the Microsoft stock are issued by the Microsoft Corporation. This is stated in the ontology as follows:
:MICROSOFT_CORP a :Company ;
rdfs:label "Microsoft Corp." ;
:issues :MICROSOFT .
Microsoft Corporation is located in the USA:
:USA a :Country ;
rdfs:label "USA" .
:MICROSOFT_CORP
:locatedIn :USA .
The MICROSOFT_CORP instance is linked to the gazetteer MICROSOFT_CORP_Gazetteer. This is defined as follows:
:MICROSOFT_CORP
:hasGazetteer :MICROSOFT_CORP_Gazetteer .
:MICROSOFT_CORP_Gazetteer
:hasTerm "Microsoft Corp" ;
:hasTerm "Microsoft Corporation" ;
:hasStopWord "CORP" ;
:hasStopWord "CORPORATION" ;
a :Gazetteer .
Gazetteer terms are acquired from several sources (in our case, from the IDMS database and data from MSN Money) and the corresponding stop words are computed automatically. Each term in a gazetteer is first represented as a set of words Ti (e.g., Ti = {―Microsoft‖, ―Corp‖}).
Then, the corresponding stop word set S is computed as S = iTi – iTi (i.e., the union of the gazetteer terms without the intersection of the gazetteer terms). In the above example,
T1 = {―Microsoft‖, ―Corp‖}, T2 = {―Microsoft‖, ―Corporation‖}, T1 T2 = {―Microsoft‖, ―Corp‖,
―Corporation‖}, T1 T2 = {―Microsoft‖}, and finally, S = T1 T2 – T1 T2 = {―Corp‖, ―Corporation‖}.
The gazetteers will be exploited by the semantic annotation component which will be developed before M18 and released together with the M24 prototype. The goal of the semantic annotation component will be to examine the text and recognize each sequence of words that corresponds to the sequence of words in a gazetteer term. In this process, the corresponding stop words will be removed from both sequences.
Last but not least, during the second project year, the ontology will be extended with additional sentiment objects such as people, industrial sectors, and topics.
4.2.3 Availability
The latest version of the FIRST ontology is available at:
Sentiment objects and gazetteers only
D3.1
© FIRST consortium Page 22 of 25
http://first.ijs.si/firstontology/FIRSTOntology.n3
Consolidated ontology (also includes indicators, object features, correlation definitions, and sentiment-bearing phrases)
http://first.ijs.si/firstontology/ConsolidatedFIRSTOntology.n3 (OWL-N3 format)
http://first.ijs.si/firstontology/ConsolidatedFIRSTOntology.owl (OWL-RDF format)
D3.1
© FIRST consortium Page 23 of 25
References
Cerbah, F. (2009). RDBToOnto User Guide: From Relational Databases to Fine-Tuned
Populated Ontologies. Retrieved 9 11, 2011, from http://www.tao-
project.eu/researchanddevelopment/demosanddownloads/RDBToOnto-Page/rdbtoontoguide.pdf
Klein, A., Altuntas, O., Kessler, W., & Häusser, T. (2011). Extracting Investor Sentiment from
Weblog Texts. Proceedings of the 13th IEEE Conference on Commerce and Enterprise
Computing (CEC). Luxembourg.
D3.1
© FIRST consortium Page 24 of 25
Annex 1. Annotated sentiment corpus construction
As indicated in D1.3, Section 3, the construction of a manually annotated corpus is crucial for the evaluation of the information extraction performance and additionally serves as a training dataset for machine learning methods.
The experts in charge of annotating the documents, who are project partners in the consortium, are also the ones who search the info sources mentioned in Annex 2 of D1.2 for instances of suitable documents. The tools applied for this task are standard Web search engines like Google.
Table 5 summarizes the three approaches to retrieving documents for the three different use cases. The main requirements are that the documents contain sentiment related to the features of predefined financial objects specific to the use cases and that there are at least 30 documents available for each indicator/topic at the end.
Retrieval of corpus documents
Use case Search engine(s) Individual search on
selected URLs
Google News
keyword search
UC #1 X
UC #2 X X X
UC #3 X
Table 5: Approach to retrieving documents for the three use cases.
Manual annotation of the documents. The most time-consuming task is definitely the annotation of all the documents by the experts. This is true especially for the FIRST corpus, as there are a lot of annotation concepts and properties to be filled in by the annotators. Furthermore, the financial domain presents a higher challenge and demands a deeper expertise when it comes to the classification of sentiments. The reason behind this is the broadness of the financial domain and the huge amount of technical knowledge that is needed to really understand blogs and news that were written by financial analysts, especially when special financial language is used or the sentiment is described in a very fuzzy way or over larger parts of text.
Corpus construction status. Figure 8 shows the overall corpus construction process. The final annotation phase was prepared in a stepwise approach which covers a period of over half a year of alignment sessions, test annotation rounds, and constant feedback from the use case experts. At the end of M12, the corpus contains 350 documents that are annotated and additional 420 documents that have been already retrieved. The envisaged final amount of annotated documents is around 1,500.
D3.1
© FIRST consortium Page 25 of 25
Figure 8: Corpus construction process.