D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof...

25
© FIRST consortium Page 1 of 25 Project Acronym: FIRST Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making Project Number: 257928 Instrument: STREP Thematic Priority: ICT-2009-4.3 Information and Communication Technology D3.1 Semantic resources and data acquisition Work Package: WP3 - Data acquisition and ontology infrastructure Due Date: 30/09/2011 Submission Date: 30/09/2011 Start Date of Project: 01/10/2010 Duration of Project: 36 Months Organisation Responsible for Deliverable: JSI Version: 1.0 Status: Final Author Name(s): Miha Grčar JSI Tobias Häusser UHOH Dominic Ressel UHOH Reviewer(s): Achim Klein Mateusz Radzimski UHOH ATOS Nature: R Report P Prototype D Demonstrator O Other Dissemination Level: PU - Public CO - Confidential, only for members of the consortium (including the Commission) RE - Restricted to a group specified by the consortium (including the Commission Services) Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

Transcript of D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof...

Page 1: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

© FIRST consortium Page 1 of 25

Project Acronym: FIRST

Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making

Project Number: 257928

Instrument: STREP

Thematic Priority: ICT-2009-4.3 Information and Communication Technology

D3.1 Semantic resources and data acquisition

Work Package: WP3 - Data acquisition and ontology infrastructure

Due Date: 30/09/2011

Submission Date: 30/09/2011

Start Date of Project: 01/10/2010

Duration of Project: 36 Months

Organisation Responsible for Deliverable: JSI

Version: 1.0

Status: Final

Author Name(s): Miha Grčar JSI

Tobias Häusser UHOH

Dominic Ressel UHOH

Reviewer(s): Achim Klein

Mateusz Radzimski UHOH

ATOS

Nature: R – Report P – Prototype D – Demonstrator O – Other

Dissemination Level: PU - Public CO - Confidential, only for members of the

consortium (including the Commission)

RE - Restricted to a group specified by the consortium (including the Commission Services)

Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

Page 2: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 2 of 25

Revision history Version Date Modified by Comments 0.1 21/08/2011 Miha Grčar (JSI) High-level TOC and first

inputs.

0.2 09/09/2011 Miha Grčar (JSI) Added content about Dacq and the dataset.

0.3 17/09/2011 Miha Grčar (JSI), Tobias Häusser (UHOH)

Added content about the FIRST ontology.

0.4 27/09/2011 Miha Grčar (JSI) Revision according to the reviewers’ comments.

0.5 28/09/2011 Miha Grčar (JSI), Dominic Ressel (UHOH)

Included Dominic’s contribution on the sentiment corpus construction.

1.0 30/09/2011 Tomás Pariente Final QA and preparation for submission

Page 3: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 3 of 25

Copyright © 2011, FIRST Consortium

The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute all or parts of this document, provided that the FIRST project and the document are properly referenced.

THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

----------------

Page 4: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 4 of 25

Executive summary

This report accompanies three ―products‖ developed in WP3 in the first project year:

(1) The data acquisition software called DacqPipe (Dacq for short)

(2) The FIRST dataset of news and blog posts

(3) The FIRST ontology

In this report, we briefly describe each of these three project assets and provide Web addresses to a range of online demos related to the data acquisition pipeline. Furthermore, we provide download locations and usage instructions for the resources released in the context of D3.1.

In addition, we report the efforts in constructing a manually annotated sentiment corpus that will serve both for the evaluation of the sentiment extraction technology developed in WP4 and training machine learning (i.e., sentiment classification) models in WP6. Since the main purpose of this document is to describe the software prototypes and datasets released at M12, the efforts related to the sentiment corpus construction are reported in Annex 1.

Page 5: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 5 of 25

Table of contents

Executive summary ....................................................................................................... 4

Abbreviations and acronyms ....................................................................................... 7

1 Introduction ............................................................................................................ 8

2 DacqPipe, the data acquisition pipeline ............................................................... 9

2.1 Introduction ....................................................................................................... 9

2.2 Availability ....................................................................................................... 10

2.3 Deployment and usage instructions ................................................................ 10

2.3.1 Deployment and configuration.................................................................. 10

2.3.2 Usage ....................................................................................................... 12

2.3.3 Data files .................................................................................................. 12

2.4 Pointers to online demos ................................................................................. 14

3 FIRST dataset of news and blog posts ............................................................... 15

3.1 Introduction ..................................................................................................... 15

3.2 Availability ....................................................................................................... 16

4 Semantic resources and the FIRST ontology .................................................... 18

4.1 Existing semantic resources ............................................................................ 18

4.2 The FIRST ontology ........................................................................................ 18

4.2.1 Ontology-based information extraction process in FIRST ........................ 19

4.2.2 Sentiment objects and gazetteers ............................................................ 20

4.2.3 Availability ................................................................................................ 21

References ................................................................................................................... 23

Annex 1. Annotated sentiment corpus construction .......................................... 24

Index of Figures

Figure 1: The current topology of the data acquisition pipeline (taken from FIRST D2.2). ........... 9

Figure 2: An example of the file with RSS sources. .................................................................. 11

Figure 3: Dacq screenshot. ...................................................................................................... 12

Figure 4: Annotated document corpus serialized into XML. ...................................................... 13

Figure 5: Annotated document serialized into HTML and displayed in a Web browser. ............ 14

Figure 6: An example of identified ontological instances interrelated through a CorrelationDefinition and a set of JAPE rules to provide a sentiment polarity classification. ..... 19

Figure 7: The ontology part defining sentiment objects and gazetteers is ―grown‖ from a list of seed stock indices. ................................................................................................................... 20

Figure 8: Corpus construction process. .................................................................................... 25

Page 6: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 6 of 25

Index of Tables

Table 1: Supported key-value pairs for configuring Dacq. ......................................................... 11

Table 2: Some basic statistics and types of annotations related to the acquired data (taken from FIRST D2.3). ............................................................................................................................ 15

Table 3: Part 1 of the FIRST dataset – basic statistics and comments. .................................... 16

Table 4: Part 2 of the FIRST dataset – basic statistics and comments. .................................... 17

Table 5: Approach to retrieving documents for the three use cases.......................................... 24

Page 7: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 7 of 25

Abbreviations and acronyms

DacqPipe, Dacq Data acquisition pipeline

OBIE Ontology-based information extraction

JAPE Java Annotation Patterns Engine, a component of the GATE platform

GATE General Architecture for Text Engineering, a Java suite of tools for all sorts of natural language processing tasks, including information extraction in many languages (originally developed at the University of Sheffield)

OWL The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies

N3 Turtle (Terse RDF Triple Language) is a serialization format for Resource Description Framework (RDF) graphs

RDF The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model

Page 8: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 8 of 25

1 Introduction

This report accompanies three ―products‖ developed in WP3 in the first project year. In this report, we briefly describe each of these three project assets and provide Web addresses to a range of online demos related to the data acquisition pipeline. Furthermore, we provide download locations and usage instructions for the resources released in the context of D3.1.

Specifically, this report covers the following three topics:

(1) The data acquisition software called DacqPipe or Dacq for short (see Section 2)

(2) The FIRST dataset of news and blog posts (see Section 3)

(3) The FIRST ontology (see Section 4)

In addition, we report the efforts in constructing a manually annotated sentiment corpus that will serve both for the evaluation of the sentiment extraction technology developed in WP4 and training machine learning (i.e., sentiment classification) models in WP6. Since the main purpose of this document is to describe the software prototypes and datasets released at M12, the efforts related to the sentiment corpus construction are reported in Annex 1.

Note that some of the content in this report was copied from other FIRST reports for the reader’s convenience.

Page 9: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 9 of 25

2 DacqPipe, the data acquisition pipeline

2.1 Introduction

The data acquisition pipeline consists of several technologies that interoperate to achieve the desired goal, i.e., preparing the data for further analysis. It is responsible for acquiring unstructured data from several data sources, preparing it for the analysis, and brokering it to the appropriate analytical components (e.g., information extraction components developed in WP4). The data acquisition pipeline is running continuously (since April 21, 2011), polling the Web and proprietary APIs for recent content, turning it into a stream of preprocessed text documents.

When dealing with official news streams—such as those provided to the consortium by IDMS—a lot of preprocessing steps can be avoided. Official news are provided in a semi-structured fashion such that titles, publication dates, and other metadata are clearly indicated. Furthermore, named entities (i.e., company names and stock symbols) are identified in texts and article bodies are provided in a raw textual format without any boilerplate (i.e., undesired content such as advertisements, copyright notices, navigation elements, and recommendations).

Content from blogs, forums, and other Web content, however, is not immediately ready to be processed by the text analysis methods. Web pages contain a lot of ―noise‖ that needs to be identified and removed before the content can be analysed. For this reason, we have developed DacqPipe (or Dacq), a data acquisition and preprocessing pipeline. Dacq consists of (i) data acquisition components, (ii) data cleaning components, (iii) natural-language preprocessing components, (iv) semantic annotation components, and (v) ZeroMQ emitter components. The current pipeline topology is shown in Figure 1. Note that the ZeroMQ emitter is actually part of the integration framework (WP7) but is tightly integrated into the data acquisition pipeline.

Figure 1: The current topology of the data acquisition pipeline (taken from FIRST D2.2).

The data acquisition components are mainly RSS readers that poll for data in parallel. One RSS reader is instantiated for each Web site of interest. The RSS sources, corresponding to a particular Web site, are polled one after another by the same RSS reader to prevent the servers from rejecting requests due to concurrency. An RSS reader, after it has collected a new set of documents from an RSS source, dispatches the data to one of several processing pipelines. The pipeline is chosen according to its current load size (load balancing). A processing pipeline consists of a boilerplate remover, language detector, duplicate detector, sentence splitter, tokenizer, part-of-speech tagger, lemmatizer, stop-word detector, semantic annotator, and ZeroMQ emitter.

Boilerplate remover

Language detector

Duplicate detector

Sentence splitter

Tokenizer POS tagger LemmatizerStop-word

detector

Boilerplate remover

Language detector

Duplicate detector

Sentence splitter

Tokenizer POS tagger LemmatizerStop-word

detector

Boilerplate remover

RSS reader

RSS reader

RSS reader

Language detector

Duplicate detector

Sentence splitter

Tokenizer POS tagger LemmatizerStop-word

detector

.

.

.

.

.

.

Loadbalancing

One readerper site

processingpipelines

Semantic annotator

ZeroMQemitter

Semantic annotator

ZeroMQemitter

Semantic annotator

ZeroMQemitter

Data cleaningcomponents (ii)

Data acquisitioncomponents (i)

Natural-language processingcomponents (iii)

Semantic annotationcomponents (iv)

ZeroMQ emittercomponents (v)

Page 10: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 10 of 25

The majority of these components were already discussed in FIRST D2.1 (see FIRST D2.1 Section 2.1). The natural-language processing stages (i.e., sentence splitter, tokenizer, part-of-speech tagger, lemmatizer, and stop-word detector) were added because they are a prerequisite for the semantic annotation component and also for the information extraction tasks. Finally, the ZeroMQ emitters were added to establish a ―messaging channel‖ between the data acquisition and preprocessing components (WP3) and the information extraction components (WP4). This enables us to run the two sets of components in two different processes (i.e., runtime environments) or even on two different machines.

2.2 Availability

Dacq is currently available for download as a configurable console-mode application. It is slightly different from the version currently running on the FIRST server. Most notably, it does not require a database connection to run. For this reason, it is extremely easy to deploy it on practically any Windows computer (potentially also under Mono1 on Linux and Mac OS but this setting has not been tested yet). Note, however, that Dacq requires running several threads concurrently when under heavy load in order to process the data in near-real time. The user is thus advised to deploy Dacq on a multi-core machine (e.g., 8 cores or more) or to keep the list of RSS sources appropriately short.

Dacq was successfully deployed on an 8-core machine with 8 GB RAM, acquiring data from more than 2,000 RSS sources from 80 different Web sites (such as CNN, BBC, and Seeking Alpha).

Dacq (the stand-alone console-mode utility) can be downloaded from the following location:

http://first.ijs.si/software/DacqPipeSep2011.zip

The source code is currently unavailable for public. However, it will be release in the accordance with the FIRST open-source strategy once the open-source strategy is fully devised at M18 (i.e., end of March 2012).

2.3 Deployment and usage instructions

2.3.1 Deployment and configuration

Once you have downloaded DacqPipe, follow these steps to install and configure it:

Unzip the downloaded archive into a folder, for example C:\DacqPipe.

If .NET Framework (2.0 or later) is not yet installed on your computer, download it from http://www.microsoft.com/download/en/details.aspx?id=19 (32-bit version) or from http://www.microsoft.com/download/en/details.aspx?id=6523 (64-bit version2). Run the downloaded executable file and follow the setup instructions.

Dacq should now run with its default settings (see the table below for the default settings).

To configure Dacq, simply edit the file Dacq.exe.config (located in the target folder, e.g., C:\DacqPipe) in your favourite text editor.

The configuration file contains a set of key-value pairs in the form ―<add key="…" value="…"/>‖. Table 1 lists the supported key-value pairs.

1 http://www.mono-project.com/Main_Page

2 It is most likely that you are running Dacq in a 64-bit environment.

Page 11: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 11 of 25

Key Description Default value

logFileName (optional)

The location and name of the log file to which Dacq writes events important mainly for debugging.

Not set

xmlDataRoot (optional1)

The location to which the acquired data is stored in the XML format.

.\Data

htmlDataRoot (optional1)

The location to which the acquired data is stored in the HTML format appropriate for viewing.

.\DataHtml

dataSourcesFileName (mandatory)

The location and name of the file containing RSS sources to be polled for content.

.\RssSources.txt

Table 1: Supported key-value pairs for configuring Dacq.

Once installed and configures, Dacq is started simply by invoking Dacq.exe from the folder into which the archive was extracted (e.g., C:\DacqPipe\Dacq.exe).

Site: abcnews

# Site: http://abcnews.go.com/

# RSS list: http://abcnews.go.com/Site/page?id=3520115

http://feeds.abcnews.com/abcnews/topstories

http://feeds.abcnews.com/abcnews/internationalheadlines

http://feeds.abcnews.com/abcnews/usheadlines

http://feeds.abcnews.com/abcnews/politicsheadlines

http://feeds.abcnews.com/abcnews/blotterheadlines

http://feeds.abcnews.com/abcnews/moneyheadlines

http://feeds.abcnews.com/abcnews/technologyheadlines

http://feeds.abcnews.com/abcnews/healthheadlines

http://feeds.abcnews.com/abcnews/entertainmentheadlines

http://feeds.abcnews.com/abcnews/travelheadlines

http://feeds.abcnews.com/abcnews/sportsheadlines

http://feeds.abcnews.com/abcnews/worldnewsheadlines

Site: bbc

# Site: http://www.bbc.co.uk/news/

# RSS list: http://www.bbc.co.uk/news/10628494

http://feeds.bbci.co.uk/news/rss.xml

http://feeds.bbci.co.uk/news/world/rss.xml

http://feeds.bbci.co.uk/news/uk/rss.xml

http://feeds.bbci.co.uk/news/business/rss.xml

http://feeds.bbci.co.uk/news/politics/rss.xml

http://feeds.bbci.co.uk/news/health/rss.xml

http://feeds.bbci.co.uk/news/education/rss.xml

http://feeds.bbci.co.uk/news/science_and_environment/rss.xml

http://feeds.bbci.co.uk/news/technology/rss.xml

Figure 2: An example of the file with RSS sources.

Dacq requires a file with RSS sources to work. These sources are periodically polled for content. The location and name of the file with RSS sources is specified with the dataSourcesFileName configuration parameter. The file format is relatively simple and contains several lists of RSS sources, one for each Web site. An example is shown in Figure 2. Each RSS list starts with a site identifier (e.g., ―Site: abcnews‖). The URLs of RSS sources are listed after the site identifier, each in its own line. The list ends with a next site identifier (or with the end of file). If a line starts with ―#‖, which indicates a comment, it is ignored by Dacq.

1 For the data acquisition to make sense, at least one of the two data locations (i.e., xmlDataRoot and htmlDataRoot)

needs to be set.

Page 12: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 12 of 25

2.3.2 Usage

Dacq starts as a console-mode application. The console displays the current activities of the data acquisition pipeline and potentially reports problems (see Figure 3). The same activity and error messages are written into a log file if logging is enabled (i.e., if logFileName is set; see Section 2.3.1).

Figure 3: Dacq screenshot.

Dacq is shut down by pressing Ctrl-C. The message ―*** Ctrl-C command received. ***‖ will appear in the console. Note that Dacq needs some time to shut down properly as it needs to finalize the processing of the data contained in the component queues. If the shut-down process takes too long and if the finalization of processing is not crucial, the user can close the window by pressing Alt-F4 and thus terminate the application.

2.3.3 Data files

A batch of documents (either news or blog posts), acquired and preprocessed with Dacq, is internally stored as an annotated document corpus object. The annotated document corpus data structure (ADC) is very similar to the one that GATE1 uses and is best described in the

1 GATE is a Java suite of tools developed at the University of Sheffield used for all sorts of natural language

processing tasks, including information extraction. It is freely available at http://gate.ac.uk/ .

Page 13: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 13 of 25

GATE user’s guide1. ADC can be serialized either into XML or into a set of HTML files. Figure 4 shows a toy example of ADC serialized into XML. In short, a document corpus normally contains one or more documents and is described with features (i.e., a set of key-value pairs). A document is also described with features and in addition contains annotations. An annotation gives a special meaning to a text segment (e.g., token, sentence, named entity). Note that an annotation can also be described with features.

<DocumentCorpus xmlns="http://freekoders.org/latino">

<Features>

<Feature>

<Name>source</Name>

<Value>smh.com.au/technology</Value>

</Feature>

</Features>

<Documents>

<Document>

<Name>Steve Jobs quits as Apple CEO</Name>

<Text>Tech industry legend and one of the finest creative minds of a generation,

Steve Jobs, has resigned as chief executive of Apple.</Text>

<Annotations>

<Annotation>

<SpanStart>75</SpanStart>

<SpanEnd>84</SpanEnd>

<Type>named entity/person</Type>

<Features />

</Annotation>

<Annotation>

<SpanStart>122</SpanStart>

<SpanEnd>126</SpanEnd>

<Type>named entity/company</Type>

<Features>

<Feature>

<Name>stockSymbol</Name>

<Value>AAPL</Value>

</Feature>

</Features>

</Annotation>

</Annotations>

<Features>

<Feature>

<Name>URL</Name>

<Value>http://www.smh.com.au/technology/technology-news/steve-jobs-quits-as-

apple-ceo-20110825-1jat8.html</Value>

</Feature>

</Features>

</Document>

</Documents>

</DocumentCorpus>

Figure 4: Annotated document corpus serialized into XML.

The annotated document, contained in the XML in Figure 4, serialized into HTML and displayed in a Web browser is shown in Figure 5.

1 Available at http://gate.ac.uk/sale/tao/split.html . See http://gate.ac.uk/sale/tao/splitch5.html#x8-910005.4.2 for

some simple examples of annotated documents in GATE.

Page 14: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 14 of 25

Figure 5: Annotated document serialized into HTML and displayed in a Web browser.

Dacq sores the acquired and annotated document corpora into files. It can be configured to store them as XMLs, as HTMLs, or both (see Section 2.3.1 on how to configure Dacq). Dacq creates a separate folder for each day (e.g., <xmlDataRoot>\2011\9\8\ would be created on September 8, 2011) and assigns unique names to data files. The name of a file consists of a time stamp and a random ID (e.g., <xmlDataRoot>\2011\9\8\14_29_33_c9bef21a1d4f4e4db0-c82624d5b741bb.xml). Note that the time stamp (the first 8 characters in the file name, i.e., hh_mm_ss) represents the acquisition time and not the publication time.

For storing HTMLs, Dacq similarly creates a separate folder for each day. However, for each document corpus, it then creates another folder with a unique name consisting of a time stamp and a random ID (e.g., <htmlDataRoot>\2011\9\8\14_29_33_c9bef21a1d4f4e4db0c82624d-5b741bb\). Each such folder contains several HTML files: the index file (i.e., Index.html) and one additional file for each document in the corpus. To view a document corpus, open the corresponding Index.html in a Web browser.

2.4 Pointers to online demos

As already stated, the data acquisition pipeline consists of several components that work together. We prepared several simple online demos that demonstrate separate data acquisition components. The following is the list of online demos related to the data acquisition pipeline:

Boilerplate removal demo:

http://first.ijs.si/demos/boilerplateremovedemo/

Language and duplicate detection demo:

http://first.ijs.si/demos/duplicatedetectordemo/

Annotation pipeline demo:

http://first.ijs.si/demos/annotationpipelinedemo/

Page 15: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 15 of 25

3 FIRST dataset of news and blog posts

3.1 Introduction

The FIRST dataset is currently available for download in two parts. The first part was acquired in the time period from April 21, 2011 (17:20+2), to June 29, 2011 (19:18+2). The acquisition of the second part is still in progress; the pipeline was started on June 29, 2011 (19:27+2). The most recent available data archive contains the August data.

Part 1

Apr–Jun 2011

(M7–M9)

Part 2

Jun–Sep 2011

(M9–M12) Now

Part 3

Sep 2011–Sep 2012

(M12–M24)

Part 4

Sep 2012–Sep 2013

(M24–M36)

Sca

le

Number of sites: 39

Number of RSS feeds: 1,950 (~50 per site on

average)

Avg. number of documents per site per day: 870

Total new documents per day: 33,950

Part 1 of the FIRST dataset contains altogether approx. 2,375,000 documents

Number of sites: 80

Number of RSS feeds: 2,472 (~30 per site on

average)

Avg. number of documents per site per day: 425

Total new documents per day: 34,000

Part 2 of the FIRST dataset contains at the moment (Sep 7, 2011) 2,674,827 documents

Unchanged

Number of sites: 160

Number of RSS feeds: 4,800 (~30 per site on

average)

Avg. number of documents per site per day: 425

Total new documents per day: 68,000

An

no

tatio

ns

N/A Boilerplate remover annotations

Language detector, duplicate detector, sentence splitter, tokenizer, POS tagger, lemmatizer, stop-word detector and entity recognizer annotations

Unchanged

Table 2: Some basic statistics and types of annotations related to the acquired data (taken from FIRST D2.3).

Table 2 presents some basic statistics and types of annotations related to the acquired data in the first project year (Part 1 and Part 2) and projections and plans for the second and third project year. Note that the average number of RSS feeds per site and the average number of acquired documents per site per day decrease from Part 1 to Part 2. This is mainly due to the fact that we included a lot of blogs in Part 2. A blog usually provides one single RSS feed and only a few posts per day/week while a larger news Web site provides a range of RSS feeds and hundreds of news per day. Another reason for the drop of the average number of documents per site per day, and consequently the total number of new documents per day, is the new filtering policy. In Part 2, we only accept HTML and plain text documents that are 10 MB or less in size. In Part 1, non-textual content (such as video, audio, PDF, and XML) is also included –

Page 16: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 16 of 25

even though it is not going to be analysed in FIRST – and its size was not limited in the acquisition process.

3.2 Availability

The FIRST dataset is available at:

http://first.ijs.si/firstdataset/

Apart from the basic statistics and comments summarized in Table 3 and Table 4, the dataset download page also allows the user to view the associated RSS sources and explore the data (i.e., view annotated document corpora HTML representations).

Period: April 21, 2011 [17:20+2] – June 29, 2011 [19:18+2]

Web sites / RSS sources:

39 / 1,950

Corpora: 754,601

Documents: ~2,375,000

Annotations: N/A

Corpus features:

RSS channel features (see RSS Specification1): title, description, language, copyright, managingEditor, link, pubDate, category

Other features: _provider [type of the component that produced the corpus], _sourceUrl [link to the HTML file], _source [HTML source encoded as Base642], _timeBetweenPolls [defines polling frequency], _timeStart [corpus acquisition start time], _timeEnd [corpus acquisition end time]

Document features:

RSS item features (see RSS Specification1): title, description, link, pubDate, author, category, source

Other features: _mimeType, _contentType, _charSet, _contentLength, raw [content encoded as Base642], _guid, _time [document acquisition time]

Comments: - All content types accepted (_contentType values: BinaryBase64, Html, Text, Xml).

- No limit on content size.

- If _contentType is "Html", <Text> contains the corresponding HTML content (including tags and boilerplate). The Base64-encoded binary representation of the same HTML content is stored in the feature "raw".

- If _contentType is "Text" or "Xml", <Text> contains the corresponding content. The Base64-encoded binary representation of the same content is stored in the feature "raw".

- If _contentType is "BinaryBase64", <Text> contains the corresponding Base64-encoded binary data. In this case, the feature "raw" contains redundant data.

Size: 40.5 GB in 3 7ZIP files (compression ratio ~10%)

Table 3: Part 1 of the FIRST dataset – basic statistics and comments.

1 http://www.rssboard.org/rss-specification

2 http://en.wikipedia.org/wiki/Base64

Page 17: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 17 of 25

Period: Started on June 29, 2011 [19:27+2]

Web sites / RSS sources:

80 / 2,472

Corpora: 899,241 (and growing)

Documents: 2,674,827 (and growing)

Annotations: TextBlock/Boilerplate, TextBlock/Headline, TextBlock/FullText, TextBlock/Supplement, TextBlock/RelatedContent, TextBlock/UserComment

Corpus features:

RSS channel features (see RSS Specification): title, description, language, copyright, managingEditor, link, pubDate, category

Other features: _provider [type of the component that produced the corpus], _sourceUrl [link to the HTML file], _source [HTML source encoded as base64], _timeBetweenPolls [defines polling frequency], _timeStart [corpus acquisition start time], _timeEnd [corpus acquisition end time]

Document features:

RSS item features (see RSS Specification): title, description, link, pubDate, author, category, source

Other features: _mimeType, _contentType, _charSet, _contentLength, raw [content encoded as base64], _guid, _time [document acquisition time]

Comments: - Only "Html" and "Text" content types accepted. - Only contents that are 10 or less MB in size accepted.

Size: 41.3 GB in 3 7ZIP files (compression ratio ~10%)

Table 4: Part 2 of the FIRST dataset – basic statistics and comments.

Page 18: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 18 of 25

4 Semantic resources and the FIRST ontology

4.1 Existing semantic resources

Semantic and lexical resources, potentially relevant to FIRST, include:

McDonald’s word list (http://www.nd.edu/~mcdonald/Word_Lists.html).

Ontology developed by UHOH, used in their preliminary sentiment analysis experiments (see (Klein, Altuntas, Kessler, & Häusser, 2011)).

Publicly available sentiment-labelled datasets designed for sentiment analysis experiments, e.g.:

o http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

o http://langtech.jrc.it/JRC_Resources.html#Sentiment-quotes

DMoz (http://www.dmoz.org/),

DBpedia (http://dbpedia.org/),

ResearchCyc (http://research.cyc.com/),

SentiWordNet (http://sentiwordnet.isti.cnr.it/),

WordNet (http://wordnet.princeton.edu/),

Financial glossaries (e.g., http://www.forbes.com/tools/glossary/index.jhtml),

Proprietary data available to the consortium via the IDMS API (contains companies, instruments, notations, countries, industrial sectors…).

After examining these resources, we decided to build the FIRST ontology on top of the UHOH ontology, extending it with sentiment objects (i.e., objects of interest, e.g., companies, stocks, countries) from the IDMS proprietary database and sentiment vocabularies from the McDonald’s word list and SentiWordNet.

4.2 The FIRST ontology

The FIRST ontology clearly shows several differences from the traditional notion of an ontology, most notably the following:

It contains only a shallow subsumption hierarchy of the domain.

It contains explicit lexical knowledge about the domain (i.e., it contains terminological gazetteers).

It contains many instances (companies, stocks, countries…), accompanied with the required lexical information, that can be detected in texts in the document annotation process and passed on to the information extraction process.

The following is the list of the high-level triples (i.e., domain-relation-range triples) in the current version of the FIRST ontology:

CorrelationDefinition – correlationDefinitionIsInfluencedByIndicator – Indicator

o Inverse: Indicator – indicatorHasCorrelationDefinition – CorrelationDefinition

CorrelationDefinition – correlationDefinitionInfluencesFeature – ObjectFeature

o Inverse: ObjectFeature – featureOfCorrelationDefinition – CorrelationDefinition

Page 19: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 19 of 25

CorrelationDefinition – correlationDefinitionInfluencesObject – SentimentObject

o Inverse: SentimentObject – objectOfCorrelationDefinition – CorrelationDefinition

The following sections give more information about these high-level entities.

4.2.1 Ontology-based information extraction process in FIRST

In this section, we briefly describe the ontology-based information extraction (OBIE) process employed in FIRST (WP4). This outline should explain the need for the top-level concepts defined in the FIRST ontology. The goal of OBIE in FIRST is to classify the polarity of sentiment with respect to a sentiment object of interest (e.g., a specific stock index) and a certain feature (e.g., the price) based on news and blog posts currently being published (e.g., ―I expect the price of NASDAQ-100 to drop‖). This goal is envisioned to be achieved as follows:

1. The OBIE component inspects the documents for SentimentObjects (e.g., NASDAQ-100), Indicators (e.g., PriceTrend which is identified by the term ―price trend‖) and for OrientationPhrases (e. g., ―unchanged‖). An example is shown in Figure 6.

Figure 6: An example of identified ontological instances interrelated through a CorrelationDefinition and a set of JAPE rules to provide a sentiment polarity classification.

2. CorrelationDefinitions relate classes of sentiment objects (e.g., stock indices) and their features (e.g., future price change) to Indicators. For each detected Indicator, OBIE thus finds the corresponding CorrelationDefinitions. For example, for the price-trend indicator, the correlation definition indicatorHasCorrelationToStockPrice can be found in the ontology. Note that the correlations can be either positive or negative. CorrelationDefinitions are related to ObjectFeatures (through the correlationDefinitionInfluencesFeature relation) and ―constrained‖ to certain classes of sentiment objects (through the correlationDefinition-InfluencesObject relation). For example, the indicatorHasCorrelationToStockPrice correlation definition influences the ObjectFeature ExpectedFuturePriceChange and the members of the class Stock_Index.

3. The identified instances, i.e., the OrientationPhrases, Indicators, CorrelationDefinitions, SentimentObjects, and ObjectFeatures are passed to the GATE’s JAPE engine which executes a set of rules to determine the sentiment on the sentence level.

More information about the information extraction process, the JAPE rules, and the sentiment aggregation model are given in FIRST D4.1 and in (Klein, Altuntas, Kessler, & Häusser, 2011).

ObjectFeature

Indicator OrientationPhraseSentimentObject

CorrelationDefinition

The price trend of NASDAQ-100 is expected to remain unchanged.

JAPE

rules

Page 20: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 20 of 25

4.2.2 Sentiment objects and gazetteers

Sentiment objects are simply objects towards which people express sentiment. The following excerpt from the subsumption hierarchy of the FIRST ontology shows the most important classes of sentiment objects:

SentimentObject

FinancialInstrument

Index

Stock_Index

Stock

Company

Country

A sentiment object is an instance of a SentimentObject descendant (e.g., Google as an instance of Company or GOOG as an instance of Stock) and is equipped with a gazetteer. Gazetteers are instances of the class Gazetteer. They define terms, stop words, and can import other gazetteers in order to provide lexical knowledge required to identify ontological instances in texts.

The part of the ontology defining sentiment objects and gazetteers is automatically induced from the IDMS database and data from MSN Money1. For the purpose of inducing the SentimentObject subsumption hierarchy and populating it with instances, we employ the RDBToOnto methodology2 (Cerbah, 2009) developed in the European project TAO3 (Transitioning Applications to Ontologies). We ―follow‖ the IDMS data access API to ―grow‖ the ontology from a list of seed stock indices as illustrated in Figure 7.

Figure 7: The ontology part defining sentiment objects and gazetteers is “grown” from a list of seed stock indices.

1 http://money.msn.com/

2 http://www.tao-project.eu/researchanddevelopment/demosanddownloads/RDBToOnto.html

3 STREP IST-2004-026460, http://www.tao-project.eu

Indices

Constituents(stocks)

Companies

Countries

Page 21: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 21 of 25

Let us look at a concrete example. One of the seed stock indices from which the FIRST ontology is ―grown‖ is NASDAQ-100 which is defined as follows:

:NASDAQ_100 a :Stock_Index ;

rdfs:label "NASDAQ-100" .

The stock of the Microsoft Corporation is a member of the NASDAQ-100 stock index:

:MICROSOFT a :Stock ;

rdfs:label "MICROSOFT CORP COM USD0.00000625" ;

:memberOf :NASDAQ_100 .

The shares of the Microsoft stock are issued by the Microsoft Corporation. This is stated in the ontology as follows:

:MICROSOFT_CORP a :Company ;

rdfs:label "Microsoft Corp." ;

:issues :MICROSOFT .

Microsoft Corporation is located in the USA:

:USA a :Country ;

rdfs:label "USA" .

:MICROSOFT_CORP

:locatedIn :USA .

The MICROSOFT_CORP instance is linked to the gazetteer MICROSOFT_CORP_Gazetteer. This is defined as follows:

:MICROSOFT_CORP

:hasGazetteer :MICROSOFT_CORP_Gazetteer .

:MICROSOFT_CORP_Gazetteer

:hasTerm "Microsoft Corp" ;

:hasTerm "Microsoft Corporation" ;

:hasStopWord "CORP" ;

:hasStopWord "CORPORATION" ;

a :Gazetteer .

Gazetteer terms are acquired from several sources (in our case, from the IDMS database and data from MSN Money) and the corresponding stop words are computed automatically. Each term in a gazetteer is first represented as a set of words Ti (e.g., Ti = {―Microsoft‖, ―Corp‖}).

Then, the corresponding stop word set S is computed as S = iTi – iTi (i.e., the union of the gazetteer terms without the intersection of the gazetteer terms). In the above example,

T1 = {―Microsoft‖, ―Corp‖}, T2 = {―Microsoft‖, ―Corporation‖}, T1 T2 = {―Microsoft‖, ―Corp‖,

―Corporation‖}, T1 T2 = {―Microsoft‖}, and finally, S = T1 T2 – T1 T2 = {―Corp‖, ―Corporation‖}.

The gazetteers will be exploited by the semantic annotation component which will be developed before M18 and released together with the M24 prototype. The goal of the semantic annotation component will be to examine the text and recognize each sequence of words that corresponds to the sequence of words in a gazetteer term. In this process, the corresponding stop words will be removed from both sequences.

Last but not least, during the second project year, the ontology will be extended with additional sentiment objects such as people, industrial sectors, and topics.

4.2.3 Availability

The latest version of the FIRST ontology is available at:

Sentiment objects and gazetteers only

Page 22: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 22 of 25

http://first.ijs.si/firstontology/FIRSTOntology.n3

Consolidated ontology (also includes indicators, object features, correlation definitions, and sentiment-bearing phrases)

http://first.ijs.si/firstontology/ConsolidatedFIRSTOntology.n3 (OWL-N3 format)

http://first.ijs.si/firstontology/ConsolidatedFIRSTOntology.owl (OWL-RDF format)

Page 23: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 23 of 25

References

Cerbah, F. (2009). RDBToOnto User Guide: From Relational Databases to Fine-Tuned

Populated Ontologies. Retrieved 9 11, 2011, from http://www.tao-

project.eu/researchanddevelopment/demosanddownloads/RDBToOnto-Page/rdbtoontoguide.pdf

Klein, A., Altuntas, O., Kessler, W., & Häusser, T. (2011). Extracting Investor Sentiment from

Weblog Texts. Proceedings of the 13th IEEE Conference on Commerce and Enterprise

Computing (CEC). Luxembourg.

Page 24: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 24 of 25

Annex 1. Annotated sentiment corpus construction

As indicated in D1.3, Section 3, the construction of a manually annotated corpus is crucial for the evaluation of the information extraction performance and additionally serves as a training dataset for machine learning methods.

The experts in charge of annotating the documents, who are project partners in the consortium, are also the ones who search the info sources mentioned in Annex 2 of D1.2 for instances of suitable documents. The tools applied for this task are standard Web search engines like Google.

Table 5 summarizes the three approaches to retrieving documents for the three different use cases. The main requirements are that the documents contain sentiment related to the features of predefined financial objects specific to the use cases and that there are at least 30 documents available for each indicator/topic at the end.

Retrieval of corpus documents

Use case Search engine(s) Individual search on

selected URLs

Google News

keyword search

UC #1 X

UC #2 X X X

UC #3 X

Table 5: Approach to retrieving documents for the three use cases.

Manual annotation of the documents. The most time-consuming task is definitely the annotation of all the documents by the experts. This is true especially for the FIRST corpus, as there are a lot of annotation concepts and properties to be filled in by the annotators. Furthermore, the financial domain presents a higher challenge and demands a deeper expertise when it comes to the classification of sentiments. The reason behind this is the broadness of the financial domain and the huge amount of technical knowledge that is needed to really understand blogs and news that were written by financial analysts, especially when special financial language is used or the sentiment is described in a very fuzzy way or over larger parts of text.

Corpus construction status. Figure 8 shows the overall corpus construction process. The final annotation phase was prepared in a stepwise approach which covers a period of over half a year of alignment sessions, test annotation rounds, and constant feedback from the use case experts. At the end of M12, the corpus contains 350 documents that are annotated and additional 420 documents that have been already retrieved. The envisaged final amount of annotated documents is around 1,500.

Page 25: D3.1 Semantic resources and data acquisitionfirst.ijs.si/FirstShowcase/Content/reports/D3.1.pdfof this document is to describe the software prototypes and datasets released at M12,

D3.1

© FIRST consortium Page 25 of 25

Figure 8: Corpus construction process.