Monitoring the Status of Wild Birds Using Twitter

39
Registration number 100053203 2017 Monitoring the Status of Wild Birds Using Twitter Supervised by Dr Wenjia Wang, Dr Simon Gillings University of East Anglia Faculty of Science School of Computing Sciences

Transcript of Monitoring the Status of Wild Birds Using Twitter

Registration number 100053203

2017

Monitoring the Status of Wild Birds UsingTwitter

Supervised by Dr Wenjia Wang, Dr Simon Gillings

University of East Anglia

Faculty of Science

School of Computing Sciences

Abstract

Twitter is a huge resource of information. I will be harnessing it in order to collect data

that will be used to improve the British Trust for Ornithology’s ability to research and

help conserve wild birds.

Twitter is an online news and social networking service that allows its users to post

information, known as tweets, about any aspect of their lives. I hope to demonstrate

that this data source can be used as an extra source of information relevant to the con-

servation and monitoring of wild bird species. This project will involve: collecting this

information from Twitter, storing it, manipulating it to convert it into a useful format,

and then completeing analysis and presentation. The project output will be reviewed to

consider whether Twitter is a viable route that could be investigated further by signifi-

cant authorities, such as the British Trust for Ornithology, to help maintain rare species.

Acknowledgements

This project could not have been completed without the valuable assistance of my su-

pervisors, Dr. Wenjia Wang (UEA) and Dr. Simon Gillings (BTO).

CMP-6013Y

Contents

1. Introduction 6

1.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2. Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. Approach 8

3. Pre-project 9

3.0.1. Gannt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4. Feasibility and Foundations 12

5. Evolutionary Development 14

5.1. Tweet Collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.1. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.2. Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2. Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.1. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.2. Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3. Analysis and Presentation . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.1. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.2. Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6. Areas of Knowledge Used and Gained 32

7. Outcome and Areas of Development 33

7.1. Areas Of Development . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.2. Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8. Conclusion 35

Reg: 100053203 iii

CMP-6013Y

References 36

A. Rare Bird Network Hashtag Codes 38

Reg: 100053203 iv

CMP-6013Y

List of Figures

1. The DSDM process (Consortium et al., 2014). . . . . . . . . . . . . . . 8

2. Project Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3. Main tweet streaming and storage method. . . . . . . . . . . . . . . . . 12

4. Geographical data before being cleaned. . . . . . . . . . . . . . . . . . 13

5. Project Processes Foundation Design. . . . . . . . . . . . . . . . . . . 14

6. Object initialisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7. Initiator method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8. Main tweet streaming and storage method. . . . . . . . . . . . . . . . . 17

9. Connect to twitter API. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

10. Scheduling script execution. . . . . . . . . . . . . . . . . . . . . . . . 19

11. Selecting which scripts to run. . . . . . . . . . . . . . . . . . . . . . . 20

12. Code to connect to mongo database. . . . . . . . . . . . . . . . . . . . 23

13. Code to query mongo database. . . . . . . . . . . . . . . . . . . . . . . 24

14. Gsub function to format geographical data. . . . . . . . . . . . . . . . 24

15. Conditional statements to check for uneven format and reformat. . . . . 24

16. Split geographical data into Longitude and Latitude. . . . . . . . . . . . 27

17. Static Map Created by "ggmap" package. . . . . . . . . . . . . . . . . 27

18. Code that creates the interactive map. . . . . . . . . . . . . . . . . . . 28

19. Interactive Map Created. . . . . . . . . . . . . . . . . . . . . . . . . . 29

20. Date selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

21. Basic Direct Comparison Table. . . . . . . . . . . . . . . . . . . . . . 31

22. Map Shiny Webpage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

List of Tables

1. Project MoSCow prioritisation . . . . . . . . . . . . . . . . . . . . . . 10

2. Comparison of SQL and NOSQL facilities . . . . . . . . . . . . . . . . 22

3. Project MoSCoW Results . . . . . . . . . . . . . . . . . . . . . . . . . 34

Reg: 100053203 v

CMP-6013Y

1. Introduction

Twitter is a well known social media website, known for the exchange of posts limited

to 140 characters called "tweets". Within the ecological world, it is a goldmine for

shreds of information being shared about species being sighted all around the world.

These pieces of information can be accessed by taking advantages of Twitter’s streaming

API. There are many potential uses for this information and I am hoping to help the

British Trust for Ornithology (BTO), by collecting large numbers of these tweets and

performing analysis on them. The aim is to find useful data to help the BTO improve

the current system for monitoring wild bird species.

1.1. Overview

The British Trust for Ornithology (BTO) is a research institute combining professional

and citizen science. The BTO uses evidence of wildlife populations, particularly birds,

to inform the public, opinion-formers and to guide environmental policy (British Trust

for Ornithology, 1932b). The British Trust for Ornithology collects data on the abun-

dance, distribution, habitat associations, movements, survival, behaviour and breeding

performance of birds. This is achieved using networks of volunteers who make a huge

contribution to the BTO’s understanding of the status and ecology of birds in the United

Kingdom. This project’s aim is to supplement the data contributed by the BTO’s sci-

entists and volunteers with information about bird sightings derived from the Twitter

social media platform. When a Twitter user publishes a message a proportion of those

tweets contain geographic information (geo data) identifying their location. Each tweet

only contains geographical location data if the Twitter user has specifically enabled that

option within their account.

The high level project aims are to:

• Create a process to monitor and collect relevant tweets -typically targeted to keywords

such as a bird species.

• Store collected data in a database.

Reg: 100053203 6

CMP-6013Y

• Analyse the data to identify geographic data.

• Present analysed data in a user-friendly, map format for web delivery.

The ultimate objective is to implement a website containing useful information for each

species of bird monitored through the system such as: when in the year the first sightings

occur, where the birds are being sighted and the population changes.

1.2. Risks

A key requirement of the BTO’s current data services is that "survey design should

minimise biases within the data, so as not to undermine the scientific objectives of the

scheme." (British Trust for Ornithology, 1932c)

The main risk with collecting information from the general public is that Twitter-derived

data may not offer any form of information that is useful to the BTO. The BTO’s current

survey method gathers information sourced from committed bird watchers all over the

country. Data collected from Twitter comes from people who may be less knowledge-

able and this could lead to errors such species identification mistakes. The contingency

plan if this issue is encountered would be to alter what the data is used for after collec-

tion. For example: if the data collected offers no information relevant to the project, the

data could be filtered with the goal to find out where the general public lack knowledge.

This information could in turn be used to identify ways that the BTO could help to fill

in the educational gaps.

Other risks identified during the project include: ’Polluted’ data. For example a tweet

about the Member of Parliament ’Angela Eagle’ could skew data about Eagles. I have

reduced this risk by removing irrelevant tweets based on context.

Data volume is a risk. The high volume of available data means that there is a risk that

data analysis performance may be slow. This has been an issue during the project and

so I have used a subset of data to test the project. When the project is deployed it may

be necessary to use higher power systems and/or distributed computing to deliver faster

real-time responses. Alternatively the system could perform analysis on a batch basis

with daily updates. This would be similar to the BTO’s current process "Every night

the BTO computer will summarise that day’s records and produce up-to-date maps and

Reg: 100053203 7

CMP-6013Y

graphs showing the latest in migration, movements and distribution" (British Trust for

Ornithology, 1932a)

2. Approach

The broad approach has been to work in an iterative and incremental manner - "A style

of development that involves the iterative application of a set of activities to evaluate

a set of assertions, resolve a set of risks, accomplish a set of development objectives,

and incrementally produce and refine an effective solution." (Bittner and Spence, 2006)

and adopting a dynamic systems development method (DSDM). DSDM projects do just

’Enough Design Up Front’ (EDUF) within a Foundations phase in order to understand

and clarify the structure of the overall solution and to create an Agile plan for delivery

of the project.

Reg: 100053203 8

CMP-6013Y

Figure 1: The DSDM process (Consortium et al., 2014).

From the outset the project has elected to use open source components. "Open Source

software is software that can be freely accessed, used, changed, and shared (in modified

or unmodified form) by anyone. Open source software is made by many people, and

distributed under licenses that comply with the Open Source Definition" (Open Source

Initiative, 1998). This means that the project can take advantage of a choice of tech-

nologies and leaves the opportunity for the system to be maintained and developed in

the future without software licence restrictions. This is particularly relevant as the target

user, the BTO, is a charitable research institute.

Reg: 100053203 9

CMP-6013Y

3. Pre-project

In order to establish a realistic scope for the project I firstly reviewed the BTO’s web

site to gain an insight to the current facilities in the area of Research and Data Services

(British Trust for Ornithology, 1932d) and discussion with Dr. Simon Gillings at the

BTO who provided clarity to the project’s direction. In a DSDM project I needed to

understand the relative importance of the possible work to be done in order to move

forward and work to deadlines. The research and interviews with Dr. Gillings led to a

set of functional candidates. I used MoSCoW as a prioritisation technique to identify

key functions and manage priorities. The The MoSCoW acronym stands for: Must

Have - No point in delivering the project without this Should Have - important but not

vital - would prefer to deliver, but the solution is still viable Could Have - desirable but

less important ideas Won’t Have this time - good ideas that will not be delivered in the

current project The project MoSCoW prioritised candidates for this project are:

Reference Candidate MoSCoW1 Define relevant tweets (Key words and phrases) Must

2 Capture relevant tweets over a sustained period Must

3 Store tweets for analysis Must

4 Identify the location source of tweets Must

5 Present tweets and location to the user Must

6 Present locations as a map Should

7 Interactive map Should

8 View map for a specific time period Could

9 Show bird migration over time Could

10 Publish as a public web site Won’t

Table 1: Project MoSCow prioritisation

In this pre-project a number of clear unknowns were identified to be addressed later

in the project:

• Do Twitter users publish relevant tweets?

Reg: 100053203 10

CMP-6013Y

• What is the volume of data to be managed?

• What proportion of the captured data contains useful extractable information such as

the user’s geographic location?

• What is the data processing overhead?

• How to present information to the end user?

3.0.1. Gannt Chart

I also created an outline project timetable to support me in keeping track of progress

within the available time. My timetable is represented in the Gantt chart below.

4. Feasibility and Foundations

In establishing the project’s foundations I needed to address bulk capture of relevant

tweets. Setting up the data collection requires access to the Twitter API stream and I

opted to build upon my existing Python skills and use an open source Python library.

I selected the tweepy library from the set of available libraries as it provides access

to all Twitter RESTful API methods and supports OAuth authentication - a protocol

that allows an application to interact with another without giving away the end user

password. In particular the Twitter search API and streaming API are at the heart of the

project’s capture processes and are both available in tweepy. An initial server process

was developed to capture relevant tweets. The key code is shown below:

Reg: 100053203 11

CMP-6013Y

Proj

ects

ched

ule

show

nfo

re-v

isio

nw

eek

num

bers

and

sem

este

rwee

knu

mbe

rs

910

1112

1314

1516

1718

1920

2122

2324

2526

2728

2930

3132

3334

3536

3738

3940

4142

12

34

56

78

910

1112

CB

12

34

56

78

910

EB

1112

1314

Proj

ectp

ropo

sal

Lite

ratu

rere

view

Dat

aC

olle

ctio

nD

esig

n

Dat

aC

olle

ctio

nR

unni

ng

Dat

aA

naly

sis

Web

site

Impl

emen

tatio

n

Test

ing

Cod

ede

liver

y

Fina

lrep

ortw

ritin

g

Insp

ectio

npr

epar

atio

n

Figure 2: Project Gantt chart

Reg: 100053203 12

CMP-6013Y

Figure 3: Main tweet streaming and storage method.

This initial feasibility and foundation phase identified some challenges to be ad-

dressed in the subsequent phases: The volume of data to be captured is large and the

structure of captured tweets is inconsistent. For example the project is dependent upon

identifying the tweeter’s physically location (latitude and longitude). As can be seen

from the examples below inconsistent information structure means that further process-

ing of captured data would be required.

Reg: 100053203 13

CMP-6013Y

Figure 4: Geographical data before being cleaned.

The above figure demonstrates how cluttered the meta data of each tweet is.

Review of this feasibility activity led to a foundation design for the project components.

Reg: 100053203 14

CMP-6013Y

Figure 5: Project Processes Foundation Design.

5. Evolutionary Development

The creation of this project can be broken down into three main sections: Tweet Collec-

tion, Data Storage, and Analysis & Presentation. Each section required extending my

knowledge base and coding skills.

5.1. Tweet Collector

5.1.1. Description

In order to create a system that will present Twitter data in a user friendly manner to

the general public, the first problem to be overcome is collection of that data. There are

many different methods of accomplishing this using multiple languages and packages.

Reg: 100053203 15

CMP-6013Y

5.1.2. Reading

Initially when starting this project, I was unaware of the different methods of collection

that could be applied to a Twitter data scraping application. Upon reading Kumar et al.

(2013), I quickly learnt that there are two types of API (Application Programming Inter-

face) that can be used to access Twitter - REST APIs, and streaming APIs. These differ

in a few ways but the main difference is that a REST API only collects data when it

is specifically called, whereas a streaming API will collect a continuous stream of data

once started. It was apparent that for this project creating a streaming API would be the

most viable.

The next question that needed answering was, which language would be best to create

this streaming API?

Java and Python were the two main contenders initially. I felt comfortable program-

ming in either language. However after researching both possibilities using the books:

Makice (2009) and Bonzanini (2016), I came to the conclusion that Python would be my

preference. This decision was driven by the volume of research and articles published

about using Python whereas Java had very little written about it.

5.1.3. Implementation

Creating the Tweet Collector process took several steps: Obtain unique twitter tokens

allowing Oauth access to the twitter API. Provide a set of areas of interest using specific

key words (MoSCoW reference 1). Examples of keywords and hashtags that I used

are: "cuckoo" and "#rbnNFk" ("#rbnNFk" is a specific hashtag used by the Rare Birds

Network to identify below to see a sightings reported in a specific UK region - in this

case the Norwich area. Access to the Rare Bird Network’s hashtag codes is given in

Appendix A) The data collection is then set to run repeatedly for 15 minutes for several

months. The aim of this is to collect a large amount of data which varies as it continues

throughout different seasons of the year to provide information over periods of time. I

developed a Python script to automate the process.

Reg: 100053203 16

CMP-6013Y

Figure 6: Object initialisation.

Before accessing the Twitter API, you must register your application with the Twitter

developers. After successful registration Twitter provides a set of unique keys.

1. The provided Twitter API access codes are initialised.

2. The objects that are required in order for the script to access the Twitter API are

initialised.

3. To allow the Python code to interact with the Twitter API, packages from "tweepy"

are imported.

The final object to be initialised here is the list of keywords that indicate to the system

the tweets to be collected and stored in the database.

Reg: 100053203 17

CMP-6013Y

Figure 7: Initiator method.

Figure 8: Main tweet streaming and storage method.

The above figures show the method used in order to connect to a database, stream

the tweets from the Twitter API and store that tweet within a locally hosted database.

Initially, the code is encapsulated within a while loop - the condition remains true while

the running time of the script remains lower than the time limit set by Twitter. Then

within a try/catch statement a series of objects are created and assigned a specific level

of the database server, starting first at the port and server, then the database and then the

collection within the database.

Reg: 100053203 18

CMP-6013Y

This process allows the script to:

1. Connect into the correct position in the database to store each tweet

2. Using the stream package within tweepy, each tweet is then assigned to an object

- a tweet - in JSON (JavaScript Object Notation) format.

3. The tweet object is stored.

4. If an error is encountered, an error message is printed to the console and the while

loop is put to sleep for 5 seconds before restarting.

5. Once the given time limit (900 seconds) has been reached, "exit()", breaks out of

the loop and moves onto the next section.

Figure 9: Connect to twitter API.

This final figure shows the code required in order for the system to connect into

the tweepy and Twitter API. It requires the use of the package "OAuthHandler" within

tweepy. This is where the unique keys provided by Twitter for verification by the Twit-

ter API are used. Also within this code, the period that the script can be connected to

Twitter is defined. This cannot be less than the time limit of the script to run - I have

selected to use the same limit in order to minimise confusion. Finally the process pro-

vides filter information. In this example I have specified that it must only track tweets

containing any word within the keyword list (Figure 6), and to discount any language

that isn’t English. In commercial use these can be changed.

Reg: 100053203 19

CMP-6013Y

The Tweet Collector process was set to run every 15 minutes for 15 minutes indefi-

nitely to maximise the volume of data collected. This method ensured that the script

was running for as long as the Twitter API would allow. If the project was put into

commercial use, I would schedule the running of the script by hosting it on a server and

configuring a "CronJob" task. However as I did not have access to a server I scheduled

the script to run using the base application "Task Scheduler" in Microsoft Windows.

Figure 10: Scheduling script execution.

Reg: 100053203 20

CMP-6013Y

Figure 11: Selecting which scripts to run.

The Tweet Collector was monitored manually every 2 weeks to check that everything

was still running and that no issues have been encountered.

Reg: 100053203 21

CMP-6013Y

5.2. Data Storage

5.2.1. Description

Storage of the data collected using the script above was crucial to satisfy the requirement

to compare sightings of different birds over a large scale time period. As the Tweet

Collector was effectively running continuously it acquired a vast amount of data to be

stored. Also, how might a structured database be designed when you are not completely

aware of the tweet meta data structure?

5.2.2. Reading

As described above relevant tweets were collected using the tweepy library and Twitter

API and, as stated, early experiments showed variability in the data received. At this

stage options were explored regarding the appropriate storage method and technology

to offer data storage flexibility with the ability to undertake data analysis and provide a

platform for data representation to the end user. It was clear that a database is needed and

a choice was made between a relational database management system (RDBMS) acces-

sible via Structured Query Language (SQL) - such as Oracle MySQL - and a NOSQL

storage structure.

From reading Cattell (2011) and looking at the unstructured content of the captured

tweets together with indications of the prospective data volume a comparison of SQL

versus NOSQL led to the selection of a NOSQL database.

Reg: 100053203 22

CMP-6013Y

SQL NoSQLStores data in tables Stores data in name-value documents

Needs a schema that defines tables up

front

Organic design - can store data without

specifying a schema and can react to evo-

lutionary development and review

Inherently supports powerful query tools

for analysis

Uses JSON (JavaScript Object Notation)

data objects for queries

Scaling to large data volumes needs pre-

planning

Scales straightforwardly to store large

data volumes

Table 2: Comparison of SQL and NOSQL facilities

I selected the open source MongoDB database as the platform to store the collected

data as MongoDB documents can vary in structure and my JSON formatted data in-

tegrates well with Mongo’s storage system and remains efficient to process even with

large data volumes. Given the project aim of presenting information as an interactive

map (MoSCoW reference 7) MongoDB is a good choice as it offers specific geospatial

features. These features include GeoJSON, an open-source format for rich geospatial

types (Butler et al., 2016).

5.2.3. Implementation

An unexpected problem that arose is the sheer multitude of data that the Tweet Collector

stored. The Tweet Collector quickly captured over 500 million entries, and it was found

that a large proportion were of no relevance. To resolve this problem I segmented a

section of around five thousand entries. This segmentation gave a representative sample

that could be read and analysed with more efficiency. The Tweet Collection process cap-

tured a large volume of tweets however further processing of captured data was required

to disassemble the tweets into component parts. In particular the geographical location

information (latitude and longitude) needed to be extracted. Early development, assem-

bly and review of a JavaScript implementation showed that a JavaScript and Node JS

implementation could be achieved but system performance would be slow. JavaScript

Reg: 100053203 23

CMP-6013Y

was initially chosen as a method for data analysis as it appeared to have mongoDB con-

nectivity while also being able to directly host a website server in which the data could

be presented to the user and manipulated. However it became quickly apparent that this

was very inefficient. To begin with, the JavaScript server seemed to be unable to handle

the sheer volume of tweets held within the database and consistently crashed. The few

times it managed to complete the streaming of the data, it took a very long time to load,

which is not viable if the website were to become available to the public. Secondly, for

data analysis purposes, JavaScript did not contain the inbuilt functionality. This meant

that functionality such as searching through data would have needed to be specifically

created.

This led to a review of the chosen technology and, in conjunction with my supervi-

sor, I chose to use ’R’ to manipulate the stored data. ’R’ is a programming environment

for scripted data manipulation, calculation and graphical display and is available as Free

Software under the terms of the Free Software Foundation’s GNU General Public Li-

cense. It is highly extensible and a large number of packages are freely available. This

project used RMongo a package designed in order to allow the user to execute mongo

queries through an R script (Chheng, 2011) to manipulate the data that has been col-

lected in the mongo database. I created an R script to connect into the locally hosted

database.

Figure 12: Code to connect to mongo database.

Once connected to the database an R script queries the database using the specified

parameters.

Reg: 100053203 24

CMP-6013Y

Figure 13: Code to query mongo database.

This project requires geo-location data but, because of the way that Twitter stores its

tweet data, in order to access the geo-location data, extra characters surrounding the

longitude and latitude had to be removed.

This was done using R’s "gsub" function to replace occurrences of a pattern within a

string sub. This initially was simple to do, as shown below.

Figure 14: Gsub function to format geographical data.

However I soon discovered that not all of the geo-location data was formatted in

the same way. Therefore I had to extend the algorithm with a section of conditional

statements checking whether the first character was a comma or not in order to reformat

the geo-location into a standardised format for subsequent use.

Figure 15: Conditional statements to check for uneven format and reformat.

Reg: 100053203 25

CMP-6013Y

5.3. Analysis and Presentation

5.3.1. Description

The Data Storage process mapped my data into JSON ((JavaScript Object Notation).

JSON had been selected with the idea that the analysis and presentation process would

be built as a web site JavaScript implementation. However it became clear that the use

of ’R’ for data manipulation could be extended to support the presentation of data as a

map (MoSCoW references 6 and 7).

5.3.2. Reading

Learning to use the programming language R beyond simple data manipulation was a

risk as I had never used it prior to this project. The language is often used by scientists,

in many research fields including ecology. It is a powerful open source data manip-

ulation language and therefore seemed appropriate to use for the large levels of data

manipulation required. R also has a number of virtues:

• It is an open source language, with all of its features available free of charge.

• It is the standard for use among professional statisticians.

• It is a general purpose programming language and incorporates other useful function-

ality such as map graphics.

Despite my fears of lack of knowledge and understanding of the language, there are

many forums and articles written that teach ’R’ functionality. One of the main sources

of this information was Matloff (2011). I used my newly-found ’R’ skills to manipulate

my database of records as preparation for the data presentation element of the project.

The final stage of of the process was implementing a subsystem to present and publish

the data to both the BTO and the general public. As the aim is to make the information

easily available and interactive the obvious answer was to create a website to access and

present the information.

However some questions arose:

Reg: 100053203 26

CMP-6013Y

• Should I be perpetually running the R scripts on a server and then streaming the

information onto the website?

• Is it possible to automate the script on the website itself and post its output directly?

• Which language is best for combining the website and the data analysis?

The answer to these questions surprised me. Initially I believed that the best approach

would be to host a server using Node Javascript, Node.js is an open-source, cross-

platform JavaScript runtime environment for executing JavaScript code server-side, and

uses the Chrome V8 JavaScript engine. However after reading Chaniotis et al. (2015),

it became apparent that not only would this involve a large amount of processing over-

head on the system, running two servers and perpetually streaming data between the

two, it would also be inefficient given the fixed project time-scales and the amount of

coding required. Whilst researching alternative languages such as PhP, I learnt about

the R package, "Shiny" (Chang et al., 2015). This is a package developed by the same

team that created the commonly used, graphical user interface (GUI), "rStudio" (Team,

2015). Shiny is a web application framework that supports the creation and hosting of

website applications that have have the R script embedded.

5.3.3. Implementation

Map representation (MoSCoW reference 6) required further data manipulation. In the

initial implementation each tweet’s geo-data is stored within one column in a single R

dataframe. To be able to plot each point on a map, the longitudes and latitudes must be

split between two columns creating a spatial object. To accomplish this, I split the data

using the "strsplit". function within a further R package "Stringr" (Wickham, 2010). As

the name implies the "strsplit" function splits up a string into pieces. I gave the function

the delimiting character of a comma, which allowed the function to split the geo-data

column at that character and then re-assign each side of the split into two new columns

in a new dataframe containing latitude and longitude.

Reg: 100053203 27

CMP-6013Y

Figure 16: Split geographical data into Longitude and Latitude.

Initially when plotting the map I used a package called "ggmap" (Kahle and Wick-

ham, 2015) - A collection of functions to visualize spatial data. It did plot the points on

a map, however required a large amount of coding to create and was a static image that

didn’t fulfil the ’should be Interactive’ aim of MoSCoW reference 7.

Figure 17: Static Map Created by "ggmap" package.

Reg: 100053203 28

CMP-6013Y

This was not ideal to meet the project objectives. I opted to change from ggmap to the

newer package "leaflet" (Joe Cheng et al, February 2017). Leaflet is used for creating

interactive maps where you can zoom in to street level with functionality similar to

that of Google maps (Leaflet takes advantage of the Google Maps API to deliver its

functionality). This interactivity is far more useful for the user to be able to accurately

pinpoint and analyse bird location and activity.

Figure 18: Code that creates the interactive map.

Reg: 100053203 29

CMP-6013Y

Figure 19: Interactive Map Created.

There are multiple benefits gained by using an interactive map:

• Labels are able to be hidden and revealed by clicking on them.

• The ability to zoom into street level gives the user the ability to accurately pinpoint

the location from which the tweet emanated.

Reg: 100053203 30

CMP-6013Y

The interactive maps are presented within a website that holds and presents the data.

I have used Twitter Bootstrap - an open-sourced framework originally developed by

Twitter in 2010. Bootstrap is a good source of website design tools which help create

professional looking, intuitive websites.

In combination with the interactive map, I created a date selection feature to let the user

interact with data from a chosen date range (MoSCoW reference 8). The dates available

for the user to choose are taken from the tweet data and directly linked to the tweet. This

prevents the user from selecting dates in which there were no tweets and generating an

empty map - a map with no labels populated.

Figure 20: Date selection.

The date selector also delivers MoSCoW reference 9 and gives the user the ability to

estimate possible bird migration patterns. For example, if there was a tweet mentioning

a specific species that emanated from Scotland then the user could then alter the starting

and ending dates to a month later, and see that another tweet about that same species

was posted from Ireland. The user could infer that the bird has migrated between the

two regions. In reality this would not be reliable as there is no method of verifying that

the bird tweeted about is the same one.

A key question identified in the pre-project stage was "What proportion of the captured

data contains useful extractable information such as the user’s geographic location?".

Reg: 100053203 31

CMP-6013Y

I found that <1% (0.66%) of tweets contained geo-location meta data. This felt like a

waste of the large volume of other tweets that may contain useful information. In order

to attempt to prevent this waste, another feature that I added to the website, was a page

to view the entire collection of tweets. This page included key useful information such

as, the user’s information, the text of the tweet, when it was posted and any media in-

cluded in the tweet such as pictures which can be viewed by copying the url into your

browser.

Figure 21: Basic Direct Comparison Table.

The final part of presenting the data was to use the package "Shiny" to embed the R

script into a website and design it so that accessing the data would remain user friendly.

To create the website I built the user interface (UI) file, and the server side file. They

interact with each other - the UI file controls the appearance of the page while the server

file handles the script.

At this point the "shiny" server is hosted locally (localhost/127.0.0.1 as illustrated in

this report’s screen shots), in a public or commercial setting it would be hosted from an

external server, allowing open access to the system.

Reg: 100053203 32

CMP-6013Y

The web site is a client-server design. Data processing is handled as a server process

separately from the data presentation process. This reduces website loading time.

Figure 22: Map Shiny Webpage.

6. Areas of Knowledge Used and Gained

To complete the project of complementing the current system used by the BTO for

monitoring birds, in addition to general programming and analysis skills knowledge of

the following technologies has been gained and used:

• Twitter streaming API and the tweepy package for capturing raw data

• Python coding to interface with tweepy

Reg: 100053203 33

CMP-6013Y

• R Data analysis, scripting and the use of Cran (Comprehensive R Archive Network)

to identify and source additional facilities

• Website design to present analysis of the results

• Bird ecology to validate that the results are sensible in the ’real world’

7. Outcome and Areas of Development

This section summarises the project outcomes, what has been achieved and identifies

areas where the project could be taken forward to enhance its validity. The project aims

are to:

• Create a process to monitor and collect relevant tweets -typically targeted to keywords

such as a bird species.

• Store collected data in a database.

• Analyse the data to identify geographic information.

• Present analysed data in a user-friendly, map format for web delivery.

In turn these project aims were converting into a set of MoSCoW prioritised candidates.

The table below restates the MoSCoW prioritised candidates together with a statement

of whether the project achieved the aim.

Reg: 100053203 34

CMP-6013Y

Reference Candidate MoSCoW Result1 Define relevant tweets (Key words and phrases) Must Achieved1

2 Capture relevant tweets over a sustained period Must Achieved

3 Store tweets for analysis Must Achieved2

4 Identify the location source of tweets Must Achieved

5 Present tweets and location to the user Must Achieved

6 Present locations as a map Should Achieved

7 Interactive map Should Achieved

8 View map for a specific time period Could Achieved3

9 Show bird migration over time Could Achieved4

10 Publish as a public web site Won’t Not in scope

Table 3: Project MoSCoW Results

As can be seen from Table 3 all Must, Could and Should aims have been achieved.

However as noted below there are a number of areas where the achievement could be

improved:

1. In the current implementation the keywords and phrases are hard coded. It would

be useful if these items could be configured at the user interface

2. The current project limited the volume of stored tweets

3. The presentation of time period at the user interface could me more dynamic e.g.

using sliders

4. Migration over time would be better presented as animation

7.1. Areas Of Development

This project has demonstrated that it is possible to capture relevant data from Twitter

and transport this into valuable information that can be presented interactively on a web

site. A clear area for development is the user interface to make the information available

to the public. However a challenge that this project has not fully addressed is the issue

Reg: 100053203 35

CMP-6013Y

of context. A key requirement of the BTO’s current data services is that "survey design

should minimise biases within the data, so as not to undermine the scientific objectives

of the scheme" Twitter’s unstructured nature means that it is possible for the data to

become ’polluted’ by duplicate and irrelevant data For example, any tweets in which

the first two characters of the text are "RT" means that it is just a retweet of someone

else’s information. Retweets must be removed as they would skew the data and would

make it appear that more sightings of particular species had occurred than had actually

happened. Whilst it is programmatically straightforward to exclude retweets it is a

more difficult challenge to determine the context. For example if the tweet mentioned

the politician "Angela Eagle", the Tweet Collector would collect it but it is clearly not

relevant to the actual bird species of eagle. Reviewing the literature shows that this is a

difficult challenge but could be a useful area for a further project.

7.2. Outcome

The results that have been produced by this system show that data can be successfully

taken from Twitter, manipulated, analysed and used to populate a user-friendly system

that would allow both the general public and more authoritative bodies, such as the

British Trust for Ornithology to monitor the bird activity and the opinions and thoughts

of the public towards wild birds.

However, this project has demonstrated that data collected from Twitter is unreliable.

Issues with context are apparent, users not enabling their geo-location on their account

removes the ability for each tweet to be plotted on the map, and lastly, there is no system

in place for verification of the information that they are claiming. For non experts, the

difference between two species of bird may be unclear and this can lead to skewed

information due to the public lack of knowledge.

8. Conclusion

This project demonstrated that it is feasible to use Twitter as a source of information

regarding wildlife, birds in particular. This information can be presented to a user in an

interactive and friendly manner.

Reg: 100053203 36

CMP-6013Y

A key aspect of this project is the identification of the geographic source of ’tweets’.

During the Tweet Collector phase it became apparent that only a small proportion of

Twitter users share their precise latitude and longitude - the location sharing feature de-

faults to OFF in the Twitter user interface. This meant that <1% of the tweets collected

contained data of interest and hence the validity of the project results could be ques-

tioned.

However the project has illustrated collection, storage and analysis of large volumes

of Twitter data. This methodology could be applied in other contexts and using other,

more general Twitter API objects. For example a business’s customer service operation

could be interested in identifying and extracting tweets that mention their name and that

receive interest from other users by gaining a number of ’retweets’ or ’likes’.

Twitter remains a large source of information on many different topics, however, har-

nessing it effectively also reveals problems. For example, the inability to verify the

accuracy of each tweet means that scientific research must take the general public’s lack

of knowledge as a factor. Within my project, I feel that I have gained an understanding

of how to begin harnessing social media information and manipulating it. I have created

a flexible system that could be replicated and similar algorithms could be applied to any

topic.

References

Bittner, K. and Spence, I. (2006). Managing iterative software development projects.

Addison-Wesley Professional.

Bonzanini, M. (2016). Mastering Social Media Mining with Python. Packt Publishing.

British Trust for Ornithology (1932a). About bird track. https:// www.bto.org/ volunteer-

surveys/ birdtrack/ about . Accessed March 18, 2017.

Reg: 100053203 37

CMP-6013Y

British Trust for Ornithology (1932b). About bto. https:// www.bto.org/ about-bto. Ac-

cessed March 17, 2017.

British Trust for Ornithology (1932c). Quality of bto data. https:// www.bto.org/

research-data-services/ data-services/ data-quality . Accessed March 19, 2017.

British Trust for Ornithology (1932d). Research and data services. https:// www.bto.org/

research-data-services. Accessed March 18, 2017.

Butler, H., Daly, M., Doyle, A., Gillies, S., Hagen, S., and Schaub, T. (2016). The

geojson format. Technical report, Internet Engineering Task Force.

Cattell, R. (2011). Scalable sql and nosql data stores. Acm Sigmod Record, 39(4):12–27.

Chang, W., Cheng, J., Allaire, J., Xie, Y., and McPherson, J. (2015). Shiny: web

application framework for r. R package version 0.11, 1.

Chaniotis, I. K., Kyriakou, K.-I. D., and Tselikas, N. D. (2015). Is node. js a viable

option for building modern web applications? a performance evaluation study. Com-

puting, 97(10):1023–1044.

Chheng, T. (2011). Rmongo: Mongodb client for r.

Consortium, D. et al. (2014). The dsdm agile project framework. DSDM Con-

sortium Ebook: https://www.dsdm.org/resources/dsdm-handbooks/the-dsdm-agile-

project-framework-2014-onwards.

Kahle, D. and Wickham, H. (2015). ggmap: A package for spatial visualization with

google maps and openstreetmap, 2013. R package version, 2.

Kumar, S., Morstatter, F., and Liu, H. (2013). Twitter data analytics. Springer Science

& Business Media.

Makice, K. (2009). Twitter API: Up and running: Learn how to build applications with

the Twitter API. " O’Reilly Media, Inc.".

Matloff, N. (2011). The art of R programming: A tour of statistical software design. No

Starch Press.

Reg: 100053203 38

CMP-6013Y

Open Source Initiative (1998). Open source initiative - frequently asked questions.

https:// opensource.org/ faq/ #osd . Accessed March 18, 2017.

Team, R. (2015). Rstudio: integrated development for r. RStudio, Inc., Boston, MA URL

http://www. rstudio. com.

Wickham, H. (2010). stringr: modern, consistent string processing. The R Journal,

2(2):38–40.

A. Rare Bird Network Hashtag Codes

The Rare Bird Network (RBN) uses Twitter to allow bird watchers publish sighting in-

formation. RBN uses pre-defined hashtags for defined regions the United Kingdom. The

hashtag codes are available at http:// www.rarebirdnetwork.co.uk/ p/ hashtag-codes.html

with a visual representation available via Google maps at https:// www.google.com/

maps/ d/ viewer?mid=1Aewtew3_0oKzsBFtTGed2qvZ-pU&msa=0&ie=UTF8&t=m&ll=55.

56592200000001%2C-3.6254880000000184&spn=11.201913%2C15.402832&z=6

Reg: 100053203 39