Monitoring the Status of Wild Birds Using Twitter
Transcript of Monitoring the Status of Wild Birds Using Twitter
Registration number 100053203
2017
Monitoring the Status of Wild Birds UsingTwitter
Supervised by Dr Wenjia Wang, Dr Simon Gillings
University of East Anglia
Faculty of Science
School of Computing Sciences
Abstract
Twitter is a huge resource of information. I will be harnessing it in order to collect data
that will be used to improve the British Trust for Ornithology’s ability to research and
help conserve wild birds.
Twitter is an online news and social networking service that allows its users to post
information, known as tweets, about any aspect of their lives. I hope to demonstrate
that this data source can be used as an extra source of information relevant to the con-
servation and monitoring of wild bird species. This project will involve: collecting this
information from Twitter, storing it, manipulating it to convert it into a useful format,
and then completeing analysis and presentation. The project output will be reviewed to
consider whether Twitter is a viable route that could be investigated further by signifi-
cant authorities, such as the British Trust for Ornithology, to help maintain rare species.
Acknowledgements
This project could not have been completed without the valuable assistance of my su-
pervisors, Dr. Wenjia Wang (UEA) and Dr. Simon Gillings (BTO).
CMP-6013Y
Contents
1. Introduction 6
1.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2. Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. Approach 8
3. Pre-project 9
3.0.1. Gannt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Feasibility and Foundations 12
5. Evolutionary Development 14
5.1. Tweet Collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.1. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.2. Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2. Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2. Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3. Analysis and Presentation . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.1. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.2. Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6. Areas of Knowledge Used and Gained 32
7. Outcome and Areas of Development 33
7.1. Areas Of Development . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2. Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8. Conclusion 35
Reg: 100053203 iii
CMP-6013Y
List of Figures
1. The DSDM process (Consortium et al., 2014). . . . . . . . . . . . . . . 8
2. Project Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3. Main tweet streaming and storage method. . . . . . . . . . . . . . . . . 12
4. Geographical data before being cleaned. . . . . . . . . . . . . . . . . . 13
5. Project Processes Foundation Design. . . . . . . . . . . . . . . . . . . 14
6. Object initialisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7. Initiator method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8. Main tweet streaming and storage method. . . . . . . . . . . . . . . . . 17
9. Connect to twitter API. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
10. Scheduling script execution. . . . . . . . . . . . . . . . . . . . . . . . 19
11. Selecting which scripts to run. . . . . . . . . . . . . . . . . . . . . . . 20
12. Code to connect to mongo database. . . . . . . . . . . . . . . . . . . . 23
13. Code to query mongo database. . . . . . . . . . . . . . . . . . . . . . . 24
14. Gsub function to format geographical data. . . . . . . . . . . . . . . . 24
15. Conditional statements to check for uneven format and reformat. . . . . 24
16. Split geographical data into Longitude and Latitude. . . . . . . . . . . . 27
17. Static Map Created by "ggmap" package. . . . . . . . . . . . . . . . . 27
18. Code that creates the interactive map. . . . . . . . . . . . . . . . . . . 28
19. Interactive Map Created. . . . . . . . . . . . . . . . . . . . . . . . . . 29
20. Date selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
21. Basic Direct Comparison Table. . . . . . . . . . . . . . . . . . . . . . 31
22. Map Shiny Webpage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
List of Tables
1. Project MoSCow prioritisation . . . . . . . . . . . . . . . . . . . . . . 10
2. Comparison of SQL and NOSQL facilities . . . . . . . . . . . . . . . . 22
3. Project MoSCoW Results . . . . . . . . . . . . . . . . . . . . . . . . . 34
Reg: 100053203 v
CMP-6013Y
1. Introduction
Twitter is a well known social media website, known for the exchange of posts limited
to 140 characters called "tweets". Within the ecological world, it is a goldmine for
shreds of information being shared about species being sighted all around the world.
These pieces of information can be accessed by taking advantages of Twitter’s streaming
API. There are many potential uses for this information and I am hoping to help the
British Trust for Ornithology (BTO), by collecting large numbers of these tweets and
performing analysis on them. The aim is to find useful data to help the BTO improve
the current system for monitoring wild bird species.
1.1. Overview
The British Trust for Ornithology (BTO) is a research institute combining professional
and citizen science. The BTO uses evidence of wildlife populations, particularly birds,
to inform the public, opinion-formers and to guide environmental policy (British Trust
for Ornithology, 1932b). The British Trust for Ornithology collects data on the abun-
dance, distribution, habitat associations, movements, survival, behaviour and breeding
performance of birds. This is achieved using networks of volunteers who make a huge
contribution to the BTO’s understanding of the status and ecology of birds in the United
Kingdom. This project’s aim is to supplement the data contributed by the BTO’s sci-
entists and volunteers with information about bird sightings derived from the Twitter
social media platform. When a Twitter user publishes a message a proportion of those
tweets contain geographic information (geo data) identifying their location. Each tweet
only contains geographical location data if the Twitter user has specifically enabled that
option within their account.
The high level project aims are to:
• Create a process to monitor and collect relevant tweets -typically targeted to keywords
such as a bird species.
• Store collected data in a database.
Reg: 100053203 6
CMP-6013Y
• Analyse the data to identify geographic data.
• Present analysed data in a user-friendly, map format for web delivery.
The ultimate objective is to implement a website containing useful information for each
species of bird monitored through the system such as: when in the year the first sightings
occur, where the birds are being sighted and the population changes.
1.2. Risks
A key requirement of the BTO’s current data services is that "survey design should
minimise biases within the data, so as not to undermine the scientific objectives of the
scheme." (British Trust for Ornithology, 1932c)
The main risk with collecting information from the general public is that Twitter-derived
data may not offer any form of information that is useful to the BTO. The BTO’s current
survey method gathers information sourced from committed bird watchers all over the
country. Data collected from Twitter comes from people who may be less knowledge-
able and this could lead to errors such species identification mistakes. The contingency
plan if this issue is encountered would be to alter what the data is used for after collec-
tion. For example: if the data collected offers no information relevant to the project, the
data could be filtered with the goal to find out where the general public lack knowledge.
This information could in turn be used to identify ways that the BTO could help to fill
in the educational gaps.
Other risks identified during the project include: ’Polluted’ data. For example a tweet
about the Member of Parliament ’Angela Eagle’ could skew data about Eagles. I have
reduced this risk by removing irrelevant tweets based on context.
Data volume is a risk. The high volume of available data means that there is a risk that
data analysis performance may be slow. This has been an issue during the project and
so I have used a subset of data to test the project. When the project is deployed it may
be necessary to use higher power systems and/or distributed computing to deliver faster
real-time responses. Alternatively the system could perform analysis on a batch basis
with daily updates. This would be similar to the BTO’s current process "Every night
the BTO computer will summarise that day’s records and produce up-to-date maps and
Reg: 100053203 7
CMP-6013Y
graphs showing the latest in migration, movements and distribution" (British Trust for
Ornithology, 1932a)
2. Approach
The broad approach has been to work in an iterative and incremental manner - "A style
of development that involves the iterative application of a set of activities to evaluate
a set of assertions, resolve a set of risks, accomplish a set of development objectives,
and incrementally produce and refine an effective solution." (Bittner and Spence, 2006)
and adopting a dynamic systems development method (DSDM). DSDM projects do just
’Enough Design Up Front’ (EDUF) within a Foundations phase in order to understand
and clarify the structure of the overall solution and to create an Agile plan for delivery
of the project.
Reg: 100053203 8
CMP-6013Y
Figure 1: The DSDM process (Consortium et al., 2014).
From the outset the project has elected to use open source components. "Open Source
software is software that can be freely accessed, used, changed, and shared (in modified
or unmodified form) by anyone. Open source software is made by many people, and
distributed under licenses that comply with the Open Source Definition" (Open Source
Initiative, 1998). This means that the project can take advantage of a choice of tech-
nologies and leaves the opportunity for the system to be maintained and developed in
the future without software licence restrictions. This is particularly relevant as the target
user, the BTO, is a charitable research institute.
Reg: 100053203 9
CMP-6013Y
3. Pre-project
In order to establish a realistic scope for the project I firstly reviewed the BTO’s web
site to gain an insight to the current facilities in the area of Research and Data Services
(British Trust for Ornithology, 1932d) and discussion with Dr. Simon Gillings at the
BTO who provided clarity to the project’s direction. In a DSDM project I needed to
understand the relative importance of the possible work to be done in order to move
forward and work to deadlines. The research and interviews with Dr. Gillings led to a
set of functional candidates. I used MoSCoW as a prioritisation technique to identify
key functions and manage priorities. The The MoSCoW acronym stands for: Must
Have - No point in delivering the project without this Should Have - important but not
vital - would prefer to deliver, but the solution is still viable Could Have - desirable but
less important ideas Won’t Have this time - good ideas that will not be delivered in the
current project The project MoSCoW prioritised candidates for this project are:
Reference Candidate MoSCoW1 Define relevant tweets (Key words and phrases) Must
2 Capture relevant tweets over a sustained period Must
3 Store tweets for analysis Must
4 Identify the location source of tweets Must
5 Present tweets and location to the user Must
6 Present locations as a map Should
7 Interactive map Should
8 View map for a specific time period Could
9 Show bird migration over time Could
10 Publish as a public web site Won’t
Table 1: Project MoSCow prioritisation
In this pre-project a number of clear unknowns were identified to be addressed later
in the project:
• Do Twitter users publish relevant tweets?
Reg: 100053203 10
CMP-6013Y
• What is the volume of data to be managed?
• What proportion of the captured data contains useful extractable information such as
the user’s geographic location?
• What is the data processing overhead?
• How to present information to the end user?
3.0.1. Gannt Chart
I also created an outline project timetable to support me in keeping track of progress
within the available time. My timetable is represented in the Gantt chart below.
4. Feasibility and Foundations
In establishing the project’s foundations I needed to address bulk capture of relevant
tweets. Setting up the data collection requires access to the Twitter API stream and I
opted to build upon my existing Python skills and use an open source Python library.
I selected the tweepy library from the set of available libraries as it provides access
to all Twitter RESTful API methods and supports OAuth authentication - a protocol
that allows an application to interact with another without giving away the end user
password. In particular the Twitter search API and streaming API are at the heart of the
project’s capture processes and are both available in tweepy. An initial server process
was developed to capture relevant tweets. The key code is shown below:
Reg: 100053203 11
CMP-6013Y
Proj
ects
ched
ule
show
nfo
re-v
isio
nw
eek
num
bers
and
sem
este
rwee
knu
mbe
rs
910
1112
1314
1516
1718
1920
2122
2324
2526
2728
2930
3132
3334
3536
3738
3940
4142
12
34
56
78
910
1112
CB
12
34
56
78
910
EB
1112
1314
Proj
ectp
ropo
sal
Lite
ratu
rere
view
Dat
aC
olle
ctio
nD
esig
n
Dat
aC
olle
ctio
nR
unni
ng
Dat
aA
naly
sis
Web
site
Impl
emen
tatio
n
Test
ing
Cod
ede
liver
y
Fina
lrep
ortw
ritin
g
Insp
ectio
npr
epar
atio
n
Figure 2: Project Gantt chart
Reg: 100053203 12
CMP-6013Y
Figure 3: Main tweet streaming and storage method.
This initial feasibility and foundation phase identified some challenges to be ad-
dressed in the subsequent phases: The volume of data to be captured is large and the
structure of captured tweets is inconsistent. For example the project is dependent upon
identifying the tweeter’s physically location (latitude and longitude). As can be seen
from the examples below inconsistent information structure means that further process-
ing of captured data would be required.
Reg: 100053203 13
CMP-6013Y
Figure 4: Geographical data before being cleaned.
The above figure demonstrates how cluttered the meta data of each tweet is.
Review of this feasibility activity led to a foundation design for the project components.
Reg: 100053203 14
CMP-6013Y
Figure 5: Project Processes Foundation Design.
5. Evolutionary Development
The creation of this project can be broken down into three main sections: Tweet Collec-
tion, Data Storage, and Analysis & Presentation. Each section required extending my
knowledge base and coding skills.
5.1. Tweet Collector
5.1.1. Description
In order to create a system that will present Twitter data in a user friendly manner to
the general public, the first problem to be overcome is collection of that data. There are
many different methods of accomplishing this using multiple languages and packages.
Reg: 100053203 15
CMP-6013Y
5.1.2. Reading
Initially when starting this project, I was unaware of the different methods of collection
that could be applied to a Twitter data scraping application. Upon reading Kumar et al.
(2013), I quickly learnt that there are two types of API (Application Programming Inter-
face) that can be used to access Twitter - REST APIs, and streaming APIs. These differ
in a few ways but the main difference is that a REST API only collects data when it
is specifically called, whereas a streaming API will collect a continuous stream of data
once started. It was apparent that for this project creating a streaming API would be the
most viable.
The next question that needed answering was, which language would be best to create
this streaming API?
Java and Python were the two main contenders initially. I felt comfortable program-
ming in either language. However after researching both possibilities using the books:
Makice (2009) and Bonzanini (2016), I came to the conclusion that Python would be my
preference. This decision was driven by the volume of research and articles published
about using Python whereas Java had very little written about it.
5.1.3. Implementation
Creating the Tweet Collector process took several steps: Obtain unique twitter tokens
allowing Oauth access to the twitter API. Provide a set of areas of interest using specific
key words (MoSCoW reference 1). Examples of keywords and hashtags that I used
are: "cuckoo" and "#rbnNFk" ("#rbnNFk" is a specific hashtag used by the Rare Birds
Network to identify below to see a sightings reported in a specific UK region - in this
case the Norwich area. Access to the Rare Bird Network’s hashtag codes is given in
Appendix A) The data collection is then set to run repeatedly for 15 minutes for several
months. The aim of this is to collect a large amount of data which varies as it continues
throughout different seasons of the year to provide information over periods of time. I
developed a Python script to automate the process.
Reg: 100053203 16
CMP-6013Y
Figure 6: Object initialisation.
Before accessing the Twitter API, you must register your application with the Twitter
developers. After successful registration Twitter provides a set of unique keys.
1. The provided Twitter API access codes are initialised.
2. The objects that are required in order for the script to access the Twitter API are
initialised.
3. To allow the Python code to interact with the Twitter API, packages from "tweepy"
are imported.
The final object to be initialised here is the list of keywords that indicate to the system
the tweets to be collected and stored in the database.
Reg: 100053203 17
CMP-6013Y
Figure 7: Initiator method.
Figure 8: Main tweet streaming and storage method.
The above figures show the method used in order to connect to a database, stream
the tweets from the Twitter API and store that tweet within a locally hosted database.
Initially, the code is encapsulated within a while loop - the condition remains true while
the running time of the script remains lower than the time limit set by Twitter. Then
within a try/catch statement a series of objects are created and assigned a specific level
of the database server, starting first at the port and server, then the database and then the
collection within the database.
Reg: 100053203 18
CMP-6013Y
This process allows the script to:
1. Connect into the correct position in the database to store each tweet
2. Using the stream package within tweepy, each tweet is then assigned to an object
- a tweet - in JSON (JavaScript Object Notation) format.
3. The tweet object is stored.
4. If an error is encountered, an error message is printed to the console and the while
loop is put to sleep for 5 seconds before restarting.
5. Once the given time limit (900 seconds) has been reached, "exit()", breaks out of
the loop and moves onto the next section.
Figure 9: Connect to twitter API.
This final figure shows the code required in order for the system to connect into
the tweepy and Twitter API. It requires the use of the package "OAuthHandler" within
tweepy. This is where the unique keys provided by Twitter for verification by the Twit-
ter API are used. Also within this code, the period that the script can be connected to
Twitter is defined. This cannot be less than the time limit of the script to run - I have
selected to use the same limit in order to minimise confusion. Finally the process pro-
vides filter information. In this example I have specified that it must only track tweets
containing any word within the keyword list (Figure 6), and to discount any language
that isn’t English. In commercial use these can be changed.
Reg: 100053203 19
CMP-6013Y
The Tweet Collector process was set to run every 15 minutes for 15 minutes indefi-
nitely to maximise the volume of data collected. This method ensured that the script
was running for as long as the Twitter API would allow. If the project was put into
commercial use, I would schedule the running of the script by hosting it on a server and
configuring a "CronJob" task. However as I did not have access to a server I scheduled
the script to run using the base application "Task Scheduler" in Microsoft Windows.
Figure 10: Scheduling script execution.
Reg: 100053203 20
CMP-6013Y
Figure 11: Selecting which scripts to run.
The Tweet Collector was monitored manually every 2 weeks to check that everything
was still running and that no issues have been encountered.
Reg: 100053203 21
CMP-6013Y
5.2. Data Storage
5.2.1. Description
Storage of the data collected using the script above was crucial to satisfy the requirement
to compare sightings of different birds over a large scale time period. As the Tweet
Collector was effectively running continuously it acquired a vast amount of data to be
stored. Also, how might a structured database be designed when you are not completely
aware of the tweet meta data structure?
5.2.2. Reading
As described above relevant tweets were collected using the tweepy library and Twitter
API and, as stated, early experiments showed variability in the data received. At this
stage options were explored regarding the appropriate storage method and technology
to offer data storage flexibility with the ability to undertake data analysis and provide a
platform for data representation to the end user. It was clear that a database is needed and
a choice was made between a relational database management system (RDBMS) acces-
sible via Structured Query Language (SQL) - such as Oracle MySQL - and a NOSQL
storage structure.
From reading Cattell (2011) and looking at the unstructured content of the captured
tweets together with indications of the prospective data volume a comparison of SQL
versus NOSQL led to the selection of a NOSQL database.
Reg: 100053203 22
CMP-6013Y
SQL NoSQLStores data in tables Stores data in name-value documents
Needs a schema that defines tables up
front
Organic design - can store data without
specifying a schema and can react to evo-
lutionary development and review
Inherently supports powerful query tools
for analysis
Uses JSON (JavaScript Object Notation)
data objects for queries
Scaling to large data volumes needs pre-
planning
Scales straightforwardly to store large
data volumes
Table 2: Comparison of SQL and NOSQL facilities
I selected the open source MongoDB database as the platform to store the collected
data as MongoDB documents can vary in structure and my JSON formatted data in-
tegrates well with Mongo’s storage system and remains efficient to process even with
large data volumes. Given the project aim of presenting information as an interactive
map (MoSCoW reference 7) MongoDB is a good choice as it offers specific geospatial
features. These features include GeoJSON, an open-source format for rich geospatial
types (Butler et al., 2016).
5.2.3. Implementation
An unexpected problem that arose is the sheer multitude of data that the Tweet Collector
stored. The Tweet Collector quickly captured over 500 million entries, and it was found
that a large proportion were of no relevance. To resolve this problem I segmented a
section of around five thousand entries. This segmentation gave a representative sample
that could be read and analysed with more efficiency. The Tweet Collection process cap-
tured a large volume of tweets however further processing of captured data was required
to disassemble the tweets into component parts. In particular the geographical location
information (latitude and longitude) needed to be extracted. Early development, assem-
bly and review of a JavaScript implementation showed that a JavaScript and Node JS
implementation could be achieved but system performance would be slow. JavaScript
Reg: 100053203 23
CMP-6013Y
was initially chosen as a method for data analysis as it appeared to have mongoDB con-
nectivity while also being able to directly host a website server in which the data could
be presented to the user and manipulated. However it became quickly apparent that this
was very inefficient. To begin with, the JavaScript server seemed to be unable to handle
the sheer volume of tweets held within the database and consistently crashed. The few
times it managed to complete the streaming of the data, it took a very long time to load,
which is not viable if the website were to become available to the public. Secondly, for
data analysis purposes, JavaScript did not contain the inbuilt functionality. This meant
that functionality such as searching through data would have needed to be specifically
created.
This led to a review of the chosen technology and, in conjunction with my supervi-
sor, I chose to use ’R’ to manipulate the stored data. ’R’ is a programming environment
for scripted data manipulation, calculation and graphical display and is available as Free
Software under the terms of the Free Software Foundation’s GNU General Public Li-
cense. It is highly extensible and a large number of packages are freely available. This
project used RMongo a package designed in order to allow the user to execute mongo
queries through an R script (Chheng, 2011) to manipulate the data that has been col-
lected in the mongo database. I created an R script to connect into the locally hosted
database.
Figure 12: Code to connect to mongo database.
Once connected to the database an R script queries the database using the specified
parameters.
Reg: 100053203 24
CMP-6013Y
Figure 13: Code to query mongo database.
This project requires geo-location data but, because of the way that Twitter stores its
tweet data, in order to access the geo-location data, extra characters surrounding the
longitude and latitude had to be removed.
This was done using R’s "gsub" function to replace occurrences of a pattern within a
string sub. This initially was simple to do, as shown below.
Figure 14: Gsub function to format geographical data.
However I soon discovered that not all of the geo-location data was formatted in
the same way. Therefore I had to extend the algorithm with a section of conditional
statements checking whether the first character was a comma or not in order to reformat
the geo-location into a standardised format for subsequent use.
Figure 15: Conditional statements to check for uneven format and reformat.
Reg: 100053203 25
CMP-6013Y
5.3. Analysis and Presentation
5.3.1. Description
The Data Storage process mapped my data into JSON ((JavaScript Object Notation).
JSON had been selected with the idea that the analysis and presentation process would
be built as a web site JavaScript implementation. However it became clear that the use
of ’R’ for data manipulation could be extended to support the presentation of data as a
map (MoSCoW references 6 and 7).
5.3.2. Reading
Learning to use the programming language R beyond simple data manipulation was a
risk as I had never used it prior to this project. The language is often used by scientists,
in many research fields including ecology. It is a powerful open source data manip-
ulation language and therefore seemed appropriate to use for the large levels of data
manipulation required. R also has a number of virtues:
• It is an open source language, with all of its features available free of charge.
• It is the standard for use among professional statisticians.
• It is a general purpose programming language and incorporates other useful function-
ality such as map graphics.
Despite my fears of lack of knowledge and understanding of the language, there are
many forums and articles written that teach ’R’ functionality. One of the main sources
of this information was Matloff (2011). I used my newly-found ’R’ skills to manipulate
my database of records as preparation for the data presentation element of the project.
The final stage of of the process was implementing a subsystem to present and publish
the data to both the BTO and the general public. As the aim is to make the information
easily available and interactive the obvious answer was to create a website to access and
present the information.
However some questions arose:
Reg: 100053203 26
CMP-6013Y
• Should I be perpetually running the R scripts on a server and then streaming the
information onto the website?
• Is it possible to automate the script on the website itself and post its output directly?
• Which language is best for combining the website and the data analysis?
The answer to these questions surprised me. Initially I believed that the best approach
would be to host a server using Node Javascript, Node.js is an open-source, cross-
platform JavaScript runtime environment for executing JavaScript code server-side, and
uses the Chrome V8 JavaScript engine. However after reading Chaniotis et al. (2015),
it became apparent that not only would this involve a large amount of processing over-
head on the system, running two servers and perpetually streaming data between the
two, it would also be inefficient given the fixed project time-scales and the amount of
coding required. Whilst researching alternative languages such as PhP, I learnt about
the R package, "Shiny" (Chang et al., 2015). This is a package developed by the same
team that created the commonly used, graphical user interface (GUI), "rStudio" (Team,
2015). Shiny is a web application framework that supports the creation and hosting of
website applications that have have the R script embedded.
5.3.3. Implementation
Map representation (MoSCoW reference 6) required further data manipulation. In the
initial implementation each tweet’s geo-data is stored within one column in a single R
dataframe. To be able to plot each point on a map, the longitudes and latitudes must be
split between two columns creating a spatial object. To accomplish this, I split the data
using the "strsplit". function within a further R package "Stringr" (Wickham, 2010). As
the name implies the "strsplit" function splits up a string into pieces. I gave the function
the delimiting character of a comma, which allowed the function to split the geo-data
column at that character and then re-assign each side of the split into two new columns
in a new dataframe containing latitude and longitude.
Reg: 100053203 27
CMP-6013Y
Figure 16: Split geographical data into Longitude and Latitude.
Initially when plotting the map I used a package called "ggmap" (Kahle and Wick-
ham, 2015) - A collection of functions to visualize spatial data. It did plot the points on
a map, however required a large amount of coding to create and was a static image that
didn’t fulfil the ’should be Interactive’ aim of MoSCoW reference 7.
Figure 17: Static Map Created by "ggmap" package.
Reg: 100053203 28
CMP-6013Y
This was not ideal to meet the project objectives. I opted to change from ggmap to the
newer package "leaflet" (Joe Cheng et al, February 2017). Leaflet is used for creating
interactive maps where you can zoom in to street level with functionality similar to
that of Google maps (Leaflet takes advantage of the Google Maps API to deliver its
functionality). This interactivity is far more useful for the user to be able to accurately
pinpoint and analyse bird location and activity.
Figure 18: Code that creates the interactive map.
Reg: 100053203 29
CMP-6013Y
Figure 19: Interactive Map Created.
There are multiple benefits gained by using an interactive map:
• Labels are able to be hidden and revealed by clicking on them.
• The ability to zoom into street level gives the user the ability to accurately pinpoint
the location from which the tweet emanated.
Reg: 100053203 30
CMP-6013Y
The interactive maps are presented within a website that holds and presents the data.
I have used Twitter Bootstrap - an open-sourced framework originally developed by
Twitter in 2010. Bootstrap is a good source of website design tools which help create
professional looking, intuitive websites.
In combination with the interactive map, I created a date selection feature to let the user
interact with data from a chosen date range (MoSCoW reference 8). The dates available
for the user to choose are taken from the tweet data and directly linked to the tweet. This
prevents the user from selecting dates in which there were no tweets and generating an
empty map - a map with no labels populated.
Figure 20: Date selection.
The date selector also delivers MoSCoW reference 9 and gives the user the ability to
estimate possible bird migration patterns. For example, if there was a tweet mentioning
a specific species that emanated from Scotland then the user could then alter the starting
and ending dates to a month later, and see that another tweet about that same species
was posted from Ireland. The user could infer that the bird has migrated between the
two regions. In reality this would not be reliable as there is no method of verifying that
the bird tweeted about is the same one.
A key question identified in the pre-project stage was "What proportion of the captured
data contains useful extractable information such as the user’s geographic location?".
Reg: 100053203 31
CMP-6013Y
I found that <1% (0.66%) of tweets contained geo-location meta data. This felt like a
waste of the large volume of other tweets that may contain useful information. In order
to attempt to prevent this waste, another feature that I added to the website, was a page
to view the entire collection of tweets. This page included key useful information such
as, the user’s information, the text of the tweet, when it was posted and any media in-
cluded in the tweet such as pictures which can be viewed by copying the url into your
browser.
Figure 21: Basic Direct Comparison Table.
The final part of presenting the data was to use the package "Shiny" to embed the R
script into a website and design it so that accessing the data would remain user friendly.
To create the website I built the user interface (UI) file, and the server side file. They
interact with each other - the UI file controls the appearance of the page while the server
file handles the script.
At this point the "shiny" server is hosted locally (localhost/127.0.0.1 as illustrated in
this report’s screen shots), in a public or commercial setting it would be hosted from an
external server, allowing open access to the system.
Reg: 100053203 32
CMP-6013Y
The web site is a client-server design. Data processing is handled as a server process
separately from the data presentation process. This reduces website loading time.
Figure 22: Map Shiny Webpage.
6. Areas of Knowledge Used and Gained
To complete the project of complementing the current system used by the BTO for
monitoring birds, in addition to general programming and analysis skills knowledge of
the following technologies has been gained and used:
• Twitter streaming API and the tweepy package for capturing raw data
• Python coding to interface with tweepy
Reg: 100053203 33
CMP-6013Y
• R Data analysis, scripting and the use of Cran (Comprehensive R Archive Network)
to identify and source additional facilities
• Website design to present analysis of the results
• Bird ecology to validate that the results are sensible in the ’real world’
7. Outcome and Areas of Development
This section summarises the project outcomes, what has been achieved and identifies
areas where the project could be taken forward to enhance its validity. The project aims
are to:
• Create a process to monitor and collect relevant tweets -typically targeted to keywords
such as a bird species.
• Store collected data in a database.
• Analyse the data to identify geographic information.
• Present analysed data in a user-friendly, map format for web delivery.
In turn these project aims were converting into a set of MoSCoW prioritised candidates.
The table below restates the MoSCoW prioritised candidates together with a statement
of whether the project achieved the aim.
Reg: 100053203 34
CMP-6013Y
Reference Candidate MoSCoW Result1 Define relevant tweets (Key words and phrases) Must Achieved1
2 Capture relevant tweets over a sustained period Must Achieved
3 Store tweets for analysis Must Achieved2
4 Identify the location source of tweets Must Achieved
5 Present tweets and location to the user Must Achieved
6 Present locations as a map Should Achieved
7 Interactive map Should Achieved
8 View map for a specific time period Could Achieved3
9 Show bird migration over time Could Achieved4
10 Publish as a public web site Won’t Not in scope
Table 3: Project MoSCoW Results
As can be seen from Table 3 all Must, Could and Should aims have been achieved.
However as noted below there are a number of areas where the achievement could be
improved:
1. In the current implementation the keywords and phrases are hard coded. It would
be useful if these items could be configured at the user interface
2. The current project limited the volume of stored tweets
3. The presentation of time period at the user interface could me more dynamic e.g.
using sliders
4. Migration over time would be better presented as animation
7.1. Areas Of Development
This project has demonstrated that it is possible to capture relevant data from Twitter
and transport this into valuable information that can be presented interactively on a web
site. A clear area for development is the user interface to make the information available
to the public. However a challenge that this project has not fully addressed is the issue
Reg: 100053203 35
CMP-6013Y
of context. A key requirement of the BTO’s current data services is that "survey design
should minimise biases within the data, so as not to undermine the scientific objectives
of the scheme" Twitter’s unstructured nature means that it is possible for the data to
become ’polluted’ by duplicate and irrelevant data For example, any tweets in which
the first two characters of the text are "RT" means that it is just a retweet of someone
else’s information. Retweets must be removed as they would skew the data and would
make it appear that more sightings of particular species had occurred than had actually
happened. Whilst it is programmatically straightforward to exclude retweets it is a
more difficult challenge to determine the context. For example if the tweet mentioned
the politician "Angela Eagle", the Tweet Collector would collect it but it is clearly not
relevant to the actual bird species of eagle. Reviewing the literature shows that this is a
difficult challenge but could be a useful area for a further project.
7.2. Outcome
The results that have been produced by this system show that data can be successfully
taken from Twitter, manipulated, analysed and used to populate a user-friendly system
that would allow both the general public and more authoritative bodies, such as the
British Trust for Ornithology to monitor the bird activity and the opinions and thoughts
of the public towards wild birds.
However, this project has demonstrated that data collected from Twitter is unreliable.
Issues with context are apparent, users not enabling their geo-location on their account
removes the ability for each tweet to be plotted on the map, and lastly, there is no system
in place for verification of the information that they are claiming. For non experts, the
difference between two species of bird may be unclear and this can lead to skewed
information due to the public lack of knowledge.
8. Conclusion
This project demonstrated that it is feasible to use Twitter as a source of information
regarding wildlife, birds in particular. This information can be presented to a user in an
interactive and friendly manner.
Reg: 100053203 36
CMP-6013Y
A key aspect of this project is the identification of the geographic source of ’tweets’.
During the Tweet Collector phase it became apparent that only a small proportion of
Twitter users share their precise latitude and longitude - the location sharing feature de-
faults to OFF in the Twitter user interface. This meant that <1% of the tweets collected
contained data of interest and hence the validity of the project results could be ques-
tioned.
However the project has illustrated collection, storage and analysis of large volumes
of Twitter data. This methodology could be applied in other contexts and using other,
more general Twitter API objects. For example a business’s customer service operation
could be interested in identifying and extracting tweets that mention their name and that
receive interest from other users by gaining a number of ’retweets’ or ’likes’.
Twitter remains a large source of information on many different topics, however, har-
nessing it effectively also reveals problems. For example, the inability to verify the
accuracy of each tweet means that scientific research must take the general public’s lack
of knowledge as a factor. Within my project, I feel that I have gained an understanding
of how to begin harnessing social media information and manipulating it. I have created
a flexible system that could be replicated and similar algorithms could be applied to any
topic.
References
Bittner, K. and Spence, I. (2006). Managing iterative software development projects.
Addison-Wesley Professional.
Bonzanini, M. (2016). Mastering Social Media Mining with Python. Packt Publishing.
British Trust for Ornithology (1932a). About bird track. https:// www.bto.org/ volunteer-
surveys/ birdtrack/ about . Accessed March 18, 2017.
Reg: 100053203 37
CMP-6013Y
British Trust for Ornithology (1932b). About bto. https:// www.bto.org/ about-bto. Ac-
cessed March 17, 2017.
British Trust for Ornithology (1932c). Quality of bto data. https:// www.bto.org/
research-data-services/ data-services/ data-quality . Accessed March 19, 2017.
British Trust for Ornithology (1932d). Research and data services. https:// www.bto.org/
research-data-services. Accessed March 18, 2017.
Butler, H., Daly, M., Doyle, A., Gillies, S., Hagen, S., and Schaub, T. (2016). The
geojson format. Technical report, Internet Engineering Task Force.
Cattell, R. (2011). Scalable sql and nosql data stores. Acm Sigmod Record, 39(4):12–27.
Chang, W., Cheng, J., Allaire, J., Xie, Y., and McPherson, J. (2015). Shiny: web
application framework for r. R package version 0.11, 1.
Chaniotis, I. K., Kyriakou, K.-I. D., and Tselikas, N. D. (2015). Is node. js a viable
option for building modern web applications? a performance evaluation study. Com-
puting, 97(10):1023–1044.
Chheng, T. (2011). Rmongo: Mongodb client for r.
Consortium, D. et al. (2014). The dsdm agile project framework. DSDM Con-
sortium Ebook: https://www.dsdm.org/resources/dsdm-handbooks/the-dsdm-agile-
project-framework-2014-onwards.
Kahle, D. and Wickham, H. (2015). ggmap: A package for spatial visualization with
google maps and openstreetmap, 2013. R package version, 2.
Kumar, S., Morstatter, F., and Liu, H. (2013). Twitter data analytics. Springer Science
& Business Media.
Makice, K. (2009). Twitter API: Up and running: Learn how to build applications with
the Twitter API. " O’Reilly Media, Inc.".
Matloff, N. (2011). The art of R programming: A tour of statistical software design. No
Starch Press.
Reg: 100053203 38
CMP-6013Y
Open Source Initiative (1998). Open source initiative - frequently asked questions.
https:// opensource.org/ faq/ #osd . Accessed March 18, 2017.
Team, R. (2015). Rstudio: integrated development for r. RStudio, Inc., Boston, MA URL
http://www. rstudio. com.
Wickham, H. (2010). stringr: modern, consistent string processing. The R Journal,
2(2):38–40.
A. Rare Bird Network Hashtag Codes
The Rare Bird Network (RBN) uses Twitter to allow bird watchers publish sighting in-
formation. RBN uses pre-defined hashtags for defined regions the United Kingdom. The
hashtag codes are available at http:// www.rarebirdnetwork.co.uk/ p/ hashtag-codes.html
with a visual representation available via Google maps at https:// www.google.com/
maps/ d/ viewer?mid=1Aewtew3_0oKzsBFtTGed2qvZ-pU&msa=0&ie=UTF8&t=m&ll=55.
56592200000001%2C-3.6254880000000184&spn=11.201913%2C15.402832&z=6
Reg: 100053203 39