Post on 08-Jul-2015
description
Mapping Online Publics
Axel Bruns / Jean Burgess
ARC Centre of Excellence for Creative Industries and Innovation, Queensland University of Technology
a.bruns@qut.edu.au – @snurb_dot_info / je.burgess@qut.edu.au – @jeanburgess
http://mappingonlinepublics.net – http://cci.edu.au/
Gathering Data
• Keyword / #hashtag archives
– Twapperkeeper.com
• No longer fully functional
– yourTwapperkeeper
• Open source solution
• Runs on your own server
• Use our modifications to be able to export CSV / TSV
• Uses Twitter streaming API to track keywords
– Including #hashtags, @mentions
Twapperkeeper / yourTwapperkeeper data
• Typical data format (#ausvotes):
Processing Data
• Gawk:
– Command-line tool for processing CSV / TSV data
– Can use ready-made scripts for complex processing
– Vol. 1 of our scripts collection now online at MOP
• Regular expressions (regex):
– Key tool for working with Gawk
– Powerful way of expressing search patterns
– E.g.: @[A-Za-z0-9_]+ = any @username
– See online regex primers...
# atextractfromtoonly.awk - Extract @replies for network visualisation
#
# this script takes a Twapperkeeper CSV/TSV archive of tweets, and reworks it into simple network data for visualisation
# the output format for this script is always CSV, to enable import into Gephi and other visualisation tols
#
# expected data format:
# text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time
#
# output format:
# from,to
#
# the script extracts @replies from tweets, and creates duplicates where multiple @replies are
# present in the same tweet - e.g. the tweet "@one @two hello" from user @user results in
# @user,@one and @user,@two
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au
BEGIN {
print "from,to"
}
/@([A-Za-z0-9_]+)/ {
a=0
do {
match(substr($1, a),/@([A-Za-z0-9_]+)?/,atArray)
a=a+atArray[1, "start"]+atArray[1, "length"]
if (atArray[1] != 0) print tolower($3) "," tolower(atArray[1])
} while(atArray[1, "start"] != 0)
}
Running Gawk Scripts
• Gawk command line execution:
– Open terminal window
– Run command:
#> gawk -F \t -f scripts\explodetime.awk input.tsv >output.tsv
– Arguments:
• -F \t = field separator is a TAB (otherwise -F ,)
• -f scripts\explodetime.awk = run the explodetime.awk script
(adjust scripts path as required)
Basic #hashtag data: most active users
• Pivot table in Excel – ‘from_user’ against ‘count of text’
Identifying Time-Based Patterns
#> gawk -F \t -f scripts\explodetime.awk input.tsv >output.tsv
• Output:
– Additional time data:
• Original format + year,month,day,hour,minute
– Uses:
• Time series per year, month, day, hour, minute
Basic #hashtag data: activity over time
• Pivot table – ‘day’ against ‘count of text’
Identifying @reply Networks
#> gawk -F \t -f scripts\atreplyfromtoonly.awk input.tsv >output.tsv
• Output:
– Basic network information:
• from,to
– Uses:
• Key @reply recipients
• Network visualisation
Basic #hashtag data: @replies received
• Pivot table – ‘to’ against ‘from’
Basic @reply Network Visualisation
• Gephi:
– Open source network visualisation tool – Gephi.org
– Frequently updated, growing number of plugins
– Load CSV into Gephi
– Run ‘Average Degree’ network metric
– Filter for minimum degree / indegree / outdegree
– Adjust node size and node colour settings:
• E.g. colour = outdegree, size = indegree
– Run network visualisation:
• E.g. ForceAtlas – play with settings as appropriate
Basic @reply Network Visualisation
• Degree = 100+, colour = outdegree, size = indegree
Tracking Themes (and More) over Time
#> gawk -F \t -f multifilter.awk search="term1,term2,..." input.tsv >output.tsv
term examples: (julia|gillard),(tony|abbott)
.?,@[A-Za-z0-9_]+,RT @[A-Za-z0-9_]+,http
• Output:
– Basic network information:
• Original format + term1 match, term2 match, ...
– Uses:
• Use on output from explodetime.awk
• Graph occurrences of terms per time period (hour, day, ...)
Tracking Themes over Time
• Pivot table – ‘day’ against keyword bundles, normalised to 100%
Dynamic @reply Network Visualisation
• Multi-step process:– Make sure tweets are in ascending chronological order
– Use timeframe.awk to select period to visualise:
#> gawk -F , -f timeframe.awk start="2011 01 01 00 00 00" end="2011 01 01 23 59 59" tweets.csv >tweets-1Jan.csv
• start / end = start and end of period to select (YYYY MM DD HH MM SS)
– Use preparegexfatreplyintervals.awk to prepare data:
#> gawk -F , -f preparegexfattimeintervals.awk tweets-1Jan.csv >tweets-1Jan-prep.csv
– Use gexfattimeintervals.awk to convert to Gephi GEXF format:
#> gawk -F , -f gexfattimeintervals.awk decaytime="1800" tweets-1Jan-prep.csv >tweets-1Jan.gexf
• decaytime = time in seconds that an @reply remains ‘active’, once made
• This may take some time...
http://mappingonlinepublics.net/@snurb_dot_info
@jeanburgess
Image by campoalto