Artificial Intelligence for Policing

1!

Oxford Internet Institute 23rd November 2017

1!

Artificial Intelligence for Policing

Presenting: Miriam Fernandez, Knowledge Media Institute

Lots of other faces behind this work! @miriam_fs

fernandezmiriam

@miriamfs

2!


2!

What do you think when

you hear “Artificial Intelligence for

Policing”?

3!


3!

https://www.youtube.com/watch?v=lG7DGMgfOb8 (2002 movie)

4!


4!

https://www.facebook.com/financialtimes/videos/vb.8860325749/10155507438890750/?type=2&theater Financial Times 2017

5!


5!

Drones!

Autonomous Weapons!

Surveillance!

6!


6!

Washington Post (October 2016)

7!


7!

8!


8! Three lines of work presented in this talk

•  Policing Engagement via Social Media

•  Detecting Grooming Behaviour on Social Media

•  Radicalisation detection on Social Media

9!


10!


11!


11! Policing Engagement via Social Media

Miriam Fernandez, Tom Dickinson, and Harith Alani. ”And analysis of UK policing engagement via social media." International Conference on Social Informatics. Springer International Publishing, 2017. Miriam Fernandez, A. Elizabeth Cano, and Harith Alani. "Policing engagement via social media." International Conference on Social Informatics. Springer International Publishing, 2014.

12!


12! Policing Engagement via Social Media

•  Policing organisations use social media to spread the word on crime, severe weather, missing people, …

•  Many forces have staff dedicated to this purpose and to improve the spreading of key messages to wider social media communities

•  Research shows that exchanges between police and citizens are infrequent

13!


13! Goal

•  Understand what attracts citizen’s to social media policing content –  What are the characteristics of the

content that generate higher attention levels •  Writing style •  Time of posting •  Topics

–  Help police forces to identify actions and recommendations to increase public engagement

14!


14! Context: UK Policing

Corporate! Non-corporate!

15!


15! Understanding Engagement

•  Social media engagement has been studied –  Through multiple lenses (marketing, social sciences, computer science) –  In multiple scenarios (product selling, elections, campaigns, etc.)

•  Study the literature of social media engagement –  [Ariely] Very clear message with a very concrete action

•  Patrol, missing persons, incidents, emergencies, local authorities? What can/should I do?

–  [Vaynerchuk] Need to differentiate each social medium (context) •  What happens in the world? To whom is the message targeted?

•  Study the literature of social media police engagement –  Works mainly focus on studying the different social media strategies that police

forces use to interact with the public •  [Denef] UK Riots 2011. Instrumental vs. expressive approach

16!


16! Barriers of Social Media Police Engagement (I)

•  Legitimacy The police needs the trust and confidence

of the communities they serve !

17!


17! Barriers of Social Media Police Engagement (II)

•  Reputation

•  Official communication channels (911)

•  Surveillance

•  Variety of topics •  Budget

18!


18! Approach (I)

•  Data Collection –  154,679 posts from 48 corporate Twitter accounts –  1,300,070 posts from 2,450 non-corporate Twitter

accounts –  January 2017

•  Engagement Indicators –  Retweets

•  % of tweets retweeted •  Average number of retweets per tweet

–  Favourites (likes) •  % of tweets favourited (liked) •  Average number of likes per tweet

–  Replies •  At the time of analysis Twitter API does not allow to

collect replies per tweet

19!


19! Just for some fun! J How am I doing?

20!


20! Engagement Indicators (I)

•  Most accounts have more than 60% of tweets retweeted –  Top 5: MET, Nottinghamshire, Northumbria, Northamptonshire, Cumbria

0

0.2

0.4

0.6

0.8

1

1.2

north

umbr

iapo

l no

ttspo

lice

Jers

eyP

olic

e D

urha

mP

olic

e N

York

sPol

ice

Cum

bria

polic

e sw

polic

e po

lices

cotla

nd

Suf

folk

Pol

ice

DC

_Pol

ice

City

Pol

ice

NW

Pol

ice

Sta

ffsP

olic

e H

erts

Pol

ice

NC

A_U

K

Cle

vela

ndP

olic

e H

ants

Pol

ice

Hum

berb

eat

kent

_pol

ice

Dyf

edP

owys

gw

entp

olic

e C

ambs

Cop

s La

ncsP

olic

e le

icsp

olic

e W

Mer

ciaP

olic

e ch

eshi

repo

lice

suss

ex_p

olic

e w

arks

polic

e W

MP

olic

e P

olic

eSer

vice

NI

Ess

exP

olic

eUK

Th

ames

VP

Nor

than

tsP

olic

e be

dspo

lice

met

polic

euk

Nor

folk

Pol

ice

Glo

s_P

olic

e A

SP

olic

e do

rset

polic

e w

iltsh

irepo

lice

Wes

tYor

ksP

olic

e lin

cspo

lice

Mer

seyP

olic

e S

urre

yPol

ice

gmpo

lice

iom

polic

e sy

ptw

eet

Der

bysP

olic

e

% tweets retweeted

21!


21! Engagement Indicators (II)

•  Most accounts receive in average 10 retweets per tweet –  Top 5: MET, Jersey, National Crime Agency, West Midlands, Scotland

0

10

20

30

40

50

60

70

north

umbr

iapo

l no

ttspo

lice

Jers

eyP

olic

e D

urha

mP

olic

e N

York

sPol

ice

Cum

bria

polic

e sw

polic

e po

lices

cotla

nd

Suf

folk

Pol

ice

DC

_Pol

ice

City

Pol

ice

NW

Pol

ice

Sta

ffsP

olic

e H

erts

Pol

ice

NC

A_U

K

Cle

vela

ndP

olic

e H

ants

Pol

ice

Hum

berb

eat

kent

_pol

ice

Dyf

edP

owys

gw

entp

olic

e C

ambs

Cop

s La

ncsP

olic

e le

icsp

olic

e W

Mer

ciaP

olic

e ch

eshi

repo

lice

suss

ex_p

olic

e w

arks

polic

e W

MP

olic

e P

olic

eSer

vice

NI

Ess

exP

olic

eUK

Th

ames

VP

Nor

than

tsP

olic

e be

dspo

lice

met

polic

euk

Nor

folk

Pol

ice

Glo

s_P

olic

e A

SP

olic

e do

rset

polic

e w

iltsh

irepo

lice

Wes

tYor

ksP

olic

e lin

cspo

lice

Mer

seyP

olic

e S

urre

yPol

ice

gmpo

lice

iom

polic

e sy

ptw

eet

Der

bysP

olic

e

Average Number of Retweets

22!


22! Engagement Indicators (III)

•  Some organisations retweet from others rather than originating discussions –  Northumbria, Nottinghamshire, Jersey, Durham, North Yorkshire

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

north

umbr

iapo

l no

ttspo

lice

Jers

eyP

olic

e D

urha

mP

olic

e N

York

sPol

ice

Cum

bria

polic

e sw

polic

e po

lices

cotla

nd

Suf

folk

Pol

ice

DC

_Pol

ice

City

Pol

ice

NW

Pol

ice

Sta

ffsP

olic

e H

erts

Pol

ice

NC

A_U

K

Cle

vela

ndP

olic

e H

ants

Pol

ice

Hum

berb

eat

kent

_pol

ice

Dyf

edP

owys

gw

entp

olic

e C

ambs

Cop

s La

ncsP

olic

e le

icsp

olic

e W

Mer

ciaP

olic

e ch

eshi

repo

lice

suss

ex_p

olic

e w

arks

polic

e W

MP

olic

e P

olic

eSer

vice

NI

Ess

exP

olic

eUK

Th

ames

VP

Nor

than

tsP

olic

e be

dspo

lice

met

polic

euk

Nor

folk

Pol

ice

Glo

s_P

olic

e A

SP

olic

e do

rset

polic

e w

iltsh

irepo

lice

Wes

tYor

ksP

olic

e lin

cspo

lice

Mer

seyP

olic

e S

urre

yPol

ice

gmpo

lice

iom

polic

e sy

ptw

eet

Der

bysP

olic

e

Ratio non-original tweets

23!


23! Non-Corporate accounts (I)

•  50% of the accounts have more than 60% of tweets retweeted

•  Top 47 accounts have a higher ratio of retweets than corporate organisations (around 80%)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

% of tweets retweeted

24!


24! Approach (II)

•  Feature Extractors –  Describe tweets in terms of their characteristics –  Content Features

•  Length / Readability / Informativeness / Complexity / Sentiment •  Media / mentions / hashtags / URLs •  Time in the day

–  User Features •  Network: In-degree / out-degree •  Activity: Post count / post rate / age in the system

–  Semantic Features •  Use knowledge bases to extracts entities and concepts

–  Persons / Organisations / Locations

•  Use Machine Learning techniques to determine the characteristics “patterns” of those tweets receiving higher engagement levels

25!


25! Results (I)

•  Tweets receiving higher engagement are: –  Longer, easier to read, more informative, lower complexity (avoid

complex terms), include media items (images, videos). –  In terms of user features they tend to be posted by accounts with a

high number of followers (corporate) or with a high post rate and a high in-out degree ratio (non-corporate).

neg pos

05

1015

2025

30

lenght

neg pos

020

4060

80100

readability

neg pos

020

4060

80100

informativeness

neg pos

−4−2

02

4

polarity

26!


26! Results (II)

•  Tweets receiving higher engagement talk about –  Weather / roads and infrastructures /

events / missing persons –  Raise awareness (domestic abuse,

hate crime, modern slavery) –  Tend to mention locations

•  Tweets receiving lower engagement talk about –  Crime updates: such as burglary,

assault or driving under the influence of alcohol

–  Following requests (#ff) –  Advices to stay safe

27!


27! Results (III)

•  Non-corporate accounts generate in average higher engagement

–  Offer help, ask for help, advise on local issues, reassure safety, etc. (#wearehereforyou)

•  Three additional ingredients

–  They retweet messages about relevant events and popular users

–  They engage closer with the communities (direct messages and mentions to citizens)

–  They are fun!

28!


28! Engagement Guidelines

•  Focus –  Consider the key goal to achieve / the audience to engage (general public,

local communities, teenagers) & provide a clear message with a concrete set of actions associated to it

•  Be clear –  Complex messages with police jargon are difficult to understand. Messages

should be simple, informative and useful. Use images/videos and humour to enhance dissemination

•  Interact –  Engage with the communities rather than only broadcast. Identify highly

engaging police staff members and community leaders and involve them •  Stay active

–  Engagement is a long-term commitment. Accounts active for longer time receive higher engagement.

•  Be respectful –  Reputation and legitimacy are extremely important. Post polite, safe and

respectful content

29!


29! Detecting Grooming Behaviour on Social Media

Cano, E; Miriam, F.; and Alani, H (2014). Detecting child grooming behaviour patterns on social media. The 6th International Conference on Social Informatics (SocInfo), Barcelona, Spain.

Slides provided by Harith Alani (Professor of Web Science, Knowledge Media Institute) @halani

30!


Child Grooming

Premeditated behaviour intending to secure the trust of a minor as a first step towards future engagement in sexual conduct.

Choo, K-K R. Responding to online child sexual grooming: an industry perspective, Trends & issues in crime and criminal justice, no. 379. July 2009

31!


Claire Lilley, Ruth Ball, Heather Vernon, The experiences of 11-16 year olds on social networking sites, NSPCC 2014

“findings show that approximately 190,000 UK children (1 in 58) will suffer contact sexual abuse by a non-related adult before turning 18, with approximately 10,000 new child victims of contact sexual abuse being reported in the UK each year.”

32!


“50% of all 11 and 12 year-olds in the UK use a social networking site, according to our research. This is because it's easy for children to access sites intended for older users.”

https://www.nspcc.org.uk/preventing-abuse/keeping-children-safe/share-aware/

33!


https://www.statista.com/statistics/271348/facebook-users-in-the-united-kingdom-uk-by-age/

34!


Children’s use of mobile phones - A special report 2014. http://www.gsma.com/publicpolicy/wp-content/uploads/2012/03/GSMA_Childrens_use_of_mobile_phones_2014.pdf

35!


https://www.thinkuknow.co.uk/parents/articles/Online-grooming/

Online Grooming

36!


https://www.thinkuknow.co.uk/14_plus/Need-advice/Online-grooming/

Signs of Online Grooming

37!


Predator: hey whats up?… Predator: I like your pic, very cute Predator: so you're in san diego? 13-yr-old-girl: not far Predator: ok, you like older guys? 13-yr-old-girl: thers nice or bad ppl all ages Predator: have some pics if you want to see Predator: do your parents look on your computer? Predator: so are you by yourself or is someone else there with you? Predator: so it should just be us, our little secret Predator: so have you ever snuck out? 13-yr-old-girl: not rlly lol Predator: yeah, what about tonight? Predator: think you could sneak out tonight? Predator: well if the wrong person found out then I'd be screwed 13-yr-old-girl: im not a teller lol Predator: I know, just wouldn't want your dad to find out Predator: if you are still up why not sneak out for a few minutes Predator: but that's the fun of it 13-yr-old-girl: fun to sneak? Predator: yes Predator: so your dad doesn't know Predator: would take a nap but I leave for bible study around 6:30 Predator: I know I'm bad, going to bible study and talking about sex with you Predator: yeah, there's nothing wrong with us being friends, we have the same lord remember ;) Predator: would take me like an hour and a half to get there Predator: see you in a little while

~700 messages

Over a 5 month period

Grooming in Action

38!


Olson, L. N., Daggs, J. L., Ellevold, B. L. and Rogers, T. K. K. (2007), Entrapping the Innocent: Toward a Theory of Child Sexual Predators’ Luring Communication. Communication Theory, 17: 231–251

Olson’s Theory of Luring Communication (LTC)

39!



Approach

Grooming

Trust Development

Isolation

Physical Approach

Physical Approach

40!



Approach

Grooming

Trust Development

Isolation

Physical Approach

Physical Approach

Can we automatically identify these stages?

41!


“think you could sneak out tonight?“

Grooming Trust

Development Physical

Approach other

Automatic Classifiers

Yes No No No

Identifying Grooming Stages

42!


Dataset

•  50 transcripts of conversations between convicted predators and volunteers who posed as minors

•  Conversations vary between 83 to 12K lines.

•  Each predator line manually labelled by two annotators.

•  Annotations labels: 1)Trust development, 2) Grooming, 3) Seek physical approach, 4) Other.

Trust Dev. Grooming Phys. Approach Other

1225 3304 2700 3304 sentences

Dataset

43!


Processing Chat Text

•  Challenges in processing chat-room conversations –  Use of irregular and ill-formed words. –  Use of chat slang and teen-lingo –  Use of emoticons.

Generated a list of over 1K terms and definitions:

Chat term Translation Emoticon Translation

ASLP Age, sex, location, picture :’-( I’m crying

AWGTHTHTTA

Are we going to have to go through this again?

o/\o High five

BRB Be right back @_@ I’m tired, trying to stay awake

CWOT Complete waste of time ( ‘}{‘ ) kiss

44!


Analysis Features and Results

Results - with all features:

Feature Description

N-gram word combinations extracted from text (N=1,2,3)

Part-of-speech tagging noun, verb, adjective, plural, etc.

sentiment average sentiment of terms in sentence

length number of words in sentence

Psycho-linguistic Patterns 62 psycho-linguistic patterns in English (swearing, sexual, agreement, etc.)

Semantic frames Type of event, relation, or entity in text, e.g., secrecy, desirability, emotion, kinship

Trust Development

Grooming Phys. Approach average

Precision 79.2% 87.6% 87.2% 84.7%

Recall 82.3% 88.8% 88.7% 86.6%

F1 80.7% 88.2% 87.9% 85.6%

45!


45! Radicalisation detection on Social Media

Saif H. Fernandez M. Dickinson T, Kastler L. & Alani H. A Semantic Graph-based Approach for Radicalisation Detection on Social Media. ESWC 2017 Saif H. Fernandez, M. Rowe, M. & Alani H. On the Role of Semantics for Detecting pro-ISIS stances on social media. ISWC 2016 Rowe M & Saif H. Mining Pro-ISIS Radicalisation Signals from Social Media Users. ICWSM 2016. Nominated for best paper Award!

slides by Hassan Saif,!!

46!


Online Radicalisation •  Is the process by which

individuals are introduced to ideological messages and belief systems that encourage movement from mainstream beliefs toward extreme views, primarily through the use of online media [International Assoc of Chiefs of Police and United States of America]

47!


Islamic State in Iraq and Syria (ISIS)

Social Media Propaganda & Recruiting

48!


ISIS on Social Media

49!


50!


Research Questions and Objectives

•  RQ1: How can we detect when a user has adopted a pro-ISIS stance?

•  RQ2: What happens to Twitter users before and after the exhibit radicalised behaviour?

•  RQ3: What influences users to adopt pro-ISIS language?

•  RQ4: Can we automatically identify users that have adopted pro- vs. anti-ISIS stances?

51!


Data Collection and Analysis

Kurdish

Jihadist

Pro-Assad

Secular/Moderate

Fig. 1: Syrian account network (652 nodes, 3,260 edges). Four major categories; Jihadist (gold, right), Kurdish (red, top),Pro-Assad (purple, left), and Secular/Moderate opposition (blue, center). Black nodes are members of multiple communities.Visualization was performed with the OpenOrd layout in Gephi.

contrast with the polarization analyzed in certain studies ofmainstream political activism [3], [10], the three communitiesselected consist of two polar opposites, jihadist and secularrevolutionary, with the third community considerably moderatein comparison. The analysis process includes the generationof rankings of the preferred YouTube channels for eachcommunity, where these channels and corresponding Freebasetopics assigned by YouTube are used to assist interpretationwhile also providing a certain level of validation2. We alsoconsider online activity surrounding “real world” events, suchas YouTube video responses to the Ghouta chemical weaponattack on 21 August 2013 [11]. The insights revealed in thisstudy confirm that alternative analytical approaches can playa key role in studies of online activity where prior knowledgemay be scarce or unreliable.

ANALYZING ONLINE POLITICAL ACTIVISM

In this paper, we consider online activity associated withthe Syria conflict within the context of other studies ofonline political activism that have focused upon relativelystatic, often mainstream groupings about which a considerablelevel of prior knowledge is available. This includes situa-tions featuring a polarization effect, or others where multiplegroupings are in existence. For example, the study of USliberal and conservative blogs by Adamic and Glance [3] foundclear separation between both communities, with noticeablebehavioral differences in terms of network density based onlinks between blogs, blog content itself, and interaction withmainstream media. They did not focus on “other” blogs, suchas those of a libertarian, independent or moderate nature (and

2http://www.freebase.com/

found few references to these from the liberal and conservativeblogs), but suggested that they could be considered in futureanalysis. Progressive and conservative polarization on Twitterwas investigated by Conover et al. , where hashtags were usedto gather data leading to two network representations based onTwitter retweets and mentions [10]. By specifically requestingthe detection of exactly two communities, polarization wasclearly observable in the retweet network. This was not thecase with analogous two-community detection within the cor-responding mentions network, where the authors suggestedthat this feature may foster cross-ideological interactions ofsome nature. In both cases, increasing the number of targetcommunities beyond two revealed smaller politically hetero-geneous communities rather than those of a more fine-grainedideological structure.

Mustafaraj et al. analyzed the vocal minority (prolific tweet-ers) and silent majority (accounts that tweeted only once)within US Democrat and Republican Twitter supporters, gath-ering data by searching for tweets containing the names of twoMassachusetts senate candidates [12]. They also found similarpolarized retweet communities in the vocal minority, whileat the same time, the activity of both of these communitieswas consistently different to the silent majority at the oppo-site end of the spectrum. The machine learning frameworkproposed by Pennacchiotti and Popescu for the classificationof Twitter accounts was evaluated using three gold standarddata sets, including one associated with political affiliation thatwas generated from lists of users who classified themselvesas either Democrat or Republican in the Twitter directoriesWeFollow and Twellow [13]. Similar political affiliation onTwitter was studied by Wong et al. , where they proposed amethod to quantify US political leaning that focused on tweets

O’Callaghan et al. 2014

625 Users

2.4M Users

154K EU Users

104M Tweets

English 43%

Arabic 41%

Others 16%

52!


Identifying Signals of Radicalisation

Lexicon- and Network-based Approach

H1 – Sharing Incitement Material H2 – Using Extremist Language

دولة الخلافة

ISIS Shirk

Caliphate Islamic State

ارهاب

Radicalization Lexicon 25.5K Suspended ISIS Accounts

53!


Activation Points (RQ1)

•  Increase in users activated between May 2014 and November 2014 coincides with execution of 6 hostages by ISIS and the videos of these executions posted via social media

•  The majority of users share content from pro-ISIS accounts before going on to posts pro-ISIS terms themselves

Table 2: Significant events involving ISIS/ISIL and the West.Date Description08-04-2013 ISIS expand into Syria04-01-2014 Fallujah captured by ISIS15-01-2014 ISIL retake Ar-Raqqah01-05-2014 ISIS carry out public executions in Ar-Raqqah09-06-2014 Mosul falls under ISIS control02-09-2014 Hostage Steven Sotloff executed13-09-2014 Hostage David Haines executed22-09-2014 Hostage Samira Salih al-Nuaimi executed03-10-2014 Hostage Alan Henning executed07-10-2014 Abu Bakr al-Baghdadi injured in US air strike16-10-2014 Hostage Peter Kassig executed14-01-2015 Christopher Lee Cornell arrested for bomb plot25-01-2015 Hostage Haruna Yukawa executed31-01-2015 Hotage Kenji Goto executed06-02-2015 Hostage Kayla Mueller killed in air strike26-02-2015 Jihadi John is identified as Mohammed Emwazi18-03-2015 ISIS responsible for Tunisia museum attack15-05-2015 Abu Sayyaf killed by US special forces30-06-2015 Alaa Saadeh arrested for attempts to aid ISIS11-07-2015 Maher Meshaal killed in coalition air strike

ses. Figure 2(a) and figure 2(b) show the number of userswho are activated on each day according to each hypothesis.We note that the span of activations of H1 users is shorterthan H2 users - as the former requires sharing content frombanned or pro-ISIS accounts, while the latter looks at theuse of pro-ISIS terms. One thing that is immediately appar-ent from the plots is that there is a large surge in activityfrom May 2014 onwards - for both H1 and H2 activations.To investigate why this surge occurs, we identified a seriesof key events related to ISIS/ISIL from 2013 onwards - theseare shown in Table 2. As noted, the increase in activationsbetween May 2014 and November 2014 coincides with exe-cution of 6 hostages by ISIS and the videos of these execu-tions posted via social media. Although we cannot discerncausation (of activation) from correlation here, there doesappear to be an association between such information ap-pearing in the public domain (of executions) and users eithersharing pro-ISIS content (Figure 2(a)) or adopting pro-ISISlanguage (Figure 2(b)).

In order to examine whether there was a link betweenusers sharing content from pro-ISIS accounts (via retweet-ing) and then posting pro-ISIS content themselves, we de-rived the �(ah1 � ah2)-distribution using all users that fallwithin the intersection of the H1 and H2 users’ sets. For eachuser in this intersection set (u 2 UH1 \ UH2) we measuredthe difference (in days) between their H2 activation point(ah2) - i.e. when they first post pro-ISIS rhetoric themselves- and their H1 activation point (ah1) - i.e. when they firstshared content from pro-ISIS accounts. Figure 2(c) presentsthe distribution of �(ah2 � ah1). We note that this distri-bution has a right skew indicating that the majority of userspost pro-ISIS terms before then going on to share contentfrom pro-ISIS accounts - note that we only have 64 userswithin intersection of H1 and H2 users.

Detecting Behaviour DivergenceHaving detected the activation points of users within boththe H1 and H2 hypotheses’ sets, we then moved on to ex-amine what happens once users have become activated:RQ2: What happens to Twitter users before they exhibit rad-icalised behaviour, and also after such exhibition? As be-haviour is a fairly abstract concept, we operationalise itsmeasurement through three dimensions: (i) the lexical termsused by a user (i.e. non-stop word terms published in his/hertweets), (ii) the users whose content the user has shared(i.e. propagated through his network), and (iii) the users thatthe user has mentioned. Each dimension, which we refer toas lexical, sharing, and interactions respectively, in essenceforms a discrete probability distribution that we can derivefrom a given half-closed time interval (i.e. [t, t0) : t < t0).Each distribution is then derived from the relative frequencydistribution of the user’s behaviour within the allotted timewindow: for instance, the lexical dimension’s distribution(PL

[t,t0)) is the relative frequency distribution of terms usedwithin the user’s tweets within the time window.5 As we aredealing with both Arabic and English tweets, we ran a pro-cess of transliteration on the former to convert Arabic scriptto English unicode characters, thereby allowing for both lan-guages to be handled using the same base language.

In order to examine whether a user’s behaviour haschanged once activated we computed the relative entropy(aka. Kullback-Leibler/KL divergence) over three time win-dows. Each time window has a midpoint (m), this midpointthen forms the boundary from which a given behaviour di-mension has two probability distributions computed (one be-fore the midpoint, and one after the midpoint). Let P[t,m)denote the distribution prior to m, and Q[m,t0) denote thedistribution on and after m, then the relative entropy is com-puted using P and Q as follows:

H(Q||P ) =

X

i

P (i) logP (i)

Q(i)(1)

As mentioned above, we measured the relative entropyover three windows, these were as follows:

1. Activation Window: the midpoint (m) of the window is thegiven user’s activation point (i.e. ah1 or ah2), and we setthe bounds of the window by going back k days from m.

2. Pre-Control Window: the midpoint of the window is 2kdays back from the activation point of the user, and thebounds are set to [a� 3k, a� k).

3. Post-Control Window: the midpoint of the window is 2kdays forward from the activation point of the user, and thebounds of the window are set to [a+ k, a+ 3k).Hence, our experimental setting provides three non-

overlapping time windows over which we could computethe relative entropy of user behaviour (lexical, sharing, in-teractions). For users labelled as pro-ISIS by H1 and H2 wecomputed their three relative entropy values over the three

5The sharing and interactions distributions are computed in thesame manner, using the relative frequencies of users whose contentis shared and users mentioned respectively.

54!


Behaviour Before/After Activation (RQ2)

•  Users exhibit a large divergence in their language once activated –  Before activation the majority of topics users discuss focus on politics,

where words like Syria, Israel and Egypt are mentioned in a negative context and with high frequency

–  After activation religious words (e.g. Allah, muslims, quran) become more popular.

Pre-Activation Activation Post-Activation

55!


Influencing Pro-ISIS Term Adoption (RQ3)

•  We study the effect of –  Lexical Homophily: similarity in language –  Sharing Homophily: diffusion of information from the same accounts –  Interaction Homophily: common communications

Social dynamics play a strong role in term uptake. Subcommunities act as bridges between radicalised user and the future adopter

pro-ISIS User Potential Adopter

56!


OBJ. Detect sub-communities of users from whom radicalised content is shared

DetectingPro-ISIS Subcommunities

57!


Kurdish

Jihadist

Pro-Assad

Secular/Moderate

Fig. 1: Syrian account network (652 nodes, 3,260 edges). Four major categories; Jihadist (gold, right), Kurdish (red, top),Pro-Assad (purple, left), and Secular/Moderate opposition (blue, center). Black nodes are members of multiple communities.Visualization was performed with the OpenOrd layout in Gephi.

contrast with the polarization analyzed in certain studies ofmainstream political activism [3], [10], the three communitiesselected consist of two polar opposites, jihadist and secularrevolutionary, with the third community considerably moderatein comparison. The analysis process includes the generationof rankings of the preferred YouTube channels for eachcommunity, where these channels and corresponding Freebasetopics assigned by YouTube are used to assist interpretationwhile also providing a certain level of validation2. We alsoconsider online activity surrounding “real world” events, suchas YouTube video responses to the Ghouta chemical weaponattack on 21 August 2013 [11]. The insights revealed in thisstudy confirm that alternative analytical approaches can playa key role in studies of online activity where prior knowledgemay be scarce or unreliable.

ANALYZING ONLINE POLITICAL ACTIVISM

In this paper, we consider online activity associated withthe Syria conflict within the context of other studies ofonline political activism that have focused upon relativelystatic, often mainstream groupings about which a considerablelevel of prior knowledge is available. This includes situa-tions featuring a polarization effect, or others where multiplegroupings are in existence. For example, the study of USliberal and conservative blogs by Adamic and Glance [3] foundclear separation between both communities, with noticeablebehavioral differences in terms of network density based onlinks between blogs, blog content itself, and interaction withmainstream media. They did not focus on “other” blogs, suchas those of a libertarian, independent or moderate nature (and

2http://www.freebase.com/

found few references to these from the liberal and conservativeblogs), but suggested that they could be considered in futureanalysis. Progressive and conservative polarization on Twitterwas investigated by Conover et al. , where hashtags were usedto gather data leading to two network representations based onTwitter retweets and mentions [10]. By specifically requestingthe detection of exactly two communities, polarization wasclearly observable in the retweet network. This was not thecase with analogous two-community detection within the cor-responding mentions network, where the authors suggestedthat this feature may foster cross-ideological interactions ofsome nature. In both cases, increasing the number of targetcommunities beyond two revealed smaller politically hetero-geneous communities rather than those of a more fine-grainedideological structure.

Mustafaraj et al. analyzed the vocal minority (prolific tweet-ers) and silent majority (accounts that tweeted only once)within US Democrat and Republican Twitter supporters, gath-ering data by searching for tweets containing the names of twoMassachusetts senate candidates [12]. They also found similarpolarized retweet communities in the vocal minority, whileat the same time, the activity of both of these communitieswas consistently different to the silent majority at the oppo-site end of the spectrum. The machine learning frameworkproposed by Pennacchiotti and Popescu for the classificationof Twitter accounts was evaluated using three gold standarddata sets, including one associated with political affiliation thatwas generated from lists of users who classified themselvesas either Democrat or Republican in the Twitter directoriesWeFollow and Twellow [13]. Similar political affiliation onTwitter was studied by Wong et al. , where they proposed amethod to quantify US political leaning that focused on tweets

O’Callaghan et al. 2014

625 Users

2.4M Users

154K EU Users

104M Tweets

Sharing Incitement Material

Using Extremist Language

566 pro-ISIS users 566 anti-ISIS users

Pro and anti-ISIS Stances (RQ4)

58!


TweetsConceptual.Semantics.Extraction

DBpedia

Semantic.Graph.Representation

Frequent.Semantic.Subgraph.Mining Classifier.Training

Pipeline of detecting pro-ISIS stances using semantic sub-graph mining-based feature extraction

•  Extract and use the semantic interdependencies and relations between words to learn patterns of radicalisation.

ISIS

Syria

Jihadist Group

Country (Military Intervention Against ISIL, place, Syria)

Entities Concepts Semantic Relations

Semantic Graph-based Approach for Pro-ISIS Stance Detection

59!


per-Stance classification performance of the five feature sets

86.3 86.3

84.886

91.7

84.4 84.4

81

87.1

92.8

80

82

84

86

88

90

92

94

Unigrams Sen6ment Topics Network Seman6cs

an6-ISIS pro-ISIS

radicalisation classification, i.e., classifying users in our dataset according to their stanceas pro-ISIS or anti-ISIS. Hence, our experimental setup requires the selection of (i) anannotated dataset of Twitter users (pro-ISIS and anti-ISIS) together with their timelines,(i) baseline features for cross-comparison and (ii) a supervised classification method.These elements are explained in the following subsections.

4.1 Dataset of pro-ISIS and anti-ISIS Twitter users

Our approach relies on a training dataset of 1, 132 European Twitter users (togetherwith their timelines) collected in our previous work [14]. In this work the pro-ISISstance of 727 Twitter users was determined based on their sharing of incitement materialfrom known pro-ISIS accounts and on their use of extremist language. By the time ofconducting this research, 161 of these Twitter accounts were suspended or changed theprivacy to protected, preventing us from accessing their profile information. As such, weresorted to remove them from the original set, resulting in 566 pro-ISIS users in total. Tobalance our dataset, we added 566 anti-ISIS users, whose stance is determined by theuse of anti-ISIS rhetoric. Table 2 shows the total number, and distribution of tweets andwords for each user group. As we can observe, both the number of tweets and wordsfor anti-ISIS users are significantly higher than the ones for pro-ISIS users. We referthe reader to the body of our work [14] for more details about the construction andannotation of this dataset.

pro-ISIS Users anti-ISIS Users

Total number of Tweets 602,511 1,368,827Average Number of Tweets per User 1,065 2,418Total number of Words 3,945,815 9,375,841Average Number of Words per User 6,971 16,570

Table 2: Statistics of the Twitter dataset used for evaluation

4.2 Baseline Features

Unigrams Features: Word unigrams are features traditionally used for various classifi-cation tasks of tweets data. For example, in the context of a sentiment analysis task, mod-els trained from word unigrams were shown to outperform random classifiers by 20%. [1]We generate the user’s unigram vector t

uunig

as the vector tuunig

= (w1, w2, ..., wm

)of the words in his timeline. Note that stopwords, non-English words and special char-acters are removed from the timeline prior to building t

uunig

in order to reduce itsdimensionality.

Sentiment Features: Sentiment features denote the sentiment orientation (positive,negative, neutral) of users in our dataset. The rational behind using these featuresis that the sentiment conveyed by the users’ posts may help discriminating betweenpro- and anti-ISIS stances. To extract these features for a given user u, we first ex-tracted the sentiment orientation of each tweet in the user’s timeline. To this end,we used SentiStrength [17], a lexicon-based sentiment detection method for the so-cial web. To construct the sentiment vector t

usentiment

for user u, we augment theunigrams feature vector t

uunig

with the extracted sentiment orientation of tweets as:tusentiment

= (w1, w2, ..., wm

, p

pos

, p

neg

, p

neu

), where p

pos

, p

neg

and p

neu

are the

Results

60!


60! Questions?

Artificial Intelligence for Policing

Technology

Transcript of Artificial Intelligence for Policing