Forensic linguistics with Apache Spark

Forensic Linguistics with Apache Spark

Kostas Perifanos @k_perifanos

Idiolect, sociolect, intertextuality

- Idiolect: individual’s distinctive and unique use of language

- Sociolect : variety of language associated with a social group (socioeconomic, ethnic, age)

- Intertextuality: the shaping of a text’s meaning by another text

Forensic Linguistics

"Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics.” [Wikipedia]

- Authorship Attribution

- Authorship Identification

- Gender/Age classification etc

Dataset

- 8m tweets between 18/06/2015 - 06/08/2015

- 92m words (white space tokenized)

- 190K users

- Key events during this period

- Referendum Announcement

- Capital Controls

- Referendum voting

Toolset

- Apache Spark 1.6.1

- DataFrames / Spark SQL

- Word2vec, KMeans

- Apache Zeppelin

- Gephi

Basic Data Exploration - Counting

Check for trends:

- Lowercase vs Uppercase ratios

- Relative frequencies of important (propaganda) words

- Average text length (per day)

- Average word length (per day)

Counting - lowercase / uppercase ratio

Counting - Propaganda

- Build a word2vec model, treat @mentions as vocabulary words

- Find top-N “synonyms” using seed accounts, keep all starting with “@”

- @handle1: @handle2, @handle3, ...

- @handle32: @handle5, @handle3, ...

- Visualize the graph

Similarities & user interactions

Similarities & interactions graph [Gephi]

Gephi : Modularity analysis, 9 communities detected

Communities:

- “Yes”, black

- “No”, magenta

- media, red

- celebrities, dark green

- “Romantic twitter”, orange

- ....

- Choose top N most frequent words [1]

- Build frequency vectors for all users

- Compare user signatures [eg Cosine Similarity]

- Identified double-account user among 180K candidates (so much for anonymity)

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694980/

2. Idiolect : Style signatures

- Apply clustering on signature vectors

- KMeans on signatures

- KMeans on word2vec vectors:

- Transform words to vectors, sum and average

- Also works very well for metaphor detection

Sociolect: Clustering

- User generates texts by sampling a number of topics

- “Similar” users will tend to have similar topic distributions

- Given a subset of similar users, identify the most influential, eg the user who enforces writing style. [But that’s another presentation :)]

Challenges

“Random events”

Opinion shifting: People change their opinions and their writing styles accordingly. Social media tends to amplify this behaviour [one more presentation :) ]

Intertextuality: LDA + signatures

- User - Topic Classification

- Gender classification

- Personality, stress, anxiety etc

- Try Deep Learning approaches

Next steps

Thank you!

Questions?

@k_perifanos - http://github.com/kperi

Forensic linguistics with Apache Spark

Data & Analytics

Transcript of Forensic linguistics with Apache Spark

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

Running Apache Spark & Apache Zeppelin in Production

Developing Apache Spark Applications · Apache Spark Introduction Introduction Apache Spark enables you to quickly develop applications and process jobs. Apache Spark is designed

Apache spark session

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Performance-Analyse von Apache Spark und Apache Hadoop€¦ · Apache Spark, Apache Hadoop, Big Data, Benchmarking, Performance-Analyse Kurzzusammenfassung Diese Bachelorarbeit beschäftigt

Accelerator for Apache Spark Functional Specification · Accelerator for Apache Spark – Functional Specification 12 Table 1: Accelerator for Apache Spark Components Component Software

Apache Spark - Yandex

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Apache Spark 101

Apache Spark - LMU

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or.

Apache Spark Overview

Introduction to Apache Spark

Apache Spark Briefing

Apache Spark & Hadoop

Apache Spark 2.0

Apache Spark Streaming

Apache Spark Introduction

Hortonworks Data Platform - Apache Spark Component …€¦ · · 2018-04-15Hortonworks Data Platform: Apache Spark Component Guide ... Tuning Spark ... and debugging Spark shell